- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
11 KiB
MusicBrainz Ingestion
Architecture documentation for ingesting music metadata from MusicBrainz.
Overview
MusicBrainz is an open music encyclopedia maintained by the MetaBrainz Foundation. It serves as the canonical source for music metadata with community-curated data covering artists, releases, recordings, and works.
| Attribute | Value |
|---|---|
| Data Quality | High (community-curated) |
| Coverage | ~2M artists, ~3M releases, ~30M recordings |
| Update Frequency | Real-time edits, weekly dumps |
| API Style | REST with Lucene search |
| Cost | Free (rate-limited) |
Data Model
MusicBrainz uses a hierarchical model that separates abstract concepts from concrete manifestations.
Entity Hierarchy
┌──────────┐
│ WORK │ ← Composition (the song as written)
│ (ISWC) │ "Bohemian Rhapsody" by Freddie Mercury
└────┬─────┘
│ performed as
▼
┌──────────┐
│RECORDING │ ← Unique audio (specific performance)
│ (ISRC) │ Studio version, live version, demo
└────┬─────┘
│ appears on
▼
┌──────────┐ ┌──────────┐
│ ARTIST │◄─────────►│ RELEASE │ ← Physical/digital product
│ (MBID) │ credited │ (UPC) │ US CD, UK Vinyl, Spotify release
└──────────┘ on └────┬─────┘
│ variant of
▼
┌──────────┐
│ RELEASE │ ← Abstract album concept
│ GROUP │ "A Night at the Opera" (all editions)
└──────────┘
Core Entities
| Entity | Description | Identifier | Example |
|---|---|---|---|
| Artist | Musician, band, orchestra, composer | MBID | Queen, Freddie Mercury |
| Work | Abstract composition | ISWC | "Bohemian Rhapsody" (the song) |
| Recording | Specific audio performance | ISRC | Studio recording of Bohemian Rhapsody |
| Release | Concrete product (CD, vinyl, digital) | Barcode/UPC | 1975 UK vinyl pressing |
| Release Group | Abstract album (all editions) | MBID | "A Night at the Opera" |
| Label | Record label or imprint | MBID | EMI, Hollywood Records |
Key Distinction: Release vs Release Group
Release Group = The abstract album concept
- "Nevermind" by Nirvana
Release = A specific physical or digital product
- 1991 US CD (DGC)
- 1991 UK CD (Geffen)
- 2011 Deluxe Edition (4 CDs)
- 2021 30th Anniversary Super Deluxe
This separation allows tracking all variants while maintaining a single "album" identity.
Key Distinction: Recording vs Work
Work = The composition (what was written)
- Composer: Kurt Cobain
- ISWC identifier
- No audio - just the abstract song
Recording = A specific audio capture
- Performer: Nirvana
- ISRC identifier
- Has duration, audio characteristics
- Multiple recordings of same work (studio, live, acoustic)
Relationship System
MusicBrainz uses Advanced Relationships (ARs) to connect entities with typed, attributed links.
Relationship Types
Artist ↔ Artist:
member of band(with dates)collaborationteacher of
Artist ↔ Recording:
performer(with instrument)producerengineermix
Artist ↔ Work:
composerlyricistwriter
Recording ↔ Work:
performance of
Artist ↔ URL:
official homepagesocial network(Spotify, YouTube, etc.)streaming
Relationship Attributes
Relationships carry attributes providing detail:
Artist: John Lennon
└─► Recording: "Come Together"
Relationship: performer
Attributes:
- instrument: vocals
- instrument: rhythm guitar
API Access Patterns
Three Methods
| Method | Purpose | Use Case |
|---|---|---|
| Lookup | Fetch single entity by MBID | Known entity, need full details |
| Browse | Paginate related entities | All albums by artist, all tracks on album |
| Search | Find entities by criteria | Find artist by name, recording by ISRC |
Lookup
Direct fetch by MusicBrainz ID (MBID). Returns single entity with optional related data via inc parameter.
Related data options: releases, recordings, url-rels, artist-rels, genres, labels, media, isrcs
Limitation: Related entities capped at 25 per request. Use Browse for complete lists.
Browse
Paginated fetch of entities related to another entity. Supports up to 100 items per request. Must iterate with offset for complete data.
Search
Lucene-syntax queries across entity fields. Useful for:
- Finding entities by name (fuzzy matching)
- Looking up by external identifier (ISRC, barcode)
- Filtering by attributes (country, type, date)
Rate Limiting
| Rule | Limit |
|---|---|
| Requests per second | 1 (hard limit) |
| Burst allowance | None |
| Violation penalty | HTTP 503 until rate drops |
| User-Agent | Required (blocked without) |
User-Agent format: AppName/Version ( contact-url-or-email )
Entity Mapping to Internal Schema
Artist
| MusicBrainz | Internal | Notes |
|---|---|---|
id |
source_id |
MBID stored as external reference |
name |
name |
|
sort-name |
sort_name |
|
type |
artist_type |
Person, Group, Orchestra, etc. |
country |
country |
ISO code |
life-span.begin |
formed_date |
|
life-span.end |
disbanded_date |
|
disambiguation |
description |
Short disambiguator |
| URL relationship (image) | image_url |
From Wikimedia Commons link |
Album (from Release Group)
| MusicBrainz | Internal | Notes |
|---|---|---|
id |
source_id |
Release Group MBID |
title |
title |
|
primary-type |
album_type |
Album, EP, Single |
first-release-date |
release_date |
Earliest release |
| Label from release | label_id |
From canonical release |
Track (from Recording)
| MusicBrainz | Internal | Notes |
|---|---|---|
id |
source_id |
Recording MBID |
title |
title |
|
length |
duration_ms |
In milliseconds |
isrcs[0] |
isrc |
First ISRC if multiple |
| Work relationship | work_id |
Link to composition |
Work
| MusicBrainz | Internal | Notes |
|---|---|---|
id |
source_id |
Work MBID |
title |
title |
|
type |
work_type |
Song, Symphony, Opera, etc. |
language |
language |
ISO code |
Label
| MusicBrainz | Internal | Notes |
|---|---|---|
id |
source_id |
Label MBID |
name |
name |
|
country |
country |
ISO code |
life-span.begin |
founded_date |
Ingestion Flow
Artist Discovery
INPUT: Artist name
│
▼
┌─────────────────────────────────────┐
│ SEARCH by name │
│ → Ranked matches with scores │
│ → Select highest + verify │
└─────────────────┬───────────────────┘
│ MBID
▼
┌─────────────────────────────────────┐
│ LOOKUP with relationships │
│ → URLs, genres, band members │
└─────────────────┬───────────────────┘
│
▼
STORE: artist + external_id + genres
Discography Sync
INPUT: Artist MBID
│
▼
┌─────────────────────────────────────┐
│ BROWSE all release-groups │
│ → Filter: album, ep, single │
│ → Paginate until exhausted │
└─────────────────┬───────────────────┘
│ for each
▼
┌─────────────────────────────────────┐
│ LOOKUP release-group │
│ → Get releases list │
│ → Select canonical release │
└─────────────────┬───────────────────┘
│ release MBID
▼
┌─────────────────────────────────────┐
│ LOOKUP release with tracks │
│ → Media structure (discs) │
│ → Track positions │
│ → ISRCs, label info │
└─────────────────┬───────────────────┘
│
▼
STORE: album + tracks + positions
Canonical Release Selection
When a release-group has multiple releases, select one as canonical:
| Priority | Criteria |
|---|---|
| 1 | Status: Official > Promotional > Bootleg |
| 2 | Format: Digital > CD > Vinyl |
| 3 | Completeness: Has barcode, has label |
| 4 | Date: Original release preferred |
Cover Art
Album artwork served by Cover Art Archive (coverartarchive.org), not MusicBrainz directly.
| Size | URL Pattern |
|---|---|
| Original | /release/{release_mbid}/front |
| Thumbnail | /release/{release_mbid}/front-250 |
| Medium | /release/{release_mbid}/front-500 |
| Large | /release/{release_mbid}/front-1200 |
Not all releases have cover art. Check availability via release metadata.
Bulk Data Access
For large-scale ingestion, database dumps avoid rate limits.
| Source | Format | Frequency | Use Case |
|---|---|---|---|
| JSON dumps | JSONL (gzipped) | 2x/week | Initial seeding |
| PostgreSQL dumps | SQL | 2x/week | Full mirror |
| Replication packets | Incremental | Hourly | Staying in sync |
Recommended Strategy
| Phase | Method |
|---|---|
| Initial load | JSON dumps |
| On-demand | Live API with caching |
| Periodic refresh | JSON dumps monthly |
Caching
| Entity | TTL | Rationale |
|---|---|---|
| Artist | 30 days | Rarely changes |
| Album | 30 days | Rarely changes |
| Track | 30 days | Rarely changes |
| Search results | 24 hours | New entries may appear |
External ID Storage
Store in *_external_ids tables:
| Field | Value |
|---|---|
source |
"musicbrainz" |
source_id |
MBID (UUID) |
url |
https://musicbrainz.org/{entity}/{mbid} |
Enables:
- Cross-source deduplication
- Lookup by MBID from other services
- Link back for verification
Go Client
Recommended: go.uploadedlobster.com/musicbrainzws2