a1f6701bac
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
370 lines
11 KiB
Markdown
370 lines
11 KiB
Markdown
# MusicBrainz Ingestion
|
|
|
|
Architecture documentation for ingesting music metadata from MusicBrainz.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
**MusicBrainz** is an open music encyclopedia maintained by the MetaBrainz Foundation. It serves as the canonical source for music metadata with community-curated data covering artists, releases, recordings, and works.
|
|
|
|
| Attribute | Value |
|
|
|-----------|-------|
|
|
| Data Quality | High (community-curated) |
|
|
| Coverage | ~2M artists, ~3M releases, ~30M recordings |
|
|
| Update Frequency | Real-time edits, weekly dumps |
|
|
| API Style | REST with Lucene search |
|
|
| Cost | Free (rate-limited) |
|
|
|
|
---
|
|
|
|
## Data Model
|
|
|
|
MusicBrainz uses a hierarchical model that separates abstract concepts from concrete manifestations.
|
|
|
|
### Entity Hierarchy
|
|
|
|
```
|
|
┌──────────┐
|
|
│ WORK │ ← Composition (the song as written)
|
|
│ (ISWC) │ "Bohemian Rhapsody" by Freddie Mercury
|
|
└────┬─────┘
|
|
│ performed as
|
|
▼
|
|
┌──────────┐
|
|
│RECORDING │ ← Unique audio (specific performance)
|
|
│ (ISRC) │ Studio version, live version, demo
|
|
└────┬─────┘
|
|
│ appears on
|
|
▼
|
|
┌──────────┐ ┌──────────┐
|
|
│ ARTIST │◄─────────►│ RELEASE │ ← Physical/digital product
|
|
│ (MBID) │ credited │ (UPC) │ US CD, UK Vinyl, Spotify release
|
|
└──────────┘ on └────┬─────┘
|
|
│ variant of
|
|
▼
|
|
┌──────────┐
|
|
│ RELEASE │ ← Abstract album concept
|
|
│ GROUP │ "A Night at the Opera" (all editions)
|
|
└──────────┘
|
|
```
|
|
|
|
### Core Entities
|
|
|
|
| Entity | Description | Identifier | Example |
|
|
|--------|-------------|------------|---------|
|
|
| **Artist** | Musician, band, orchestra, composer | MBID | Queen, Freddie Mercury |
|
|
| **Work** | Abstract composition | ISWC | "Bohemian Rhapsody" (the song) |
|
|
| **Recording** | Specific audio performance | ISRC | Studio recording of Bohemian Rhapsody |
|
|
| **Release** | Concrete product (CD, vinyl, digital) | Barcode/UPC | 1975 UK vinyl pressing |
|
|
| **Release Group** | Abstract album (all editions) | MBID | "A Night at the Opera" |
|
|
| **Label** | Record label or imprint | MBID | EMI, Hollywood Records |
|
|
|
|
### Key Distinction: Release vs Release Group
|
|
|
|
**Release Group** = The abstract album concept
|
|
- "Nevermind" by Nirvana
|
|
|
|
**Release** = A specific physical or digital product
|
|
- 1991 US CD (DGC)
|
|
- 1991 UK CD (Geffen)
|
|
- 2011 Deluxe Edition (4 CDs)
|
|
- 2021 30th Anniversary Super Deluxe
|
|
|
|
This separation allows tracking all variants while maintaining a single "album" identity.
|
|
|
|
### Key Distinction: Recording vs Work
|
|
|
|
**Work** = The composition (what was written)
|
|
- Composer: Kurt Cobain
|
|
- ISWC identifier
|
|
- No audio - just the abstract song
|
|
|
|
**Recording** = A specific audio capture
|
|
- Performer: Nirvana
|
|
- ISRC identifier
|
|
- Has duration, audio characteristics
|
|
- Multiple recordings of same work (studio, live, acoustic)
|
|
|
|
---
|
|
|
|
## Relationship System
|
|
|
|
MusicBrainz uses **Advanced Relationships (ARs)** to connect entities with typed, attributed links.
|
|
|
|
### Relationship Types
|
|
|
|
**Artist ↔ Artist:**
|
|
- `member of band` (with dates)
|
|
- `collaboration`
|
|
- `teacher of`
|
|
|
|
**Artist ↔ Recording:**
|
|
- `performer` (with instrument)
|
|
- `producer`
|
|
- `engineer`
|
|
- `mix`
|
|
|
|
**Artist ↔ Work:**
|
|
- `composer`
|
|
- `lyricist`
|
|
- `writer`
|
|
|
|
**Recording ↔ Work:**
|
|
- `performance of`
|
|
|
|
**Artist ↔ URL:**
|
|
- `official homepage`
|
|
- `social network` (Spotify, YouTube, etc.)
|
|
- `streaming`
|
|
|
|
### Relationship Attributes
|
|
|
|
Relationships carry attributes providing detail:
|
|
|
|
```
|
|
Artist: John Lennon
|
|
└─► Recording: "Come Together"
|
|
Relationship: performer
|
|
Attributes:
|
|
- instrument: vocals
|
|
- instrument: rhythm guitar
|
|
```
|
|
|
|
---
|
|
|
|
## API Access Patterns
|
|
|
|
### Three Methods
|
|
|
|
| Method | Purpose | Use Case |
|
|
|--------|---------|----------|
|
|
| **Lookup** | Fetch single entity by MBID | Known entity, need full details |
|
|
| **Browse** | Paginate related entities | All albums by artist, all tracks on album |
|
|
| **Search** | Find entities by criteria | Find artist by name, recording by ISRC |
|
|
|
|
### Lookup
|
|
|
|
Direct fetch by MusicBrainz ID (MBID). Returns single entity with optional related data via `inc` parameter.
|
|
|
|
Related data options: `releases`, `recordings`, `url-rels`, `artist-rels`, `genres`, `labels`, `media`, `isrcs`
|
|
|
|
**Limitation:** Related entities capped at 25 per request. Use Browse for complete lists.
|
|
|
|
### Browse
|
|
|
|
Paginated fetch of entities related to another entity. Supports up to 100 items per request. Must iterate with offset for complete data.
|
|
|
|
### Search
|
|
|
|
Lucene-syntax queries across entity fields. Useful for:
|
|
- Finding entities by name (fuzzy matching)
|
|
- Looking up by external identifier (ISRC, barcode)
|
|
- Filtering by attributes (country, type, date)
|
|
|
|
---
|
|
|
|
## Rate Limiting
|
|
|
|
| Rule | Limit |
|
|
|------|-------|
|
|
| Requests per second | **1** (hard limit) |
|
|
| Burst allowance | None |
|
|
| Violation penalty | HTTP 503 until rate drops |
|
|
| User-Agent | **Required** (blocked without) |
|
|
|
|
User-Agent format: `AppName/Version ( contact-url-or-email )`
|
|
|
|
---
|
|
|
|
## Entity Mapping to Internal Schema
|
|
|
|
### Artist
|
|
|
|
| MusicBrainz | Internal | Notes |
|
|
|-------------|----------|-------|
|
|
| `id` | `source_id` | MBID stored as external reference |
|
|
| `name` | `name` | |
|
|
| `sort-name` | `sort_name` | |
|
|
| `type` | `artist_type` | Person, Group, Orchestra, etc. |
|
|
| `country` | `country` | ISO code |
|
|
| `life-span.begin` | `formed_date` | |
|
|
| `life-span.end` | `disbanded_date` | |
|
|
| `disambiguation` | `description` | Short disambiguator |
|
|
| URL relationship (image) | `image_url` | From Wikimedia Commons link |
|
|
|
|
### Album (from Release Group)
|
|
|
|
| MusicBrainz | Internal | Notes |
|
|
|-------------|----------|-------|
|
|
| `id` | `source_id` | Release Group MBID |
|
|
| `title` | `title` | |
|
|
| `primary-type` | `album_type` | Album, EP, Single |
|
|
| `first-release-date` | `release_date` | Earliest release |
|
|
| Label from release | `label_id` | From canonical release |
|
|
|
|
### Track (from Recording)
|
|
|
|
| MusicBrainz | Internal | Notes |
|
|
|-------------|----------|-------|
|
|
| `id` | `source_id` | Recording MBID |
|
|
| `title` | `title` | |
|
|
| `length` | `duration_ms` | In milliseconds |
|
|
| `isrcs[0]` | `isrc` | First ISRC if multiple |
|
|
| Work relationship | `work_id` | Link to composition |
|
|
|
|
### Work
|
|
|
|
| MusicBrainz | Internal | Notes |
|
|
|-------------|----------|-------|
|
|
| `id` | `source_id` | Work MBID |
|
|
| `title` | `title` | |
|
|
| `type` | `work_type` | Song, Symphony, Opera, etc. |
|
|
| `language` | `language` | ISO code |
|
|
|
|
### Label
|
|
|
|
| MusicBrainz | Internal | Notes |
|
|
|-------------|----------|-------|
|
|
| `id` | `source_id` | Label MBID |
|
|
| `name` | `name` | |
|
|
| `country` | `country` | ISO code |
|
|
| `life-span.begin` | `founded_date` | |
|
|
|
|
---
|
|
|
|
## Ingestion Flow
|
|
|
|
### Artist Discovery
|
|
|
|
```
|
|
INPUT: Artist name
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ SEARCH by name │
|
|
│ → Ranked matches with scores │
|
|
│ → Select highest + verify │
|
|
└─────────────────┬───────────────────┘
|
|
│ MBID
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ LOOKUP with relationships │
|
|
│ → URLs, genres, band members │
|
|
└─────────────────┬───────────────────┘
|
|
│
|
|
▼
|
|
STORE: artist + external_id + genres
|
|
```
|
|
|
|
### Discography Sync
|
|
|
|
```
|
|
INPUT: Artist MBID
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ BROWSE all release-groups │
|
|
│ → Filter: album, ep, single │
|
|
│ → Paginate until exhausted │
|
|
└─────────────────┬───────────────────┘
|
|
│ for each
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ LOOKUP release-group │
|
|
│ → Get releases list │
|
|
│ → Select canonical release │
|
|
└─────────────────┬───────────────────┘
|
|
│ release MBID
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ LOOKUP release with tracks │
|
|
│ → Media structure (discs) │
|
|
│ → Track positions │
|
|
│ → ISRCs, label info │
|
|
└─────────────────┬───────────────────┘
|
|
│
|
|
▼
|
|
STORE: album + tracks + positions
|
|
```
|
|
|
|
### Canonical Release Selection
|
|
|
|
When a release-group has multiple releases, select one as canonical:
|
|
|
|
| Priority | Criteria |
|
|
|----------|----------|
|
|
| 1 | Status: Official > Promotional > Bootleg |
|
|
| 2 | Format: Digital > CD > Vinyl |
|
|
| 3 | Completeness: Has barcode, has label |
|
|
| 4 | Date: Original release preferred |
|
|
|
|
---
|
|
|
|
## Cover Art
|
|
|
|
Album artwork served by **Cover Art Archive** (coverartarchive.org), not MusicBrainz directly.
|
|
|
|
| Size | URL Pattern |
|
|
|------|-------------|
|
|
| Original | `/release/{release_mbid}/front` |
|
|
| Thumbnail | `/release/{release_mbid}/front-250` |
|
|
| Medium | `/release/{release_mbid}/front-500` |
|
|
| Large | `/release/{release_mbid}/front-1200` |
|
|
|
|
Not all releases have cover art. Check availability via release metadata.
|
|
|
|
---
|
|
|
|
## Bulk Data Access
|
|
|
|
For large-scale ingestion, database dumps avoid rate limits.
|
|
|
|
| Source | Format | Frequency | Use Case |
|
|
|--------|--------|-----------|----------|
|
|
| JSON dumps | JSONL (gzipped) | 2x/week | Initial seeding |
|
|
| PostgreSQL dumps | SQL | 2x/week | Full mirror |
|
|
| Replication packets | Incremental | Hourly | Staying in sync |
|
|
|
|
### Recommended Strategy
|
|
|
|
| Phase | Method |
|
|
|-------|--------|
|
|
| Initial load | JSON dumps |
|
|
| On-demand | Live API with caching |
|
|
| Periodic refresh | JSON dumps monthly |
|
|
|
|
---
|
|
|
|
## Caching
|
|
|
|
| Entity | TTL | Rationale |
|
|
|--------|-----|-----------|
|
|
| Artist | 30 days | Rarely changes |
|
|
| Album | 30 days | Rarely changes |
|
|
| Track | 30 days | Rarely changes |
|
|
| Search results | 24 hours | New entries may appear |
|
|
|
|
---
|
|
|
|
## External ID Storage
|
|
|
|
Store in `*_external_ids` tables:
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| `source` | `"musicbrainz"` |
|
|
| `source_id` | MBID (UUID) |
|
|
| `url` | `https://musicbrainz.org/{entity}/{mbid}` |
|
|
|
|
Enables:
|
|
- Cross-source deduplication
|
|
- Lookup by MBID from other services
|
|
- Link back for verification
|
|
|
|
---
|
|
|
|
## Go Client
|
|
|
|
Recommended: `go.uploadedlobster.com/musicbrainzws2`
|