Files
metadata-agregator/docs/INGESTION_MUSICBRAINZ.md
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

370 lines
11 KiB
Markdown

# MusicBrainz Ingestion
Architecture documentation for ingesting music metadata from MusicBrainz.
---
## Overview
**MusicBrainz** is an open music encyclopedia maintained by the MetaBrainz Foundation. It serves as the canonical source for music metadata with community-curated data covering artists, releases, recordings, and works.
| Attribute | Value |
|-----------|-------|
| Data Quality | High (community-curated) |
| Coverage | ~2M artists, ~3M releases, ~30M recordings |
| Update Frequency | Real-time edits, weekly dumps |
| API Style | REST with Lucene search |
| Cost | Free (rate-limited) |
---
## Data Model
MusicBrainz uses a hierarchical model that separates abstract concepts from concrete manifestations.
### Entity Hierarchy
```
┌──────────┐
│ WORK │ ← Composition (the song as written)
│ (ISWC) │ "Bohemian Rhapsody" by Freddie Mercury
└────┬─────┘
│ performed as
┌──────────┐
│RECORDING │ ← Unique audio (specific performance)
│ (ISRC) │ Studio version, live version, demo
└────┬─────┘
│ appears on
┌──────────┐ ┌──────────┐
│ ARTIST │◄─────────►│ RELEASE │ ← Physical/digital product
│ (MBID) │ credited │ (UPC) │ US CD, UK Vinyl, Spotify release
└──────────┘ on └────┬─────┘
│ variant of
┌──────────┐
│ RELEASE │ ← Abstract album concept
│ GROUP │ "A Night at the Opera" (all editions)
└──────────┘
```
### Core Entities
| Entity | Description | Identifier | Example |
|--------|-------------|------------|---------|
| **Artist** | Musician, band, orchestra, composer | MBID | Queen, Freddie Mercury |
| **Work** | Abstract composition | ISWC | "Bohemian Rhapsody" (the song) |
| **Recording** | Specific audio performance | ISRC | Studio recording of Bohemian Rhapsody |
| **Release** | Concrete product (CD, vinyl, digital) | Barcode/UPC | 1975 UK vinyl pressing |
| **Release Group** | Abstract album (all editions) | MBID | "A Night at the Opera" |
| **Label** | Record label or imprint | MBID | EMI, Hollywood Records |
### Key Distinction: Release vs Release Group
**Release Group** = The abstract album concept
- "Nevermind" by Nirvana
**Release** = A specific physical or digital product
- 1991 US CD (DGC)
- 1991 UK CD (Geffen)
- 2011 Deluxe Edition (4 CDs)
- 2021 30th Anniversary Super Deluxe
This separation allows tracking all variants while maintaining a single "album" identity.
### Key Distinction: Recording vs Work
**Work** = The composition (what was written)
- Composer: Kurt Cobain
- ISWC identifier
- No audio - just the abstract song
**Recording** = A specific audio capture
- Performer: Nirvana
- ISRC identifier
- Has duration, audio characteristics
- Multiple recordings of same work (studio, live, acoustic)
---
## Relationship System
MusicBrainz uses **Advanced Relationships (ARs)** to connect entities with typed, attributed links.
### Relationship Types
**Artist ↔ Artist:**
- `member of band` (with dates)
- `collaboration`
- `teacher of`
**Artist ↔ Recording:**
- `performer` (with instrument)
- `producer`
- `engineer`
- `mix`
**Artist ↔ Work:**
- `composer`
- `lyricist`
- `writer`
**Recording ↔ Work:**
- `performance of`
**Artist ↔ URL:**
- `official homepage`
- `social network` (Spotify, YouTube, etc.)
- `streaming`
### Relationship Attributes
Relationships carry attributes providing detail:
```
Artist: John Lennon
└─► Recording: "Come Together"
Relationship: performer
Attributes:
- instrument: vocals
- instrument: rhythm guitar
```
---
## API Access Patterns
### Three Methods
| Method | Purpose | Use Case |
|--------|---------|----------|
| **Lookup** | Fetch single entity by MBID | Known entity, need full details |
| **Browse** | Paginate related entities | All albums by artist, all tracks on album |
| **Search** | Find entities by criteria | Find artist by name, recording by ISRC |
### Lookup
Direct fetch by MusicBrainz ID (MBID). Returns single entity with optional related data via `inc` parameter.
Related data options: `releases`, `recordings`, `url-rels`, `artist-rels`, `genres`, `labels`, `media`, `isrcs`
**Limitation:** Related entities capped at 25 per request. Use Browse for complete lists.
### Browse
Paginated fetch of entities related to another entity. Supports up to 100 items per request. Must iterate with offset for complete data.
### Search
Lucene-syntax queries across entity fields. Useful for:
- Finding entities by name (fuzzy matching)
- Looking up by external identifier (ISRC, barcode)
- Filtering by attributes (country, type, date)
---
## Rate Limiting
| Rule | Limit |
|------|-------|
| Requests per second | **1** (hard limit) |
| Burst allowance | None |
| Violation penalty | HTTP 503 until rate drops |
| User-Agent | **Required** (blocked without) |
User-Agent format: `AppName/Version ( contact-url-or-email )`
---
## Entity Mapping to Internal Schema
### Artist
| MusicBrainz | Internal | Notes |
|-------------|----------|-------|
| `id` | `source_id` | MBID stored as external reference |
| `name` | `name` | |
| `sort-name` | `sort_name` | |
| `type` | `artist_type` | Person, Group, Orchestra, etc. |
| `country` | `country` | ISO code |
| `life-span.begin` | `formed_date` | |
| `life-span.end` | `disbanded_date` | |
| `disambiguation` | `description` | Short disambiguator |
| URL relationship (image) | `image_url` | From Wikimedia Commons link |
### Album (from Release Group)
| MusicBrainz | Internal | Notes |
|-------------|----------|-------|
| `id` | `source_id` | Release Group MBID |
| `title` | `title` | |
| `primary-type` | `album_type` | Album, EP, Single |
| `first-release-date` | `release_date` | Earliest release |
| Label from release | `label_id` | From canonical release |
### Track (from Recording)
| MusicBrainz | Internal | Notes |
|-------------|----------|-------|
| `id` | `source_id` | Recording MBID |
| `title` | `title` | |
| `length` | `duration_ms` | In milliseconds |
| `isrcs[0]` | `isrc` | First ISRC if multiple |
| Work relationship | `work_id` | Link to composition |
### Work
| MusicBrainz | Internal | Notes |
|-------------|----------|-------|
| `id` | `source_id` | Work MBID |
| `title` | `title` | |
| `type` | `work_type` | Song, Symphony, Opera, etc. |
| `language` | `language` | ISO code |
### Label
| MusicBrainz | Internal | Notes |
|-------------|----------|-------|
| `id` | `source_id` | Label MBID |
| `name` | `name` | |
| `country` | `country` | ISO code |
| `life-span.begin` | `founded_date` | |
---
## Ingestion Flow
### Artist Discovery
```
INPUT: Artist name
┌─────────────────────────────────────┐
│ SEARCH by name │
│ → Ranked matches with scores │
│ → Select highest + verify │
└─────────────────┬───────────────────┘
│ MBID
┌─────────────────────────────────────┐
│ LOOKUP with relationships │
│ → URLs, genres, band members │
└─────────────────┬───────────────────┘
STORE: artist + external_id + genres
```
### Discography Sync
```
INPUT: Artist MBID
┌─────────────────────────────────────┐
│ BROWSE all release-groups │
│ → Filter: album, ep, single │
│ → Paginate until exhausted │
└─────────────────┬───────────────────┘
│ for each
┌─────────────────────────────────────┐
│ LOOKUP release-group │
│ → Get releases list │
│ → Select canonical release │
└─────────────────┬───────────────────┘
│ release MBID
┌─────────────────────────────────────┐
│ LOOKUP release with tracks │
│ → Media structure (discs) │
│ → Track positions │
│ → ISRCs, label info │
└─────────────────┬───────────────────┘
STORE: album + tracks + positions
```
### Canonical Release Selection
When a release-group has multiple releases, select one as canonical:
| Priority | Criteria |
|----------|----------|
| 1 | Status: Official > Promotional > Bootleg |
| 2 | Format: Digital > CD > Vinyl |
| 3 | Completeness: Has barcode, has label |
| 4 | Date: Original release preferred |
---
## Cover Art
Album artwork served by **Cover Art Archive** (coverartarchive.org), not MusicBrainz directly.
| Size | URL Pattern |
|------|-------------|
| Original | `/release/{release_mbid}/front` |
| Thumbnail | `/release/{release_mbid}/front-250` |
| Medium | `/release/{release_mbid}/front-500` |
| Large | `/release/{release_mbid}/front-1200` |
Not all releases have cover art. Check availability via release metadata.
---
## Bulk Data Access
For large-scale ingestion, database dumps avoid rate limits.
| Source | Format | Frequency | Use Case |
|--------|--------|-----------|----------|
| JSON dumps | JSONL (gzipped) | 2x/week | Initial seeding |
| PostgreSQL dumps | SQL | 2x/week | Full mirror |
| Replication packets | Incremental | Hourly | Staying in sync |
### Recommended Strategy
| Phase | Method |
|-------|--------|
| Initial load | JSON dumps |
| On-demand | Live API with caching |
| Periodic refresh | JSON dumps monthly |
---
## Caching
| Entity | TTL | Rationale |
|--------|-----|-----------|
| Artist | 30 days | Rarely changes |
| Album | 30 days | Rarely changes |
| Track | 30 days | Rarely changes |
| Search results | 24 hours | New entries may appear |
---
## External ID Storage
Store in `*_external_ids` tables:
| Field | Value |
|-------|-------|
| `source` | `"musicbrainz"` |
| `source_id` | MBID (UUID) |
| `url` | `https://musicbrainz.org/{entity}/{mbid}` |
Enables:
- Cross-source deduplication
- Lookup by MBID from other services
- Link back for verification
---
## Go Client
Recommended: `go.uploadedlobster.com/musicbrainzws2`