feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,369 @@
|
||||
# MusicBrainz Ingestion
|
||||
|
||||
Architecture documentation for ingesting music metadata from MusicBrainz.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
**MusicBrainz** is an open music encyclopedia maintained by the MetaBrainz Foundation. It serves as the canonical source for music metadata with community-curated data covering artists, releases, recordings, and works.
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Data Quality | High (community-curated) |
|
||||
| Coverage | ~2M artists, ~3M releases, ~30M recordings |
|
||||
| Update Frequency | Real-time edits, weekly dumps |
|
||||
| API Style | REST with Lucene search |
|
||||
| Cost | Free (rate-limited) |
|
||||
|
||||
---
|
||||
|
||||
## Data Model
|
||||
|
||||
MusicBrainz uses a hierarchical model that separates abstract concepts from concrete manifestations.
|
||||
|
||||
### Entity Hierarchy
|
||||
|
||||
```
|
||||
┌──────────┐
|
||||
│ WORK │ ← Composition (the song as written)
|
||||
│ (ISWC) │ "Bohemian Rhapsody" by Freddie Mercury
|
||||
└────┬─────┘
|
||||
│ performed as
|
||||
▼
|
||||
┌──────────┐
|
||||
│RECORDING │ ← Unique audio (specific performance)
|
||||
│ (ISRC) │ Studio version, live version, demo
|
||||
└────┬─────┘
|
||||
│ appears on
|
||||
▼
|
||||
┌──────────┐ ┌──────────┐
|
||||
│ ARTIST │◄─────────►│ RELEASE │ ← Physical/digital product
|
||||
│ (MBID) │ credited │ (UPC) │ US CD, UK Vinyl, Spotify release
|
||||
└──────────┘ on └────┬─────┘
|
||||
│ variant of
|
||||
▼
|
||||
┌──────────┐
|
||||
│ RELEASE │ ← Abstract album concept
|
||||
│ GROUP │ "A Night at the Opera" (all editions)
|
||||
└──────────┘
|
||||
```
|
||||
|
||||
### Core Entities
|
||||
|
||||
| Entity | Description | Identifier | Example |
|
||||
|--------|-------------|------------|---------|
|
||||
| **Artist** | Musician, band, orchestra, composer | MBID | Queen, Freddie Mercury |
|
||||
| **Work** | Abstract composition | ISWC | "Bohemian Rhapsody" (the song) |
|
||||
| **Recording** | Specific audio performance | ISRC | Studio recording of Bohemian Rhapsody |
|
||||
| **Release** | Concrete product (CD, vinyl, digital) | Barcode/UPC | 1975 UK vinyl pressing |
|
||||
| **Release Group** | Abstract album (all editions) | MBID | "A Night at the Opera" |
|
||||
| **Label** | Record label or imprint | MBID | EMI, Hollywood Records |
|
||||
|
||||
### Key Distinction: Release vs Release Group
|
||||
|
||||
**Release Group** = The abstract album concept
|
||||
- "Nevermind" by Nirvana
|
||||
|
||||
**Release** = A specific physical or digital product
|
||||
- 1991 US CD (DGC)
|
||||
- 1991 UK CD (Geffen)
|
||||
- 2011 Deluxe Edition (4 CDs)
|
||||
- 2021 30th Anniversary Super Deluxe
|
||||
|
||||
This separation allows tracking all variants while maintaining a single "album" identity.
|
||||
|
||||
### Key Distinction: Recording vs Work
|
||||
|
||||
**Work** = The composition (what was written)
|
||||
- Composer: Kurt Cobain
|
||||
- ISWC identifier
|
||||
- No audio - just the abstract song
|
||||
|
||||
**Recording** = A specific audio capture
|
||||
- Performer: Nirvana
|
||||
- ISRC identifier
|
||||
- Has duration, audio characteristics
|
||||
- Multiple recordings of same work (studio, live, acoustic)
|
||||
|
||||
---
|
||||
|
||||
## Relationship System
|
||||
|
||||
MusicBrainz uses **Advanced Relationships (ARs)** to connect entities with typed, attributed links.
|
||||
|
||||
### Relationship Types
|
||||
|
||||
**Artist ↔ Artist:**
|
||||
- `member of band` (with dates)
|
||||
- `collaboration`
|
||||
- `teacher of`
|
||||
|
||||
**Artist ↔ Recording:**
|
||||
- `performer` (with instrument)
|
||||
- `producer`
|
||||
- `engineer`
|
||||
- `mix`
|
||||
|
||||
**Artist ↔ Work:**
|
||||
- `composer`
|
||||
- `lyricist`
|
||||
- `writer`
|
||||
|
||||
**Recording ↔ Work:**
|
||||
- `performance of`
|
||||
|
||||
**Artist ↔ URL:**
|
||||
- `official homepage`
|
||||
- `social network` (Spotify, YouTube, etc.)
|
||||
- `streaming`
|
||||
|
||||
### Relationship Attributes
|
||||
|
||||
Relationships carry attributes providing detail:
|
||||
|
||||
```
|
||||
Artist: John Lennon
|
||||
└─► Recording: "Come Together"
|
||||
Relationship: performer
|
||||
Attributes:
|
||||
- instrument: vocals
|
||||
- instrument: rhythm guitar
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Access Patterns
|
||||
|
||||
### Three Methods
|
||||
|
||||
| Method | Purpose | Use Case |
|
||||
|--------|---------|----------|
|
||||
| **Lookup** | Fetch single entity by MBID | Known entity, need full details |
|
||||
| **Browse** | Paginate related entities | All albums by artist, all tracks on album |
|
||||
| **Search** | Find entities by criteria | Find artist by name, recording by ISRC |
|
||||
|
||||
### Lookup
|
||||
|
||||
Direct fetch by MusicBrainz ID (MBID). Returns single entity with optional related data via `inc` parameter.
|
||||
|
||||
Related data options: `releases`, `recordings`, `url-rels`, `artist-rels`, `genres`, `labels`, `media`, `isrcs`
|
||||
|
||||
**Limitation:** Related entities capped at 25 per request. Use Browse for complete lists.
|
||||
|
||||
### Browse
|
||||
|
||||
Paginated fetch of entities related to another entity. Supports up to 100 items per request. Must iterate with offset for complete data.
|
||||
|
||||
### Search
|
||||
|
||||
Lucene-syntax queries across entity fields. Useful for:
|
||||
- Finding entities by name (fuzzy matching)
|
||||
- Looking up by external identifier (ISRC, barcode)
|
||||
- Filtering by attributes (country, type, date)
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
| Rule | Limit |
|
||||
|------|-------|
|
||||
| Requests per second | **1** (hard limit) |
|
||||
| Burst allowance | None |
|
||||
| Violation penalty | HTTP 503 until rate drops |
|
||||
| User-Agent | **Required** (blocked without) |
|
||||
|
||||
User-Agent format: `AppName/Version ( contact-url-or-email )`
|
||||
|
||||
---
|
||||
|
||||
## Entity Mapping to Internal Schema
|
||||
|
||||
### Artist
|
||||
|
||||
| MusicBrainz | Internal | Notes |
|
||||
|-------------|----------|-------|
|
||||
| `id` | `source_id` | MBID stored as external reference |
|
||||
| `name` | `name` | |
|
||||
| `sort-name` | `sort_name` | |
|
||||
| `type` | `artist_type` | Person, Group, Orchestra, etc. |
|
||||
| `country` | `country` | ISO code |
|
||||
| `life-span.begin` | `formed_date` | |
|
||||
| `life-span.end` | `disbanded_date` | |
|
||||
| `disambiguation` | `description` | Short disambiguator |
|
||||
| URL relationship (image) | `image_url` | From Wikimedia Commons link |
|
||||
|
||||
### Album (from Release Group)
|
||||
|
||||
| MusicBrainz | Internal | Notes |
|
||||
|-------------|----------|-------|
|
||||
| `id` | `source_id` | Release Group MBID |
|
||||
| `title` | `title` | |
|
||||
| `primary-type` | `album_type` | Album, EP, Single |
|
||||
| `first-release-date` | `release_date` | Earliest release |
|
||||
| Label from release | `label_id` | From canonical release |
|
||||
|
||||
### Track (from Recording)
|
||||
|
||||
| MusicBrainz | Internal | Notes |
|
||||
|-------------|----------|-------|
|
||||
| `id` | `source_id` | Recording MBID |
|
||||
| `title` | `title` | |
|
||||
| `length` | `duration_ms` | In milliseconds |
|
||||
| `isrcs[0]` | `isrc` | First ISRC if multiple |
|
||||
| Work relationship | `work_id` | Link to composition |
|
||||
|
||||
### Work
|
||||
|
||||
| MusicBrainz | Internal | Notes |
|
||||
|-------------|----------|-------|
|
||||
| `id` | `source_id` | Work MBID |
|
||||
| `title` | `title` | |
|
||||
| `type` | `work_type` | Song, Symphony, Opera, etc. |
|
||||
| `language` | `language` | ISO code |
|
||||
|
||||
### Label
|
||||
|
||||
| MusicBrainz | Internal | Notes |
|
||||
|-------------|----------|-------|
|
||||
| `id` | `source_id` | Label MBID |
|
||||
| `name` | `name` | |
|
||||
| `country` | `country` | ISO code |
|
||||
| `life-span.begin` | `founded_date` | |
|
||||
|
||||
---
|
||||
|
||||
## Ingestion Flow
|
||||
|
||||
### Artist Discovery
|
||||
|
||||
```
|
||||
INPUT: Artist name
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ SEARCH by name │
|
||||
│ → Ranked matches with scores │
|
||||
│ → Select highest + verify │
|
||||
└─────────────────┬───────────────────┘
|
||||
│ MBID
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ LOOKUP with relationships │
|
||||
│ → URLs, genres, band members │
|
||||
└─────────────────┬───────────────────┘
|
||||
│
|
||||
▼
|
||||
STORE: artist + external_id + genres
|
||||
```
|
||||
|
||||
### Discography Sync
|
||||
|
||||
```
|
||||
INPUT: Artist MBID
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ BROWSE all release-groups │
|
||||
│ → Filter: album, ep, single │
|
||||
│ → Paginate until exhausted │
|
||||
└─────────────────┬───────────────────┘
|
||||
│ for each
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ LOOKUP release-group │
|
||||
│ → Get releases list │
|
||||
│ → Select canonical release │
|
||||
└─────────────────┬───────────────────┘
|
||||
│ release MBID
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ LOOKUP release with tracks │
|
||||
│ → Media structure (discs) │
|
||||
│ → Track positions │
|
||||
│ → ISRCs, label info │
|
||||
└─────────────────┬───────────────────┘
|
||||
│
|
||||
▼
|
||||
STORE: album + tracks + positions
|
||||
```
|
||||
|
||||
### Canonical Release Selection
|
||||
|
||||
When a release-group has multiple releases, select one as canonical:
|
||||
|
||||
| Priority | Criteria |
|
||||
|----------|----------|
|
||||
| 1 | Status: Official > Promotional > Bootleg |
|
||||
| 2 | Format: Digital > CD > Vinyl |
|
||||
| 3 | Completeness: Has barcode, has label |
|
||||
| 4 | Date: Original release preferred |
|
||||
|
||||
---
|
||||
|
||||
## Cover Art
|
||||
|
||||
Album artwork served by **Cover Art Archive** (coverartarchive.org), not MusicBrainz directly.
|
||||
|
||||
| Size | URL Pattern |
|
||||
|------|-------------|
|
||||
| Original | `/release/{release_mbid}/front` |
|
||||
| Thumbnail | `/release/{release_mbid}/front-250` |
|
||||
| Medium | `/release/{release_mbid}/front-500` |
|
||||
| Large | `/release/{release_mbid}/front-1200` |
|
||||
|
||||
Not all releases have cover art. Check availability via release metadata.
|
||||
|
||||
---
|
||||
|
||||
## Bulk Data Access
|
||||
|
||||
For large-scale ingestion, database dumps avoid rate limits.
|
||||
|
||||
| Source | Format | Frequency | Use Case |
|
||||
|--------|--------|-----------|----------|
|
||||
| JSON dumps | JSONL (gzipped) | 2x/week | Initial seeding |
|
||||
| PostgreSQL dumps | SQL | 2x/week | Full mirror |
|
||||
| Replication packets | Incremental | Hourly | Staying in sync |
|
||||
|
||||
### Recommended Strategy
|
||||
|
||||
| Phase | Method |
|
||||
|-------|--------|
|
||||
| Initial load | JSON dumps |
|
||||
| On-demand | Live API with caching |
|
||||
| Periodic refresh | JSON dumps monthly |
|
||||
|
||||
---
|
||||
|
||||
## Caching
|
||||
|
||||
| Entity | TTL | Rationale |
|
||||
|--------|-----|-----------|
|
||||
| Artist | 30 days | Rarely changes |
|
||||
| Album | 30 days | Rarely changes |
|
||||
| Track | 30 days | Rarely changes |
|
||||
| Search results | 24 hours | New entries may appear |
|
||||
|
||||
---
|
||||
|
||||
## External ID Storage
|
||||
|
||||
Store in `*_external_ids` tables:
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| `source` | `"musicbrainz"` |
|
||||
| `source_id` | MBID (UUID) |
|
||||
| `url` | `https://musicbrainz.org/{entity}/{mbid}` |
|
||||
|
||||
Enables:
|
||||
- Cross-source deduplication
|
||||
- Lookup by MBID from other services
|
||||
- Link back for verification
|
||||
|
||||
---
|
||||
|
||||
## Go Client
|
||||
|
||||
Recommended: `go.uploadedlobster.com/musicbrainzws2`
|
||||
Reference in New Issue
Block a user