Files
metadata-agregator/docs/INGESTION_MUSICBRAINZ.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

11 KiB

MusicBrainz Ingestion

Architecture documentation for ingesting music metadata from MusicBrainz.


Overview

MusicBrainz is an open music encyclopedia maintained by the MetaBrainz Foundation. It serves as the canonical source for music metadata with community-curated data covering artists, releases, recordings, and works.

Attribute Value
Data Quality High (community-curated)
Coverage ~2M artists, ~3M releases, ~30M recordings
Update Frequency Real-time edits, weekly dumps
API Style REST with Lucene search
Cost Free (rate-limited)

Data Model

MusicBrainz uses a hierarchical model that separates abstract concepts from concrete manifestations.

Entity Hierarchy

                        ┌──────────┐
                        │   WORK   │  ← Composition (the song as written)
                        │  (ISWC)  │     "Bohemian Rhapsody" by Freddie Mercury
                        └────┬─────┘
                             │ performed as
                             ▼
                        ┌──────────┐
                        │RECORDING │  ← Unique audio (specific performance)
                        │  (ISRC)  │     Studio version, live version, demo
                        └────┬─────┘
                             │ appears on
                             ▼
┌──────────┐           ┌──────────┐
│  ARTIST  │◄─────────►│ RELEASE  │  ← Physical/digital product
│  (MBID)  │ credited  │  (UPC)   │     US CD, UK Vinyl, Spotify release
└──────────┘    on     └────┬─────┘
                             │ variant of
                             ▼
                        ┌──────────┐
                        │ RELEASE  │  ← Abstract album concept
                        │  GROUP   │     "A Night at the Opera" (all editions)
                        └──────────┘

Core Entities

Entity Description Identifier Example
Artist Musician, band, orchestra, composer MBID Queen, Freddie Mercury
Work Abstract composition ISWC "Bohemian Rhapsody" (the song)
Recording Specific audio performance ISRC Studio recording of Bohemian Rhapsody
Release Concrete product (CD, vinyl, digital) Barcode/UPC 1975 UK vinyl pressing
Release Group Abstract album (all editions) MBID "A Night at the Opera"
Label Record label or imprint MBID EMI, Hollywood Records

Key Distinction: Release vs Release Group

Release Group = The abstract album concept

  • "Nevermind" by Nirvana

Release = A specific physical or digital product

  • 1991 US CD (DGC)
  • 1991 UK CD (Geffen)
  • 2011 Deluxe Edition (4 CDs)
  • 2021 30th Anniversary Super Deluxe

This separation allows tracking all variants while maintaining a single "album" identity.

Key Distinction: Recording vs Work

Work = The composition (what was written)

  • Composer: Kurt Cobain
  • ISWC identifier
  • No audio - just the abstract song

Recording = A specific audio capture

  • Performer: Nirvana
  • ISRC identifier
  • Has duration, audio characteristics
  • Multiple recordings of same work (studio, live, acoustic)

Relationship System

MusicBrainz uses Advanced Relationships (ARs) to connect entities with typed, attributed links.

Relationship Types

Artist ↔ Artist:

  • member of band (with dates)
  • collaboration
  • teacher of

Artist ↔ Recording:

  • performer (with instrument)
  • producer
  • engineer
  • mix

Artist ↔ Work:

  • composer
  • lyricist
  • writer

Recording ↔ Work:

  • performance of

Artist ↔ URL:

  • official homepage
  • social network (Spotify, YouTube, etc.)
  • streaming

Relationship Attributes

Relationships carry attributes providing detail:

Artist: John Lennon
  └─► Recording: "Come Together"
      Relationship: performer
      Attributes: 
        - instrument: vocals
        - instrument: rhythm guitar

API Access Patterns

Three Methods

Method Purpose Use Case
Lookup Fetch single entity by MBID Known entity, need full details
Browse Paginate related entities All albums by artist, all tracks on album
Search Find entities by criteria Find artist by name, recording by ISRC

Lookup

Direct fetch by MusicBrainz ID (MBID). Returns single entity with optional related data via inc parameter.

Related data options: releases, recordings, url-rels, artist-rels, genres, labels, media, isrcs

Limitation: Related entities capped at 25 per request. Use Browse for complete lists.

Browse

Paginated fetch of entities related to another entity. Supports up to 100 items per request. Must iterate with offset for complete data.

Lucene-syntax queries across entity fields. Useful for:

  • Finding entities by name (fuzzy matching)
  • Looking up by external identifier (ISRC, barcode)
  • Filtering by attributes (country, type, date)

Rate Limiting

Rule Limit
Requests per second 1 (hard limit)
Burst allowance None
Violation penalty HTTP 503 until rate drops
User-Agent Required (blocked without)

User-Agent format: AppName/Version ( contact-url-or-email )


Entity Mapping to Internal Schema

Artist

MusicBrainz Internal Notes
id source_id MBID stored as external reference
name name
sort-name sort_name
type artist_type Person, Group, Orchestra, etc.
country country ISO code
life-span.begin formed_date
life-span.end disbanded_date
disambiguation description Short disambiguator
URL relationship (image) image_url From Wikimedia Commons link

Album (from Release Group)

MusicBrainz Internal Notes
id source_id Release Group MBID
title title
primary-type album_type Album, EP, Single
first-release-date release_date Earliest release
Label from release label_id From canonical release

Track (from Recording)

MusicBrainz Internal Notes
id source_id Recording MBID
title title
length duration_ms In milliseconds
isrcs[0] isrc First ISRC if multiple
Work relationship work_id Link to composition

Work

MusicBrainz Internal Notes
id source_id Work MBID
title title
type work_type Song, Symphony, Opera, etc.
language language ISO code

Label

MusicBrainz Internal Notes
id source_id Label MBID
name name
country country ISO code
life-span.begin founded_date

Ingestion Flow

Artist Discovery

INPUT: Artist name
         │
         ▼
┌─────────────────────────────────────┐
│  SEARCH by name                     │
│  → Ranked matches with scores       │
│  → Select highest + verify          │
└─────────────────┬───────────────────┘
                  │ MBID
                  ▼
┌─────────────────────────────────────┐
│  LOOKUP with relationships          │
│  → URLs, genres, band members       │
└─────────────────┬───────────────────┘
                  │
                  ▼
         STORE: artist + external_id + genres

Discography Sync

INPUT: Artist MBID
         │
         ▼
┌─────────────────────────────────────┐
│  BROWSE all release-groups          │
│  → Filter: album, ep, single        │
│  → Paginate until exhausted         │
└─────────────────┬───────────────────┘
                  │ for each
                  ▼
┌─────────────────────────────────────┐
│  LOOKUP release-group               │
│  → Get releases list                │
│  → Select canonical release         │
└─────────────────┬───────────────────┘
                  │ release MBID
                  ▼
┌─────────────────────────────────────┐
│  LOOKUP release with tracks         │
│  → Media structure (discs)          │
│  → Track positions                  │
│  → ISRCs, label info                │
└─────────────────┬───────────────────┘
                  │
                  ▼
         STORE: album + tracks + positions

Canonical Release Selection

When a release-group has multiple releases, select one as canonical:

Priority Criteria
1 Status: Official > Promotional > Bootleg
2 Format: Digital > CD > Vinyl
3 Completeness: Has barcode, has label
4 Date: Original release preferred

Cover Art

Album artwork served by Cover Art Archive (coverartarchive.org), not MusicBrainz directly.

Size URL Pattern
Original /release/{release_mbid}/front
Thumbnail /release/{release_mbid}/front-250
Medium /release/{release_mbid}/front-500
Large /release/{release_mbid}/front-1200

Not all releases have cover art. Check availability via release metadata.


Bulk Data Access

For large-scale ingestion, database dumps avoid rate limits.

Source Format Frequency Use Case
JSON dumps JSONL (gzipped) 2x/week Initial seeding
PostgreSQL dumps SQL 2x/week Full mirror
Replication packets Incremental Hourly Staying in sync
Phase Method
Initial load JSON dumps
On-demand Live API with caching
Periodic refresh JSON dumps monthly

Caching

Entity TTL Rationale
Artist 30 days Rarely changes
Album 30 days Rarely changes
Track 30 days Rarely changes
Search results 24 hours New entries may appear

External ID Storage

Store in *_external_ids tables:

Field Value
source "musicbrainz"
source_id MBID (UUID)
url https://musicbrainz.org/{entity}/{mbid}

Enables:

  • Cross-source deduplication
  • Lookup by MBID from other services
  • Link back for verification

Go Client

Recommended: go.uploadedlobster.com/musicbrainzws2