Files
metadata-agregator/docs/research/musicbrainz-server/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

16 KiB

MusicBrainz Server Data Layer

Database Overview

Engine: PostgreSQL 16+
Tables: 375
Foreign Key Constraints: 500+
Schema Definition: admin/sql/CreateTables.sql (4,068 lines)
Production Size: ~350GB (full dataset with indexes)

PostgreSQL Schema

Core Entity Tables

Artists:

  • artist - Artist entities (bands, musicians, orchestras, etc.)
  • artist_alias - Alternative names for artists
  • artist_credit - Artist credit configurations
  • artist_credit_name - Individual artists in a credit
  • artist_type - Artist type enumeration (person, group, etc.)
  • artist_tag - Folksonomy tags
  • artist_rating_raw - User ratings
  • artist_annotation - User annotations
  • artist_gid_redirect - MBID redirects after merges

Releases:

  • release - Release entities (albums, singles, etc.)
  • release_alias - Alternative release names
  • release_group - Logical grouping of releases
  • release_group_primary_type - Album, Single, EP, etc.
  • release_group_secondary_type - Compilation, Live, Remix, etc.
  • release_status - Official, Promotion, Bootleg, etc.
  • release_packaging - Jewel Case, Digipak, etc.
  • release_label - Labels associated with release
  • release_country - Release events by country
  • release_tag - Folksonomy tags
  • release_rating_raw - User ratings
  • release_annotation - User annotations
  • release_gid_redirect - MBID redirects

Recordings:

  • recording - Recording entities (unique audio recordings)
  • recording_alias - Alternative recording names
  • recording_tag - Folksonomy tags
  • recording_rating_raw - User ratings
  • recording_annotation - User annotations
  • recording_gid_redirect - MBID redirects
  • isrc - International Standard Recording Codes
  • recording_isrc - Recording to ISRC mapping

Works:

  • work - Musical composition entities
  • work_alias - Alternative work names
  • work_type - Song, Symphony, Opera, etc.
  • work_attribute - Work attributes (key, tempo, etc.)
  • work_attribute_type - Attribute type definitions
  • work_tag - Folksonomy tags
  • work_rating_raw - User ratings
  • work_annotation - User annotations
  • work_gid_redirect - MBID redirects
  • iswc - International Standard Musical Work Codes
  • work_iswc - Work to ISWC mapping

Labels:

  • label - Record label entities
  • label_alias - Alternative label names
  • label_type - Original Production, Bootleg Production, etc.
  • label_tag - Folksonomy tags
  • label_rating_raw - User ratings
  • label_annotation - User annotations
  • label_gid_redirect - MBID redirects

Geographic:

  • area - Geographic areas (countries, cities, etc.)
  • area_alias - Alternative area names
  • area_type - Country, Subdivision, City, etc.
  • area_tag - Folksonomy tags
  • area_annotation - User annotations
  • area_gid_redirect - MBID redirects
  • country_area - ISO country code mapping
  • iso_3166_1 - ISO 3166-1 country codes
  • iso_3166_2 - ISO 3166-2 subdivision codes
  • iso_3166_3 - ISO 3166-3 former country codes

Events:

  • event - Event entities (concerts, festivals, etc.)
  • event_alias - Alternative event names
  • event_type - Concert, Festival, etc.
  • event_tag - Folksonomy tags
  • event_rating_raw - User ratings
  • event_annotation - User annotations
  • event_gid_redirect - MBID redirects

Places:

  • place - Venue/location entities
  • place_alias - Alternative place names
  • place_type - Venue, Studio, etc.
  • place_tag - Folksonomy tags
  • place_annotation - User annotations
  • place_gid_redirect - MBID redirects

Series:

  • series - Ordered sequence entities
  • series_alias - Alternative series names
  • series_type - Release group series, etc.
  • series_ordering_type - Automatic, Manual
  • series_tag - Folksonomy tags
  • series_annotation - User annotations
  • series_gid_redirect - MBID redirects

Instruments:

  • instrument - Musical instrument entities
  • instrument_alias - Alternative instrument names
  • instrument_type - Wind, String, Percussion, etc.
  • instrument_tag - Folksonomy tags
  • instrument_annotation - User annotations
  • instrument_gid_redirect - MBID redirects

Genres:

  • genre - Genre entities
  • genre_alias - Alternative genre names
  • genre_annotation - User annotations
  • genre_gid_redirect - MBID redirects

URLs:

  • url - External URL entities
  • url_gid_redirect - MBID redirects

Relationship Tables (l_* tables)

Pattern: l_{entity1}_{entity2} for relationships between entities.

Examples:

  • l_artist_artist - Artist-to-artist relationships (member of, collaboration, etc.)
  • l_artist_recording - Artist-to-recording relationships (performer, conductor, etc.)
  • l_artist_release - Artist-to-release relationships
  • l_artist_release_group - Artist-to-release-group relationships
  • l_artist_work - Artist-to-work relationships (composer, lyricist, etc.)
  • l_artist_url - Artist-to-URL relationships (official homepage, social media, etc.)
  • l_recording_work - Recording-to-work relationships (performance of)
  • l_release_release_group - Release-to-release-group relationships
  • l_release_url - Release-to-URL relationships (purchase links, streaming, etc.)

Relationship Support Tables:

  • link - Link instances
  • link_type - Relationship type definitions
  • link_attribute - Relationship attributes
  • link_attribute_type - Attribute type definitions
  • link_crediting - Custom relationship credits
  • link_text_attribute - Text attributes for relationships

Media Tables

Physical Media:

  • medium - Physical media (CDs, vinyl, etc.)
  • medium_format - CD, Vinyl, Digital Media, etc.
  • medium_cdtoc - CD table of contents
  • cdtoc - CD TOC data
  • cdtoc_raw - Raw CD TOC data

Tracks:

  • track - Individual tracks on media
  • track_gid_redirect - Track MBID redirects

Metadata Tables

Tags:

  • tag - Tag definitions
  • tag_relation - Tag relationships
  • {entity}_tag - Tags per entity type
  • {entity}_tag_raw - Raw user tag submissions

Ratings:

  • {entity}_rating_raw - Raw user ratings per entity type

Annotations:

  • annotation - Annotation text
  • {entity}_annotation - Annotations per entity type

Collections:

  • editor_collection - User collections
  • editor_collection_type - Collection type (release, artist, etc.)
  • editor_collection_{entity} - Collection contents per entity type

Editorial Tables

Edits:

  • edit - Edit submissions
  • edit_data - Edit-specific data (JSON)
  • edit_{entity} - Edit to entity mappings
  • vote - User votes on edits
  • edit_note - Discussion notes on edits
  • edit_note_recipient - Edit note notifications

Editors:

  • editor - User accounts
  • editor_preference - User preferences
  • editor_language - User language preferences
  • editor_subscribe_artist - Artist subscriptions
  • editor_subscribe_collection - Collection subscriptions
  • editor_subscribe_label - Label subscriptions
  • editor_subscribe_series - Series subscriptions
  • editor_subscribe_editor - Editor subscriptions
  • editor_oauth_token - OAuth tokens
  • application - OAuth applications

Moderation:

  • autoeditor_election - Auto-editor elections
  • autoeditor_election_vote - Election votes
  • editor_watch_preferences - Watchlist preferences
  • editor_watch_artist - Artist watchlist
  • editor_watch_release_group_type - Release group type filters
  • editor_watch_release_status - Release status filters

Identifier Tables

Standard Identifiers:

  • isrc - International Standard Recording Code
  • iswc - International Standard Musical Work Code
  • recording_isrc - Recording to ISRC mapping
  • work_iswc - Work to ISWC mapping

MusicBrainz Identifiers:

  • {entity}_gid_redirect - MBID redirects after merges

Barcodes:

  • release_barcode - Release barcodes (EAN, UPC)

Replication Tables (dbmirror2)

Replication System:

  • dbmirror_pending - Pending replication packets
  • dbmirror_pendingdata - Replication data
  • replication_control - Replication state tracking

Modes:

  • RT_MASTER - Master database (generates replication packets)
  • RT_MIRROR - Mirror database (consumes replication packets)
  • RT_STANDALONE - Standalone database (no replication)

Auxiliary Tables

Statistics:

  • statistic - Cached statistics
  • statistic_event - Statistic calculation events

Documentation:

  • documentation.l_{entity1}_{entity2}_example - Relationship examples

Deprecated:

  • Various _deleted tables for soft deletes

Schema Management

CreateTables.sql

Location: admin/sql/CreateTables.sql
Size: 4,068 lines
Purpose: Complete schema definition for fresh installations

Structure:

-- Core entity tables
CREATE TABLE artist (...);
CREATE TABLE release (...);
CREATE TABLE recording (...);

-- Indexes
CREATE INDEX artist_idx_name ON artist (name);
CREATE INDEX artist_idx_gid ON artist (gid);

-- Foreign keys
ALTER TABLE artist_credit_name 
  ADD CONSTRAINT artist_credit_name_fk_artist 
  FOREIGN KEY (artist) REFERENCES artist(id);

-- Triggers
CREATE TRIGGER a_ins_artist AFTER INSERT ON artist ...;

Migration System

Location: admin/sql/updates/
Count: 332 migration files
Naming: Date-based (YYYYMMDD-HHMMSS-description.sql)

Example Filenames:

  • 20230115-mbs-12345-add-genre-table.sql
  • 20230220-mbs-12346-add-event-series-relationship.sql
  • 20230315-mbs-12347-add-recording-length-index.sql

Migration Structure:

\set ON_ERROR_STOP 1

BEGIN;

-- Schema changes
ALTER TABLE artist ADD COLUMN disambiguation TEXT;

-- Data migrations
UPDATE artist SET disambiguation = '' WHERE disambiguation IS NULL;

-- Constraints
ALTER TABLE artist ALTER COLUMN disambiguation SET NOT NULL;

COMMIT;

Schema Change Variants:

  • schema-change/ subdirectory contains master/mirror variants
  • Master migrations may include replication setup
  • Mirror migrations skip replication-specific changes

Migration Tracking:

  • Migrations are tracked in the database
  • Applied migrations recorded to prevent re-application
  • Rollback not supported (forward-only migrations)

Custom ORM (Moose-based Data Layer)

Architecture

NOT DBIx::Class - MusicBrainz uses a custom Moose-based data access layer.

Components:

  • 106 Data modules in lib/MusicBrainz/Server/Data/
  • DBIx::Connector for connection pooling
  • Sql.pm for query abstraction
  • Raw SQL via DBD::Pg

Data Module Pattern

Base Class: MusicBrainz::Server::Data::Entity

Example:

package MusicBrainz::Server::Data::Artist;
use Moose;
extends 'MusicBrainz::Server::Data::Entity';

with 'MusicBrainz::Server::Data::Role::Editable';
with 'MusicBrainz::Server::Data::Role::LinksToEdit';
with 'MusicBrainz::Server::Data::Role::Merge';

sub _table { 'artist' }
sub _entity_class { 'MusicBrainz::Server::Entity::Artist' }

sub _columns {
    return 'id, gid, name, sort_name, begin_date_year, begin_date_month, 
            begin_date_day, end_date_year, end_date_month, end_date_day, 
            type, area, gender, comment, edits_pending, last_updated, 
            ended, begin_area, end_area';
}

sub _column_mapping {
    return {
        id => 'id',
        gid => 'gid',
        name => 'name',
        sort_name => 'sort_name',
        type_id => 'type',
        area_id => 'area',
        gender_id => 'gender',
        comment => 'comment',
        edits_pending => 'edits_pending',
        last_updated => 'last_updated',
        ended => 'ended',
        begin_area_id => 'begin_area',
        end_area_id => 'end_area',
    };
}

sub get_by_gid {
    my ($self, $gid) = @_;
    return $self->_get_by_key('gid', $gid);
}

sub insert {
    my ($self, $data) = @_;
    my $row = $self->_hash_to_row($data);
    my $id = $self->sql->insert_row('artist', $row, 'id');
    return $self->_new_from_row($row);
}

Moose Roles

Role::Editable:

  • Entities that can be edited via the edit system
  • Provides load_meta() for edit counts

Role::Taggable:

  • Entities that support folksonomy tags
  • Provides tags(), add_tags(), remove_tags()

Role::Rateable:

  • Entities that can be rated (0-100 scale)
  • Provides rating(), user_rating()

Role::Relatable:

  • Entities that can have relationships
  • Provides relationships(), add_relationship()

Role::Aliasable:

  • Entities that can have alternative names
  • Provides aliases(), add_alias()

Role::Annotation:

  • Entities that can be annotated
  • Provides latest_annotation()

Sql.pm Abstraction

Location: lib/MusicBrainz/Server/Sql.pm

Purpose: Thin abstraction over DBI for common query patterns.

Methods:

# Single row
my $row = $sql->select_single_row_hash(
    'SELECT * FROM artist WHERE gid = ?', $gid
);

# Multiple rows
my $rows = $sql->select_list_of_hashes(
    'SELECT * FROM artist WHERE area = ?', $area_id
);

# Insert
my $id = $sql->insert_row('artist', {
    gid => $gid,
    name => $name,
    sort_name => $sort_name,
}, 'id');

# Update
$sql->update_row('artist', {
    name => $new_name,
}, { id => $artist_id });

# Delete
$sql->delete_row('artist', { id => $artist_id });

# Transaction
$sql->begin;
eval {
    $sql->insert_row(...);
    $sql->update_row(...);
    $sql->commit;
};
if ($@) {
    $sql->rollback;
    die $@;
}

DBIx::Connector

Purpose: Fast, safe DBI connection management with automatic reconnection.

Configuration:

my $conn = DBIx::Connector->new(
    $dsn, $username, $password,
    {
        RaiseError => 1,
        AutoCommit => 1,
        pg_enable_utf8 => 1,
    }
);

# Execute with automatic reconnection
$conn->run(sub {
    my $dbh = $_;
    $dbh->do('SELECT ...');
});

Search Infrastructure

Apache Solr (Primary)

Purpose: Full-text search across all entities

Cores:

  • artist - Artist search
  • release - Release search
  • release-group - Release group search
  • recording - Recording search
  • work - Work search
  • label - Label search
  • area - Area search
  • event - Event search
  • place - Place search
  • series - Series search
  • instrument - Instrument search
  • tag - Tag search

Indexing:

  • Incremental updates via edit system
  • Full reindex via admin/BuildSearchIndexes.pl
  • Real-time updates for new entities

Query Features:

  • Fuzzy matching
  • Phrase search
  • Boolean operators (AND, OR, NOT)
  • Field-specific search (artist:nirvana)
  • Wildcards (nirv*)
  • Proximity search ("smells spirit"~5)

PostgreSQL Full-Text (Fallback)

Purpose: Fallback when Solr is unavailable

Implementation:

  • mb_simple_tsvector function for text vectorization
  • GIN indexes on tsvector columns
  • to_tsquery() for query parsing

Example:

CREATE INDEX artist_idx_name_txt ON artist 
  USING gin(mb_simple_tsvector(name));

SELECT * FROM artist 
WHERE mb_simple_tsvector(name) @@ to_tsquery('simple', 'nirvana');

Limitations:

  • Less sophisticated than Solr
  • No fuzzy matching
  • Limited ranking
  • Used only as emergency fallback

Redis Caching

Architecture

Databases: 16 separate Redis databases (0-15)

Database Allocation:

  • DB 0: Entity cache (GID lookups)
  • DB 1: Session storage
  • DB 2-15: Various caches (search, statistics, etc.)

Entity Cache (GID Cache)

Purpose: Cache entity lookups by MBID (GID)

Pattern:

# Cache key: entity:gid:{gid}
my $cache_key = "artist:gid:$gid";

# Try cache first
my $cached = $redis->get($cache_key);
if ($cached) {
    return decode_json($cached);
}

# Cache miss - load from database
my $artist = $self->sql->select_single_row_hash(
    'SELECT * FROM artist WHERE gid = ?', $gid
);

# Store in cache (1 hour TTL)
$redis->setex($cache_key, 3600, encode_json($artist));

return $artist;

TTL: 1 hour (3600 seconds)

Invalidation: On edit application

Session Storage

Purpose: Store user sessions

Pattern:

# Session key: session:{session_id}
my $session_key = "session:$session_id";

# Store session
$redis->setex($session_key, 36000, encode_json({
    user_id => $user_id,
    csrf_token => $csrf_token,
    last_activity => time(),
}));

# Retrieve session
my $session = decode_json($redis->get($session_key));

TTL: 10 hours absolute, 3 hours idle

Cookie: AF_SID (SameSite=Lax, Secure, HttpOnly)

Cache Invalidation

Strategy: Invalidate on write

Example:

# After updating artist
$self->sql->update_row('artist', { name => $new_name }, { id => $id });

# Invalidate cache
$redis->del("artist:gid:$gid");

Bulk Invalidation:

  • Pattern-based deletion via SCAN + DEL
  • Used for relationship changes affecting multiple entities