Files
metadata-agregator/docs/research/lidarr-metadata-api/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

32 KiB

Lidarr Metadata API - Data Layer

Data Source Overview

The Lidarr Metadata API integrates with four primary data storage systems:

System Purpose Size Technology Access Pattern
MusicBrainz PostgreSQL Authoritative music metadata 100GB+ PostgreSQL 12+ Direct SQL (asyncpg)
Cache Database Persistent metadata cache 10-50GB PostgreSQL 12+ Direct SQL (asyncpg)
Redis Ephemeral cache + rate limiting 512MB Redis 6+ Key-value (aioredis)
Solr Full-text search index 8GB+ Solr 8.x HTTP REST API

MusicBrainz PostgreSQL Database

Database Overview

Purpose: Authoritative source for all music metadata

Access method: Direct read-only SQL queries via asyncpg

Replication: Hourly updates from MusicBrainz master database

Container image: ghcr.io/lidarr/mb-postgres:1.0.10

Connection configuration:

DATABASE = {
    'host': 'musicbrainz',
    'port': 5432,
    'database': 'musicbrainz_db',
    'user': 'abc',
    'password': 'abc',
    'min_pool_size': 10,
    'max_pool_size': 50,
    'command_timeout': 30
}

Core Tables

artist

Purpose: Artist entities (musicians, bands, orchestras, etc.)

Key columns:

Column Type Description
id INTEGER Primary key
gid UUID MusicBrainz ID (public identifier)
name VARCHAR Artist name
sort_name VARCHAR Name for alphabetical sorting
type INTEGER Artist type (Person, Group, etc.)
gender INTEGER Gender (for Person type)
area INTEGER Geographic area
begin_date_year SMALLINT Formation/birth year
end_date_year SMALLINT Dissolution/death year
comment VARCHAR Disambiguation comment
last_updated TIMESTAMP Last modification timestamp

Indices:

CREATE INDEX idx_artist_gid ON artist (gid);
CREATE INDEX idx_artist_name ON artist (name);
CREATE INDEX idx_artist_last_updated ON artist (last_updated DESC);

Row count: ~2 million artists

release_group

Purpose: Album groupings (abstract releases)

Key columns:

Column Type Description
id INTEGER Primary key
gid UUID MusicBrainz ID
name VARCHAR Album title
artist_credit INTEGER Artist credit ID
type INTEGER Primary type (Album, Single, EP, etc.)
comment VARCHAR Disambiguation
last_updated TIMESTAMP Last modification timestamp

Indices:

CREATE INDEX idx_release_group_gid ON release_group (gid);
CREATE INDEX idx_release_group_artist_credit ON release_group (artist_credit);
CREATE INDEX idx_release_group_last_updated ON release_group (last_updated DESC);

Row count: ~3 million release groups

release

Purpose: Specific releases (physical/digital products)

Key columns:

Column Type Description
id INTEGER Primary key
gid UUID MusicBrainz ID
name VARCHAR Release title
release_group INTEGER Release group ID
artist_credit INTEGER Artist credit ID
status INTEGER Release status (Official, Promotion, etc.)
packaging INTEGER Packaging type
barcode VARCHAR Barcode
last_updated TIMESTAMP Last modification timestamp

Indices:

CREATE INDEX idx_release_gid ON release (gid);
CREATE INDEX idx_release_release_group ON release (release_group);
CREATE INDEX idx_release_last_updated ON release (last_updated DESC);

Row count: ~5 million releases

medium

Purpose: Physical/digital media (CDs, Vinyl, Digital, etc.)

Key columns:

Column Type Description
id INTEGER Primary key
release INTEGER Release ID
position INTEGER Disc number
format INTEGER Medium format (CD, Vinyl, etc.)
name VARCHAR Medium name (e.g., "Bonus Disc")
track_count INTEGER Number of tracks

Indices:

CREATE INDEX idx_medium_release ON medium (release);

Row count: ~6 million media

track

Purpose: Track listings on media

Key columns:

Column Type Description
id INTEGER Primary key
gid UUID MusicBrainz ID
recording INTEGER Recording ID
medium INTEGER Medium ID
position INTEGER Track number
number VARCHAR Track number (string, e.g., "A1")
name VARCHAR Track title
length INTEGER Duration in milliseconds

Indices:

CREATE INDEX idx_track_medium ON track (medium);
CREATE INDEX idx_track_recording ON track (recording);

Row count: ~50 million tracks

recording

Purpose: Abstract recordings (audio content)

Key columns:

Column Type Description
id INTEGER Primary key
gid UUID MusicBrainz ID
name VARCHAR Recording title
artist_credit INTEGER Artist credit ID
length INTEGER Duration in milliseconds
comment VARCHAR Disambiguation
last_updated TIMESTAMP Last modification timestamp

Row count: ~40 million recordings

url

Purpose: External URLs (websites, streaming services, etc.)

Key columns:

Column Type Description
id INTEGER Primary key
gid UUID MusicBrainz ID
url TEXT URL string

Indices:

CREATE INDEX idx_url_url ON url (url);

Row count: ~10 million URLs

l_artist_url

Purpose: Artist-URL relationships (links)

Key columns:

Column Type Description
id INTEGER Primary key
link INTEGER Link type ID
entity0 INTEGER Artist ID
entity1 INTEGER URL ID
last_updated TIMESTAMP Last modification timestamp

Indices:

CREATE INDEX idx_l_artist_url_entity0 ON l_artist_url (entity0);
CREATE INDEX idx_l_artist_url_entity1 ON l_artist_url (entity1);
CREATE INDEX idx_l_artist_url_last_updated ON l_artist_url (last_updated DESC);

Row count: ~5 million links

cover_art_archive.index_listing

Purpose: Cover art availability tracking

Key columns:

Column Type Description
release INTEGER Release ID
date_updated TIMESTAMP Last cover art update

Indices:

CREATE INDEX idx_cover_art_release ON cover_art_archive.index_listing (release);
CREATE INDEX idx_cover_art_date_updated ON cover_art_archive.index_listing (date_updated DESC);

Row count: ~2 million releases with cover art

replication_control

Purpose: Replication status tracking

Key columns:

Column Type Description
current_schema_sequence INTEGER Current schema version
current_replication_sequence INTEGER Current replication packet
last_replication_date TIMESTAMP Last replication timestamp

Usage: Monitoring replication lag and detecting updates

Custom Indices

To support efficient change detection and queries, custom indices are created:

-- Artist change detection
CREATE INDEX IF NOT EXISTS idx_artist_last_updated
ON artist (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Release group change detection
CREATE INDEX IF NOT EXISTS idx_release_group_last_updated
ON release_group (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Release change detection
CREATE INDEX IF NOT EXISTS idx_release_last_updated
ON release (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Link change detection
CREATE INDEX IF NOT EXISTS idx_l_artist_url_last_updated
ON l_artist_url (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Cover art change detection
CREATE INDEX IF NOT EXISTS idx_cover_art_date_updated
ON cover_art_archive.index_listing (date_updated DESC)
WHERE date_updated IS NOT NULL;

SQL Query Files

artist.sql

Purpose: Fetch complete artist metadata with releases

Location: lidarrmetadata/sql/artist.sql

Parameters:

  • $1: Artist MBID (UUID)
  • $2: Primary release types (array)
  • $3: Secondary release types (array)
  • $4: Release statuses (array)

Query structure:

WITH artist_data AS (
    SELECT
        a.gid AS id,
        a.name AS artist_name,
        a.sort_name,
        a.comment AS disambiguation,
        at.name AS artist_type,
        g.name AS gender,
        ar.name AS hometown,
        a.begin_date_year AS start_year,
        a.end_date_year AS end_year
    FROM artist a
    LEFT JOIN artist_type at ON a.type = at.id
    LEFT JOIN gender g ON a.gender = g.id
    LEFT JOIN area ar ON a.area = ar.id
    WHERE a.gid = $1
),
releases AS (
    SELECT
        rg.gid AS id,
        rg.name AS title,
        rg.comment AS disambiguation,
        rgt.name AS primary_type,
        rgst.name AS secondary_type,
        rs.name AS status,
        COALESCE(
            TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
            ''
        ) AS release_date
    FROM artist a
    JOIN release_group rg ON rg.artist_credit = a.id
    LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
    LEFT JOIN release_group_secondary_type_join rgstj ON rgstj.release_group = rg.id
    LEFT JOIN release_group_secondary_type rgst ON rgstj.secondary_type = rgst.id
    LEFT JOIN release r ON r.release_group = rg.id
    LEFT JOIN release_status rs ON r.status = rs.id
    LEFT JOIN (
        SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
        FROM release_country
        GROUP BY release
    ) rd ON rd.release = r.id
    WHERE a.gid = $1
        AND (ARRAY_LENGTH($2::text[], 1) IS NULL OR rgt.name = ANY($2))
        AND (ARRAY_LENGTH($3::text[], 1) IS NULL OR rgst.name = ANY($3))
        AND (ARRAY_LENGTH($4::text[], 1) IS NULL OR rs.name = ANY($4))
    GROUP BY rg.id, rg.name, rg.comment, rgt.name, rgst.name, rs.name, rd.date_year, rd.date_month, rd.date_day
    ORDER BY rd.date_year DESC NULLS LAST, rd.date_month DESC NULLS LAST, rd.date_day DESC NULLS LAST
),
links AS (
    SELECT
        lt.name AS link_type,
        u.url
    FROM artist a
    JOIN l_artist_url lau ON lau.entity0 = a.id
    JOIN url u ON lau.entity1 = u.id
    JOIN link l ON lau.link = l.id
    JOIN link_type lt ON l.link_type = lt.id
    WHERE a.gid = $1
)
SELECT
    row_to_json(artist_data.*) AS artist,
    COALESCE(
        (SELECT json_agg(row_to_json(releases.*)) FROM releases),
        '[]'::json
    ) AS albums,
    COALESCE(
        (SELECT json_agg(row_to_json(links.*)) FROM links),
        '[]'::json
    ) AS links
FROM artist_data;

Performance: 100-500ms depending on artist discography size

Result format: Single row with three JSON columns (artist, albums, links)

album.sql

Purpose: Fetch complete album metadata with tracks

Location: lidarrmetadata/sql/album.sql

Parameters:

  • $1: Release group MBID (UUID)

Query structure:

WITH album_data AS (
    SELECT
        rg.gid AS id,
        rg.name AS title,
        rg.comment AS disambiguation,
        rgt.name AS primary_type,
        rs.name AS status,
        a.gid AS artist_id,
        a.name AS artist_name,
        COALESCE(
            TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
            ''
        ) AS release_date
    FROM release_group rg
    JOIN artist_credit ac ON rg.artist_credit = ac.id
    JOIN artist_credit_name acn ON acn.artist_credit = ac.id
    JOIN artist a ON acn.artist = a.id
    LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
    LEFT JOIN release r ON r.release_group = rg.id
    LEFT JOIN release_status rs ON r.status = rs.id
    LEFT JOIN (
        SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
        FROM release_country
        GROUP BY release
    ) rd ON rd.release = r.id
    WHERE rg.gid = $1
    LIMIT 1
),
media AS (
    SELECT
        m.position,
        mf.name AS format,
        json_agg(
            json_build_object(
                'position', t.position,
                'title', t.name,
                'duration', t.length,
                'artist_name', ta.name
            )
            ORDER BY t.position
        ) AS tracks
    FROM release_group rg
    JOIN release r ON r.release_group = rg.id
    JOIN medium m ON m.release = r.id
    LEFT JOIN medium_format mf ON m.format = mf.id
    JOIN track t ON t.medium = m.id
    JOIN recording rec ON t.recording = rec.id
    JOIN artist_credit tac ON rec.artist_credit = tac.id
    JOIN artist_credit_name tacn ON tacn.artist_credit = tac.id
    JOIN artist ta ON tacn.artist = ta.id
    WHERE rg.gid = $1
    GROUP BY m.id, m.position, mf.name
    ORDER BY m.position
)
SELECT
    row_to_json(album_data.*) AS album,
    COALESCE(
        (SELECT json_agg(row_to_json(media.*)) FROM media),
        '[]'::json
    ) AS media
FROM album_data;

Performance: 200-1000ms depending on track count

updated_artists.sql

Purpose: Detect recently updated artists for cache invalidation

Location: lidarrmetadata/sql/updated_artists.sql

Parameters:

  • $1: Timestamp threshold (only artists updated after this)
  • $2: Result limit

Query structure (UNION of 5 change sources):

-- Source 1: Artists with updated metadata
SELECT DISTINCT
    a.gid,
    a.last_updated,
    'metadata' AS change_type
FROM artist a
WHERE a.last_updated > $1

UNION

-- Source 2: Artists with new release groups
SELECT DISTINCT
    a.gid,
    rg.last_updated,
    'new_release' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
WHERE rg.last_updated > $1

UNION

-- Source 3: Artists with updated releases
SELECT DISTINCT
    a.gid,
    r.last_updated,
    'updated_release' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
JOIN release r ON r.release_group = rg.id
WHERE r.last_updated > $1

UNION

-- Source 4: Artists with new/updated links
SELECT DISTINCT
    a.gid,
    lau.last_updated,
    'new_link' AS change_type
FROM artist a
JOIN l_artist_url lau ON lau.entity0 = a.id
WHERE lau.last_updated > $1

UNION

-- Source 5: Artists with updated cover art
SELECT DISTINCT
    a.gid,
    caa.date_updated AS last_updated,
    'cover_art' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
JOIN release r ON r.release_group = rg.id
JOIN cover_art_archive.index_listing caa ON caa.release = r.id
WHERE caa.date_updated > $1

ORDER BY last_updated DESC
LIMIT $2;

Performance: 500-2000ms depending on time window

Use case: Crawler scheduling, cache invalidation

updated_albums.sql

Purpose: Detect recently updated albums

Location: lidarrmetadata/sql/updated_albums.sql

Parameters: Same as updated_artists.sql

Query structure: Similar UNION pattern with 5 change sources:

  1. Release group metadata updates
  2. New releases in group
  3. Updated releases in group
  4. New/updated links
  5. Cover art updates

Database Replication

Replication method: MusicBrainz replication packets

Update frequency: Hourly

Replication process:

  1. Check replication_control table for current sequence
  2. Fetch replication packets from MusicBrainz FTP
  3. Apply SQL changes from packets
  4. Update replication_control table
  5. Trigger search index updates via RabbitMQ

Monitoring:

SELECT
    current_replication_sequence,
    last_replication_date,
    NOW() - last_replication_date AS replication_lag
FROM replication_control;

Typical lag: 1-2 hours behind MusicBrainz master

Cache Database (PostgreSQL)

Database Overview

Purpose: Persistent cache storage with compression

Technology: PostgreSQL 12+ (same instance as MusicBrainz or separate)

Database name: lm_cache_db

Connection configuration:

CACHE_DATABASE = {
    'host': 'localhost',
    'port': 5432,
    'database': 'lm_cache_db',
    'user': 'abc',
    'password': 'abc'
}

Auto-Created Cache Tables

Each cache type gets its own table with identical schema:

Table names:

  • artist: Artist metadata cache
  • album: Album metadata cache
  • spotify: Spotify lookup cache
  • fanart: FanArt.tv image cache
  • tadb: TheAudioDB metadata cache
  • wikipedia: Wikipedia overview cache

Schema:

CREATE TABLE IF NOT EXISTS {cache_name} (
    key VARCHAR(255) PRIMARY KEY,
    expires TIMESTAMP,
    updated TIMESTAMP DEFAULT NOW(),
    value BYTEA
);

CREATE INDEX IF NOT EXISTS idx_{cache_name}_expires
ON {cache_name} (expires)
WHERE expires IS NOT NULL;

CREATE INDEX IF NOT EXISTS idx_{cache_name}_updated
ON {cache_name} (updated DESC);

Trigger for automatic timestamp updates:

CREATE OR REPLACE FUNCTION update_updated_column()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated = NOW();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER update_{cache_name}_updated
    BEFORE UPDATE ON {cache_name}
    FOR EACH ROW
    EXECUTE FUNCTION update_updated_column();

Cache Entry Format

Key structure: {cache_type}:{identifier}:{parameters}

Examples:

  • artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:Album:Official
  • album:1b022e01-4da6-387b-8658-8678046e4cef
  • spotify:artist:6olE6TJLqED3rqDCT0FyPh
  • wikipedia:5b11f4ce-a62d-471e-81fc-a69a8278c7da:en

Value format: zlib-compressed pickle

Compression implementation:

import zlib
import pickle

def compress_value(value):
    """Compress Python object for storage"""
    pickled = pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL)
    compressed = zlib.compress(pickled, level=6)
    return compressed

def decompress_value(compressed):
    """Decompress stored value to Python object"""
    pickled = zlib.decompress(compressed)
    value = pickle.loads(pickled)
    return value

Compression ratio: Typically 10:1 for JSON metadata

Example:

  • Uncompressed artist metadata: 50KB JSON
  • Pickled: 52KB
  • Compressed: 5KB
  • Storage savings: 90%

Cache Operations

Insert/Update

async def cache_set(key, value, ttl=2592000):
    """Store value in cache with optional TTL"""
    compressed = zlib.compress(pickle.dumps(value))
    expires = datetime.now() + timedelta(seconds=ttl) if ttl else None
    
    await conn.execute(
        """
        INSERT INTO artist (key, value, expires)
        VALUES ($1, $2, $3)
        ON CONFLICT (key) DO UPDATE
        SET value = $2, expires = $3, updated = NOW()
        """,
        key, compressed, expires
    )

Retrieve

async def cache_get(key):
    """Retrieve value from cache"""
    row = await conn.fetchrow(
        """
        SELECT value, expires
        FROM artist
        WHERE key = $1
        """,
        key
    )
    
    if not row:
        return None
    
    # Check expiration
    if row['expires'] and row['expires'] < datetime.now():
        # Expired, delete and return None
        await conn.execute("DELETE FROM artist WHERE key = $1", key)
        return None
    
    # Decompress and return
    value = pickle.loads(zlib.decompress(row['value']))
    return value

Delete

async def cache_delete(key):
    """Delete value from cache"""
    await conn.execute("DELETE FROM artist WHERE key = $1", key)

Cleanup Expired Entries

async def cache_cleanup():
    """Remove expired entries"""
    deleted = await conn.execute(
        """
        DELETE FROM artist
        WHERE expires IS NOT NULL AND expires < NOW()
        """
    )
    return deleted

Cleanup schedule: Daily via cron or crawler

Cache Statistics

Query for cache statistics:

SELECT
    'artist' AS cache_name,
    COUNT(*) AS total_entries,
    COUNT(*) FILTER (WHERE expires IS NOT NULL AND expires < NOW()) AS expired_entries,
    COUNT(*) FILTER (WHERE expires IS NULL OR expires >= NOW()) AS valid_entries,
    pg_size_pretty(pg_total_relation_size('artist')) AS total_size,
    AVG(LENGTH(value)) AS avg_value_size,
    MAX(updated) AS last_updated
FROM artist;

Example output:

cache_name | total_entries | expired_entries | valid_entries | total_size | avg_value_size | last_updated
-----------+---------------+-----------------+---------------+------------+----------------+---------------------
artist     | 125000        | 5000            | 120000        | 2048 MB    | 5120           | 2025-04-28 12:34:56

Redis Cache

Redis Overview

Purpose: Ephemeral cache for hot data and rate limiting

Technology: Redis 6+

Memory limit: 512MB

Eviction policy: LFU (Least Frequently Used)

Namespace: lm3.7

Connection configuration:

REDIS_URL = 'redis://localhost:6379/0'
REDIS_MIN_POOL_SIZE = 5
REDIS_MAX_POOL_SIZE = 20

Redis Configuration

redis.conf settings:

maxmemory 512mb
maxmemory-policy allkeys-lfu
save ""  # Disable persistence
appendonly no  # Disable AOF

Rationale:

  • LFU eviction keeps most-accessed data in cache
  • No persistence needed (PostgreSQL is persistent layer)
  • Maximum performance for cache operations

Key Structure

Namespace prefix: All keys prefixed with lm3.7:

Key patterns:

  • lm3.7:artist:{mbid}:{params}: Artist metadata
  • lm3.7:album:{mbid}: Album metadata
  • lm3.7:search:artist:{query}: Search results
  • lm3.7:ratelimit:{ip}:{window}: Rate limiter state
  • lm3.7:sentry:{error_hash}: Sentry deduplication
  • lm3.7:lock:invalidate:{mbid}: Invalidation locks

Cache Operations

Set with TTL

async def redis_set(key, value, ttl=604800):
    """Store value in Redis with TTL (default 7 days)"""
    compressed = zlib.compress(pickle.dumps(value))
    await redis.setex(f"lm3.7:{key}", ttl, compressed)

Get

async def redis_get(key):
    """Retrieve value from Redis"""
    compressed = await redis.get(f"lm3.7:{key}")
    if not compressed:
        return None
    value = pickle.loads(zlib.decompress(compressed))
    return value

Delete

async def redis_delete(key):
    """Delete value from Redis"""
    await redis.delete(f"lm3.7:{key}")

Batch Delete

async def redis_delete_pattern(pattern):
    """Delete all keys matching pattern"""
    cursor = 0
    while True:
        cursor, keys = await redis.scan(cursor, match=f"lm3.7:{pattern}", count=100)
        if keys:
            await redis.delete(*keys)
        if cursor == 0:
            break

Rate Limiting with Redis

Implementation: Sliding window counter

async def rate_limit_check(key, max_requests=100, window=60):
    """Check if request is within rate limit"""
    now = time.time()
    window_key = f"lm3.7:ratelimit:{key}:{int(now / window)}"
    
    # Increment counter
    count = await redis.incr(window_key)
    
    # Set expiration on first request
    if count == 1:
        await redis.expire(window_key, window)
    
    # Check limit
    if count > max_requests:
        raise RateLimitExceeded(f"Rate limit exceeded: {count}/{max_requests}")
    
    return count

Sentry Deduplication with Redis

Purpose: Prevent duplicate error reports

async def sentry_should_send(error_hash):
    """Check if error should be sent to Sentry"""
    key = f"lm3.7:sentry:{error_hash}"
    
    # Check if error seen recently
    if await redis.exists(key):
        return False
    
    # Mark error as seen for 1 hour
    await redis.setex(key, 3600, "1")
    return True

Redis Monitoring

Memory usage:

redis-cli INFO memory

Key count:

redis-cli DBSIZE

Eviction stats:

redis-cli INFO stats | grep evicted

Hit rate:

redis-cli INFO stats | grep keyspace

Solr Search Index

Solr Overview

Purpose: Full-text search for artists and albums

Technology: Apache Solr 8.x

Container image: ghcr.io/lidarr/mb-solr:3.3.1.9

Cores:

  • artist: Artist search index
  • release-group: Album search index

Update method: Real-time via RabbitMQ + SIR (Search Index Rebuilder)

Solr Configuration

solrconfig.xml settings:

<config>
  <requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="defType">dismax</str>
      <int name="rows">10</int>
    </lst>
  </requestHandler>
  
  <requestHandler name="/update" class="solr.UpdateRequestHandler" />
  
  <requestHandler name="/admin/ping" class="solr.PingRequestHandler">
    <lst name="invariants">
      <str name="q">*:*</str>
    </lst>
  </requestHandler>
</config>

Artist Core Schema

schema.xml:

<schema name="artist" version="1.6">
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="artist" type="text_general" indexed="true" stored="true" />
  <field name="sortname" type="text_general" indexed="true" stored="true" />
  <field name="alias" type="text_general" indexed="true" stored="true" multiValued="true" />
  <field name="type" type="string" indexed="true" stored="true" />
  <field name="disambiguation" type="text_general" indexed="true" stored="true" />
  <field name="_version_" type="long" indexed="true" stored="true" />
  
  <uniqueKey>id</uniqueKey>
  
  <copyField source="artist" dest="text" />
  <copyField source="sortname" dest="text" />
  <copyField source="alias" dest="text" />
</schema>

Indexed fields:

  • id: MusicBrainz artist MBID
  • artist: Artist name (boosted 2x in queries)
  • sortname: Sort name
  • alias: Artist aliases (multi-valued)
  • type: Artist type (Person, Group, etc.)
  • disambiguation: Disambiguation comment

Release Group Core Schema

Indexed fields:

  • id: MusicBrainz release group MBID
  • title: Album title (boosted 2x)
  • artist: Artist name
  • type: Primary type (Album, Single, etc.)
  • status: Release status
  • disambiguation: Disambiguation comment

Search Query Format

Dismax query example:

{
  "query": "nirvana",
  "limit": 10,
  "params": {
    "defType": "dismax",
    "qf": "artist^2 sortname alias",
    "mm": "1"
  }
}

Query field weights:

  • artist^2: Artist name (2x boost)
  • sortname: Sort name (1x)
  • alias: Aliases (1x)

Minimum match: At least 1 term must match

Solr Update Process

Real-time updates via RabbitMQ:

  1. MusicBrainz replication applies database changes
  2. Database triggers publish messages to RabbitMQ
  3. SIR (Search Index Rebuilder) consumes messages
  4. SIR queries MusicBrainz database for updated entities
  5. SIR posts updates to Solr cores
  6. Solr commits changes (soft commit every 1s)

RabbitMQ configuration:

RABBITMQ = {
    'host': 'rabbitmq',
    'port': 5672,
    'user': 'abc',
    'password': 'abc',
    'exchange': 'search.index',
    'queue': 'search.index.artist'
}

Update latency: 1-5 seconds from database change to searchable

Solr Performance

Query timeout: 5 seconds

Typical query time: 50-200ms

Index size:

  • Artist core: ~4GB
  • Release group core: ~4GB

Document count:

  • Artist core: ~2 million
  • Release group core: ~3 million

Change Detection System

Overview

Purpose: Identify recently updated entities for cache invalidation

Method: SQL queries tracking changes across multiple sources

Update frequency: Hourly (aligned with MusicBrainz replication)

Artist Change Sources

5 change sources tracked:

  1. Artist metadata updates: artist.last_updated
  2. New release groups: release_group.last_updated
  3. Updated releases: release.last_updated
  4. New/updated links: l_artist_url.last_updated
  5. Cover art updates: cover_art_archive.index_listing.date_updated

Query: updated_artists.sql (UNION of 5 queries)

Performance: 500-2000ms for 24-hour window

Typical results: 1000-5000 artists per hour

Album Change Sources

5 change sources tracked:

  1. Release group metadata updates: release_group.last_updated
  2. New releases in group: release.last_updated
  3. Updated releases in group: release.last_updated
  4. New/updated links: l_release_group_url.last_updated
  5. Cover art updates: cover_art_archive.index_listing.date_updated

Query: updated_albums.sql

Typical results: 2000-10000 albums per hour

Change Detection Workflow

Crawler process:

  1. Query updated_artists.sql with timestamp from last run
  2. For each updated artist:
    • Delete from Redis cache
    • Delete from PostgreSQL cache
    • Purge from Cloudflare CDN
  3. Optionally pre-fetch fresh metadata
  4. Update last run timestamp
  5. Sleep until next cycle

Invalidation vs. pre-fetching:

  • Invalidation only: Fast, minimal API load
  • Pre-fetching: Slower, but ensures cache is warm

Configuration:

CRAWLER_INVALIDATE_ONLY = False  # Pre-fetch after invalidation
CRAWLER_INTERVAL = 3600  # 1 hour
CRAWLER_BATCH_SIZE = 100  # Process 100 entities per batch

Data Consistency

Cache Coherence

Problem: Three cache tiers can become inconsistent

Solution: Hierarchical invalidation

Invalidation order:

  1. Cloudflare CDN (purge API)
  2. Redis (delete key)
  3. PostgreSQL (delete row)

Rationale: Invalidate from edge to origin to prevent stale data propagation

Replication Lag Handling

Problem: MusicBrainz replication has 1-2 hour lag

Solution: Accept eventual consistency

User impact: Newly added artists/albums may not appear in search for 1-2 hours

Mitigation: Manual cache refresh endpoint for urgent updates

Concurrent Update Handling

Problem: Multiple API instances may invalidate cache simultaneously

Solution: Redis-based distributed locks

Implementation:

async def invalidate_with_lock(mbid):
    """Invalidate cache with distributed lock"""
    lock_key = f"lm3.7:lock:invalidate:{mbid}"
    
    # Try to acquire lock (30 second TTL)
    acquired = await redis.set(lock_key, "1", ex=30, nx=True)
    
    if not acquired:
        # Another instance is already invalidating
        return False
    
    try:
        # Perform invalidation
        await invalidate_cdn(mbid)
        await redis_delete(f"artist:{mbid}")
        await postgres_delete(f"artist:{mbid}")
        return True
    finally:
        # Release lock
        await redis.delete(lock_key)

Data Volume Estimates

MusicBrainz Database

Table Row Count Avg Row Size Total Size
artist 2M 500B 1GB
release_group 3M 400B 1.2GB
release 5M 600B 3GB
medium 6M 200B 1.2GB
track 50M 300B 15GB
recording 40M 400B 16GB
url 10M 200B 2GB
l_artist_url 5M 100B 500MB
Total 121M - ~100GB

Cache Database

Table Entry Count Avg Compressed Size Total Size
artist 120K 5KB 600MB
album 200K 3KB 600MB
spotify 50K 1KB 50MB
fanart 80K 2KB 160MB
tadb 60K 2KB 120MB
wikipedia 100K 1KB 100MB
Total 610K - ~2GB

Redis Cache

Key Pattern Entry Count Avg Size Total Size
artist:* 10K 5KB 50MB
album:* 15K 3KB 45MB
search:* 5K 10KB 50MB
ratelimit:* 1K 100B 100KB
sentry:* 500 100B 50KB
Total 31.5K - ~150MB

Solr Index

Core Document Count Index Size
artist 2M 4GB
release-group 3M 4GB
Total 5M 8GB

Conclusion

The data layer demonstrates sophisticated multi-tier architecture:

  1. MusicBrainz PostgreSQL: Authoritative source with complex SQL aggregation
  2. Cache PostgreSQL: Persistent compressed cache with automatic expiration
  3. Redis: Hot cache with LFU eviction and rate limiting
  4. Solr: Real-time search index with RabbitMQ updates

Key strengths:

  • Direct database access for complex queries
  • Three-tier caching with compression (10:1 ratio)
  • Change detection across 5 sources per entity type
  • Real-time search index updates

The SQL queries using row_to_json and json_agg are particularly elegant, building nested JSON structures directly in the database for optimal performance.