Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

32 KiB

Raw Blame History

Lidarr Metadata API - Data Layer

Data Source Overview

The Lidarr Metadata API integrates with four primary data storage systems:

System	Purpose	Size	Technology	Access Pattern
MusicBrainz PostgreSQL	Authoritative music metadata	100GB+	PostgreSQL 12+	Direct SQL (asyncpg)
Cache Database	Persistent metadata cache	10-50GB	PostgreSQL 12+	Direct SQL (asyncpg)
Redis	Ephemeral cache + rate limiting	512MB	Redis 6+	Key-value (aioredis)
Solr	Full-text search index	8GB+	Solr 8.x	HTTP REST API

MusicBrainz PostgreSQL Database

Database Overview

Purpose: Authoritative source for all music metadata

Access method: Direct read-only SQL queries via asyncpg

Replication: Hourly updates from MusicBrainz master database

Container image: ghcr.io/lidarr/mb-postgres:1.0.10

Connection configuration:

DATABASE = {
    'host': 'musicbrainz',
    'port': 5432,
    'database': 'musicbrainz_db',
    'user': 'abc',
    'password': 'abc',
    'min_pool_size': 10,
    'max_pool_size': 50,
    'command_timeout': 30
}

Core Tables

artist

Purpose: Artist entities (musicians, bands, orchestras, etc.)

Key columns:

Column	Type	Description
`id`	INTEGER	Primary key
`gid`	UUID	MusicBrainz ID (public identifier)
`name`	VARCHAR	Artist name
`sort_name`	VARCHAR	Name for alphabetical sorting
`type`	INTEGER	Artist type (Person, Group, etc.)
`gender`	INTEGER	Gender (for Person type)
`area`	INTEGER	Geographic area
`begin_date_year`	SMALLINT	Formation/birth year
`end_date_year`	SMALLINT	Dissolution/death year
`comment`	VARCHAR	Disambiguation comment
`last_updated`	TIMESTAMP	Last modification timestamp

Indices:

CREATE INDEX idx_artist_gid ON artist (gid);
CREATE INDEX idx_artist_name ON artist (name);
CREATE INDEX idx_artist_last_updated ON artist (last_updated DESC);

Row count: ~2 million artists

release_group

Purpose: Album groupings (abstract releases)

Key columns:

Column	Type	Description
`id`	INTEGER	Primary key
`gid`	UUID	MusicBrainz ID
`name`	VARCHAR	Album title
`artist_credit`	INTEGER	Artist credit ID
`type`	INTEGER	Primary type (Album, Single, EP, etc.)
`comment`	VARCHAR	Disambiguation
`last_updated`	TIMESTAMP	Last modification timestamp

Indices:

CREATE INDEX idx_release_group_gid ON release_group (gid);
CREATE INDEX idx_release_group_artist_credit ON release_group (artist_credit);
CREATE INDEX idx_release_group_last_updated ON release_group (last_updated DESC);

Row count: ~3 million release groups

release

Purpose: Specific releases (physical/digital products)

Key columns:

Column	Type	Description
`id`	INTEGER	Primary key
`gid`	UUID	MusicBrainz ID
`name`	VARCHAR	Release title
`release_group`	INTEGER	Release group ID
`artist_credit`	INTEGER	Artist credit ID
`status`	INTEGER	Release status (Official, Promotion, etc.)
`packaging`	INTEGER	Packaging type
`barcode`	VARCHAR	Barcode
`last_updated`	TIMESTAMP	Last modification timestamp

Indices:

CREATE INDEX idx_release_gid ON release (gid);
CREATE INDEX idx_release_release_group ON release (release_group);
CREATE INDEX idx_release_last_updated ON release (last_updated DESC);

Row count: ~5 million releases

medium

Purpose: Physical/digital media (CDs, Vinyl, Digital, etc.)

Key columns:

Column	Type	Description
`id`	INTEGER	Primary key
`release`	INTEGER	Release ID
`position`	INTEGER	Disc number
`format`	INTEGER	Medium format (CD, Vinyl, etc.)
`name`	VARCHAR	Medium name (e.g., "Bonus Disc")
`track_count`	INTEGER	Number of tracks

Indices:

CREATE INDEX idx_medium_release ON medium (release);

Row count: ~6 million media

track

Purpose: Track listings on media

Key columns:

Column	Type	Description
`id`	INTEGER	Primary key
`gid`	UUID	MusicBrainz ID
`recording`	INTEGER	Recording ID
`medium`	INTEGER	Medium ID
`position`	INTEGER	Track number
`number`	VARCHAR	Track number (string, e.g., "A1")
`name`	VARCHAR	Track title
`length`	INTEGER	Duration in milliseconds

Indices:

CREATE INDEX idx_track_medium ON track (medium);
CREATE INDEX idx_track_recording ON track (recording);

Row count: ~50 million tracks

recording

Purpose: Abstract recordings (audio content)

Key columns:

Column	Type	Description
`id`	INTEGER	Primary key
`gid`	UUID	MusicBrainz ID
`name`	VARCHAR	Recording title
`artist_credit`	INTEGER	Artist credit ID
`length`	INTEGER	Duration in milliseconds
`comment`	VARCHAR	Disambiguation
`last_updated`	TIMESTAMP	Last modification timestamp

Row count: ~40 million recordings

url

Purpose: External URLs (websites, streaming services, etc.)

Key columns:

Column	Type	Description
`id`	INTEGER	Primary key
`gid`	UUID	MusicBrainz ID
`url`	TEXT	URL string

Indices:

CREATE INDEX idx_url_url ON url (url);

Row count: ~10 million URLs

l_artist_url

Purpose: Artist-URL relationships (links)

Key columns:

Column	Type	Description
`id`	INTEGER	Primary key
`link`	INTEGER	Link type ID
`entity0`	INTEGER	Artist ID
`entity1`	INTEGER	URL ID
`last_updated`	TIMESTAMP	Last modification timestamp

Indices:

CREATE INDEX idx_l_artist_url_entity0 ON l_artist_url (entity0);
CREATE INDEX idx_l_artist_url_entity1 ON l_artist_url (entity1);
CREATE INDEX idx_l_artist_url_last_updated ON l_artist_url (last_updated DESC);

Row count: ~5 million links

cover_art_archive.index_listing

Purpose: Cover art availability tracking

Key columns:

Column	Type	Description
`release`	INTEGER	Release ID
`date_updated`	TIMESTAMP	Last cover art update

Indices:

CREATE INDEX idx_cover_art_release ON cover_art_archive.index_listing (release);
CREATE INDEX idx_cover_art_date_updated ON cover_art_archive.index_listing (date_updated DESC);

Row count: ~2 million releases with cover art

replication_control

Purpose: Replication status tracking

Key columns:

Column	Type	Description
`current_schema_sequence`	INTEGER	Current schema version
`current_replication_sequence`	INTEGER	Current replication packet
`last_replication_date`	TIMESTAMP	Last replication timestamp

Usage: Monitoring replication lag and detecting updates

Custom Indices

To support efficient change detection and queries, custom indices are created:

-- Artist change detection
CREATE INDEX IF NOT EXISTS idx_artist_last_updated
ON artist (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Release group change detection
CREATE INDEX IF NOT EXISTS idx_release_group_last_updated
ON release_group (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Release change detection
CREATE INDEX IF NOT EXISTS idx_release_last_updated
ON release (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Link change detection
CREATE INDEX IF NOT EXISTS idx_l_artist_url_last_updated
ON l_artist_url (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Cover art change detection
CREATE INDEX IF NOT EXISTS idx_cover_art_date_updated
ON cover_art_archive.index_listing (date_updated DESC)
WHERE date_updated IS NOT NULL;

SQL Query Files

artist.sql

Purpose: Fetch complete artist metadata with releases

Location: lidarrmetadata/sql/artist.sql

Parameters:

$1: Artist MBID (UUID)
$2: Primary release types (array)
$3: Secondary release types (array)
$4: Release statuses (array)

Query structure:

WITH artist_data AS (
    SELECT
        a.gid AS id,
        a.name AS artist_name,
        a.sort_name,
        a.comment AS disambiguation,
        at.name AS artist_type,
        g.name AS gender,
        ar.name AS hometown,
        a.begin_date_year AS start_year,
        a.end_date_year AS end_year
    FROM artist a
    LEFT JOIN artist_type at ON a.type = at.id
    LEFT JOIN gender g ON a.gender = g.id
    LEFT JOIN area ar ON a.area = ar.id
    WHERE a.gid = $1
),
releases AS (
    SELECT
        rg.gid AS id,
        rg.name AS title,
        rg.comment AS disambiguation,
        rgt.name AS primary_type,
        rgst.name AS secondary_type,
        rs.name AS status,
        COALESCE(
            TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
            ''
        ) AS release_date
    FROM artist a
    JOIN release_group rg ON rg.artist_credit = a.id
    LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
    LEFT JOIN release_group_secondary_type_join rgstj ON rgstj.release_group = rg.id
    LEFT JOIN release_group_secondary_type rgst ON rgstj.secondary_type = rgst.id
    LEFT JOIN release r ON r.release_group = rg.id
    LEFT JOIN release_status rs ON r.status = rs.id
    LEFT JOIN (
        SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
        FROM release_country
        GROUP BY release
    ) rd ON rd.release = r.id
    WHERE a.gid = $1
        AND (ARRAY_LENGTH($2::text[], 1) IS NULL OR rgt.name = ANY($2))
        AND (ARRAY_LENGTH($3::text[], 1) IS NULL OR rgst.name = ANY($3))
        AND (ARRAY_LENGTH($4::text[], 1) IS NULL OR rs.name = ANY($4))
    GROUP BY rg.id, rg.name, rg.comment, rgt.name, rgst.name, rs.name, rd.date_year, rd.date_month, rd.date_day
    ORDER BY rd.date_year DESC NULLS LAST, rd.date_month DESC NULLS LAST, rd.date_day DESC NULLS LAST
),
links AS (
    SELECT
        lt.name AS link_type,
        u.url
    FROM artist a
    JOIN l_artist_url lau ON lau.entity0 = a.id
    JOIN url u ON lau.entity1 = u.id
    JOIN link l ON lau.link = l.id
    JOIN link_type lt ON l.link_type = lt.id
    WHERE a.gid = $1
)
SELECT
    row_to_json(artist_data.*) AS artist,
    COALESCE(
        (SELECT json_agg(row_to_json(releases.*)) FROM releases),
        '[]'::json
    ) AS albums,
    COALESCE(
        (SELECT json_agg(row_to_json(links.*)) FROM links),
        '[]'::json
    ) AS links
FROM artist_data;

Performance: 100-500ms depending on artist discography size

Result format: Single row with three JSON columns (artist, albums, links)

album.sql

Purpose: Fetch complete album metadata with tracks

Location: lidarrmetadata/sql/album.sql

Parameters:

$1: Release group MBID (UUID)

Query structure:

WITH album_data AS (
    SELECT
        rg.gid AS id,
        rg.name AS title,
        rg.comment AS disambiguation,
        rgt.name AS primary_type,
        rs.name AS status,
        a.gid AS artist_id,
        a.name AS artist_name,
        COALESCE(
            TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
            ''
        ) AS release_date
    FROM release_group rg
    JOIN artist_credit ac ON rg.artist_credit = ac.id
    JOIN artist_credit_name acn ON acn.artist_credit = ac.id
    JOIN artist a ON acn.artist = a.id
    LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
    LEFT JOIN release r ON r.release_group = rg.id
    LEFT JOIN release_status rs ON r.status = rs.id
    LEFT JOIN (
        SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
        FROM release_country
        GROUP BY release
    ) rd ON rd.release = r.id
    WHERE rg.gid = $1
    LIMIT 1
),
media AS (
    SELECT
        m.position,
        mf.name AS format,
        json_agg(
            json_build_object(
                'position', t.position,
                'title', t.name,
                'duration', t.length,
                'artist_name', ta.name
            )
            ORDER BY t.position
        ) AS tracks
    FROM release_group rg
    JOIN release r ON r.release_group = rg.id
    JOIN medium m ON m.release = r.id
    LEFT JOIN medium_format mf ON m.format = mf.id
    JOIN track t ON t.medium = m.id
    JOIN recording rec ON t.recording = rec.id
    JOIN artist_credit tac ON rec.artist_credit = tac.id
    JOIN artist_credit_name tacn ON tacn.artist_credit = tac.id
    JOIN artist ta ON tacn.artist = ta.id
    WHERE rg.gid = $1
    GROUP BY m.id, m.position, mf.name
    ORDER BY m.position
)
SELECT
    row_to_json(album_data.*) AS album,
    COALESCE(
        (SELECT json_agg(row_to_json(media.*)) FROM media),
        '[]'::json
    ) AS media
FROM album_data;

Performance: 200-1000ms depending on track count

updated_artists.sql

Purpose: Detect recently updated artists for cache invalidation

Location: lidarrmetadata/sql/updated_artists.sql

Parameters:

$1: Timestamp threshold (only artists updated after this)
$2: Result limit

Query structure (UNION of 5 change sources):

-- Source 1: Artists with updated metadata
SELECT DISTINCT
    a.gid,
    a.last_updated,
    'metadata' AS change_type
FROM artist a
WHERE a.last_updated > $1

UNION

-- Source 2: Artists with new release groups
SELECT DISTINCT
    a.gid,
    rg.last_updated,
    'new_release' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
WHERE rg.last_updated > $1

UNION

-- Source 3: Artists with updated releases
SELECT DISTINCT
    a.gid,
    r.last_updated,
    'updated_release' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
JOIN release r ON r.release_group = rg.id
WHERE r.last_updated > $1

UNION

-- Source 4: Artists with new/updated links
SELECT DISTINCT
    a.gid,
    lau.last_updated,
    'new_link' AS change_type
FROM artist a
JOIN l_artist_url lau ON lau.entity0 = a.id
WHERE lau.last_updated > $1

UNION

-- Source 5: Artists with updated cover art
SELECT DISTINCT
    a.gid,
    caa.date_updated AS last_updated,
    'cover_art' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
JOIN release r ON r.release_group = rg.id
JOIN cover_art_archive.index_listing caa ON caa.release = r.id
WHERE caa.date_updated > $1

ORDER BY last_updated DESC
LIMIT $2;

Performance: 500-2000ms depending on time window

Use case: Crawler scheduling, cache invalidation

updated_albums.sql

Purpose: Detect recently updated albums

Location: lidarrmetadata/sql/updated_albums.sql

Parameters: Same as updated_artists.sql

Query structure: Similar UNION pattern with 5 change sources:

Release group metadata updates
New releases in group
Updated releases in group
New/updated links
Cover art updates

Database Replication

Replication method: MusicBrainz replication packets

Update frequency: Hourly

Replication process:

Check replication_control table for current sequence
Fetch replication packets from MusicBrainz FTP
Apply SQL changes from packets
Update replication_control table
Trigger search index updates via RabbitMQ

Monitoring:

SELECT
    current_replication_sequence,
    last_replication_date,
    NOW() - last_replication_date AS replication_lag
FROM replication_control;

Typical lag: 1-2 hours behind MusicBrainz master

Cache Database (PostgreSQL)

Database Overview

Purpose: Persistent cache storage with compression

Technology: PostgreSQL 12+ (same instance as MusicBrainz or separate)

Database name: lm_cache_db

Connection configuration:

CACHE_DATABASE = {
    'host': 'localhost',
    'port': 5432,
    'database': 'lm_cache_db',
    'user': 'abc',
    'password': 'abc'
}

Auto-Created Cache Tables

Each cache type gets its own table with identical schema:

Table names:

artist: Artist metadata cache
album: Album metadata cache
spotify: Spotify lookup cache
fanart: FanArt.tv image cache
tadb: TheAudioDB metadata cache
wikipedia: Wikipedia overview cache

Schema:

CREATE TABLE IF NOT EXISTS {cache_name} (
    key VARCHAR(255) PRIMARY KEY,
    expires TIMESTAMP,
    updated TIMESTAMP DEFAULT NOW(),
    value BYTEA
);

CREATE INDEX IF NOT EXISTS idx_{cache_name}_expires
ON {cache_name} (expires)
WHERE expires IS NOT NULL;

CREATE INDEX IF NOT EXISTS idx_{cache_name}_updated
ON {cache_name} (updated DESC);

Trigger for automatic timestamp updates:

CREATE OR REPLACE FUNCTION update_updated_column()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated = NOW();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER update_{cache_name}_updated
    BEFORE UPDATE ON {cache_name}
    FOR EACH ROW
    EXECUTE FUNCTION update_updated_column();

Cache Entry Format

Key structure: {cache_type}:{identifier}:{parameters}

Examples:

artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:Album:Official
album:1b022e01-4da6-387b-8658-8678046e4cef
spotify:artist:6olE6TJLqED3rqDCT0FyPh
wikipedia:5b11f4ce-a62d-471e-81fc-a69a8278c7da:en

Value format: zlib-compressed pickle

Compression implementation:

import zlib
import pickle

def compress_value(value):
    """Compress Python object for storage"""
    pickled = pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL)
    compressed = zlib.compress(pickled, level=6)
    return compressed

def decompress_value(compressed):
    """Decompress stored value to Python object"""
    pickled = zlib.decompress(compressed)
    value = pickle.loads(pickled)
    return value

Compression ratio: Typically 10:1 for JSON metadata

Example:

Uncompressed artist metadata: 50KB JSON
Pickled: 52KB
Compressed: 5KB
Storage savings: 90%

Cache Operations

Insert/Update

async def cache_set(key, value, ttl=2592000):
    """Store value in cache with optional TTL"""
    compressed = zlib.compress(pickle.dumps(value))
    expires = datetime.now() + timedelta(seconds=ttl) if ttl else None
    
    await conn.execute(
        """
        INSERT INTO artist (key, value, expires)
        VALUES ($1, $2, $3)
        ON CONFLICT (key) DO UPDATE
        SET value = $2, expires = $3, updated = NOW()
        """,
        key, compressed, expires
    )

Retrieve

async def cache_get(key):
    """Retrieve value from cache"""
    row = await conn.fetchrow(
        """
        SELECT value, expires
        FROM artist
        WHERE key = $1
        """,
        key
    )
    
    if not row:
        return None
    
    # Check expiration
    if row['expires'] and row['expires'] < datetime.now():
        # Expired, delete and return None
        await conn.execute("DELETE FROM artist WHERE key = $1", key)
        return None
    
    # Decompress and return
    value = pickle.loads(zlib.decompress(row['value']))
    return value

Delete

async def cache_delete(key):
    """Delete value from cache"""
    await conn.execute("DELETE FROM artist WHERE key = $1", key)

Cleanup Expired Entries

async def cache_cleanup():
    """Remove expired entries"""
    deleted = await conn.execute(
        """
        DELETE FROM artist
        WHERE expires IS NOT NULL AND expires < NOW()
        """
    )
    return deleted

Cleanup schedule: Daily via cron or crawler

Cache Statistics

Query for cache statistics:

SELECT
    'artist' AS cache_name,
    COUNT(*) AS total_entries,
    COUNT(*) FILTER (WHERE expires IS NOT NULL AND expires < NOW()) AS expired_entries,
    COUNT(*) FILTER (WHERE expires IS NULL OR expires >= NOW()) AS valid_entries,
    pg_size_pretty(pg_total_relation_size('artist')) AS total_size,
    AVG(LENGTH(value)) AS avg_value_size,
    MAX(updated) AS last_updated
FROM artist;

Example output:

cache_name | total_entries | expired_entries | valid_entries | total_size | avg_value_size | last_updated
-----------+---------------+-----------------+---------------+------------+----------------+---------------------
artist     | 125000        | 5000            | 120000        | 2048 MB    | 5120           | 2025-04-28 12:34:56

Redis Cache

Redis Overview

Purpose: Ephemeral cache for hot data and rate limiting

Technology: Redis 6+

Memory limit: 512MB

Eviction policy: LFU (Least Frequently Used)

Namespace: lm3.7

Connection configuration:

REDIS_URL = 'redis://localhost:6379/0'
REDIS_MIN_POOL_SIZE = 5
REDIS_MAX_POOL_SIZE = 20

Redis Configuration

redis.conf settings:

maxmemory 512mb
maxmemory-policy allkeys-lfu
save ""  # Disable persistence
appendonly no  # Disable AOF

Rationale:

LFU eviction keeps most-accessed data in cache
No persistence needed (PostgreSQL is persistent layer)
Maximum performance for cache operations

Key Structure

Namespace prefix: All keys prefixed with lm3.7:

Key patterns:

lm3.7:artist:{mbid}:{params}: Artist metadata
lm3.7:album:{mbid}: Album metadata
lm3.7:search:artist:{query}: Search results
lm3.7:ratelimit:{ip}:{window}: Rate limiter state
lm3.7:sentry:{error_hash}: Sentry deduplication
lm3.7:lock:invalidate:{mbid}: Invalidation locks

Cache Operations

Set with TTL

async def redis_set(key, value, ttl=604800):
    """Store value in Redis with TTL (default 7 days)"""
    compressed = zlib.compress(pickle.dumps(value))
    await redis.setex(f"lm3.7:{key}", ttl, compressed)

Get

async def redis_get(key):
    """Retrieve value from Redis"""
    compressed = await redis.get(f"lm3.7:{key}")
    if not compressed:
        return None
    value = pickle.loads(zlib.decompress(compressed))
    return value

Delete

async def redis_delete(key):
    """Delete value from Redis"""
    await redis.delete(f"lm3.7:{key}")

Batch Delete

async def redis_delete_pattern(pattern):
    """Delete all keys matching pattern"""
    cursor = 0
    while True:
        cursor, keys = await redis.scan(cursor, match=f"lm3.7:{pattern}", count=100)
        if keys:
            await redis.delete(*keys)
        if cursor == 0:
            break

Rate Limiting with Redis

Implementation: Sliding window counter

async def rate_limit_check(key, max_requests=100, window=60):
    """Check if request is within rate limit"""
    now = time.time()
    window_key = f"lm3.7:ratelimit:{key}:{int(now / window)}"
    
    # Increment counter
    count = await redis.incr(window_key)
    
    # Set expiration on first request
    if count == 1:
        await redis.expire(window_key, window)
    
    # Check limit
    if count > max_requests:
        raise RateLimitExceeded(f"Rate limit exceeded: {count}/{max_requests}")
    
    return count

Sentry Deduplication with Redis

Purpose: Prevent duplicate error reports

async def sentry_should_send(error_hash):
    """Check if error should be sent to Sentry"""
    key = f"lm3.7:sentry:{error_hash}"
    
    # Check if error seen recently
    if await redis.exists(key):
        return False
    
    # Mark error as seen for 1 hour
    await redis.setex(key, 3600, "1")
    return True

Redis Monitoring

Memory usage:

redis-cli INFO memory

Key count:

redis-cli DBSIZE

Eviction stats:

redis-cli INFO stats | grep evicted

Hit rate:

redis-cli INFO stats | grep keyspace

Solr Search Index

Solr Overview

Purpose: Full-text search for artists and albums

Technology: Apache Solr 8.x

Container image: ghcr.io/lidarr/mb-solr:3.3.1.9

Cores:

artist: Artist search index
release-group: Album search index

Update method: Real-time via RabbitMQ + SIR (Search Index Rebuilder)

Solr Configuration

solrconfig.xml settings:

<config>
  <requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="defType">dismax</str>
      <int name="rows">10</int>
    </lst>
  </requestHandler>
  
  <requestHandler name="/update" class="solr.UpdateRequestHandler" />
  
  <requestHandler name="/admin/ping" class="solr.PingRequestHandler">
    <lst name="invariants">
      <str name="q">*:*</str>
    </lst>
  </requestHandler>
</config>

Artist Core Schema

schema.xml:

<schema name="artist" version="1.6">
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="artist" type="text_general" indexed="true" stored="true" />
  <field name="sortname" type="text_general" indexed="true" stored="true" />
  <field name="alias" type="text_general" indexed="true" stored="true" multiValued="true" />
  <field name="type" type="string" indexed="true" stored="true" />
  <field name="disambiguation" type="text_general" indexed="true" stored="true" />
  <field name="_version_" type="long" indexed="true" stored="true" />
  
  <uniqueKey>id</uniqueKey>
  
  <copyField source="artist" dest="text" />
  <copyField source="sortname" dest="text" />
  <copyField source="alias" dest="text" />
</schema>

Indexed fields:

id: MusicBrainz artist MBID
artist: Artist name (boosted 2x in queries)
sortname: Sort name
alias: Artist aliases (multi-valued)
type: Artist type (Person, Group, etc.)
disambiguation: Disambiguation comment

Release Group Core Schema

Indexed fields:

id: MusicBrainz release group MBID
title: Album title (boosted 2x)
artist: Artist name
type: Primary type (Album, Single, etc.)
status: Release status
disambiguation: Disambiguation comment

Search Query Format

Dismax query example:

{
  "query": "nirvana",
  "limit": 10,
  "params": {
    "defType": "dismax",
    "qf": "artist^2 sortname alias",
    "mm": "1"
  }
}

Query field weights:

artist^2: Artist name (2x boost)
sortname: Sort name (1x)
alias: Aliases (1x)

Minimum match: At least 1 term must match

Solr Update Process

Real-time updates via RabbitMQ:

MusicBrainz replication applies database changes
Database triggers publish messages to RabbitMQ
SIR (Search Index Rebuilder) consumes messages
SIR queries MusicBrainz database for updated entities
SIR posts updates to Solr cores
Solr commits changes (soft commit every 1s)

RabbitMQ configuration:

RABBITMQ = {
    'host': 'rabbitmq',
    'port': 5672,
    'user': 'abc',
    'password': 'abc',
    'exchange': 'search.index',
    'queue': 'search.index.artist'
}

Update latency: 1-5 seconds from database change to searchable

Solr Performance

Query timeout: 5 seconds

Typical query time: 50-200ms

Index size:

Artist core: ~4GB
Release group core: ~4GB

Document count:

Artist core: ~2 million
Release group core: ~3 million

Change Detection System

Overview

Purpose: Identify recently updated entities for cache invalidation

Method: SQL queries tracking changes across multiple sources

Update frequency: Hourly (aligned with MusicBrainz replication)

Artist Change Sources

5 change sources tracked:

Artist metadata updates: artist.last_updated
New release groups: release_group.last_updated
Updated releases: release.last_updated
New/updated links: l_artist_url.last_updated
Cover art updates: cover_art_archive.index_listing.date_updated

Query: updated_artists.sql (UNION of 5 queries)

Performance: 500-2000ms for 24-hour window

Typical results: 1000-5000 artists per hour

Album Change Sources

5 change sources tracked:

Release group metadata updates: release_group.last_updated
New releases in group: release.last_updated
Updated releases in group: release.last_updated
New/updated links: l_release_group_url.last_updated
Cover art updates: cover_art_archive.index_listing.date_updated

Query: updated_albums.sql

Typical results: 2000-10000 albums per hour

Change Detection Workflow

Crawler process:

Query updated_artists.sql with timestamp from last run
For each updated artist:
- Delete from Redis cache
- Delete from PostgreSQL cache
- Purge from Cloudflare CDN
Optionally pre-fetch fresh metadata
Update last run timestamp
Sleep until next cycle

Invalidation vs. pre-fetching:

Invalidation only: Fast, minimal API load
Pre-fetching: Slower, but ensures cache is warm

Configuration:

CRAWLER_INVALIDATE_ONLY = False  # Pre-fetch after invalidation
CRAWLER_INTERVAL = 3600  # 1 hour
CRAWLER_BATCH_SIZE = 100  # Process 100 entities per batch

Data Consistency

Cache Coherence

Problem: Three cache tiers can become inconsistent

Solution: Hierarchical invalidation

Invalidation order:

Cloudflare CDN (purge API)
Redis (delete key)
PostgreSQL (delete row)

Rationale: Invalidate from edge to origin to prevent stale data propagation

Replication Lag Handling

Problem: MusicBrainz replication has 1-2 hour lag

Solution: Accept eventual consistency

User impact: Newly added artists/albums may not appear in search for 1-2 hours

Mitigation: Manual cache refresh endpoint for urgent updates

Concurrent Update Handling

Problem: Multiple API instances may invalidate cache simultaneously

Solution: Redis-based distributed locks

Implementation:

async def invalidate_with_lock(mbid):
    """Invalidate cache with distributed lock"""
    lock_key = f"lm3.7:lock:invalidate:{mbid}"
    
    # Try to acquire lock (30 second TTL)
    acquired = await redis.set(lock_key, "1", ex=30, nx=True)
    
    if not acquired:
        # Another instance is already invalidating
        return False
    
    try:
        # Perform invalidation
        await invalidate_cdn(mbid)
        await redis_delete(f"artist:{mbid}")
        await postgres_delete(f"artist:{mbid}")
        return True
    finally:
        # Release lock
        await redis.delete(lock_key)

Data Volume Estimates

MusicBrainz Database

Table	Row Count	Avg Row Size	Total Size
artist	2M	500B	1GB
release_group	3M	400B	1.2GB
release	5M	600B	3GB
medium	6M	200B	1.2GB
track	50M	300B	15GB
recording	40M	400B	16GB
url	10M	200B	2GB
l_artist_url	5M	100B	500MB
Total	121M	-	~100GB

Cache Database

Table	Entry Count	Avg Compressed Size	Total Size
artist	120K	5KB	600MB
album	200K	3KB	600MB
spotify	50K	1KB	50MB
fanart	80K	2KB	160MB
tadb	60K	2KB	120MB
wikipedia	100K	1KB	100MB
Total	610K	-	~2GB

Redis Cache

Key Pattern	Entry Count	Avg Size	Total Size
artist:*	10K	5KB	50MB
album:*	15K	3KB	45MB
search:*	5K	10KB	50MB
ratelimit:*	1K	100B	100KB
sentry:*	500	100B	50KB
Total	31.5K	-	~150MB

Solr Index

Core	Document Count	Index Size
artist	2M	4GB
release-group	3M	4GB
Total	5M	8GB

Conclusion

The data layer demonstrates sophisticated multi-tier architecture:

MusicBrainz PostgreSQL: Authoritative source with complex SQL aggregation
Cache PostgreSQL: Persistent compressed cache with automatic expiration
Redis: Hot cache with LFU eviction and rate limiting
Solr: Real-time search index with RabbitMQ updates

Key strengths:

Direct database access for complex queries
Three-tier caching with compression (10:1 ratio)
Change detection across 5 sources per entity type
Real-time search index updates

The SQL queries using row_to_json and json_agg are particularly elegant, building nested JSON structures directly in the database for optimal performance.

32 KiB Raw Blame History

Lidarr Metadata API - Data Layer

Data Source Overview

MusicBrainz PostgreSQL Database

Database Overview

Core Tables

artist

release_group

release

medium

track

recording

url

l_artist_url

cover_art_archive.index_listing

replication_control

Custom Indices

SQL Query Files

artist.sql

album.sql

updated_artists.sql

updated_albums.sql

Database Replication

Cache Database (PostgreSQL)

Database Overview

Auto-Created Cache Tables

Cache Entry Format

Cache Operations

Insert/Update

Retrieve

Delete

Cleanup Expired Entries

Cache Statistics

Redis Cache

Redis Overview

Redis Configuration

Key Structure

Cache Operations

Set with TTL

Get

Delete

Batch Delete

Rate Limiting with Redis

Sentry Deduplication with Redis

Redis Monitoring

Solr Search Index

Solr Overview

Solr Configuration

Artist Core Schema

Release Group Core Schema

Search Query Format

Solr Update Process

Solr Performance

Change Detection System

Overview

Artist Change Sources

Album Change Sources

Change Detection Workflow

Data Consistency

Cache Coherence

Replication Lag Handling

Concurrent Update Handling

Data Volume Estimates

MusicBrainz Database

Cache Database

Redis Cache

Solr Index

Conclusion

32 KiB

Raw Blame History