- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
32 KiB
Lidarr Metadata API - Data Layer
Data Source Overview
The Lidarr Metadata API integrates with four primary data storage systems:
| System | Purpose | Size | Technology | Access Pattern |
|---|---|---|---|---|
| MusicBrainz PostgreSQL | Authoritative music metadata | 100GB+ | PostgreSQL 12+ | Direct SQL (asyncpg) |
| Cache Database | Persistent metadata cache | 10-50GB | PostgreSQL 12+ | Direct SQL (asyncpg) |
| Redis | Ephemeral cache + rate limiting | 512MB | Redis 6+ | Key-value (aioredis) |
| Solr | Full-text search index | 8GB+ | Solr 8.x | HTTP REST API |
MusicBrainz PostgreSQL Database
Database Overview
Purpose: Authoritative source for all music metadata
Access method: Direct read-only SQL queries via asyncpg
Replication: Hourly updates from MusicBrainz master database
Container image: ghcr.io/lidarr/mb-postgres:1.0.10
Connection configuration:
DATABASE = {
'host': 'musicbrainz',
'port': 5432,
'database': 'musicbrainz_db',
'user': 'abc',
'password': 'abc',
'min_pool_size': 10,
'max_pool_size': 50,
'command_timeout': 30
}
Core Tables
artist
Purpose: Artist entities (musicians, bands, orchestras, etc.)
Key columns:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
gid |
UUID | MusicBrainz ID (public identifier) |
name |
VARCHAR | Artist name |
sort_name |
VARCHAR | Name for alphabetical sorting |
type |
INTEGER | Artist type (Person, Group, etc.) |
gender |
INTEGER | Gender (for Person type) |
area |
INTEGER | Geographic area |
begin_date_year |
SMALLINT | Formation/birth year |
end_date_year |
SMALLINT | Dissolution/death year |
comment |
VARCHAR | Disambiguation comment |
last_updated |
TIMESTAMP | Last modification timestamp |
Indices:
CREATE INDEX idx_artist_gid ON artist (gid);
CREATE INDEX idx_artist_name ON artist (name);
CREATE INDEX idx_artist_last_updated ON artist (last_updated DESC);
Row count: ~2 million artists
release_group
Purpose: Album groupings (abstract releases)
Key columns:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
gid |
UUID | MusicBrainz ID |
name |
VARCHAR | Album title |
artist_credit |
INTEGER | Artist credit ID |
type |
INTEGER | Primary type (Album, Single, EP, etc.) |
comment |
VARCHAR | Disambiguation |
last_updated |
TIMESTAMP | Last modification timestamp |
Indices:
CREATE INDEX idx_release_group_gid ON release_group (gid);
CREATE INDEX idx_release_group_artist_credit ON release_group (artist_credit);
CREATE INDEX idx_release_group_last_updated ON release_group (last_updated DESC);
Row count: ~3 million release groups
release
Purpose: Specific releases (physical/digital products)
Key columns:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
gid |
UUID | MusicBrainz ID |
name |
VARCHAR | Release title |
release_group |
INTEGER | Release group ID |
artist_credit |
INTEGER | Artist credit ID |
status |
INTEGER | Release status (Official, Promotion, etc.) |
packaging |
INTEGER | Packaging type |
barcode |
VARCHAR | Barcode |
last_updated |
TIMESTAMP | Last modification timestamp |
Indices:
CREATE INDEX idx_release_gid ON release (gid);
CREATE INDEX idx_release_release_group ON release (release_group);
CREATE INDEX idx_release_last_updated ON release (last_updated DESC);
Row count: ~5 million releases
medium
Purpose: Physical/digital media (CDs, Vinyl, Digital, etc.)
Key columns:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
release |
INTEGER | Release ID |
position |
INTEGER | Disc number |
format |
INTEGER | Medium format (CD, Vinyl, etc.) |
name |
VARCHAR | Medium name (e.g., "Bonus Disc") |
track_count |
INTEGER | Number of tracks |
Indices:
CREATE INDEX idx_medium_release ON medium (release);
Row count: ~6 million media
track
Purpose: Track listings on media
Key columns:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
gid |
UUID | MusicBrainz ID |
recording |
INTEGER | Recording ID |
medium |
INTEGER | Medium ID |
position |
INTEGER | Track number |
number |
VARCHAR | Track number (string, e.g., "A1") |
name |
VARCHAR | Track title |
length |
INTEGER | Duration in milliseconds |
Indices:
CREATE INDEX idx_track_medium ON track (medium);
CREATE INDEX idx_track_recording ON track (recording);
Row count: ~50 million tracks
recording
Purpose: Abstract recordings (audio content)
Key columns:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
gid |
UUID | MusicBrainz ID |
name |
VARCHAR | Recording title |
artist_credit |
INTEGER | Artist credit ID |
length |
INTEGER | Duration in milliseconds |
comment |
VARCHAR | Disambiguation |
last_updated |
TIMESTAMP | Last modification timestamp |
Row count: ~40 million recordings
url
Purpose: External URLs (websites, streaming services, etc.)
Key columns:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
gid |
UUID | MusicBrainz ID |
url |
TEXT | URL string |
Indices:
CREATE INDEX idx_url_url ON url (url);
Row count: ~10 million URLs
l_artist_url
Purpose: Artist-URL relationships (links)
Key columns:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
link |
INTEGER | Link type ID |
entity0 |
INTEGER | Artist ID |
entity1 |
INTEGER | URL ID |
last_updated |
TIMESTAMP | Last modification timestamp |
Indices:
CREATE INDEX idx_l_artist_url_entity0 ON l_artist_url (entity0);
CREATE INDEX idx_l_artist_url_entity1 ON l_artist_url (entity1);
CREATE INDEX idx_l_artist_url_last_updated ON l_artist_url (last_updated DESC);
Row count: ~5 million links
cover_art_archive.index_listing
Purpose: Cover art availability tracking
Key columns:
| Column | Type | Description |
|---|---|---|
release |
INTEGER | Release ID |
date_updated |
TIMESTAMP | Last cover art update |
Indices:
CREATE INDEX idx_cover_art_release ON cover_art_archive.index_listing (release);
CREATE INDEX idx_cover_art_date_updated ON cover_art_archive.index_listing (date_updated DESC);
Row count: ~2 million releases with cover art
replication_control
Purpose: Replication status tracking
Key columns:
| Column | Type | Description |
|---|---|---|
current_schema_sequence |
INTEGER | Current schema version |
current_replication_sequence |
INTEGER | Current replication packet |
last_replication_date |
TIMESTAMP | Last replication timestamp |
Usage: Monitoring replication lag and detecting updates
Custom Indices
To support efficient change detection and queries, custom indices are created:
-- Artist change detection
CREATE INDEX IF NOT EXISTS idx_artist_last_updated
ON artist (last_updated DESC)
WHERE last_updated IS NOT NULL;
-- Release group change detection
CREATE INDEX IF NOT EXISTS idx_release_group_last_updated
ON release_group (last_updated DESC)
WHERE last_updated IS NOT NULL;
-- Release change detection
CREATE INDEX IF NOT EXISTS idx_release_last_updated
ON release (last_updated DESC)
WHERE last_updated IS NOT NULL;
-- Link change detection
CREATE INDEX IF NOT EXISTS idx_l_artist_url_last_updated
ON l_artist_url (last_updated DESC)
WHERE last_updated IS NOT NULL;
-- Cover art change detection
CREATE INDEX IF NOT EXISTS idx_cover_art_date_updated
ON cover_art_archive.index_listing (date_updated DESC)
WHERE date_updated IS NOT NULL;
SQL Query Files
artist.sql
Purpose: Fetch complete artist metadata with releases
Location: lidarrmetadata/sql/artist.sql
Parameters:
$1: Artist MBID (UUID)$2: Primary release types (array)$3: Secondary release types (array)$4: Release statuses (array)
Query structure:
WITH artist_data AS (
SELECT
a.gid AS id,
a.name AS artist_name,
a.sort_name,
a.comment AS disambiguation,
at.name AS artist_type,
g.name AS gender,
ar.name AS hometown,
a.begin_date_year AS start_year,
a.end_date_year AS end_year
FROM artist a
LEFT JOIN artist_type at ON a.type = at.id
LEFT JOIN gender g ON a.gender = g.id
LEFT JOIN area ar ON a.area = ar.id
WHERE a.gid = $1
),
releases AS (
SELECT
rg.gid AS id,
rg.name AS title,
rg.comment AS disambiguation,
rgt.name AS primary_type,
rgst.name AS secondary_type,
rs.name AS status,
COALESCE(
TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
''
) AS release_date
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
LEFT JOIN release_group_secondary_type_join rgstj ON rgstj.release_group = rg.id
LEFT JOIN release_group_secondary_type rgst ON rgstj.secondary_type = rgst.id
LEFT JOIN release r ON r.release_group = rg.id
LEFT JOIN release_status rs ON r.status = rs.id
LEFT JOIN (
SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
FROM release_country
GROUP BY release
) rd ON rd.release = r.id
WHERE a.gid = $1
AND (ARRAY_LENGTH($2::text[], 1) IS NULL OR rgt.name = ANY($2))
AND (ARRAY_LENGTH($3::text[], 1) IS NULL OR rgst.name = ANY($3))
AND (ARRAY_LENGTH($4::text[], 1) IS NULL OR rs.name = ANY($4))
GROUP BY rg.id, rg.name, rg.comment, rgt.name, rgst.name, rs.name, rd.date_year, rd.date_month, rd.date_day
ORDER BY rd.date_year DESC NULLS LAST, rd.date_month DESC NULLS LAST, rd.date_day DESC NULLS LAST
),
links AS (
SELECT
lt.name AS link_type,
u.url
FROM artist a
JOIN l_artist_url lau ON lau.entity0 = a.id
JOIN url u ON lau.entity1 = u.id
JOIN link l ON lau.link = l.id
JOIN link_type lt ON l.link_type = lt.id
WHERE a.gid = $1
)
SELECT
row_to_json(artist_data.*) AS artist,
COALESCE(
(SELECT json_agg(row_to_json(releases.*)) FROM releases),
'[]'::json
) AS albums,
COALESCE(
(SELECT json_agg(row_to_json(links.*)) FROM links),
'[]'::json
) AS links
FROM artist_data;
Performance: 100-500ms depending on artist discography size
Result format: Single row with three JSON columns (artist, albums, links)
album.sql
Purpose: Fetch complete album metadata with tracks
Location: lidarrmetadata/sql/album.sql
Parameters:
$1: Release group MBID (UUID)
Query structure:
WITH album_data AS (
SELECT
rg.gid AS id,
rg.name AS title,
rg.comment AS disambiguation,
rgt.name AS primary_type,
rs.name AS status,
a.gid AS artist_id,
a.name AS artist_name,
COALESCE(
TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
''
) AS release_date
FROM release_group rg
JOIN artist_credit ac ON rg.artist_credit = ac.id
JOIN artist_credit_name acn ON acn.artist_credit = ac.id
JOIN artist a ON acn.artist = a.id
LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
LEFT JOIN release r ON r.release_group = rg.id
LEFT JOIN release_status rs ON r.status = rs.id
LEFT JOIN (
SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
FROM release_country
GROUP BY release
) rd ON rd.release = r.id
WHERE rg.gid = $1
LIMIT 1
),
media AS (
SELECT
m.position,
mf.name AS format,
json_agg(
json_build_object(
'position', t.position,
'title', t.name,
'duration', t.length,
'artist_name', ta.name
)
ORDER BY t.position
) AS tracks
FROM release_group rg
JOIN release r ON r.release_group = rg.id
JOIN medium m ON m.release = r.id
LEFT JOIN medium_format mf ON m.format = mf.id
JOIN track t ON t.medium = m.id
JOIN recording rec ON t.recording = rec.id
JOIN artist_credit tac ON rec.artist_credit = tac.id
JOIN artist_credit_name tacn ON tacn.artist_credit = tac.id
JOIN artist ta ON tacn.artist = ta.id
WHERE rg.gid = $1
GROUP BY m.id, m.position, mf.name
ORDER BY m.position
)
SELECT
row_to_json(album_data.*) AS album,
COALESCE(
(SELECT json_agg(row_to_json(media.*)) FROM media),
'[]'::json
) AS media
FROM album_data;
Performance: 200-1000ms depending on track count
updated_artists.sql
Purpose: Detect recently updated artists for cache invalidation
Location: lidarrmetadata/sql/updated_artists.sql
Parameters:
$1: Timestamp threshold (only artists updated after this)$2: Result limit
Query structure (UNION of 5 change sources):
-- Source 1: Artists with updated metadata
SELECT DISTINCT
a.gid,
a.last_updated,
'metadata' AS change_type
FROM artist a
WHERE a.last_updated > $1
UNION
-- Source 2: Artists with new release groups
SELECT DISTINCT
a.gid,
rg.last_updated,
'new_release' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
WHERE rg.last_updated > $1
UNION
-- Source 3: Artists with updated releases
SELECT DISTINCT
a.gid,
r.last_updated,
'updated_release' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
JOIN release r ON r.release_group = rg.id
WHERE r.last_updated > $1
UNION
-- Source 4: Artists with new/updated links
SELECT DISTINCT
a.gid,
lau.last_updated,
'new_link' AS change_type
FROM artist a
JOIN l_artist_url lau ON lau.entity0 = a.id
WHERE lau.last_updated > $1
UNION
-- Source 5: Artists with updated cover art
SELECT DISTINCT
a.gid,
caa.date_updated AS last_updated,
'cover_art' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
JOIN release r ON r.release_group = rg.id
JOIN cover_art_archive.index_listing caa ON caa.release = r.id
WHERE caa.date_updated > $1
ORDER BY last_updated DESC
LIMIT $2;
Performance: 500-2000ms depending on time window
Use case: Crawler scheduling, cache invalidation
updated_albums.sql
Purpose: Detect recently updated albums
Location: lidarrmetadata/sql/updated_albums.sql
Parameters: Same as updated_artists.sql
Query structure: Similar UNION pattern with 5 change sources:
- Release group metadata updates
- New releases in group
- Updated releases in group
- New/updated links
- Cover art updates
Database Replication
Replication method: MusicBrainz replication packets
Update frequency: Hourly
Replication process:
- Check
replication_controltable for current sequence - Fetch replication packets from MusicBrainz FTP
- Apply SQL changes from packets
- Update
replication_controltable - Trigger search index updates via RabbitMQ
Monitoring:
SELECT
current_replication_sequence,
last_replication_date,
NOW() - last_replication_date AS replication_lag
FROM replication_control;
Typical lag: 1-2 hours behind MusicBrainz master
Cache Database (PostgreSQL)
Database Overview
Purpose: Persistent cache storage with compression
Technology: PostgreSQL 12+ (same instance as MusicBrainz or separate)
Database name: lm_cache_db
Connection configuration:
CACHE_DATABASE = {
'host': 'localhost',
'port': 5432,
'database': 'lm_cache_db',
'user': 'abc',
'password': 'abc'
}
Auto-Created Cache Tables
Each cache type gets its own table with identical schema:
Table names:
artist: Artist metadata cachealbum: Album metadata cachespotify: Spotify lookup cachefanart: FanArt.tv image cachetadb: TheAudioDB metadata cachewikipedia: Wikipedia overview cache
Schema:
CREATE TABLE IF NOT EXISTS {cache_name} (
key VARCHAR(255) PRIMARY KEY,
expires TIMESTAMP,
updated TIMESTAMP DEFAULT NOW(),
value BYTEA
);
CREATE INDEX IF NOT EXISTS idx_{cache_name}_expires
ON {cache_name} (expires)
WHERE expires IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_{cache_name}_updated
ON {cache_name} (updated DESC);
Trigger for automatic timestamp updates:
CREATE OR REPLACE FUNCTION update_updated_column()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER update_{cache_name}_updated
BEFORE UPDATE ON {cache_name}
FOR EACH ROW
EXECUTE FUNCTION update_updated_column();
Cache Entry Format
Key structure: {cache_type}:{identifier}:{parameters}
Examples:
artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:Album:Officialalbum:1b022e01-4da6-387b-8658-8678046e4cefspotify:artist:6olE6TJLqED3rqDCT0FyPhwikipedia:5b11f4ce-a62d-471e-81fc-a69a8278c7da:en
Value format: zlib-compressed pickle
Compression implementation:
import zlib
import pickle
def compress_value(value):
"""Compress Python object for storage"""
pickled = pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL)
compressed = zlib.compress(pickled, level=6)
return compressed
def decompress_value(compressed):
"""Decompress stored value to Python object"""
pickled = zlib.decompress(compressed)
value = pickle.loads(pickled)
return value
Compression ratio: Typically 10:1 for JSON metadata
Example:
- Uncompressed artist metadata: 50KB JSON
- Pickled: 52KB
- Compressed: 5KB
- Storage savings: 90%
Cache Operations
Insert/Update
async def cache_set(key, value, ttl=2592000):
"""Store value in cache with optional TTL"""
compressed = zlib.compress(pickle.dumps(value))
expires = datetime.now() + timedelta(seconds=ttl) if ttl else None
await conn.execute(
"""
INSERT INTO artist (key, value, expires)
VALUES ($1, $2, $3)
ON CONFLICT (key) DO UPDATE
SET value = $2, expires = $3, updated = NOW()
""",
key, compressed, expires
)
Retrieve
async def cache_get(key):
"""Retrieve value from cache"""
row = await conn.fetchrow(
"""
SELECT value, expires
FROM artist
WHERE key = $1
""",
key
)
if not row:
return None
# Check expiration
if row['expires'] and row['expires'] < datetime.now():
# Expired, delete and return None
await conn.execute("DELETE FROM artist WHERE key = $1", key)
return None
# Decompress and return
value = pickle.loads(zlib.decompress(row['value']))
return value
Delete
async def cache_delete(key):
"""Delete value from cache"""
await conn.execute("DELETE FROM artist WHERE key = $1", key)
Cleanup Expired Entries
async def cache_cleanup():
"""Remove expired entries"""
deleted = await conn.execute(
"""
DELETE FROM artist
WHERE expires IS NOT NULL AND expires < NOW()
"""
)
return deleted
Cleanup schedule: Daily via cron or crawler
Cache Statistics
Query for cache statistics:
SELECT
'artist' AS cache_name,
COUNT(*) AS total_entries,
COUNT(*) FILTER (WHERE expires IS NOT NULL AND expires < NOW()) AS expired_entries,
COUNT(*) FILTER (WHERE expires IS NULL OR expires >= NOW()) AS valid_entries,
pg_size_pretty(pg_total_relation_size('artist')) AS total_size,
AVG(LENGTH(value)) AS avg_value_size,
MAX(updated) AS last_updated
FROM artist;
Example output:
cache_name | total_entries | expired_entries | valid_entries | total_size | avg_value_size | last_updated
-----------+---------------+-----------------+---------------+------------+----------------+---------------------
artist | 125000 | 5000 | 120000 | 2048 MB | 5120 | 2025-04-28 12:34:56
Redis Cache
Redis Overview
Purpose: Ephemeral cache for hot data and rate limiting
Technology: Redis 6+
Memory limit: 512MB
Eviction policy: LFU (Least Frequently Used)
Namespace: lm3.7
Connection configuration:
REDIS_URL = 'redis://localhost:6379/0'
REDIS_MIN_POOL_SIZE = 5
REDIS_MAX_POOL_SIZE = 20
Redis Configuration
redis.conf settings:
maxmemory 512mb
maxmemory-policy allkeys-lfu
save "" # Disable persistence
appendonly no # Disable AOF
Rationale:
- LFU eviction keeps most-accessed data in cache
- No persistence needed (PostgreSQL is persistent layer)
- Maximum performance for cache operations
Key Structure
Namespace prefix: All keys prefixed with lm3.7:
Key patterns:
lm3.7:artist:{mbid}:{params}: Artist metadatalm3.7:album:{mbid}: Album metadatalm3.7:search:artist:{query}: Search resultslm3.7:ratelimit:{ip}:{window}: Rate limiter statelm3.7:sentry:{error_hash}: Sentry deduplicationlm3.7:lock:invalidate:{mbid}: Invalidation locks
Cache Operations
Set with TTL
async def redis_set(key, value, ttl=604800):
"""Store value in Redis with TTL (default 7 days)"""
compressed = zlib.compress(pickle.dumps(value))
await redis.setex(f"lm3.7:{key}", ttl, compressed)
Get
async def redis_get(key):
"""Retrieve value from Redis"""
compressed = await redis.get(f"lm3.7:{key}")
if not compressed:
return None
value = pickle.loads(zlib.decompress(compressed))
return value
Delete
async def redis_delete(key):
"""Delete value from Redis"""
await redis.delete(f"lm3.7:{key}")
Batch Delete
async def redis_delete_pattern(pattern):
"""Delete all keys matching pattern"""
cursor = 0
while True:
cursor, keys = await redis.scan(cursor, match=f"lm3.7:{pattern}", count=100)
if keys:
await redis.delete(*keys)
if cursor == 0:
break
Rate Limiting with Redis
Implementation: Sliding window counter
async def rate_limit_check(key, max_requests=100, window=60):
"""Check if request is within rate limit"""
now = time.time()
window_key = f"lm3.7:ratelimit:{key}:{int(now / window)}"
# Increment counter
count = await redis.incr(window_key)
# Set expiration on first request
if count == 1:
await redis.expire(window_key, window)
# Check limit
if count > max_requests:
raise RateLimitExceeded(f"Rate limit exceeded: {count}/{max_requests}")
return count
Sentry Deduplication with Redis
Purpose: Prevent duplicate error reports
async def sentry_should_send(error_hash):
"""Check if error should be sent to Sentry"""
key = f"lm3.7:sentry:{error_hash}"
# Check if error seen recently
if await redis.exists(key):
return False
# Mark error as seen for 1 hour
await redis.setex(key, 3600, "1")
return True
Redis Monitoring
Memory usage:
redis-cli INFO memory
Key count:
redis-cli DBSIZE
Eviction stats:
redis-cli INFO stats | grep evicted
Hit rate:
redis-cli INFO stats | grep keyspace
Solr Search Index
Solr Overview
Purpose: Full-text search for artists and albums
Technology: Apache Solr 8.x
Container image: ghcr.io/lidarr/mb-solr:3.3.1.9
Cores:
artist: Artist search indexrelease-group: Album search index
Update method: Real-time via RabbitMQ + SIR (Search Index Rebuilder)
Solr Configuration
solrconfig.xml settings:
<config>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="defType">dismax</str>
<int name="rows">10</int>
</lst>
</requestHandler>
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">*:*</str>
</lst>
</requestHandler>
</config>
Artist Core Schema
schema.xml:
<schema name="artist" version="1.6">
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="artist" type="text_general" indexed="true" stored="true" />
<field name="sortname" type="text_general" indexed="true" stored="true" />
<field name="alias" type="text_general" indexed="true" stored="true" multiValued="true" />
<field name="type" type="string" indexed="true" stored="true" />
<field name="disambiguation" type="text_general" indexed="true" stored="true" />
<field name="_version_" type="long" indexed="true" stored="true" />
<uniqueKey>id</uniqueKey>
<copyField source="artist" dest="text" />
<copyField source="sortname" dest="text" />
<copyField source="alias" dest="text" />
</schema>
Indexed fields:
id: MusicBrainz artist MBIDartist: Artist name (boosted 2x in queries)sortname: Sort namealias: Artist aliases (multi-valued)type: Artist type (Person, Group, etc.)disambiguation: Disambiguation comment
Release Group Core Schema
Indexed fields:
id: MusicBrainz release group MBIDtitle: Album title (boosted 2x)artist: Artist nametype: Primary type (Album, Single, etc.)status: Release statusdisambiguation: Disambiguation comment
Search Query Format
Dismax query example:
{
"query": "nirvana",
"limit": 10,
"params": {
"defType": "dismax",
"qf": "artist^2 sortname alias",
"mm": "1"
}
}
Query field weights:
artist^2: Artist name (2x boost)sortname: Sort name (1x)alias: Aliases (1x)
Minimum match: At least 1 term must match
Solr Update Process
Real-time updates via RabbitMQ:
- MusicBrainz replication applies database changes
- Database triggers publish messages to RabbitMQ
- SIR (Search Index Rebuilder) consumes messages
- SIR queries MusicBrainz database for updated entities
- SIR posts updates to Solr cores
- Solr commits changes (soft commit every 1s)
RabbitMQ configuration:
RABBITMQ = {
'host': 'rabbitmq',
'port': 5672,
'user': 'abc',
'password': 'abc',
'exchange': 'search.index',
'queue': 'search.index.artist'
}
Update latency: 1-5 seconds from database change to searchable
Solr Performance
Query timeout: 5 seconds
Typical query time: 50-200ms
Index size:
- Artist core: ~4GB
- Release group core: ~4GB
Document count:
- Artist core: ~2 million
- Release group core: ~3 million
Change Detection System
Overview
Purpose: Identify recently updated entities for cache invalidation
Method: SQL queries tracking changes across multiple sources
Update frequency: Hourly (aligned with MusicBrainz replication)
Artist Change Sources
5 change sources tracked:
- Artist metadata updates:
artist.last_updated - New release groups:
release_group.last_updated - Updated releases:
release.last_updated - New/updated links:
l_artist_url.last_updated - Cover art updates:
cover_art_archive.index_listing.date_updated
Query: updated_artists.sql (UNION of 5 queries)
Performance: 500-2000ms for 24-hour window
Typical results: 1000-5000 artists per hour
Album Change Sources
5 change sources tracked:
- Release group metadata updates:
release_group.last_updated - New releases in group:
release.last_updated - Updated releases in group:
release.last_updated - New/updated links:
l_release_group_url.last_updated - Cover art updates:
cover_art_archive.index_listing.date_updated
Query: updated_albums.sql
Typical results: 2000-10000 albums per hour
Change Detection Workflow
Crawler process:
- Query
updated_artists.sqlwith timestamp from last run - For each updated artist:
- Delete from Redis cache
- Delete from PostgreSQL cache
- Purge from Cloudflare CDN
- Optionally pre-fetch fresh metadata
- Update last run timestamp
- Sleep until next cycle
Invalidation vs. pre-fetching:
- Invalidation only: Fast, minimal API load
- Pre-fetching: Slower, but ensures cache is warm
Configuration:
CRAWLER_INVALIDATE_ONLY = False # Pre-fetch after invalidation
CRAWLER_INTERVAL = 3600 # 1 hour
CRAWLER_BATCH_SIZE = 100 # Process 100 entities per batch
Data Consistency
Cache Coherence
Problem: Three cache tiers can become inconsistent
Solution: Hierarchical invalidation
Invalidation order:
- Cloudflare CDN (purge API)
- Redis (delete key)
- PostgreSQL (delete row)
Rationale: Invalidate from edge to origin to prevent stale data propagation
Replication Lag Handling
Problem: MusicBrainz replication has 1-2 hour lag
Solution: Accept eventual consistency
User impact: Newly added artists/albums may not appear in search for 1-2 hours
Mitigation: Manual cache refresh endpoint for urgent updates
Concurrent Update Handling
Problem: Multiple API instances may invalidate cache simultaneously
Solution: Redis-based distributed locks
Implementation:
async def invalidate_with_lock(mbid):
"""Invalidate cache with distributed lock"""
lock_key = f"lm3.7:lock:invalidate:{mbid}"
# Try to acquire lock (30 second TTL)
acquired = await redis.set(lock_key, "1", ex=30, nx=True)
if not acquired:
# Another instance is already invalidating
return False
try:
# Perform invalidation
await invalidate_cdn(mbid)
await redis_delete(f"artist:{mbid}")
await postgres_delete(f"artist:{mbid}")
return True
finally:
# Release lock
await redis.delete(lock_key)
Data Volume Estimates
MusicBrainz Database
| Table | Row Count | Avg Row Size | Total Size |
|---|---|---|---|
| artist | 2M | 500B | 1GB |
| release_group | 3M | 400B | 1.2GB |
| release | 5M | 600B | 3GB |
| medium | 6M | 200B | 1.2GB |
| track | 50M | 300B | 15GB |
| recording | 40M | 400B | 16GB |
| url | 10M | 200B | 2GB |
| l_artist_url | 5M | 100B | 500MB |
| Total | 121M | - | ~100GB |
Cache Database
| Table | Entry Count | Avg Compressed Size | Total Size |
|---|---|---|---|
| artist | 120K | 5KB | 600MB |
| album | 200K | 3KB | 600MB |
| spotify | 50K | 1KB | 50MB |
| fanart | 80K | 2KB | 160MB |
| tadb | 60K | 2KB | 120MB |
| wikipedia | 100K | 1KB | 100MB |
| Total | 610K | - | ~2GB |
Redis Cache
| Key Pattern | Entry Count | Avg Size | Total Size |
|---|---|---|---|
| artist:* | 10K | 5KB | 50MB |
| album:* | 15K | 3KB | 45MB |
| search:* | 5K | 10KB | 50MB |
| ratelimit:* | 1K | 100B | 100KB |
| sentry:* | 500 | 100B | 50KB |
| Total | 31.5K | - | ~150MB |
Solr Index
| Core | Document Count | Index Size |
|---|---|---|
| artist | 2M | 4GB |
| release-group | 3M | 4GB |
| Total | 5M | 8GB |
Conclusion
The data layer demonstrates sophisticated multi-tier architecture:
- MusicBrainz PostgreSQL: Authoritative source with complex SQL aggregation
- Cache PostgreSQL: Persistent compressed cache with automatic expiration
- Redis: Hot cache with LFU eviction and rate limiting
- Solr: Real-time search index with RabbitMQ updates
Key strengths:
- Direct database access for complex queries
- Three-tier caching with compression (10:1 ratio)
- Change detection across 5 sources per entity type
- Real-time search index updates
The SQL queries using row_to_json and json_agg are particularly elegant, building nested JSON structures directly in the database for optimal performance.