a1f6701bac
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
1254 lines
32 KiB
Markdown
1254 lines
32 KiB
Markdown
# Lidarr Metadata API - Data Layer
|
|
|
|
## Data Source Overview
|
|
|
|
The Lidarr Metadata API integrates with four primary data storage systems:
|
|
|
|
| System | Purpose | Size | Technology | Access Pattern |
|
|
|--------|---------|------|------------|----------------|
|
|
| **MusicBrainz PostgreSQL** | Authoritative music metadata | 100GB+ | PostgreSQL 12+ | Direct SQL (asyncpg) |
|
|
| **Cache Database** | Persistent metadata cache | 10-50GB | PostgreSQL 12+ | Direct SQL (asyncpg) |
|
|
| **Redis** | Ephemeral cache + rate limiting | 512MB | Redis 6+ | Key-value (aioredis) |
|
|
| **Solr** | Full-text search index | 8GB+ | Solr 8.x | HTTP REST API |
|
|
|
|
## MusicBrainz PostgreSQL Database
|
|
|
|
### Database Overview
|
|
|
|
**Purpose**: Authoritative source for all music metadata
|
|
|
|
**Access method**: Direct read-only SQL queries via asyncpg
|
|
|
|
**Replication**: Hourly updates from MusicBrainz master database
|
|
|
|
**Container image**: `ghcr.io/lidarr/mb-postgres:1.0.10`
|
|
|
|
**Connection configuration**:
|
|
```python
|
|
DATABASE = {
|
|
'host': 'musicbrainz',
|
|
'port': 5432,
|
|
'database': 'musicbrainz_db',
|
|
'user': 'abc',
|
|
'password': 'abc',
|
|
'min_pool_size': 10,
|
|
'max_pool_size': 50,
|
|
'command_timeout': 30
|
|
}
|
|
```
|
|
|
|
### Core Tables
|
|
|
|
#### artist
|
|
|
|
**Purpose**: Artist entities (musicians, bands, orchestras, etc.)
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | INTEGER | Primary key |
|
|
| `gid` | UUID | MusicBrainz ID (public identifier) |
|
|
| `name` | VARCHAR | Artist name |
|
|
| `sort_name` | VARCHAR | Name for alphabetical sorting |
|
|
| `type` | INTEGER | Artist type (Person, Group, etc.) |
|
|
| `gender` | INTEGER | Gender (for Person type) |
|
|
| `area` | INTEGER | Geographic area |
|
|
| `begin_date_year` | SMALLINT | Formation/birth year |
|
|
| `end_date_year` | SMALLINT | Dissolution/death year |
|
|
| `comment` | VARCHAR | Disambiguation comment |
|
|
| `last_updated` | TIMESTAMP | Last modification timestamp |
|
|
|
|
**Indices**:
|
|
```sql
|
|
CREATE INDEX idx_artist_gid ON artist (gid);
|
|
CREATE INDEX idx_artist_name ON artist (name);
|
|
CREATE INDEX idx_artist_last_updated ON artist (last_updated DESC);
|
|
```
|
|
|
|
**Row count**: ~2 million artists
|
|
|
|
#### release_group
|
|
|
|
**Purpose**: Album groupings (abstract releases)
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | INTEGER | Primary key |
|
|
| `gid` | UUID | MusicBrainz ID |
|
|
| `name` | VARCHAR | Album title |
|
|
| `artist_credit` | INTEGER | Artist credit ID |
|
|
| `type` | INTEGER | Primary type (Album, Single, EP, etc.) |
|
|
| `comment` | VARCHAR | Disambiguation |
|
|
| `last_updated` | TIMESTAMP | Last modification timestamp |
|
|
|
|
**Indices**:
|
|
```sql
|
|
CREATE INDEX idx_release_group_gid ON release_group (gid);
|
|
CREATE INDEX idx_release_group_artist_credit ON release_group (artist_credit);
|
|
CREATE INDEX idx_release_group_last_updated ON release_group (last_updated DESC);
|
|
```
|
|
|
|
**Row count**: ~3 million release groups
|
|
|
|
#### release
|
|
|
|
**Purpose**: Specific releases (physical/digital products)
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | INTEGER | Primary key |
|
|
| `gid` | UUID | MusicBrainz ID |
|
|
| `name` | VARCHAR | Release title |
|
|
| `release_group` | INTEGER | Release group ID |
|
|
| `artist_credit` | INTEGER | Artist credit ID |
|
|
| `status` | INTEGER | Release status (Official, Promotion, etc.) |
|
|
| `packaging` | INTEGER | Packaging type |
|
|
| `barcode` | VARCHAR | Barcode |
|
|
| `last_updated` | TIMESTAMP | Last modification timestamp |
|
|
|
|
**Indices**:
|
|
```sql
|
|
CREATE INDEX idx_release_gid ON release (gid);
|
|
CREATE INDEX idx_release_release_group ON release (release_group);
|
|
CREATE INDEX idx_release_last_updated ON release (last_updated DESC);
|
|
```
|
|
|
|
**Row count**: ~5 million releases
|
|
|
|
#### medium
|
|
|
|
**Purpose**: Physical/digital media (CDs, Vinyl, Digital, etc.)
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | INTEGER | Primary key |
|
|
| `release` | INTEGER | Release ID |
|
|
| `position` | INTEGER | Disc number |
|
|
| `format` | INTEGER | Medium format (CD, Vinyl, etc.) |
|
|
| `name` | VARCHAR | Medium name (e.g., "Bonus Disc") |
|
|
| `track_count` | INTEGER | Number of tracks |
|
|
|
|
**Indices**:
|
|
```sql
|
|
CREATE INDEX idx_medium_release ON medium (release);
|
|
```
|
|
|
|
**Row count**: ~6 million media
|
|
|
|
#### track
|
|
|
|
**Purpose**: Track listings on media
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | INTEGER | Primary key |
|
|
| `gid` | UUID | MusicBrainz ID |
|
|
| `recording` | INTEGER | Recording ID |
|
|
| `medium` | INTEGER | Medium ID |
|
|
| `position` | INTEGER | Track number |
|
|
| `number` | VARCHAR | Track number (string, e.g., "A1") |
|
|
| `name` | VARCHAR | Track title |
|
|
| `length` | INTEGER | Duration in milliseconds |
|
|
|
|
**Indices**:
|
|
```sql
|
|
CREATE INDEX idx_track_medium ON track (medium);
|
|
CREATE INDEX idx_track_recording ON track (recording);
|
|
```
|
|
|
|
**Row count**: ~50 million tracks
|
|
|
|
#### recording
|
|
|
|
**Purpose**: Abstract recordings (audio content)
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | INTEGER | Primary key |
|
|
| `gid` | UUID | MusicBrainz ID |
|
|
| `name` | VARCHAR | Recording title |
|
|
| `artist_credit` | INTEGER | Artist credit ID |
|
|
| `length` | INTEGER | Duration in milliseconds |
|
|
| `comment` | VARCHAR | Disambiguation |
|
|
| `last_updated` | TIMESTAMP | Last modification timestamp |
|
|
|
|
**Row count**: ~40 million recordings
|
|
|
|
#### url
|
|
|
|
**Purpose**: External URLs (websites, streaming services, etc.)
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | INTEGER | Primary key |
|
|
| `gid` | UUID | MusicBrainz ID |
|
|
| `url` | TEXT | URL string |
|
|
|
|
**Indices**:
|
|
```sql
|
|
CREATE INDEX idx_url_url ON url (url);
|
|
```
|
|
|
|
**Row count**: ~10 million URLs
|
|
|
|
#### l_artist_url
|
|
|
|
**Purpose**: Artist-URL relationships (links)
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | INTEGER | Primary key |
|
|
| `link` | INTEGER | Link type ID |
|
|
| `entity0` | INTEGER | Artist ID |
|
|
| `entity1` | INTEGER | URL ID |
|
|
| `last_updated` | TIMESTAMP | Last modification timestamp |
|
|
|
|
**Indices**:
|
|
```sql
|
|
CREATE INDEX idx_l_artist_url_entity0 ON l_artist_url (entity0);
|
|
CREATE INDEX idx_l_artist_url_entity1 ON l_artist_url (entity1);
|
|
CREATE INDEX idx_l_artist_url_last_updated ON l_artist_url (last_updated DESC);
|
|
```
|
|
|
|
**Row count**: ~5 million links
|
|
|
|
#### cover_art_archive.index_listing
|
|
|
|
**Purpose**: Cover art availability tracking
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `release` | INTEGER | Release ID |
|
|
| `date_updated` | TIMESTAMP | Last cover art update |
|
|
|
|
**Indices**:
|
|
```sql
|
|
CREATE INDEX idx_cover_art_release ON cover_art_archive.index_listing (release);
|
|
CREATE INDEX idx_cover_art_date_updated ON cover_art_archive.index_listing (date_updated DESC);
|
|
```
|
|
|
|
**Row count**: ~2 million releases with cover art
|
|
|
|
#### replication_control
|
|
|
|
**Purpose**: Replication status tracking
|
|
|
|
**Key columns**:
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `current_schema_sequence` | INTEGER | Current schema version |
|
|
| `current_replication_sequence` | INTEGER | Current replication packet |
|
|
| `last_replication_date` | TIMESTAMP | Last replication timestamp |
|
|
|
|
**Usage**: Monitoring replication lag and detecting updates
|
|
|
|
### Custom Indices
|
|
|
|
To support efficient change detection and queries, custom indices are created:
|
|
|
|
```sql
|
|
-- Artist change detection
|
|
CREATE INDEX IF NOT EXISTS idx_artist_last_updated
|
|
ON artist (last_updated DESC)
|
|
WHERE last_updated IS NOT NULL;
|
|
|
|
-- Release group change detection
|
|
CREATE INDEX IF NOT EXISTS idx_release_group_last_updated
|
|
ON release_group (last_updated DESC)
|
|
WHERE last_updated IS NOT NULL;
|
|
|
|
-- Release change detection
|
|
CREATE INDEX IF NOT EXISTS idx_release_last_updated
|
|
ON release (last_updated DESC)
|
|
WHERE last_updated IS NOT NULL;
|
|
|
|
-- Link change detection
|
|
CREATE INDEX IF NOT EXISTS idx_l_artist_url_last_updated
|
|
ON l_artist_url (last_updated DESC)
|
|
WHERE last_updated IS NOT NULL;
|
|
|
|
-- Cover art change detection
|
|
CREATE INDEX IF NOT EXISTS idx_cover_art_date_updated
|
|
ON cover_art_archive.index_listing (date_updated DESC)
|
|
WHERE date_updated IS NOT NULL;
|
|
```
|
|
|
|
### SQL Query Files
|
|
|
|
#### artist.sql
|
|
|
|
**Purpose**: Fetch complete artist metadata with releases
|
|
|
|
**Location**: `lidarrmetadata/sql/artist.sql`
|
|
|
|
**Parameters**:
|
|
- `$1`: Artist MBID (UUID)
|
|
- `$2`: Primary release types (array)
|
|
- `$3`: Secondary release types (array)
|
|
- `$4`: Release statuses (array)
|
|
|
|
**Query structure**:
|
|
```sql
|
|
WITH artist_data AS (
|
|
SELECT
|
|
a.gid AS id,
|
|
a.name AS artist_name,
|
|
a.sort_name,
|
|
a.comment AS disambiguation,
|
|
at.name AS artist_type,
|
|
g.name AS gender,
|
|
ar.name AS hometown,
|
|
a.begin_date_year AS start_year,
|
|
a.end_date_year AS end_year
|
|
FROM artist a
|
|
LEFT JOIN artist_type at ON a.type = at.id
|
|
LEFT JOIN gender g ON a.gender = g.id
|
|
LEFT JOIN area ar ON a.area = ar.id
|
|
WHERE a.gid = $1
|
|
),
|
|
releases AS (
|
|
SELECT
|
|
rg.gid AS id,
|
|
rg.name AS title,
|
|
rg.comment AS disambiguation,
|
|
rgt.name AS primary_type,
|
|
rgst.name AS secondary_type,
|
|
rs.name AS status,
|
|
COALESCE(
|
|
TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
|
|
''
|
|
) AS release_date
|
|
FROM artist a
|
|
JOIN release_group rg ON rg.artist_credit = a.id
|
|
LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
|
|
LEFT JOIN release_group_secondary_type_join rgstj ON rgstj.release_group = rg.id
|
|
LEFT JOIN release_group_secondary_type rgst ON rgstj.secondary_type = rgst.id
|
|
LEFT JOIN release r ON r.release_group = rg.id
|
|
LEFT JOIN release_status rs ON r.status = rs.id
|
|
LEFT JOIN (
|
|
SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
|
|
FROM release_country
|
|
GROUP BY release
|
|
) rd ON rd.release = r.id
|
|
WHERE a.gid = $1
|
|
AND (ARRAY_LENGTH($2::text[], 1) IS NULL OR rgt.name = ANY($2))
|
|
AND (ARRAY_LENGTH($3::text[], 1) IS NULL OR rgst.name = ANY($3))
|
|
AND (ARRAY_LENGTH($4::text[], 1) IS NULL OR rs.name = ANY($4))
|
|
GROUP BY rg.id, rg.name, rg.comment, rgt.name, rgst.name, rs.name, rd.date_year, rd.date_month, rd.date_day
|
|
ORDER BY rd.date_year DESC NULLS LAST, rd.date_month DESC NULLS LAST, rd.date_day DESC NULLS LAST
|
|
),
|
|
links AS (
|
|
SELECT
|
|
lt.name AS link_type,
|
|
u.url
|
|
FROM artist a
|
|
JOIN l_artist_url lau ON lau.entity0 = a.id
|
|
JOIN url u ON lau.entity1 = u.id
|
|
JOIN link l ON lau.link = l.id
|
|
JOIN link_type lt ON l.link_type = lt.id
|
|
WHERE a.gid = $1
|
|
)
|
|
SELECT
|
|
row_to_json(artist_data.*) AS artist,
|
|
COALESCE(
|
|
(SELECT json_agg(row_to_json(releases.*)) FROM releases),
|
|
'[]'::json
|
|
) AS albums,
|
|
COALESCE(
|
|
(SELECT json_agg(row_to_json(links.*)) FROM links),
|
|
'[]'::json
|
|
) AS links
|
|
FROM artist_data;
|
|
```
|
|
|
|
**Performance**: 100-500ms depending on artist discography size
|
|
|
|
**Result format**: Single row with three JSON columns (artist, albums, links)
|
|
|
|
#### album.sql
|
|
|
|
**Purpose**: Fetch complete album metadata with tracks
|
|
|
|
**Location**: `lidarrmetadata/sql/album.sql`
|
|
|
|
**Parameters**:
|
|
- `$1`: Release group MBID (UUID)
|
|
|
|
**Query structure**:
|
|
```sql
|
|
WITH album_data AS (
|
|
SELECT
|
|
rg.gid AS id,
|
|
rg.name AS title,
|
|
rg.comment AS disambiguation,
|
|
rgt.name AS primary_type,
|
|
rs.name AS status,
|
|
a.gid AS artist_id,
|
|
a.name AS artist_name,
|
|
COALESCE(
|
|
TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
|
|
''
|
|
) AS release_date
|
|
FROM release_group rg
|
|
JOIN artist_credit ac ON rg.artist_credit = ac.id
|
|
JOIN artist_credit_name acn ON acn.artist_credit = ac.id
|
|
JOIN artist a ON acn.artist = a.id
|
|
LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
|
|
LEFT JOIN release r ON r.release_group = rg.id
|
|
LEFT JOIN release_status rs ON r.status = rs.id
|
|
LEFT JOIN (
|
|
SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
|
|
FROM release_country
|
|
GROUP BY release
|
|
) rd ON rd.release = r.id
|
|
WHERE rg.gid = $1
|
|
LIMIT 1
|
|
),
|
|
media AS (
|
|
SELECT
|
|
m.position,
|
|
mf.name AS format,
|
|
json_agg(
|
|
json_build_object(
|
|
'position', t.position,
|
|
'title', t.name,
|
|
'duration', t.length,
|
|
'artist_name', ta.name
|
|
)
|
|
ORDER BY t.position
|
|
) AS tracks
|
|
FROM release_group rg
|
|
JOIN release r ON r.release_group = rg.id
|
|
JOIN medium m ON m.release = r.id
|
|
LEFT JOIN medium_format mf ON m.format = mf.id
|
|
JOIN track t ON t.medium = m.id
|
|
JOIN recording rec ON t.recording = rec.id
|
|
JOIN artist_credit tac ON rec.artist_credit = tac.id
|
|
JOIN artist_credit_name tacn ON tacn.artist_credit = tac.id
|
|
JOIN artist ta ON tacn.artist = ta.id
|
|
WHERE rg.gid = $1
|
|
GROUP BY m.id, m.position, mf.name
|
|
ORDER BY m.position
|
|
)
|
|
SELECT
|
|
row_to_json(album_data.*) AS album,
|
|
COALESCE(
|
|
(SELECT json_agg(row_to_json(media.*)) FROM media),
|
|
'[]'::json
|
|
) AS media
|
|
FROM album_data;
|
|
```
|
|
|
|
**Performance**: 200-1000ms depending on track count
|
|
|
|
#### updated_artists.sql
|
|
|
|
**Purpose**: Detect recently updated artists for cache invalidation
|
|
|
|
**Location**: `lidarrmetadata/sql/updated_artists.sql`
|
|
|
|
**Parameters**:
|
|
- `$1`: Timestamp threshold (only artists updated after this)
|
|
- `$2`: Result limit
|
|
|
|
**Query structure** (UNION of 5 change sources):
|
|
|
|
```sql
|
|
-- Source 1: Artists with updated metadata
|
|
SELECT DISTINCT
|
|
a.gid,
|
|
a.last_updated,
|
|
'metadata' AS change_type
|
|
FROM artist a
|
|
WHERE a.last_updated > $1
|
|
|
|
UNION
|
|
|
|
-- Source 2: Artists with new release groups
|
|
SELECT DISTINCT
|
|
a.gid,
|
|
rg.last_updated,
|
|
'new_release' AS change_type
|
|
FROM artist a
|
|
JOIN release_group rg ON rg.artist_credit = a.id
|
|
WHERE rg.last_updated > $1
|
|
|
|
UNION
|
|
|
|
-- Source 3: Artists with updated releases
|
|
SELECT DISTINCT
|
|
a.gid,
|
|
r.last_updated,
|
|
'updated_release' AS change_type
|
|
FROM artist a
|
|
JOIN release_group rg ON rg.artist_credit = a.id
|
|
JOIN release r ON r.release_group = rg.id
|
|
WHERE r.last_updated > $1
|
|
|
|
UNION
|
|
|
|
-- Source 4: Artists with new/updated links
|
|
SELECT DISTINCT
|
|
a.gid,
|
|
lau.last_updated,
|
|
'new_link' AS change_type
|
|
FROM artist a
|
|
JOIN l_artist_url lau ON lau.entity0 = a.id
|
|
WHERE lau.last_updated > $1
|
|
|
|
UNION
|
|
|
|
-- Source 5: Artists with updated cover art
|
|
SELECT DISTINCT
|
|
a.gid,
|
|
caa.date_updated AS last_updated,
|
|
'cover_art' AS change_type
|
|
FROM artist a
|
|
JOIN release_group rg ON rg.artist_credit = a.id
|
|
JOIN release r ON r.release_group = rg.id
|
|
JOIN cover_art_archive.index_listing caa ON caa.release = r.id
|
|
WHERE caa.date_updated > $1
|
|
|
|
ORDER BY last_updated DESC
|
|
LIMIT $2;
|
|
```
|
|
|
|
**Performance**: 500-2000ms depending on time window
|
|
|
|
**Use case**: Crawler scheduling, cache invalidation
|
|
|
|
#### updated_albums.sql
|
|
|
|
**Purpose**: Detect recently updated albums
|
|
|
|
**Location**: `lidarrmetadata/sql/updated_albums.sql`
|
|
|
|
**Parameters**: Same as `updated_artists.sql`
|
|
|
|
**Query structure**: Similar UNION pattern with 5 change sources:
|
|
1. Release group metadata updates
|
|
2. New releases in group
|
|
3. Updated releases in group
|
|
4. New/updated links
|
|
5. Cover art updates
|
|
|
|
### Database Replication
|
|
|
|
**Replication method**: MusicBrainz replication packets
|
|
|
|
**Update frequency**: Hourly
|
|
|
|
**Replication process**:
|
|
1. Check `replication_control` table for current sequence
|
|
2. Fetch replication packets from MusicBrainz FTP
|
|
3. Apply SQL changes from packets
|
|
4. Update `replication_control` table
|
|
5. Trigger search index updates via RabbitMQ
|
|
|
|
**Monitoring**:
|
|
```sql
|
|
SELECT
|
|
current_replication_sequence,
|
|
last_replication_date,
|
|
NOW() - last_replication_date AS replication_lag
|
|
FROM replication_control;
|
|
```
|
|
|
|
**Typical lag**: 1-2 hours behind MusicBrainz master
|
|
|
|
## Cache Database (PostgreSQL)
|
|
|
|
### Database Overview
|
|
|
|
**Purpose**: Persistent cache storage with compression
|
|
|
|
**Technology**: PostgreSQL 12+ (same instance as MusicBrainz or separate)
|
|
|
|
**Database name**: `lm_cache_db`
|
|
|
|
**Connection configuration**:
|
|
```python
|
|
CACHE_DATABASE = {
|
|
'host': 'localhost',
|
|
'port': 5432,
|
|
'database': 'lm_cache_db',
|
|
'user': 'abc',
|
|
'password': 'abc'
|
|
}
|
|
```
|
|
|
|
### Auto-Created Cache Tables
|
|
|
|
Each cache type gets its own table with identical schema:
|
|
|
|
**Table names**:
|
|
- `artist`: Artist metadata cache
|
|
- `album`: Album metadata cache
|
|
- `spotify`: Spotify lookup cache
|
|
- `fanart`: FanArt.tv image cache
|
|
- `tadb`: TheAudioDB metadata cache
|
|
- `wikipedia`: Wikipedia overview cache
|
|
|
|
**Schema**:
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS {cache_name} (
|
|
key VARCHAR(255) PRIMARY KEY,
|
|
expires TIMESTAMP,
|
|
updated TIMESTAMP DEFAULT NOW(),
|
|
value BYTEA
|
|
);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_{cache_name}_expires
|
|
ON {cache_name} (expires)
|
|
WHERE expires IS NOT NULL;
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_{cache_name}_updated
|
|
ON {cache_name} (updated DESC);
|
|
```
|
|
|
|
**Trigger for automatic timestamp updates**:
|
|
```sql
|
|
CREATE OR REPLACE FUNCTION update_updated_column()
|
|
RETURNS TRIGGER AS $$
|
|
BEGIN
|
|
NEW.updated = NOW();
|
|
RETURN NEW;
|
|
END;
|
|
$$ LANGUAGE plpgsql;
|
|
|
|
CREATE TRIGGER update_{cache_name}_updated
|
|
BEFORE UPDATE ON {cache_name}
|
|
FOR EACH ROW
|
|
EXECUTE FUNCTION update_updated_column();
|
|
```
|
|
|
|
### Cache Entry Format
|
|
|
|
**Key structure**: `{cache_type}:{identifier}:{parameters}`
|
|
|
|
**Examples**:
|
|
- `artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:Album:Official`
|
|
- `album:1b022e01-4da6-387b-8658-8678046e4cef`
|
|
- `spotify:artist:6olE6TJLqED3rqDCT0FyPh`
|
|
- `wikipedia:5b11f4ce-a62d-471e-81fc-a69a8278c7da:en`
|
|
|
|
**Value format**: zlib-compressed pickle
|
|
|
|
**Compression implementation**:
|
|
```python
|
|
import zlib
|
|
import pickle
|
|
|
|
def compress_value(value):
|
|
"""Compress Python object for storage"""
|
|
pickled = pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL)
|
|
compressed = zlib.compress(pickled, level=6)
|
|
return compressed
|
|
|
|
def decompress_value(compressed):
|
|
"""Decompress stored value to Python object"""
|
|
pickled = zlib.decompress(compressed)
|
|
value = pickle.loads(pickled)
|
|
return value
|
|
```
|
|
|
|
**Compression ratio**: Typically 10:1 for JSON metadata
|
|
|
|
**Example**:
|
|
- Uncompressed artist metadata: 50KB JSON
|
|
- Pickled: 52KB
|
|
- Compressed: 5KB
|
|
- Storage savings: 90%
|
|
|
|
### Cache Operations
|
|
|
|
#### Insert/Update
|
|
|
|
```python
|
|
async def cache_set(key, value, ttl=2592000):
|
|
"""Store value in cache with optional TTL"""
|
|
compressed = zlib.compress(pickle.dumps(value))
|
|
expires = datetime.now() + timedelta(seconds=ttl) if ttl else None
|
|
|
|
await conn.execute(
|
|
"""
|
|
INSERT INTO artist (key, value, expires)
|
|
VALUES ($1, $2, $3)
|
|
ON CONFLICT (key) DO UPDATE
|
|
SET value = $2, expires = $3, updated = NOW()
|
|
""",
|
|
key, compressed, expires
|
|
)
|
|
```
|
|
|
|
#### Retrieve
|
|
|
|
```python
|
|
async def cache_get(key):
|
|
"""Retrieve value from cache"""
|
|
row = await conn.fetchrow(
|
|
"""
|
|
SELECT value, expires
|
|
FROM artist
|
|
WHERE key = $1
|
|
""",
|
|
key
|
|
)
|
|
|
|
if not row:
|
|
return None
|
|
|
|
# Check expiration
|
|
if row['expires'] and row['expires'] < datetime.now():
|
|
# Expired, delete and return None
|
|
await conn.execute("DELETE FROM artist WHERE key = $1", key)
|
|
return None
|
|
|
|
# Decompress and return
|
|
value = pickle.loads(zlib.decompress(row['value']))
|
|
return value
|
|
```
|
|
|
|
#### Delete
|
|
|
|
```python
|
|
async def cache_delete(key):
|
|
"""Delete value from cache"""
|
|
await conn.execute("DELETE FROM artist WHERE key = $1", key)
|
|
```
|
|
|
|
#### Cleanup Expired Entries
|
|
|
|
```python
|
|
async def cache_cleanup():
|
|
"""Remove expired entries"""
|
|
deleted = await conn.execute(
|
|
"""
|
|
DELETE FROM artist
|
|
WHERE expires IS NOT NULL AND expires < NOW()
|
|
"""
|
|
)
|
|
return deleted
|
|
```
|
|
|
|
**Cleanup schedule**: Daily via cron or crawler
|
|
|
|
### Cache Statistics
|
|
|
|
**Query for cache statistics**:
|
|
```sql
|
|
SELECT
|
|
'artist' AS cache_name,
|
|
COUNT(*) AS total_entries,
|
|
COUNT(*) FILTER (WHERE expires IS NOT NULL AND expires < NOW()) AS expired_entries,
|
|
COUNT(*) FILTER (WHERE expires IS NULL OR expires >= NOW()) AS valid_entries,
|
|
pg_size_pretty(pg_total_relation_size('artist')) AS total_size,
|
|
AVG(LENGTH(value)) AS avg_value_size,
|
|
MAX(updated) AS last_updated
|
|
FROM artist;
|
|
```
|
|
|
|
**Example output**:
|
|
```
|
|
cache_name | total_entries | expired_entries | valid_entries | total_size | avg_value_size | last_updated
|
|
-----------+---------------+-----------------+---------------+------------+----------------+---------------------
|
|
artist | 125000 | 5000 | 120000 | 2048 MB | 5120 | 2025-04-28 12:34:56
|
|
```
|
|
|
|
## Redis Cache
|
|
|
|
### Redis Overview
|
|
|
|
**Purpose**: Ephemeral cache for hot data and rate limiting
|
|
|
|
**Technology**: Redis 6+
|
|
|
|
**Memory limit**: 512MB
|
|
|
|
**Eviction policy**: LFU (Least Frequently Used)
|
|
|
|
**Namespace**: `lm3.7`
|
|
|
|
**Connection configuration**:
|
|
```python
|
|
REDIS_URL = 'redis://localhost:6379/0'
|
|
REDIS_MIN_POOL_SIZE = 5
|
|
REDIS_MAX_POOL_SIZE = 20
|
|
```
|
|
|
|
### Redis Configuration
|
|
|
|
**redis.conf settings**:
|
|
```conf
|
|
maxmemory 512mb
|
|
maxmemory-policy allkeys-lfu
|
|
save "" # Disable persistence
|
|
appendonly no # Disable AOF
|
|
```
|
|
|
|
**Rationale**:
|
|
- LFU eviction keeps most-accessed data in cache
|
|
- No persistence needed (PostgreSQL is persistent layer)
|
|
- Maximum performance for cache operations
|
|
|
|
### Key Structure
|
|
|
|
**Namespace prefix**: All keys prefixed with `lm3.7:`
|
|
|
|
**Key patterns**:
|
|
- `lm3.7:artist:{mbid}:{params}`: Artist metadata
|
|
- `lm3.7:album:{mbid}`: Album metadata
|
|
- `lm3.7:search:artist:{query}`: Search results
|
|
- `lm3.7:ratelimit:{ip}:{window}`: Rate limiter state
|
|
- `lm3.7:sentry:{error_hash}`: Sentry deduplication
|
|
- `lm3.7:lock:invalidate:{mbid}`: Invalidation locks
|
|
|
|
### Cache Operations
|
|
|
|
#### Set with TTL
|
|
|
|
```python
|
|
async def redis_set(key, value, ttl=604800):
|
|
"""Store value in Redis with TTL (default 7 days)"""
|
|
compressed = zlib.compress(pickle.dumps(value))
|
|
await redis.setex(f"lm3.7:{key}", ttl, compressed)
|
|
```
|
|
|
|
#### Get
|
|
|
|
```python
|
|
async def redis_get(key):
|
|
"""Retrieve value from Redis"""
|
|
compressed = await redis.get(f"lm3.7:{key}")
|
|
if not compressed:
|
|
return None
|
|
value = pickle.loads(zlib.decompress(compressed))
|
|
return value
|
|
```
|
|
|
|
#### Delete
|
|
|
|
```python
|
|
async def redis_delete(key):
|
|
"""Delete value from Redis"""
|
|
await redis.delete(f"lm3.7:{key}")
|
|
```
|
|
|
|
#### Batch Delete
|
|
|
|
```python
|
|
async def redis_delete_pattern(pattern):
|
|
"""Delete all keys matching pattern"""
|
|
cursor = 0
|
|
while True:
|
|
cursor, keys = await redis.scan(cursor, match=f"lm3.7:{pattern}", count=100)
|
|
if keys:
|
|
await redis.delete(*keys)
|
|
if cursor == 0:
|
|
break
|
|
```
|
|
|
|
### Rate Limiting with Redis
|
|
|
|
**Implementation**: Sliding window counter
|
|
|
|
```python
|
|
async def rate_limit_check(key, max_requests=100, window=60):
|
|
"""Check if request is within rate limit"""
|
|
now = time.time()
|
|
window_key = f"lm3.7:ratelimit:{key}:{int(now / window)}"
|
|
|
|
# Increment counter
|
|
count = await redis.incr(window_key)
|
|
|
|
# Set expiration on first request
|
|
if count == 1:
|
|
await redis.expire(window_key, window)
|
|
|
|
# Check limit
|
|
if count > max_requests:
|
|
raise RateLimitExceeded(f"Rate limit exceeded: {count}/{max_requests}")
|
|
|
|
return count
|
|
```
|
|
|
|
### Sentry Deduplication with Redis
|
|
|
|
**Purpose**: Prevent duplicate error reports
|
|
|
|
```python
|
|
async def sentry_should_send(error_hash):
|
|
"""Check if error should be sent to Sentry"""
|
|
key = f"lm3.7:sentry:{error_hash}"
|
|
|
|
# Check if error seen recently
|
|
if await redis.exists(key):
|
|
return False
|
|
|
|
# Mark error as seen for 1 hour
|
|
await redis.setex(key, 3600, "1")
|
|
return True
|
|
```
|
|
|
|
### Redis Monitoring
|
|
|
|
**Memory usage**:
|
|
```bash
|
|
redis-cli INFO memory
|
|
```
|
|
|
|
**Key count**:
|
|
```bash
|
|
redis-cli DBSIZE
|
|
```
|
|
|
|
**Eviction stats**:
|
|
```bash
|
|
redis-cli INFO stats | grep evicted
|
|
```
|
|
|
|
**Hit rate**:
|
|
```bash
|
|
redis-cli INFO stats | grep keyspace
|
|
```
|
|
|
|
## Solr Search Index
|
|
|
|
### Solr Overview
|
|
|
|
**Purpose**: Full-text search for artists and albums
|
|
|
|
**Technology**: Apache Solr 8.x
|
|
|
|
**Container image**: `ghcr.io/lidarr/mb-solr:3.3.1.9`
|
|
|
|
**Cores**:
|
|
- `artist`: Artist search index
|
|
- `release-group`: Album search index
|
|
|
|
**Update method**: Real-time via RabbitMQ + SIR (Search Index Rebuilder)
|
|
|
|
### Solr Configuration
|
|
|
|
**solrconfig.xml settings**:
|
|
```xml
|
|
<config>
|
|
<requestHandler name="/select" class="solr.SearchHandler">
|
|
<lst name="defaults">
|
|
<str name="echoParams">explicit</str>
|
|
<str name="defType">dismax</str>
|
|
<int name="rows">10</int>
|
|
</lst>
|
|
</requestHandler>
|
|
|
|
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
|
|
|
|
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
|
|
<lst name="invariants">
|
|
<str name="q">*:*</str>
|
|
</lst>
|
|
</requestHandler>
|
|
</config>
|
|
```
|
|
|
|
### Artist Core Schema
|
|
|
|
**schema.xml**:
|
|
```xml
|
|
<schema name="artist" version="1.6">
|
|
<field name="id" type="string" indexed="true" stored="true" required="true" />
|
|
<field name="artist" type="text_general" indexed="true" stored="true" />
|
|
<field name="sortname" type="text_general" indexed="true" stored="true" />
|
|
<field name="alias" type="text_general" indexed="true" stored="true" multiValued="true" />
|
|
<field name="type" type="string" indexed="true" stored="true" />
|
|
<field name="disambiguation" type="text_general" indexed="true" stored="true" />
|
|
<field name="_version_" type="long" indexed="true" stored="true" />
|
|
|
|
<uniqueKey>id</uniqueKey>
|
|
|
|
<copyField source="artist" dest="text" />
|
|
<copyField source="sortname" dest="text" />
|
|
<copyField source="alias" dest="text" />
|
|
</schema>
|
|
```
|
|
|
|
**Indexed fields**:
|
|
- `id`: MusicBrainz artist MBID
|
|
- `artist`: Artist name (boosted 2x in queries)
|
|
- `sortname`: Sort name
|
|
- `alias`: Artist aliases (multi-valued)
|
|
- `type`: Artist type (Person, Group, etc.)
|
|
- `disambiguation`: Disambiguation comment
|
|
|
|
### Release Group Core Schema
|
|
|
|
**Indexed fields**:
|
|
- `id`: MusicBrainz release group MBID
|
|
- `title`: Album title (boosted 2x)
|
|
- `artist`: Artist name
|
|
- `type`: Primary type (Album, Single, etc.)
|
|
- `status`: Release status
|
|
- `disambiguation`: Disambiguation comment
|
|
|
|
### Search Query Format
|
|
|
|
**Dismax query example**:
|
|
```json
|
|
{
|
|
"query": "nirvana",
|
|
"limit": 10,
|
|
"params": {
|
|
"defType": "dismax",
|
|
"qf": "artist^2 sortname alias",
|
|
"mm": "1"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Query field weights**:
|
|
- `artist^2`: Artist name (2x boost)
|
|
- `sortname`: Sort name (1x)
|
|
- `alias`: Aliases (1x)
|
|
|
|
**Minimum match**: At least 1 term must match
|
|
|
|
### Solr Update Process
|
|
|
|
**Real-time updates via RabbitMQ**:
|
|
|
|
1. MusicBrainz replication applies database changes
|
|
2. Database triggers publish messages to RabbitMQ
|
|
3. SIR (Search Index Rebuilder) consumes messages
|
|
4. SIR queries MusicBrainz database for updated entities
|
|
5. SIR posts updates to Solr cores
|
|
6. Solr commits changes (soft commit every 1s)
|
|
|
|
**RabbitMQ configuration**:
|
|
```python
|
|
RABBITMQ = {
|
|
'host': 'rabbitmq',
|
|
'port': 5672,
|
|
'user': 'abc',
|
|
'password': 'abc',
|
|
'exchange': 'search.index',
|
|
'queue': 'search.index.artist'
|
|
}
|
|
```
|
|
|
|
**Update latency**: 1-5 seconds from database change to searchable
|
|
|
|
### Solr Performance
|
|
|
|
**Query timeout**: 5 seconds
|
|
|
|
**Typical query time**: 50-200ms
|
|
|
|
**Index size**:
|
|
- Artist core: ~4GB
|
|
- Release group core: ~4GB
|
|
|
|
**Document count**:
|
|
- Artist core: ~2 million
|
|
- Release group core: ~3 million
|
|
|
|
## Change Detection System
|
|
|
|
### Overview
|
|
|
|
**Purpose**: Identify recently updated entities for cache invalidation
|
|
|
|
**Method**: SQL queries tracking changes across multiple sources
|
|
|
|
**Update frequency**: Hourly (aligned with MusicBrainz replication)
|
|
|
|
### Artist Change Sources
|
|
|
|
**5 change sources tracked**:
|
|
|
|
1. **Artist metadata updates**: `artist.last_updated`
|
|
2. **New release groups**: `release_group.last_updated`
|
|
3. **Updated releases**: `release.last_updated`
|
|
4. **New/updated links**: `l_artist_url.last_updated`
|
|
5. **Cover art updates**: `cover_art_archive.index_listing.date_updated`
|
|
|
|
**Query**: `updated_artists.sql` (UNION of 5 queries)
|
|
|
|
**Performance**: 500-2000ms for 24-hour window
|
|
|
|
**Typical results**: 1000-5000 artists per hour
|
|
|
|
### Album Change Sources
|
|
|
|
**5 change sources tracked**:
|
|
|
|
1. **Release group metadata updates**: `release_group.last_updated`
|
|
2. **New releases in group**: `release.last_updated`
|
|
3. **Updated releases in group**: `release.last_updated`
|
|
4. **New/updated links**: `l_release_group_url.last_updated`
|
|
5. **Cover art updates**: `cover_art_archive.index_listing.date_updated`
|
|
|
|
**Query**: `updated_albums.sql`
|
|
|
|
**Typical results**: 2000-10000 albums per hour
|
|
|
|
### Change Detection Workflow
|
|
|
|
**Crawler process**:
|
|
|
|
1. Query `updated_artists.sql` with timestamp from last run
|
|
2. For each updated artist:
|
|
- Delete from Redis cache
|
|
- Delete from PostgreSQL cache
|
|
- Purge from Cloudflare CDN
|
|
3. Optionally pre-fetch fresh metadata
|
|
4. Update last run timestamp
|
|
5. Sleep until next cycle
|
|
|
|
**Invalidation vs. pre-fetching**:
|
|
- **Invalidation only**: Fast, minimal API load
|
|
- **Pre-fetching**: Slower, but ensures cache is warm
|
|
|
|
**Configuration**:
|
|
```python
|
|
CRAWLER_INVALIDATE_ONLY = False # Pre-fetch after invalidation
|
|
CRAWLER_INTERVAL = 3600 # 1 hour
|
|
CRAWLER_BATCH_SIZE = 100 # Process 100 entities per batch
|
|
```
|
|
|
|
## Data Consistency
|
|
|
|
### Cache Coherence
|
|
|
|
**Problem**: Three cache tiers can become inconsistent
|
|
|
|
**Solution**: Hierarchical invalidation
|
|
|
|
**Invalidation order**:
|
|
1. Cloudflare CDN (purge API)
|
|
2. Redis (delete key)
|
|
3. PostgreSQL (delete row)
|
|
|
|
**Rationale**: Invalidate from edge to origin to prevent stale data propagation
|
|
|
|
### Replication Lag Handling
|
|
|
|
**Problem**: MusicBrainz replication has 1-2 hour lag
|
|
|
|
**Solution**: Accept eventual consistency
|
|
|
|
**User impact**: Newly added artists/albums may not appear in search for 1-2 hours
|
|
|
|
**Mitigation**: Manual cache refresh endpoint for urgent updates
|
|
|
|
### Concurrent Update Handling
|
|
|
|
**Problem**: Multiple API instances may invalidate cache simultaneously
|
|
|
|
**Solution**: Redis-based distributed locks
|
|
|
|
**Implementation**:
|
|
```python
|
|
async def invalidate_with_lock(mbid):
|
|
"""Invalidate cache with distributed lock"""
|
|
lock_key = f"lm3.7:lock:invalidate:{mbid}"
|
|
|
|
# Try to acquire lock (30 second TTL)
|
|
acquired = await redis.set(lock_key, "1", ex=30, nx=True)
|
|
|
|
if not acquired:
|
|
# Another instance is already invalidating
|
|
return False
|
|
|
|
try:
|
|
# Perform invalidation
|
|
await invalidate_cdn(mbid)
|
|
await redis_delete(f"artist:{mbid}")
|
|
await postgres_delete(f"artist:{mbid}")
|
|
return True
|
|
finally:
|
|
# Release lock
|
|
await redis.delete(lock_key)
|
|
```
|
|
|
|
## Data Volume Estimates
|
|
|
|
### MusicBrainz Database
|
|
|
|
| Table | Row Count | Avg Row Size | Total Size |
|
|
|-------|-----------|--------------|------------|
|
|
| artist | 2M | 500B | 1GB |
|
|
| release_group | 3M | 400B | 1.2GB |
|
|
| release | 5M | 600B | 3GB |
|
|
| medium | 6M | 200B | 1.2GB |
|
|
| track | 50M | 300B | 15GB |
|
|
| recording | 40M | 400B | 16GB |
|
|
| url | 10M | 200B | 2GB |
|
|
| l_artist_url | 5M | 100B | 500MB |
|
|
| **Total** | **121M** | - | **~100GB** |
|
|
|
|
### Cache Database
|
|
|
|
| Table | Entry Count | Avg Compressed Size | Total Size |
|
|
|-------|-------------|---------------------|------------|
|
|
| artist | 120K | 5KB | 600MB |
|
|
| album | 200K | 3KB | 600MB |
|
|
| spotify | 50K | 1KB | 50MB |
|
|
| fanart | 80K | 2KB | 160MB |
|
|
| tadb | 60K | 2KB | 120MB |
|
|
| wikipedia | 100K | 1KB | 100MB |
|
|
| **Total** | **610K** | - | **~2GB** |
|
|
|
|
### Redis Cache
|
|
|
|
| Key Pattern | Entry Count | Avg Size | Total Size |
|
|
|-------------|-------------|----------|------------|
|
|
| artist:* | 10K | 5KB | 50MB |
|
|
| album:* | 15K | 3KB | 45MB |
|
|
| search:* | 5K | 10KB | 50MB |
|
|
| ratelimit:* | 1K | 100B | 100KB |
|
|
| sentry:* | 500 | 100B | 50KB |
|
|
| **Total** | **31.5K** | - | **~150MB** |
|
|
|
|
### Solr Index
|
|
|
|
| Core | Document Count | Index Size |
|
|
|------|----------------|------------|
|
|
| artist | 2M | 4GB |
|
|
| release-group | 3M | 4GB |
|
|
| **Total** | **5M** | **8GB** |
|
|
|
|
## Conclusion
|
|
|
|
The data layer demonstrates sophisticated multi-tier architecture:
|
|
|
|
1. **MusicBrainz PostgreSQL**: Authoritative source with complex SQL aggregation
|
|
2. **Cache PostgreSQL**: Persistent compressed cache with automatic expiration
|
|
3. **Redis**: Hot cache with LFU eviction and rate limiting
|
|
4. **Solr**: Real-time search index with RabbitMQ updates
|
|
|
|
Key strengths:
|
|
- Direct database access for complex queries
|
|
- Three-tier caching with compression (10:1 ratio)
|
|
- Change detection across 5 sources per entity type
|
|
- Real-time search index updates
|
|
|
|
The SQL queries using `row_to_json` and `json_agg` are particularly elegant, building nested JSON structures directly in the database for optimal performance.
|