metadata-agregator/docs/research/lidarr-metadata-api/analysis/DATA.md

# Lidarr Metadata API - Data Layer

## Data Source Overview

The Lidarr Metadata API integrates with four primary data storage systems:

| System | Purpose | Size | Technology | Access Pattern |
|--------|---------|------|------------|----------------|
| **MusicBrainz PostgreSQL** | Authoritative music metadata | 100GB+ | PostgreSQL 12+ | Direct SQL (asyncpg) |
| **Cache Database** | Persistent metadata cache | 10-50GB | PostgreSQL 12+ | Direct SQL (asyncpg) |
| **Redis** | Ephemeral cache + rate limiting | 512MB | Redis 6+ | Key-value (aioredis) |
| **Solr** | Full-text search index | 8GB+ | Solr 8.x | HTTP REST API |

## MusicBrainz PostgreSQL Database

### Database Overview

**Purpose**: Authoritative source for all music metadata

**Access method**: Direct read-only SQL queries via asyncpg

**Replication**: Hourly updates from MusicBrainz master database

**Container image**: `ghcr.io/lidarr/mb-postgres:1.0.10`

**Connection configuration**:
```python
DATABASE = {
    'host': 'musicbrainz',
    'port': 5432,
    'database': 'musicbrainz_db',
    'user': 'abc',
    'password': 'abc',
    'min_pool_size': 10,
    'max_pool_size': 50,
    'command_timeout': 30
}
```

### Core Tables

#### artist

**Purpose**: Artist entities (musicians, bands, orchestras, etc.)

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `gid` | UUID | MusicBrainz ID (public identifier) |
| `name` | VARCHAR | Artist name |
| `sort_name` | VARCHAR | Name for alphabetical sorting |
| `type` | INTEGER | Artist type (Person, Group, etc.) |
| `gender` | INTEGER | Gender (for Person type) |
| `area` | INTEGER | Geographic area |
| `begin_date_year` | SMALLINT | Formation/birth year |
| `end_date_year` | SMALLINT | Dissolution/death year |
| `comment` | VARCHAR | Disambiguation comment |
| `last_updated` | TIMESTAMP | Last modification timestamp |

**Indices**:
```sql
CREATE INDEX idx_artist_gid ON artist (gid);
CREATE INDEX idx_artist_name ON artist (name);
CREATE INDEX idx_artist_last_updated ON artist (last_updated DESC);
```

**Row count**: ~2 million artists

#### release_group

**Purpose**: Album groupings (abstract releases)

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `gid` | UUID | MusicBrainz ID |
| `name` | VARCHAR | Album title |
| `artist_credit` | INTEGER | Artist credit ID |
| `type` | INTEGER | Primary type (Album, Single, EP, etc.) |
| `comment` | VARCHAR | Disambiguation |
| `last_updated` | TIMESTAMP | Last modification timestamp |

**Indices**:
```sql
CREATE INDEX idx_release_group_gid ON release_group (gid);
CREATE INDEX idx_release_group_artist_credit ON release_group (artist_credit);
CREATE INDEX idx_release_group_last_updated ON release_group (last_updated DESC);
```

**Row count**: ~3 million release groups

#### release

**Purpose**: Specific releases (physical/digital products)

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `gid` | UUID | MusicBrainz ID |
| `name` | VARCHAR | Release title |
| `release_group` | INTEGER | Release group ID |
| `artist_credit` | INTEGER | Artist credit ID |
| `status` | INTEGER | Release status (Official, Promotion, etc.) |
| `packaging` | INTEGER | Packaging type |
| `barcode` | VARCHAR | Barcode |
| `last_updated` | TIMESTAMP | Last modification timestamp |

**Indices**:
```sql
CREATE INDEX idx_release_gid ON release (gid);
CREATE INDEX idx_release_release_group ON release (release_group);
CREATE INDEX idx_release_last_updated ON release (last_updated DESC);
```

**Row count**: ~5 million releases

#### medium

**Purpose**: Physical/digital media (CDs, Vinyl, Digital, etc.)

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `release` | INTEGER | Release ID |
| `position` | INTEGER | Disc number |
| `format` | INTEGER | Medium format (CD, Vinyl, etc.) |
| `name` | VARCHAR | Medium name (e.g., "Bonus Disc") |
| `track_count` | INTEGER | Number of tracks |

**Indices**:
```sql
CREATE INDEX idx_medium_release ON medium (release);
```

**Row count**: ~6 million media

#### track

**Purpose**: Track listings on media

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `gid` | UUID | MusicBrainz ID |
| `recording` | INTEGER | Recording ID |
| `medium` | INTEGER | Medium ID |
| `position` | INTEGER | Track number |
| `number` | VARCHAR | Track number (string, e.g., "A1") |
| `name` | VARCHAR | Track title |
| `length` | INTEGER | Duration in milliseconds |

**Indices**:
```sql
CREATE INDEX idx_track_medium ON track (medium);
CREATE INDEX idx_track_recording ON track (recording);
```

**Row count**: ~50 million tracks

#### recording

**Purpose**: Abstract recordings (audio content)

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `gid` | UUID | MusicBrainz ID |
| `name` | VARCHAR | Recording title |
| `artist_credit` | INTEGER | Artist credit ID |
| `length` | INTEGER | Duration in milliseconds |
| `comment` | VARCHAR | Disambiguation |
| `last_updated` | TIMESTAMP | Last modification timestamp |

**Row count**: ~40 million recordings

#### url

**Purpose**: External URLs (websites, streaming services, etc.)

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `gid` | UUID | MusicBrainz ID |
| `url` | TEXT | URL string |

**Indices**:
```sql
CREATE INDEX idx_url_url ON url (url);
```

**Row count**: ~10 million URLs

#### l_artist_url

**Purpose**: Artist-URL relationships (links)

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key |
| `link` | INTEGER | Link type ID |
| `entity0` | INTEGER | Artist ID |
| `entity1` | INTEGER | URL ID |
| `last_updated` | TIMESTAMP | Last modification timestamp |

**Indices**:
```sql
CREATE INDEX idx_l_artist_url_entity0 ON l_artist_url (entity0);
CREATE INDEX idx_l_artist_url_entity1 ON l_artist_url (entity1);
CREATE INDEX idx_l_artist_url_last_updated ON l_artist_url (last_updated DESC);
```

**Row count**: ~5 million links

#### cover_art_archive.index_listing

**Purpose**: Cover art availability tracking

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `release` | INTEGER | Release ID |
| `date_updated` | TIMESTAMP | Last cover art update |

**Indices**:
```sql
CREATE INDEX idx_cover_art_release ON cover_art_archive.index_listing (release);
CREATE INDEX idx_cover_art_date_updated ON cover_art_archive.index_listing (date_updated DESC);
```

**Row count**: ~2 million releases with cover art

#### replication_control

**Purpose**: Replication status tracking

**Key columns**:

| Column | Type | Description |
|--------|------|-------------|
| `current_schema_sequence` | INTEGER | Current schema version |
| `current_replication_sequence` | INTEGER | Current replication packet |
| `last_replication_date` | TIMESTAMP | Last replication timestamp |

**Usage**: Monitoring replication lag and detecting updates

### Custom Indices

To support efficient change detection and queries, custom indices are created:

```sql
-- Artist change detection
CREATE INDEX IF NOT EXISTS idx_artist_last_updated
ON artist (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Release group change detection
CREATE INDEX IF NOT EXISTS idx_release_group_last_updated
ON release_group (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Release change detection
CREATE INDEX IF NOT EXISTS idx_release_last_updated
ON release (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Link change detection
CREATE INDEX IF NOT EXISTS idx_l_artist_url_last_updated
ON l_artist_url (last_updated DESC)
WHERE last_updated IS NOT NULL;

-- Cover art change detection
CREATE INDEX IF NOT EXISTS idx_cover_art_date_updated
ON cover_art_archive.index_listing (date_updated DESC)
WHERE date_updated IS NOT NULL;
```

### SQL Query Files

#### artist.sql

**Purpose**: Fetch complete artist metadata with releases

**Location**: `lidarrmetadata/sql/artist.sql`

**Parameters**:
- `$1`: Artist MBID (UUID)
- `$2`: Primary release types (array)
- `$3`: Secondary release types (array)
- `$4`: Release statuses (array)

**Query structure**:
```sql
WITH artist_data AS (
    SELECT
        a.gid AS id,
        a.name AS artist_name,
        a.sort_name,
        a.comment AS disambiguation,
        at.name AS artist_type,
        g.name AS gender,
        ar.name AS hometown,
        a.begin_date_year AS start_year,
        a.end_date_year AS end_year
    FROM artist a
    LEFT JOIN artist_type at ON a.type = at.id
    LEFT JOIN gender g ON a.gender = g.id
    LEFT JOIN area ar ON a.area = ar.id
    WHERE a.gid = $1
),
releases AS (
    SELECT
        rg.gid AS id,
        rg.name AS title,
        rg.comment AS disambiguation,
        rgt.name AS primary_type,
        rgst.name AS secondary_type,
        rs.name AS status,
        COALESCE(
            TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
            ''
        ) AS release_date
    FROM artist a
    JOIN release_group rg ON rg.artist_credit = a.id
    LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
    LEFT JOIN release_group_secondary_type_join rgstj ON rgstj.release_group = rg.id
    LEFT JOIN release_group_secondary_type rgst ON rgstj.secondary_type = rgst.id
    LEFT JOIN release r ON r.release_group = rg.id
    LEFT JOIN release_status rs ON r.status = rs.id
    LEFT JOIN (
        SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
        FROM release_country
        GROUP BY release
    ) rd ON rd.release = r.id
    WHERE a.gid = $1
        AND (ARRAY_LENGTH($2::text[], 1) IS NULL OR rgt.name = ANY($2))
        AND (ARRAY_LENGTH($3::text[], 1) IS NULL OR rgst.name = ANY($3))
        AND (ARRAY_LENGTH($4::text[], 1) IS NULL OR rs.name = ANY($4))
    GROUP BY rg.id, rg.name, rg.comment, rgt.name, rgst.name, rs.name, rd.date_year, rd.date_month, rd.date_day
    ORDER BY rd.date_year DESC NULLS LAST, rd.date_month DESC NULLS LAST, rd.date_day DESC NULLS LAST
),
links AS (
    SELECT
        lt.name AS link_type,
        u.url
    FROM artist a
    JOIN l_artist_url lau ON lau.entity0 = a.id
    JOIN url u ON lau.entity1 = u.id
    JOIN link l ON lau.link = l.id
    JOIN link_type lt ON l.link_type = lt.id
    WHERE a.gid = $1
)
SELECT
    row_to_json(artist_data.*) AS artist,
    COALESCE(
        (SELECT json_agg(row_to_json(releases.*)) FROM releases),
        '[]'::json
    ) AS albums,
    COALESCE(
        (SELECT json_agg(row_to_json(links.*)) FROM links),
        '[]'::json
    ) AS links
FROM artist_data;
```

**Performance**: 100-500ms depending on artist discography size

**Result format**: Single row with three JSON columns (artist, albums, links)

#### album.sql

**Purpose**: Fetch complete album metadata with tracks

**Location**: `lidarrmetadata/sql/album.sql`

**Parameters**:
- `$1`: Release group MBID (UUID)

**Query structure**:
```sql
WITH album_data AS (
    SELECT
        rg.gid AS id,
        rg.name AS title,
        rg.comment AS disambiguation,
        rgt.name AS primary_type,
        rs.name AS status,
        a.gid AS artist_id,
        a.name AS artist_name,
        COALESCE(
            TO_CHAR(DATE(rd.date_year || '-' || COALESCE(rd.date_month, 1) || '-' || COALESCE(rd.date_day, 1)), 'YYYY-MM-DD'),
            ''
        ) AS release_date
    FROM release_group rg
    JOIN artist_credit ac ON rg.artist_credit = ac.id
    JOIN artist_credit_name acn ON acn.artist_credit = ac.id
    JOIN artist a ON acn.artist = a.id
    LEFT JOIN release_group_primary_type rgt ON rg.type = rgt.id
    LEFT JOIN release r ON r.release_group = rg.id
    LEFT JOIN release_status rs ON r.status = rs.id
    LEFT JOIN (
        SELECT release, MIN(date_year) AS date_year, MIN(date_month) AS date_month, MIN(date_day) AS date_day
        FROM release_country
        GROUP BY release
    ) rd ON rd.release = r.id
    WHERE rg.gid = $1
    LIMIT 1
),
media AS (
    SELECT
        m.position,
        mf.name AS format,
        json_agg(
            json_build_object(
                'position', t.position,
                'title', t.name,
                'duration', t.length,
                'artist_name', ta.name
            )
            ORDER BY t.position
        ) AS tracks
    FROM release_group rg
    JOIN release r ON r.release_group = rg.id
    JOIN medium m ON m.release = r.id
    LEFT JOIN medium_format mf ON m.format = mf.id
    JOIN track t ON t.medium = m.id
    JOIN recording rec ON t.recording = rec.id
    JOIN artist_credit tac ON rec.artist_credit = tac.id
    JOIN artist_credit_name tacn ON tacn.artist_credit = tac.id
    JOIN artist ta ON tacn.artist = ta.id
    WHERE rg.gid = $1
    GROUP BY m.id, m.position, mf.name
    ORDER BY m.position
)
SELECT
    row_to_json(album_data.*) AS album,
    COALESCE(
        (SELECT json_agg(row_to_json(media.*)) FROM media),
        '[]'::json
    ) AS media
FROM album_data;
```

**Performance**: 200-1000ms depending on track count

#### updated_artists.sql

**Purpose**: Detect recently updated artists for cache invalidation

**Location**: `lidarrmetadata/sql/updated_artists.sql`

**Parameters**:
- `$1`: Timestamp threshold (only artists updated after this)
- `$2`: Result limit

**Query structure** (UNION of 5 change sources):

```sql
-- Source 1: Artists with updated metadata
SELECT DISTINCT
    a.gid,
    a.last_updated,
    'metadata' AS change_type
FROM artist a
WHERE a.last_updated > $1

UNION

-- Source 2: Artists with new release groups
SELECT DISTINCT
    a.gid,
    rg.last_updated,
    'new_release' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
WHERE rg.last_updated > $1

UNION

-- Source 3: Artists with updated releases
SELECT DISTINCT
    a.gid,
    r.last_updated,
    'updated_release' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
JOIN release r ON r.release_group = rg.id
WHERE r.last_updated > $1

UNION

-- Source 4: Artists with new/updated links
SELECT DISTINCT
    a.gid,
    lau.last_updated,
    'new_link' AS change_type
FROM artist a
JOIN l_artist_url lau ON lau.entity0 = a.id
WHERE lau.last_updated > $1

UNION

-- Source 5: Artists with updated cover art
SELECT DISTINCT
    a.gid,
    caa.date_updated AS last_updated,
    'cover_art' AS change_type
FROM artist a
JOIN release_group rg ON rg.artist_credit = a.id
JOIN release r ON r.release_group = rg.id
JOIN cover_art_archive.index_listing caa ON caa.release = r.id
WHERE caa.date_updated > $1

ORDER BY last_updated DESC
LIMIT $2;
```

**Performance**: 500-2000ms depending on time window

**Use case**: Crawler scheduling, cache invalidation

#### updated_albums.sql

**Purpose**: Detect recently updated albums

**Location**: `lidarrmetadata/sql/updated_albums.sql`

**Parameters**: Same as `updated_artists.sql`

**Query structure**: Similar UNION pattern with 5 change sources:
1. Release group metadata updates
2. New releases in group
3. Updated releases in group
4. New/updated links
5. Cover art updates

### Database Replication

**Replication method**: MusicBrainz replication packets

**Update frequency**: Hourly

**Replication process**:
1. Check `replication_control` table for current sequence
2. Fetch replication packets from MusicBrainz FTP
3. Apply SQL changes from packets
4. Update `replication_control` table
5. Trigger search index updates via RabbitMQ

**Monitoring**:
```sql
SELECT
    current_replication_sequence,
    last_replication_date,
    NOW() - last_replication_date AS replication_lag
FROM replication_control;
```

**Typical lag**: 1-2 hours behind MusicBrainz master

## Cache Database (PostgreSQL)

### Database Overview

**Purpose**: Persistent cache storage with compression

**Technology**: PostgreSQL 12+ (same instance as MusicBrainz or separate)

**Database name**: `lm_cache_db`

**Connection configuration**:
```python
CACHE_DATABASE = {
    'host': 'localhost',
    'port': 5432,
    'database': 'lm_cache_db',
    'user': 'abc',
    'password': 'abc'
}
```

### Auto-Created Cache Tables

Each cache type gets its own table with identical schema:

**Table names**:
- `artist`: Artist metadata cache
- `album`: Album metadata cache
- `spotify`: Spotify lookup cache
- `fanart`: FanArt.tv image cache
- `tadb`: TheAudioDB metadata cache
- `wikipedia`: Wikipedia overview cache

**Schema**:
```sql
CREATE TABLE IF NOT EXISTS {cache_name} (
    key VARCHAR(255) PRIMARY KEY,
    expires TIMESTAMP,
    updated TIMESTAMP DEFAULT NOW(),
    value BYTEA
);

CREATE INDEX IF NOT EXISTS idx_{cache_name}_expires
ON {cache_name} (expires)
WHERE expires IS NOT NULL;

CREATE INDEX IF NOT EXISTS idx_{cache_name}_updated
ON {cache_name} (updated DESC);
```

**Trigger for automatic timestamp updates**:
```sql
CREATE OR REPLACE FUNCTION update_updated_column()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated = NOW();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER update_{cache_name}_updated
    BEFORE UPDATE ON {cache_name}
    FOR EACH ROW
    EXECUTE FUNCTION update_updated_column();
```

### Cache Entry Format

**Key structure**: `{cache_type}:{identifier}:{parameters}`

**Examples**:
- `artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:Album:Official`
- `album:1b022e01-4da6-387b-8658-8678046e4cef`
- `spotify:artist:6olE6TJLqED3rqDCT0FyPh`
- `wikipedia:5b11f4ce-a62d-471e-81fc-a69a8278c7da:en`

**Value format**: zlib-compressed pickle

**Compression implementation**:
```python
import zlib
import pickle

def compress_value(value):
    """Compress Python object for storage"""
    pickled = pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL)
    compressed = zlib.compress(pickled, level=6)
    return compressed

def decompress_value(compressed):
    """Decompress stored value to Python object"""
    pickled = zlib.decompress(compressed)
    value = pickle.loads(pickled)
    return value
```

**Compression ratio**: Typically 10:1 for JSON metadata

**Example**:
- Uncompressed artist metadata: 50KB JSON
- Pickled: 52KB
- Compressed: 5KB
- Storage savings: 90%

### Cache Operations

#### Insert/Update

```python
async def cache_set(key, value, ttl=2592000):
    """Store value in cache with optional TTL"""
    compressed = zlib.compress(pickle.dumps(value))
    expires = datetime.now() + timedelta(seconds=ttl) if ttl else None

    await conn.execute(
        """
        INSERT INTO artist (key, value, expires)
        VALUES ($1, $2, $3)
        ON CONFLICT (key) DO UPDATE
        SET value = $2, expires = $3, updated = NOW()
        """,
        key, compressed, expires
    )
```

#### Retrieve

```python
async def cache_get(key):
    """Retrieve value from cache"""
    row = await conn.fetchrow(
        """
        SELECT value, expires
        FROM artist
        WHERE key = $1
        """,
        key
    )

    if not row:
        return None

    # Check expiration
    if row['expires'] and row['expires'] < datetime.now():
        # Expired, delete and return None
        await conn.execute("DELETE FROM artist WHERE key = $1", key)
        return None

    # Decompress and return
    value = pickle.loads(zlib.decompress(row['value']))
    return value
```

#### Delete

```python
async def cache_delete(key):
    """Delete value from cache"""
    await conn.execute("DELETE FROM artist WHERE key = $1", key)
```

#### Cleanup Expired Entries

```python
async def cache_cleanup():
    """Remove expired entries"""
    deleted = await conn.execute(
        """
        DELETE FROM artist
        WHERE expires IS NOT NULL AND expires < NOW()
        """
    )
    return deleted
```

**Cleanup schedule**: Daily via cron or crawler

### Cache Statistics

**Query for cache statistics**:
```sql
SELECT
    'artist' AS cache_name,
    COUNT(*) AS total_entries,
    COUNT(*) FILTER (WHERE expires IS NOT NULL AND expires < NOW()) AS expired_entries,
    COUNT(*) FILTER (WHERE expires IS NULL OR expires >= NOW()) AS valid_entries,
    pg_size_pretty(pg_total_relation_size('artist')) AS total_size,
    AVG(LENGTH(value)) AS avg_value_size,
    MAX(updated) AS last_updated
FROM artist;
```

**Example output**:
```
cache_name | total_entries | expired_entries | valid_entries | total_size | avg_value_size | last_updated
-----------+---------------+-----------------+---------------+------------+----------------+---------------------
artist     | 125000        | 5000            | 120000        | 2048 MB    | 5120           | 2025-04-28 12:34:56
```

## Redis Cache

### Redis Overview

**Purpose**: Ephemeral cache for hot data and rate limiting

**Technology**: Redis 6+

**Memory limit**: 512MB

**Eviction policy**: LFU (Least Frequently Used)

**Namespace**: `lm3.7`

**Connection configuration**:
```python
REDIS_URL = 'redis://localhost:6379/0'
REDIS_MIN_POOL_SIZE = 5
REDIS_MAX_POOL_SIZE = 20
```

### Redis Configuration

**redis.conf settings**:
```conf
maxmemory 512mb
maxmemory-policy allkeys-lfu
save ""  # Disable persistence
appendonly no  # Disable AOF
```

**Rationale**:
- LFU eviction keeps most-accessed data in cache
- No persistence needed (PostgreSQL is persistent layer)
- Maximum performance for cache operations

### Key Structure

**Namespace prefix**: All keys prefixed with `lm3.7:`

**Key patterns**:
- `lm3.7:artist:{mbid}:{params}`: Artist metadata
- `lm3.7:album:{mbid}`: Album metadata
- `lm3.7:search:artist:{query}`: Search results
- `lm3.7:ratelimit:{ip}:{window}`: Rate limiter state
- `lm3.7:sentry:{error_hash}`: Sentry deduplication
- `lm3.7:lock:invalidate:{mbid}`: Invalidation locks

### Cache Operations

#### Set with TTL

```python
async def redis_set(key, value, ttl=604800):
    """Store value in Redis with TTL (default 7 days)"""
    compressed = zlib.compress(pickle.dumps(value))
    await redis.setex(f"lm3.7:{key}", ttl, compressed)
```

#### Get

```python
async def redis_get(key):
    """Retrieve value from Redis"""
    compressed = await redis.get(f"lm3.7:{key}")
    if not compressed:
        return None
    value = pickle.loads(zlib.decompress(compressed))
    return value
```

#### Delete

```python
async def redis_delete(key):
    """Delete value from Redis"""
    await redis.delete(f"lm3.7:{key}")
```

#### Batch Delete

```python
async def redis_delete_pattern(pattern):
    """Delete all keys matching pattern"""
    cursor = 0
    while True:
        cursor, keys = await redis.scan(cursor, match=f"lm3.7:{pattern}", count=100)
        if keys:
            await redis.delete(*keys)
        if cursor == 0:
            break
```

### Rate Limiting with Redis

**Implementation**: Sliding window counter

```python
async def rate_limit_check(key, max_requests=100, window=60):
    """Check if request is within rate limit"""
    now = time.time()
    window_key = f"lm3.7:ratelimit:{key}:{int(now / window)}"

    # Increment counter
    count = await redis.incr(window_key)

    # Set expiration on first request
    if count == 1:
        await redis.expire(window_key, window)

    # Check limit
    if count > max_requests:
        raise RateLimitExceeded(f"Rate limit exceeded: {count}/{max_requests}")

    return count
```

### Sentry Deduplication with Redis

**Purpose**: Prevent duplicate error reports

```python
async def sentry_should_send(error_hash):
    """Check if error should be sent to Sentry"""
    key = f"lm3.7:sentry:{error_hash}"

    # Check if error seen recently
    if await redis.exists(key):
        return False

    # Mark error as seen for 1 hour
    await redis.setex(key, 3600, "1")
    return True
```

### Redis Monitoring

**Memory usage**:
```bash
redis-cli INFO memory
```

**Key count**:
```bash
redis-cli DBSIZE
```

**Eviction stats**:
```bash
redis-cli INFO stats | grep evicted
```

**Hit rate**:
```bash
redis-cli INFO stats | grep keyspace
```

## Solr Search Index

### Solr Overview

**Purpose**: Full-text search for artists and albums

**Technology**: Apache Solr 8.x

**Container image**: `ghcr.io/lidarr/mb-solr:3.3.1.9`

**Cores**:
- `artist`: Artist search index
- `release-group`: Album search index

**Update method**: Real-time via RabbitMQ + SIR (Search Index Rebuilder)

### Solr Configuration

**solrconfig.xml settings**:
```xml
<config>
  <requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="defType">dismax</str>
      <int name="rows">10</int>
    </lst>
  </requestHandler>

  <requestHandler name="/update" class="solr.UpdateRequestHandler" />

  <requestHandler name="/admin/ping" class="solr.PingRequestHandler">
    <lst name="invariants">
      <str name="q">*:*</str>
    </lst>
  </requestHandler>
</config>
```

### Artist Core Schema

**schema.xml**:
```xml
<schema name="artist" version="1.6">
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="artist" type="text_general" indexed="true" stored="true" />
  <field name="sortname" type="text_general" indexed="true" stored="true" />
  <field name="alias" type="text_general" indexed="true" stored="true" multiValued="true" />
  <field name="type" type="string" indexed="true" stored="true" />
  <field name="disambiguation" type="text_general" indexed="true" stored="true" />
  <field name="_version_" type="long" indexed="true" stored="true" />

  <uniqueKey>id</uniqueKey>

  <copyField source="artist" dest="text" />
  <copyField source="sortname" dest="text" />
  <copyField source="alias" dest="text" />
</schema>
```

**Indexed fields**:
- `id`: MusicBrainz artist MBID
- `artist`: Artist name (boosted 2x in queries)
- `sortname`: Sort name
- `alias`: Artist aliases (multi-valued)
- `type`: Artist type (Person, Group, etc.)
- `disambiguation`: Disambiguation comment

### Release Group Core Schema

**Indexed fields**:
- `id`: MusicBrainz release group MBID
- `title`: Album title (boosted 2x)
- `artist`: Artist name
- `type`: Primary type (Album, Single, etc.)
- `status`: Release status
- `disambiguation`: Disambiguation comment

### Search Query Format

**Dismax query example**:
```json
{
  "query": "nirvana",
  "limit": 10,
  "params": {
    "defType": "dismax",
    "qf": "artist^2 sortname alias",
    "mm": "1"
  }
}
```

**Query field weights**:
- `artist^2`: Artist name (2x boost)
- `sortname`: Sort name (1x)
- `alias`: Aliases (1x)

**Minimum match**: At least 1 term must match

### Solr Update Process

**Real-time updates via RabbitMQ**:

1. MusicBrainz replication applies database changes
2. Database triggers publish messages to RabbitMQ
3. SIR (Search Index Rebuilder) consumes messages
4. SIR queries MusicBrainz database for updated entities
5. SIR posts updates to Solr cores
6. Solr commits changes (soft commit every 1s)

**RabbitMQ configuration**:
```python
RABBITMQ = {
    'host': 'rabbitmq',
    'port': 5672,
    'user': 'abc',
    'password': 'abc',
    'exchange': 'search.index',
    'queue': 'search.index.artist'
}
```

**Update latency**: 1-5 seconds from database change to searchable

### Solr Performance

**Query timeout**: 5 seconds

**Typical query time**: 50-200ms

**Index size**:
- Artist core: ~4GB
- Release group core: ~4GB

**Document count**:
- Artist core: ~2 million
- Release group core: ~3 million

## Change Detection System

### Overview

**Purpose**: Identify recently updated entities for cache invalidation

**Method**: SQL queries tracking changes across multiple sources

**Update frequency**: Hourly (aligned with MusicBrainz replication)

### Artist Change Sources

**5 change sources tracked**:

1. **Artist metadata updates**: `artist.last_updated`
2. **New release groups**: `release_group.last_updated`
3. **Updated releases**: `release.last_updated`
4. **New/updated links**: `l_artist_url.last_updated`
5. **Cover art updates**: `cover_art_archive.index_listing.date_updated`

**Query**: `updated_artists.sql` (UNION of 5 queries)

**Performance**: 500-2000ms for 24-hour window

**Typical results**: 1000-5000 artists per hour

### Album Change Sources

**5 change sources tracked**:

1. **Release group metadata updates**: `release_group.last_updated`
2. **New releases in group**: `release.last_updated`
3. **Updated releases in group**: `release.last_updated`
4. **New/updated links**: `l_release_group_url.last_updated`
5. **Cover art updates**: `cover_art_archive.index_listing.date_updated`

**Query**: `updated_albums.sql`

**Typical results**: 2000-10000 albums per hour

### Change Detection Workflow

**Crawler process**:

1. Query `updated_artists.sql` with timestamp from last run
2. For each updated artist:
   - Delete from Redis cache
   - Delete from PostgreSQL cache
   - Purge from Cloudflare CDN
3. Optionally pre-fetch fresh metadata
4. Update last run timestamp
5. Sleep until next cycle

**Invalidation vs. pre-fetching**:
- **Invalidation only**: Fast, minimal API load
- **Pre-fetching**: Slower, but ensures cache is warm

**Configuration**:
```python
CRAWLER_INVALIDATE_ONLY = False  # Pre-fetch after invalidation
CRAWLER_INTERVAL = 3600  # 1 hour
CRAWLER_BATCH_SIZE = 100  # Process 100 entities per batch
```

## Data Consistency

### Cache Coherence

**Problem**: Three cache tiers can become inconsistent

**Solution**: Hierarchical invalidation

**Invalidation order**:
1. Cloudflare CDN (purge API)
2. Redis (delete key)
3. PostgreSQL (delete row)

**Rationale**: Invalidate from edge to origin to prevent stale data propagation

### Replication Lag Handling

**Problem**: MusicBrainz replication has 1-2 hour lag

**Solution**: Accept eventual consistency

**User impact**: Newly added artists/albums may not appear in search for 1-2 hours

**Mitigation**: Manual cache refresh endpoint for urgent updates

### Concurrent Update Handling

**Problem**: Multiple API instances may invalidate cache simultaneously

**Solution**: Redis-based distributed locks

**Implementation**:
```python
async def invalidate_with_lock(mbid):
    """Invalidate cache with distributed lock"""
    lock_key = f"lm3.7:lock:invalidate:{mbid}"

    # Try to acquire lock (30 second TTL)
    acquired = await redis.set(lock_key, "1", ex=30, nx=True)

    if not acquired:
        # Another instance is already invalidating
        return False

    try:
        # Perform invalidation
        await invalidate_cdn(mbid)
        await redis_delete(f"artist:{mbid}")
        await postgres_delete(f"artist:{mbid}")
        return True
    finally:
        # Release lock
        await redis.delete(lock_key)
```

## Data Volume Estimates

### MusicBrainz Database

| Table | Row Count | Avg Row Size | Total Size |
|-------|-----------|--------------|------------|
| artist | 2M | 500B | 1GB |
| release_group | 3M | 400B | 1.2GB |
| release | 5M | 600B | 3GB |
| medium | 6M | 200B | 1.2GB |
| track | 50M | 300B | 15GB |
| recording | 40M | 400B | 16GB |
| url | 10M | 200B | 2GB |
| l_artist_url | 5M | 100B | 500MB |
| **Total** | **121M** | - | **~100GB** |

### Cache Database

| Table | Entry Count | Avg Compressed Size | Total Size |
|-------|-------------|---------------------|------------|
| artist | 120K | 5KB | 600MB |
| album | 200K | 3KB | 600MB |
| spotify | 50K | 1KB | 50MB |
| fanart | 80K | 2KB | 160MB |
| tadb | 60K | 2KB | 120MB |
| wikipedia | 100K | 1KB | 100MB |
| **Total** | **610K** | - | **~2GB** |

### Redis Cache

| Key Pattern | Entry Count | Avg Size | Total Size |
|-------------|-------------|----------|------------|
| artist:* | 10K | 5KB | 50MB |
| album:* | 15K | 3KB | 45MB |
| search:* | 5K | 10KB | 50MB |
| ratelimit:* | 1K | 100B | 100KB |
| sentry:* | 500 | 100B | 50KB |
| **Total** | **31.5K** | - | **~150MB** |

### Solr Index

| Core | Document Count | Index Size |
|------|----------------|------------|
| artist | 2M | 4GB |
| release-group | 3M | 4GB |
| **Total** | **5M** | **8GB** |

## Conclusion

The data layer demonstrates sophisticated multi-tier architecture:

1. **MusicBrainz PostgreSQL**: Authoritative source with complex SQL aggregation
2. **Cache PostgreSQL**: Persistent compressed cache with automatic expiration
3. **Redis**: Hot cache with LFU eviction and rate limiting
4. **Solr**: Real-time search index with RabbitMQ updates

Key strengths:
- Direct database access for complex queries
- Three-tier caching with compression (10:1 ratio)
- Change detection across 5 sources per entity type
- Real-time search index updates

The SQL queries using `row_to_json` and `json_agg` are particularly elegant, building nested JSON structures directly in the database for optimal performance.