Files
metadata-agregator/docs/research/lidarr-metadata-api/analysis/OVERVIEW.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

14 KiB

Lidarr Metadata API - Overview

Project Identity

Property Value
Name LidarrAPI.Metadata
Repository https://github.com/Lidarr/LidarrAPI.Metadata
Version 10.0.0.0
License GPL-3.0
Primary Language Python 3.9
Purpose Enriched metadata aggregation API for Lidarr music manager

Core Purpose

LidarrAPI.Metadata serves as a metadata enrichment layer for the Lidarr music management application. It aggregates data from multiple authoritative sources (MusicBrainz, FanArt.tv, TheAudioDB, Wikipedia, Spotify, Last.fm, Billboard, Apple Music) to provide comprehensive artist and album metadata including:

  • Artist biographical information
  • Album release details
  • High-quality cover art and artist images
  • Genre classifications
  • Music charts and trending data
  • Cross-platform ID mappings (MusicBrainz, Spotify, TheAudioDB)

The API acts as an intelligent caching proxy that transforms raw MusicBrainz database records into enriched JSON responses suitable for consumption by Lidarr clients.

Technology Stack

Core Framework

Component Version Purpose
Python 3.9 Runtime environment
Quart 0.14.1 Async web framework (Flask-compatible)
Gunicorn Latest WSGI HTTP server
Uvicorn Latest ASGI server (worker class)

Data Layer

Component Version Purpose
asyncpg 0.26.0 PostgreSQL async driver
aioredis 1.3.1 Redis async client
PostgreSQL 12+ MusicBrainz database + cache storage
Redis 6+ Ephemeral cache + rate limiting
Solr 8.x Full-text search engine

External Integrations

Library Version Purpose
spotipy 2.16.1 Spotify API client
pylast 4.3.0 Last.fm API client
billboard-py 7.0.0 Billboard chart scraper
beautifulsoup4 Latest HTML parsing (Wikipedia)
sentry-sdk 0.19.5 Error tracking

Application Entry Points

The project provides two executable entry points:

1. API Server

lidarr-metadata-server

Implementation: lidarrmetadata/server.py

Starts the Quart web application serving the metadata API on port 5001. Supports configurable path prefix via APPLICATION_ROOT environment variable.

Production command:

gunicorn -w 1 -k uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:5001 \
  --access-logfile - \
  lidarrmetadata.server:app

2. Background Crawler

lidarr-metadata-crawler

Implementation: lidarrmetadata/crawler.py

Runs background cache warming tasks to proactively fetch and cache metadata for recently updated artists and albums. Operates independently of the API server.

Crawler types:

  • Wikipedia overview crawler
  • FanArt.tv image crawler
  • TheAudioDB metadata crawler
  • Artist metadata crawler
  • Album metadata crawler

Network Configuration

Setting Default Configurable Via
Port 5001 Docker/Gunicorn bind
Path Prefix / APPLICATION_ROOT env var
Workers 1 Gunicorn -w flag
Worker Class uvicorn Gunicorn -k flag

Lidarr Music Manager

The primary consumer of this API. Lidarr is an automated music collection manager for Usenet and BitTorrent users. It monitors multiple RSS feeds for new albums from favorite artists and grabs, sorts, and renames them.

Integration: Lidarr queries this API to enrich its local music library database with metadata, images, and biographical information.

MusicBrainz Database

The authoritative source for music metadata. MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.

Integration: Direct PostgreSQL connection to a replicated MusicBrainz database instance. The API does NOT use the MusicBrainz web API; it queries the database directly for performance.

Database size: ~100GB+ for full MusicBrainz dataset with hourly replication.

Cover Art Archive

A joint project between the Internet Archive and MusicBrainz providing cover art images for releases in the MusicBrainz database.

Integration: Images are proxied through imagecache.lidarr.audio CDN for performance and bandwidth optimization.

Deployment Architecture

The application is designed for containerized deployment with Docker Compose. A typical production deployment includes:

Container Purpose Resource Requirements
musicbrainz PostgreSQL with MusicBrainz schema 100GB+ storage, 4GB+ RAM
solr Search index (artist/album) 8GB+ storage, 2GB+ RAM
redis Cache + rate limiting 512MB RAM limit
rabbitmq Search index updates 1GB RAM
indexer Solr index updater (SIR) 512MB RAM
api-v0.3 Stable API version 1GB+ RAM
api-testing Development API version 1GB+ RAM
crawler Background cache warmer 512MB RAM

Version Strategy

The project uses semantic versioning with a unique dual-deployment strategy:

  • v0.3: Stable production version
  • testing: Development/staging version

Both versions run simultaneously in production, allowing gradual rollout and A/B testing of new features.

Configuration Management

Configuration is managed through a metaclass-based system with environment variable overrides:

# Select configuration class
LIDARR_METADATA_CONFIG=lidarrmetadata.config.ProductionConfig

# Override specific settings (double underscore for nesting)
CACHE__REDIS_URL=redis://redis:6379/0
DATABASE__HOST=musicbrainz

Key Features

Multi-Source Aggregation

Combines data from 15+ external sources into unified artist/album responses:

  • Core metadata: MusicBrainz database (direct SQL)
  • Images: Cover Art Archive, FanArt.tv, TheAudioDB
  • Biographies: Wikipedia (32 language fallback)
  • Cross-platform IDs: Spotify, TheAudioDB, MusicBrainz
  • Charts: Last.fm, Billboard, Apple Music, iTunes

Intelligent Caching

Three-tier caching strategy:

  1. Redis: Ephemeral cache (7-day TTL, 512MB limit, LFU eviction)
  2. PostgreSQL: Persistent cache with zlib compression
  3. Cloudflare CDN: Edge caching with programmatic invalidation

Change Detection

Monitors MusicBrainz replication stream to detect updated artists/albums and invalidate stale cache entries. SQL queries track changes across 5 different update sources per entity type.

Background Crawling

Proactive cache warming for recently updated entities. Crawlers run on configurable schedules to pre-fetch expensive metadata (Wikipedia overviews, FanArt images) before user requests.

Provider Fallback Chain

Graceful degradation when external services are unavailable. Each metadata type has a primary provider and optional fallback providers with timeout handling.

Performance Characteristics

Metric Value Notes
Cache hit rate ~85%+ With crawler enabled
Cold request latency 2-5s Multiple external API calls
Cached request latency 50-200ms Redis/PostgreSQL lookup
CDN request latency 10-50ms Cloudflare edge cache
Database size 100GB+ MusicBrainz full dataset
Cache database size 10-50GB Compressed metadata cache

API Response Format

All endpoints return JSON with consistent structure:

{
  "Id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
  "ArtistName": "Nirvana",
  "Disambiguation": "90s US grunge band",
  "Overview": "Nirvana was an American rock band...",
  "Images": [
    {
      "Url": "https://imagecache.lidarr.audio/...",
      "CoverType": "poster",
      "Extension": ".jpg"
    }
  ],
  "Links": [
    {
      "Url": "https://www.spotify.com/artist/...",
      "Name": "spotify"
    }
  ],
  "Genres": ["Grunge", "Alternative Rock"],
  "Albums": [...]
}

Security Posture

Current state: Development-focused with insecure defaults.

Aspect Status Details
API authentication None Read endpoints are public
Admin authentication Single API key /invalidate endpoint only
Database credentials Hardcoded abc/abc in multiple configs
RabbitMQ credentials Hardcoded abc/abc default
HTTPS Not enforced Relies on reverse proxy
Rate limiting Optional Disabled by default (NullRateLimiter)

Production recommendation: Deploy behind authenticated reverse proxy (Cloudflare Access, OAuth2 Proxy, etc.).

Monitoring and Observability

Error Tracking

Sentry integration with custom rate limiting to prevent alert fatigue:

sentry_sdk.init(
    dsn=config.SENTRY_DSN,
    integrations=[FlaskIntegration()],
    release=f"lidarr-metadata@{__version__}"
)

Redis-backed deduplication prevents duplicate error reports.

Metrics

StatsD/Telegraf integration for operational metrics:

  • Provider request counts
  • Response time histograms
  • Cache hit/miss rates
  • Rate limiter state

Logging

Python standard library logging with per-module handlers:

  • DEBUG: Detailed request/response logging
  • INFO: Request summaries, cache operations
  • WARN: Provider timeouts, fallback usage
  • ERROR: Unhandled exceptions, data inconsistencies

Development Workflow

Local Development

# Install dependencies
poetry install

# Start infrastructure
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -d

# Run API server
LIDARR_METADATA_CONFIG=lidarrmetadata.config.DevelopmentConfig \
  python -m lidarrmetadata.server

# Run tests (currently disabled in CI)
pytest tests/

Testing

Test suite uses pytest with async support:

  • tests/test_config.py: Configuration system (152 lines, most comprehensive)
  • tests/test_provider.py: Provider mixin behavior
  • tests/test_cache.py: Cache layer functionality
  • tests/test_api.py: API endpoint responses
  • tests/test_util.py: Utility functions
  • tests/test_app.py: Application initialization

Note: Tests are commented out in Azure Pipelines CI configuration.

Project Maturity Assessment

Aspect Maturity Evidence
Production readiness High Running in production for Lidarr ecosystem
Code quality Medium SonarCloud integration, but tests disabled
Security Low Hardcoded credentials, no auth on read endpoints
Documentation Medium README comprehensive, inline docs sparse
Dependency freshness Low Python 3.9, aioredis 1.x (deprecated)
Test coverage Unknown Tests disabled in CI
Operational maturity High Sentry, metrics, multi-tier caching, CDN integration

Relevance to Metadata Aggregator Project

This codebase represents the closest real-world implementation of a production metadata aggregation service. Key learnings:

  1. Multi-source enrichment pattern: MusicBrainz as authoritative core + specialized providers for images/bios/charts
  2. Caching strategy: Three-tier approach with compression and invalidation is battle-tested
  3. Provider architecture: Mixin-based design allows flexible composition of data sources
  4. Change detection: Monitoring upstream data sources for cache invalidation is critical
  5. Background crawling: Proactive cache warming significantly improves user experience
  6. Direct database access: Querying MusicBrainz DB directly (vs API) enables complex aggregations
  7. SQL aggregation: Using row_to_json and json_agg to build nested JSON in database is highly efficient

File Structure Overview

lidarrmetadata/
├── __init__.py           # Version and package metadata
├── server.py             # API server entry point
├── crawler.py            # Background crawler entry point
├── app.py                # Quart application factory + routes
├── api.py                # Business logic layer
├── provider.py           # Provider mixins and implementations
├── cache.py              # Multi-tier cache implementation
├── config.py             # Configuration metaclass system
├── util.py               # Utility functions
├── sql/                  # MusicBrainz SQL queries
│   ├── artist.sql
│   ├── album.sql
│   ├── updated_artists.sql
│   └── updated_albums.sql
└── providers/            # Individual provider implementations
    ├── musicbrainz_db.py
    ├── solr_search.py
    ├── fanart.py
    ├── theaudiodb.py
    ├── wikipedia.py
    └── spotify.py

Dependencies Analysis

Production Dependencies (17 total)

Web framework:

  • quart==0.14.1 (async Flask alternative)
  • hypercorn (ASGI server, Quart dependency)

Database:

  • asyncpg==0.26.0 (PostgreSQL async driver)
  • aioredis==1.3.1 (Redis async client, deprecated)

External APIs:

  • spotipy==2.16.1 (Spotify)
  • pylast==4.3.0 (Last.fm)
  • billboard-py==7.0.0 (Billboard charts)
  • beautifulsoup4 (Wikipedia scraping)

Utilities:

  • python-dateutil (date parsing)
  • pytz (timezone handling)
  • requests (HTTP client for sync operations)
  • lxml (XML parsing)

Monitoring:

  • sentry-sdk==0.19.5 (error tracking)
  • statsd (metrics)

Server:

  • gunicorn (WSGI server)
  • uvicorn (ASGI worker)

Development Dependencies

  • pytest
  • pytest-asyncio
  • black (code formatting)
  • flake8 (linting)

Dependency Concerns

  1. Python 3.9: End of life October 2025, should upgrade to 3.11+
  2. aioredis 1.3.1: Deprecated, merged into redis-py 4.2+
  3. Quart 0.14.1: Current version is 0.19+, missing 5 years of updates
  4. asyncpg 0.26.0: Current version is 0.29+
  5. sentry-sdk 0.19.5: Current version is 2.0+, missing major version

Conclusion

LidarrAPI.Metadata is a production-grade metadata aggregation service with sophisticated caching, multi-source enrichment, and operational maturity. While it has technical debt (outdated dependencies, disabled tests, insecure defaults), its architecture and patterns provide an excellent reference for building a modern metadata aggregator.

The direct MusicBrainz database integration, provider fallback chain, and three-tier caching strategy are particularly valuable patterns to adopt.