Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

14 KiB

Raw Blame History

Lidarr Metadata API - Overview

Project Identity

Property	Value
Name	LidarrAPI.Metadata
Repository	https://github.com/Lidarr/LidarrAPI.Metadata
Version	10.0.0.0
License	GPL-3.0
Primary Language	Python 3.9
Purpose	Enriched metadata aggregation API for Lidarr music manager

Core Purpose

LidarrAPI.Metadata serves as a metadata enrichment layer for the Lidarr music management application. It aggregates data from multiple authoritative sources (MusicBrainz, FanArt.tv, TheAudioDB, Wikipedia, Spotify, Last.fm, Billboard, Apple Music) to provide comprehensive artist and album metadata including:

Artist biographical information
Album release details
High-quality cover art and artist images
Genre classifications
Music charts and trending data
Cross-platform ID mappings (MusicBrainz, Spotify, TheAudioDB)

The API acts as an intelligent caching proxy that transforms raw MusicBrainz database records into enriched JSON responses suitable for consumption by Lidarr clients.

Technology Stack

Core Framework

Component	Version	Purpose
Python	3.9	Runtime environment
Quart	0.14.1	Async web framework (Flask-compatible)
Gunicorn	Latest	WSGI HTTP server
Uvicorn	Latest	ASGI server (worker class)

Data Layer

Component	Version	Purpose
asyncpg	0.26.0	PostgreSQL async driver
aioredis	1.3.1	Redis async client
PostgreSQL	12+	MusicBrainz database + cache storage
Redis	6+	Ephemeral cache + rate limiting
Solr	8.x	Full-text search engine

External Integrations

Library	Version	Purpose
spotipy	2.16.1	Spotify API client
pylast	4.3.0	Last.fm API client
billboard-py	7.0.0	Billboard chart scraper
beautifulsoup4	Latest	HTML parsing (Wikipedia)
sentry-sdk	0.19.5	Error tracking

Application Entry Points

The project provides two executable entry points:

1. API Server

lidarr-metadata-server

Implementation: lidarrmetadata/server.py

Starts the Quart web application serving the metadata API on port 5001. Supports configurable path prefix via APPLICATION_ROOT environment variable.

Production command:

gunicorn -w 1 -k uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:5001 \
  --access-logfile - \
  lidarrmetadata.server:app

2. Background Crawler

lidarr-metadata-crawler

Implementation: lidarrmetadata/crawler.py

Runs background cache warming tasks to proactively fetch and cache metadata for recently updated artists and albums. Operates independently of the API server.

Crawler types:

Wikipedia overview crawler
FanArt.tv image crawler
TheAudioDB metadata crawler
Artist metadata crawler
Album metadata crawler

Network Configuration

Setting	Default	Configurable Via
Port	5001	Docker/Gunicorn bind
Path Prefix	`/`	`APPLICATION_ROOT` env var
Workers	1	Gunicorn `-w` flag
Worker Class	uvicorn	Gunicorn `-k` flag

Lidarr Music Manager

The primary consumer of this API. Lidarr is an automated music collection manager for Usenet and BitTorrent users. It monitors multiple RSS feeds for new albums from favorite artists and grabs, sorts, and renames them.

Integration: Lidarr queries this API to enrich its local music library database with metadata, images, and biographical information.

MusicBrainz Database

The authoritative source for music metadata. MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.

Integration: Direct PostgreSQL connection to a replicated MusicBrainz database instance. The API does NOT use the MusicBrainz web API; it queries the database directly for performance.

Database size: ~100GB+ for full MusicBrainz dataset with hourly replication.

Cover Art Archive

A joint project between the Internet Archive and MusicBrainz providing cover art images for releases in the MusicBrainz database.

Integration: Images are proxied through imagecache.lidarr.audio CDN for performance and bandwidth optimization.

Deployment Architecture

The application is designed for containerized deployment with Docker Compose. A typical production deployment includes:

Container	Purpose	Resource Requirements
musicbrainz	PostgreSQL with MusicBrainz schema	100GB+ storage, 4GB+ RAM
solr	Search index (artist/album)	8GB+ storage, 2GB+ RAM
redis	Cache + rate limiting	512MB RAM limit
rabbitmq	Search index updates	1GB RAM
indexer	Solr index updater (SIR)	512MB RAM
api-v0.3	Stable API version	1GB+ RAM
api-testing	Development API version	1GB+ RAM
crawler	Background cache warmer	512MB RAM

Version Strategy

The project uses semantic versioning with a unique dual-deployment strategy:

v0.3: Stable production version
testing: Development/staging version

Both versions run simultaneously in production, allowing gradual rollout and A/B testing of new features.

Configuration Management

Configuration is managed through a metaclass-based system with environment variable overrides:

# Select configuration class
LIDARR_METADATA_CONFIG=lidarrmetadata.config.ProductionConfig

# Override specific settings (double underscore for nesting)
CACHE__REDIS_URL=redis://redis:6379/0
DATABASE__HOST=musicbrainz

Key Features

Multi-Source Aggregation

Combines data from 15+ external sources into unified artist/album responses:

Core metadata: MusicBrainz database (direct SQL)
Images: Cover Art Archive, FanArt.tv, TheAudioDB
Biographies: Wikipedia (32 language fallback)
Cross-platform IDs: Spotify, TheAudioDB, MusicBrainz
Charts: Last.fm, Billboard, Apple Music, iTunes

Intelligent Caching

Three-tier caching strategy:

Redis: Ephemeral cache (7-day TTL, 512MB limit, LFU eviction)
PostgreSQL: Persistent cache with zlib compression
Cloudflare CDN: Edge caching with programmatic invalidation

Change Detection

Monitors MusicBrainz replication stream to detect updated artists/albums and invalidate stale cache entries. SQL queries track changes across 5 different update sources per entity type.

Background Crawling

Proactive cache warming for recently updated entities. Crawlers run on configurable schedules to pre-fetch expensive metadata (Wikipedia overviews, FanArt images) before user requests.

Provider Fallback Chain

Graceful degradation when external services are unavailable. Each metadata type has a primary provider and optional fallback providers with timeout handling.

Performance Characteristics

Metric	Value	Notes
Cache hit rate	~85%+	With crawler enabled
Cold request latency	2-5s	Multiple external API calls
Cached request latency	50-200ms	Redis/PostgreSQL lookup
CDN request latency	10-50ms	Cloudflare edge cache
Database size	100GB+	MusicBrainz full dataset
Cache database size	10-50GB	Compressed metadata cache

API Response Format

All endpoints return JSON with consistent structure:

{
  "Id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
  "ArtistName": "Nirvana",
  "Disambiguation": "90s US grunge band",
  "Overview": "Nirvana was an American rock band...",
  "Images": [
    {
      "Url": "https://imagecache.lidarr.audio/...",
      "CoverType": "poster",
      "Extension": ".jpg"
    }
  ],
  "Links": [
    {
      "Url": "https://www.spotify.com/artist/...",
      "Name": "spotify"
    }
  ],
  "Genres": ["Grunge", "Alternative Rock"],
  "Albums": [...]
}

Security Posture

Current state: Development-focused with insecure defaults.

Aspect	Status	Details
API authentication	None	Read endpoints are public
Admin authentication	Single API key	`/invalidate` endpoint only
Database credentials	Hardcoded	`abc/abc` in multiple configs
RabbitMQ credentials	Hardcoded	`abc/abc` default
HTTPS	Not enforced	Relies on reverse proxy
Rate limiting	Optional	Disabled by default (NullRateLimiter)

Production recommendation: Deploy behind authenticated reverse proxy (Cloudflare Access, OAuth2 Proxy, etc.).

Monitoring and Observability

Error Tracking

Sentry integration with custom rate limiting to prevent alert fatigue:

sentry_sdk.init(
    dsn=config.SENTRY_DSN,
    integrations=[FlaskIntegration()],
    release=f"lidarr-metadata@{__version__}"
)

Redis-backed deduplication prevents duplicate error reports.

Metrics

StatsD/Telegraf integration for operational metrics:

Provider request counts
Response time histograms
Cache hit/miss rates
Rate limiter state

Logging

Python standard library logging with per-module handlers:

DEBUG: Detailed request/response logging
INFO: Request summaries, cache operations
WARN: Provider timeouts, fallback usage
ERROR: Unhandled exceptions, data inconsistencies

Development Workflow

Local Development

# Install dependencies
poetry install

# Start infrastructure
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -d

# Run API server
LIDARR_METADATA_CONFIG=lidarrmetadata.config.DevelopmentConfig \
  python -m lidarrmetadata.server

# Run tests (currently disabled in CI)
pytest tests/

Testing

Test suite uses pytest with async support:

tests/test_config.py: Configuration system (152 lines, most comprehensive)
tests/test_provider.py: Provider mixin behavior
tests/test_cache.py: Cache layer functionality
tests/test_api.py: API endpoint responses
tests/test_util.py: Utility functions
tests/test_app.py: Application initialization

Note: Tests are commented out in Azure Pipelines CI configuration.

Project Maturity Assessment

Aspect	Maturity	Evidence
Production readiness	High	Running in production for Lidarr ecosystem
Code quality	Medium	SonarCloud integration, but tests disabled
Security	Low	Hardcoded credentials, no auth on read endpoints
Documentation	Medium	README comprehensive, inline docs sparse
Dependency freshness	Low	Python 3.9, aioredis 1.x (deprecated)
Test coverage	Unknown	Tests disabled in CI
Operational maturity	High	Sentry, metrics, multi-tier caching, CDN integration

Relevance to Metadata Aggregator Project

This codebase represents the closest real-world implementation of a production metadata aggregation service. Key learnings:

Multi-source enrichment pattern: MusicBrainz as authoritative core + specialized providers for images/bios/charts
Caching strategy: Three-tier approach with compression and invalidation is battle-tested
Provider architecture: Mixin-based design allows flexible composition of data sources
Change detection: Monitoring upstream data sources for cache invalidation is critical
Background crawling: Proactive cache warming significantly improves user experience
Direct database access: Querying MusicBrainz DB directly (vs API) enables complex aggregations
SQL aggregation: Using row_to_json and json_agg to build nested JSON in database is highly efficient

File Structure Overview

lidarrmetadata/
├── __init__.py           # Version and package metadata
├── server.py             # API server entry point
├── crawler.py            # Background crawler entry point
├── app.py                # Quart application factory + routes
├── api.py                # Business logic layer
├── provider.py           # Provider mixins and implementations
├── cache.py              # Multi-tier cache implementation
├── config.py             # Configuration metaclass system
├── util.py               # Utility functions
├── sql/                  # MusicBrainz SQL queries
│   ├── artist.sql
│   ├── album.sql
│   ├── updated_artists.sql
│   └── updated_albums.sql
└── providers/            # Individual provider implementations
    ├── musicbrainz_db.py
    ├── solr_search.py
    ├── fanart.py
    ├── theaudiodb.py
    ├── wikipedia.py
    └── spotify.py

Dependencies Analysis

Production Dependencies (17 total)

Web framework:

quart==0.14.1 (async Flask alternative)
hypercorn (ASGI server, Quart dependency)

Database:

asyncpg==0.26.0 (PostgreSQL async driver)
aioredis==1.3.1 (Redis async client, deprecated)

External APIs:

spotipy==2.16.1 (Spotify)
pylast==4.3.0 (Last.fm)
billboard-py==7.0.0 (Billboard charts)
beautifulsoup4 (Wikipedia scraping)

Utilities:

python-dateutil (date parsing)
pytz (timezone handling)
requests (HTTP client for sync operations)
lxml (XML parsing)

Monitoring:

sentry-sdk==0.19.5 (error tracking)
statsd (metrics)

Server:

gunicorn (WSGI server)
uvicorn (ASGI worker)

Development Dependencies

pytest
pytest-asyncio
black (code formatting)
flake8 (linting)

Dependency Concerns

Python 3.9: End of life October 2025, should upgrade to 3.11+
aioredis 1.3.1: Deprecated, merged into redis-py 4.2+
Quart 0.14.1: Current version is 0.19+, missing 5 years of updates
asyncpg 0.26.0: Current version is 0.29+
sentry-sdk 0.19.5: Current version is 2.0+, missing major version

Conclusion

LidarrAPI.Metadata is a production-grade metadata aggregation service with sophisticated caching, multi-source enrichment, and operational maturity. While it has technical debt (outdated dependencies, disabled tests, insecure defaults), its architecture and patterns provide an excellent reference for building a modern metadata aggregator.

The direct MusicBrainz database integration, provider fallback chain, and three-tier caching strategy are particularly valuable patterns to adopt.

14 KiB Raw Blame History