feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,419 @@
+# Lidarr Metadata API - Overview
+
+## Project Identity
+
+| Property | Value |
+|----------|-------|
+| **Name** | LidarrAPI.Metadata |
+| **Repository** | https://github.com/Lidarr/LidarrAPI.Metadata |
+| **Version** | 10.0.0.0 |
+| **License** | GPL-3.0 |
+| **Primary Language** | Python 3.9 |
+| **Purpose** | Enriched metadata aggregation API for Lidarr music manager |
+
+## Core Purpose
+
+LidarrAPI.Metadata serves as a metadata enrichment layer for the Lidarr music management application. It aggregates data from multiple authoritative sources (MusicBrainz, FanArt.tv, TheAudioDB, Wikipedia, Spotify, Last.fm, Billboard, Apple Music) to provide comprehensive artist and album metadata including:
+
+- Artist biographical information
+- Album release details
+- High-quality cover art and artist images
+- Genre classifications
+- Music charts and trending data
+- Cross-platform ID mappings (MusicBrainz, Spotify, TheAudioDB)
+
+The API acts as an intelligent caching proxy that transforms raw MusicBrainz database records into enriched JSON responses suitable for consumption by Lidarr clients.
+
+## Technology Stack
+
+### Core Framework
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| **Python** | 3.9 | Runtime environment |
+| **Quart** | 0.14.1 | Async web framework (Flask-compatible) |
+| **Gunicorn** | Latest | WSGI HTTP server |
+| **Uvicorn** | Latest | ASGI server (worker class) |
+
+### Data Layer
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| **asyncpg** | 0.26.0 | PostgreSQL async driver |
+| **aioredis** | 1.3.1 | Redis async client |
+| **PostgreSQL** | 12+ | MusicBrainz database + cache storage |
+| **Redis** | 6+ | Ephemeral cache + rate limiting |
+| **Solr** | 8.x | Full-text search engine |
+
+### External Integrations
+
+| Library | Version | Purpose |
+|---------|---------|---------|
+| **spotipy** | 2.16.1 | Spotify API client |
+| **pylast** | 4.3.0 | Last.fm API client |
+| **billboard-py** | 7.0.0 | Billboard chart scraper |
+| **beautifulsoup4** | Latest | HTML parsing (Wikipedia) |
+| **sentry-sdk** | 0.19.5 | Error tracking |
+
+## Application Entry Points
+
+The project provides two executable entry points:
+
+### 1. API Server
+
+```bash
+lidarr-metadata-server
+```
+
+**Implementation**: `lidarrmetadata/server.py`
+
+Starts the Quart web application serving the metadata API on port 5001. Supports configurable path prefix via `APPLICATION_ROOT` environment variable.
+
+**Production command**:
+```bash
+gunicorn -w 1 -k uvicorn.workers.UvicornWorker \
+  --bind 0.0.0.0:5001 \
+  --access-logfile - \
+  lidarrmetadata.server:app
+```
+
+### 2. Background Crawler
+
+```bash
+lidarr-metadata-crawler
+```
+
+**Implementation**: `lidarrmetadata/crawler.py`
+
+Runs background cache warming tasks to proactively fetch and cache metadata for recently updated artists and albums. Operates independently of the API server.
+
+**Crawler types**:
+- Wikipedia overview crawler
+- FanArt.tv image crawler
+- TheAudioDB metadata crawler
+- Artist metadata crawler
+- Album metadata crawler
+
+## Network Configuration
+
+| Setting | Default | Configurable Via |
+|---------|---------|------------------|
+| **Port** | 5001 | Docker/Gunicorn bind |
+| **Path Prefix** | `/` | `APPLICATION_ROOT` env var |
+| **Workers** | 1 | Gunicorn `-w` flag |
+| **Worker Class** | uvicorn | Gunicorn `-k` flag |
+
+## Related Ecosystem Components
+
+### Lidarr Music Manager
+
+The primary consumer of this API. Lidarr is an automated music collection manager for Usenet and BitTorrent users. It monitors multiple RSS feeds for new albums from favorite artists and grabs, sorts, and renames them.
+
+**Integration**: Lidarr queries this API to enrich its local music library database with metadata, images, and biographical information.
+
+### MusicBrainz Database
+
+The authoritative source for music metadata. MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.
+
+**Integration**: Direct PostgreSQL connection to a replicated MusicBrainz database instance. The API does NOT use the MusicBrainz web API; it queries the database directly for performance.
+
+**Database size**: ~100GB+ for full MusicBrainz dataset with hourly replication.
+
+### Cover Art Archive
+
+A joint project between the Internet Archive and MusicBrainz providing cover art images for releases in the MusicBrainz database.
+
+**Integration**: Images are proxied through `imagecache.lidarr.audio` CDN for performance and bandwidth optimization.
+
+## Deployment Architecture
+
+The application is designed for containerized deployment with Docker Compose. A typical production deployment includes:
+
+| Container | Purpose | Resource Requirements |
+|-----------|---------|----------------------|
+| **musicbrainz** | PostgreSQL with MusicBrainz schema | 100GB+ storage, 4GB+ RAM |
+| **solr** | Search index (artist/album) | 8GB+ storage, 2GB+ RAM |
+| **redis** | Cache + rate limiting | 512MB RAM limit |
+| **rabbitmq** | Search index updates | 1GB RAM |
+| **indexer** | Solr index updater (SIR) | 512MB RAM |
+| **api-v0.3** | Stable API version | 1GB+ RAM |
+| **api-testing** | Development API version | 1GB+ RAM |
+| **crawler** | Background cache warmer | 512MB RAM |
+
+## Version Strategy
+
+The project uses semantic versioning with a unique dual-deployment strategy:
+
+- **v0.3**: Stable production version
+- **testing**: Development/staging version
+
+Both versions run simultaneously in production, allowing gradual rollout and A/B testing of new features.
+
+## Configuration Management
+
+Configuration is managed through a metaclass-based system with environment variable overrides:
+
+```python
+# Select configuration class
+LIDARR_METADATA_CONFIG=lidarrmetadata.config.ProductionConfig
+
+# Override specific settings (double underscore for nesting)
+CACHE__REDIS_URL=redis://redis:6379/0
+DATABASE__HOST=musicbrainz
+```
+
+## Key Features
+
+### Multi-Source Aggregation
+
+Combines data from 15+ external sources into unified artist/album responses:
+
+- **Core metadata**: MusicBrainz database (direct SQL)
+- **Images**: Cover Art Archive, FanArt.tv, TheAudioDB
+- **Biographies**: Wikipedia (32 language fallback)
+- **Cross-platform IDs**: Spotify, TheAudioDB, MusicBrainz
+- **Charts**: Last.fm, Billboard, Apple Music, iTunes
+
+### Intelligent Caching
+
+Three-tier caching strategy:
+
+1. **Redis**: Ephemeral cache (7-day TTL, 512MB limit, LFU eviction)
+2. **PostgreSQL**: Persistent cache with zlib compression
+3. **Cloudflare CDN**: Edge caching with programmatic invalidation
+
+### Change Detection
+
+Monitors MusicBrainz replication stream to detect updated artists/albums and invalidate stale cache entries. SQL queries track changes across 5 different update sources per entity type.
+
+### Background Crawling
+
+Proactive cache warming for recently updated entities. Crawlers run on configurable schedules to pre-fetch expensive metadata (Wikipedia overviews, FanArt images) before user requests.
+
+### Provider Fallback Chain
+
+Graceful degradation when external services are unavailable. Each metadata type has a primary provider and optional fallback providers with timeout handling.
+
+## Performance Characteristics
+
+| Metric | Value | Notes |
+|--------|-------|-------|
+| **Cache hit rate** | ~85%+ | With crawler enabled |
+| **Cold request latency** | 2-5s | Multiple external API calls |
+| **Cached request latency** | 50-200ms | Redis/PostgreSQL lookup |
+| **CDN request latency** | 10-50ms | Cloudflare edge cache |
+| **Database size** | 100GB+ | MusicBrainz full dataset |
+| **Cache database size** | 10-50GB | Compressed metadata cache |
+
+## API Response Format
+
+All endpoints return JSON with consistent structure:
+
+```json
+{
+  "Id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
+  "ArtistName": "Nirvana",
+  "Disambiguation": "90s US grunge band",
+  "Overview": "Nirvana was an American rock band...",
+  "Images": [
+    {
+      "Url": "https://imagecache.lidarr.audio/...",
+      "CoverType": "poster",
+      "Extension": ".jpg"
+    }
+  ],
+  "Links": [
+    {
+      "Url": "https://www.spotify.com/artist/...",
+      "Name": "spotify"
+    }
+  ],
+  "Genres": ["Grunge", "Alternative Rock"],
+  "Albums": [...]
+}
+```
+
+## Security Posture
+
+**Current state**: Development-focused with insecure defaults.
+
+| Aspect | Status | Details |
+|--------|--------|---------|
+| **API authentication** | None | Read endpoints are public |
+| **Admin authentication** | Single API key | `/invalidate` endpoint only |
+| **Database credentials** | Hardcoded | `abc/abc` in multiple configs |
+| **RabbitMQ credentials** | Hardcoded | `abc/abc` default |
+| **HTTPS** | Not enforced | Relies on reverse proxy |
+| **Rate limiting** | Optional | Disabled by default (NullRateLimiter) |
+
+**Production recommendation**: Deploy behind authenticated reverse proxy (Cloudflare Access, OAuth2 Proxy, etc.).
+
+## Monitoring and Observability
+
+### Error Tracking
+
+Sentry integration with custom rate limiting to prevent alert fatigue:
+
+```python
+sentry_sdk.init(
+    dsn=config.SENTRY_DSN,
+    integrations=[FlaskIntegration()],
+    release=f"lidarr-metadata@{__version__}"
+)
+```
+
+Redis-backed deduplication prevents duplicate error reports.
+
+### Metrics
+
+StatsD/Telegraf integration for operational metrics:
+
+- Provider request counts
+- Response time histograms
+- Cache hit/miss rates
+- Rate limiter state
+
+### Logging
+
+Python standard library logging with per-module handlers:
+
+- **DEBUG**: Detailed request/response logging
+- **INFO**: Request summaries, cache operations
+- **WARN**: Provider timeouts, fallback usage
+- **ERROR**: Unhandled exceptions, data inconsistencies
+
+## Development Workflow
+
+### Local Development
+
+```bash
+# Install dependencies
+poetry install
+
+# Start infrastructure
+docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -d
+
+# Run API server
+LIDARR_METADATA_CONFIG=lidarrmetadata.config.DevelopmentConfig \
+  python -m lidarrmetadata.server
+
+# Run tests (currently disabled in CI)
+pytest tests/
+```
+
+### Testing
+
+Test suite uses pytest with async support:
+
+- `tests/test_config.py`: Configuration system (152 lines, most comprehensive)
+- `tests/test_provider.py`: Provider mixin behavior
+- `tests/test_cache.py`: Cache layer functionality
+- `tests/test_api.py`: API endpoint responses
+- `tests/test_util.py`: Utility functions
+- `tests/test_app.py`: Application initialization
+
+**Note**: Tests are commented out in Azure Pipelines CI configuration.
+
+## Project Maturity Assessment
+
+| Aspect | Maturity | Evidence |
+|--------|----------|----------|
+| **Production readiness** | High | Running in production for Lidarr ecosystem |
+| **Code quality** | Medium | SonarCloud integration, but tests disabled |
+| **Security** | Low | Hardcoded credentials, no auth on read endpoints |
+| **Documentation** | Medium | README comprehensive, inline docs sparse |
+| **Dependency freshness** | Low | Python 3.9, aioredis 1.x (deprecated) |
+| **Test coverage** | Unknown | Tests disabled in CI |
+| **Operational maturity** | High | Sentry, metrics, multi-tier caching, CDN integration |
+
+## Relevance to Metadata Aggregator Project
+
+This codebase represents the closest real-world implementation of a production metadata aggregation service. Key learnings:
+
+1. **Multi-source enrichment pattern**: MusicBrainz as authoritative core + specialized providers for images/bios/charts
+2. **Caching strategy**: Three-tier approach with compression and invalidation is battle-tested
+3. **Provider architecture**: Mixin-based design allows flexible composition of data sources
+4. **Change detection**: Monitoring upstream data sources for cache invalidation is critical
+5. **Background crawling**: Proactive cache warming significantly improves user experience
+6. **Direct database access**: Querying MusicBrainz DB directly (vs API) enables complex aggregations
+7. **SQL aggregation**: Using `row_to_json` and `json_agg` to build nested JSON in database is highly efficient
+
+## File Structure Overview
+
+```
+lidarrmetadata/
+├── __init__.py           # Version and package metadata
+├── server.py             # API server entry point
+├── crawler.py            # Background crawler entry point
+├── app.py                # Quart application factory + routes
+├── api.py                # Business logic layer
+├── provider.py           # Provider mixins and implementations
+├── cache.py              # Multi-tier cache implementation
+├── config.py             # Configuration metaclass system
+├── util.py               # Utility functions
+├── sql/                  # MusicBrainz SQL queries
+│   ├── artist.sql
+│   ├── album.sql
+│   ├── updated_artists.sql
+│   └── updated_albums.sql
+└── providers/            # Individual provider implementations
+    ├── musicbrainz_db.py
+    ├── solr_search.py
+    ├── fanart.py
+    ├── theaudiodb.py
+    ├── wikipedia.py
+    └── spotify.py
+```
+
+## Dependencies Analysis
+
+### Production Dependencies (17 total)
+
+**Web framework**:
+- quart==0.14.1 (async Flask alternative)
+- hypercorn (ASGI server, Quart dependency)
+
+**Database**:
+- asyncpg==0.26.0 (PostgreSQL async driver)
+- aioredis==1.3.1 (Redis async client, deprecated)
+
+**External APIs**:
+- spotipy==2.16.1 (Spotify)
+- pylast==4.3.0 (Last.fm)
+- billboard-py==7.0.0 (Billboard charts)
+- beautifulsoup4 (Wikipedia scraping)
+
+**Utilities**:
+- python-dateutil (date parsing)
+- pytz (timezone handling)
+- requests (HTTP client for sync operations)
+- lxml (XML parsing)
+
+**Monitoring**:
+- sentry-sdk==0.19.5 (error tracking)
+- statsd (metrics)
+
+**Server**:
+- gunicorn (WSGI server)
+- uvicorn (ASGI worker)
+
+### Development Dependencies
+
+- pytest
+- pytest-asyncio
+- black (code formatting)
+- flake8 (linting)
+
+### Dependency Concerns
+
+1. **Python 3.9**: End of life October 2025, should upgrade to 3.11+
+2. **aioredis 1.3.1**: Deprecated, merged into redis-py 4.2+
+3. **Quart 0.14.1**: Current version is 0.19+, missing 5 years of updates
+4. **asyncpg 0.26.0**: Current version is 0.29+
+5. **sentry-sdk 0.19.5**: Current version is 2.0+, missing major version
+
+## Conclusion
+
+LidarrAPI.Metadata is a production-grade metadata aggregation service with sophisticated caching, multi-source enrichment, and operational maturity. While it has technical debt (outdated dependencies, disabled tests, insecure defaults), its architecture and patterns provide an excellent reference for building a modern metadata aggregator.
+
+The direct MusicBrainz database integration, provider fallback chain, and three-tier caching strategy are particularly valuable patterns to adopt.