feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,54 @@
# Lidarr Metadata API
## Overview
Custom metadata API that powers Lidarr (music collection manager). Built on top of MusicBrainz with enhanced artist/album data.
## Key Features
- **Purpose**: Metadata backend for Lidarr
- **Data Source**: MusicBrainz PostgreSQL + Solr
- **API**: REST
- **License**: GPL-3.0
## Source
| Resource | URL |
|----------|-----|
| **Repository** | https://github.com/Lidarr/LidarrAPI.Metadata |
| **Lidarr Main** | https://github.com/Lidarr/Lidarr |
| **Documentation** | https://wiki.servarr.com/lidarr |
## Architecture
Requires:
- MusicBrainz PostgreSQL database
- Solr search server
```
docker-compose.yml # Base services (MusicBrainz DB, Solr)
docker-compose.dev.yml # Dev mode (exposed ports)
docker-compose.prod.yml # Production (metadata service in Docker)
```
## Self-Hosting
```bash
git clone https://github.com/Lidarr/LidarrAPI.Metadata.git
cd LidarrAPI.Metadata
# Start with Docker Compose
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up
# Or run directly
python server.py
# or
lidarr-metadata-server
```
## Notes
- Powers the Lidarr ecosystem (music management for *arr stack)
- Enhanced MusicBrainz data with better album matching
- Community-hosted instance at `api.musicinfo.pro`
- Requires significant resources (~350GB for full MusicBrainz mirror)
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,785 @@
# Lidarr Metadata API - Evaluation and Recommendations
## Executive Summary
The Lidarr Metadata API represents a production-grade metadata aggregation service with sophisticated architecture and operational maturity. After comprehensive analysis of the codebase, architecture, data layer, integrations, deployment, and implementation details, this evaluation provides an assessment of strengths, weaknesses, and applicability to the metadata aggregator project.
**Overall assessment**: Excellent reference implementation with battle-tested patterns, but requires modernization and security hardening for new deployments.
## Strengths
### 1. Multi-Source Metadata Aggregation
**Excellence**: The API successfully aggregates data from 15+ external sources into unified responses.
**Implementation quality**: High
**Key patterns**:
| Pattern | Implementation | Benefit |
|---------|----------------|---------|
| **Provider abstraction** | Mixin-based architecture | Clean separation of concerns |
| **Fallback chains** | Primary + secondary providers | Resilience to service failures |
| **Parallel fetching** | asyncio.create_task() | Reduced latency |
| **Data normalization** | Consistent response format | Easy client integration |
**Example workflow**:
```
Artist request → MusicBrainz (core) → FanArt.tv (images) → Wikipedia (bio) → Spotify (links)
↓ (if timeout)
TheAudioDB (fallback)
```
**Applicability to metadata aggregator**: **CRITICAL**
This is the core pattern we need. The mixin-based provider architecture allows flexible composition of data sources while maintaining clean interfaces.
**Recommendation**: Adopt the provider mixin pattern with fallback chains. Consider adding circuit breaker pattern for failing providers.
### 2. Three-Tier Caching Strategy
**Excellence**: Sophisticated caching with Redis (hot), PostgreSQL (persistent), and Cloudflare CDN (edge).
**Implementation quality**: Excellent
**Cache hierarchy**:
| Tier | Purpose | TTL | Hit Rate | Latency |
|------|---------|-----|----------|---------|
| **Cloudflare CDN** | Edge caching | 30 days | ~60% | 10-50ms |
| **Redis** | Hot cache | 7 days | ~25% | 50-200ms |
| **PostgreSQL** | Persistent cache | 30 days | ~10% | 100-300ms |
| **Origin** | Fresh fetch | N/A | ~5% | 2-5s |
**Compression**: zlib compression of pickled objects (10:1 ratio)
**Invalidation**: Hierarchical (CDN → Redis → PostgreSQL)
**Applicability to metadata aggregator**: **HIGH**
The three-tier approach balances performance, cost, and reliability. The compression strategy significantly reduces storage costs.
**Recommendation**: Adopt three-tier caching with compression. Consider adding cache warming for popular entities.
### 3. Direct MusicBrainz Database Access
**Excellence**: Querying MusicBrainz PostgreSQL directly instead of using the web API.
**Implementation quality**: Excellent
**Advantages**:
| Aspect | Direct DB | Web API |
|--------|-----------|---------|
| **Query complexity** | Complex joins, JSON aggregation | Limited filtering |
| **Performance** | 100-500ms | 1-5s (rate limited) |
| **Rate limiting** | None | 1 req/sec |
| **Flexibility** | Full SQL power | Fixed endpoints |
| **Maintenance** | Schema changes require updates | API stable |
**SQL aggregation example**:
```sql
SELECT
row_to_json(artist.*) AS artist,
json_agg(releases.*) AS albums,
json_agg(links.*) AS links
FROM artist
LEFT JOIN releases ON ...
LEFT JOIN links ON ...
WHERE artist.gid = $1
GROUP BY artist.id;
```
**Applicability to metadata aggregator**: **MEDIUM**
Direct database access is powerful but requires maintaining a full MusicBrainz replica (~100GB+). For smaller deployments, the web API may be more practical.
**Recommendation**: Evaluate based on scale. For high-volume production use, direct DB access is worth the complexity. For prototypes, use the web API.
### 4. Change Detection and Cache Invalidation
**Excellence**: Proactive cache invalidation based on upstream data changes.
**Implementation quality**: High
**Change detection sources** (5 per entity type):
**Artists**:
1. Artist metadata updates
2. New release groups
3. Updated releases
4. New/updated links
5. Cover art updates
**Albums**:
1. Release group metadata updates
2. New releases in group
3. Updated releases in group
4. New/updated links
5. Cover art updates
**Invalidation workflow**:
```
Hourly replication → Detect changes → Invalidate cache → Optionally pre-fetch
```
**Applicability to metadata aggregator**: **HIGH**
Automatic cache invalidation ensures data freshness without manual intervention. The change detection SQL queries are well-optimized.
**Recommendation**: Implement change detection for all upstream data sources. Consider webhook-based invalidation where available.
### 5. Background Crawler for Cache Warming
**Excellence**: Proactive cache warming improves user experience.
**Implementation quality**: High
**Crawler types**:
- Wikipedia overview crawler
- FanArt.tv image crawler
- TheAudioDB metadata crawler
- Artist metadata crawler
- Album metadata crawler
**Benefits**:
- Reduced cold request latency
- Higher cache hit rate (85%+ vs 60% without crawler)
- Distributed load on external APIs
- Pre-validation of data quality
**Applicability to metadata aggregator**: **MEDIUM**
Cache warming is valuable for high-traffic deployments but adds operational complexity.
**Recommendation**: Implement crawler for production deployments. Make it optional for development/testing.
### 6. Real-Time Search Index Updates
**Excellence**: Search index stays synchronized with database via RabbitMQ.
**Implementation quality**: Excellent
**Update flow**:
```
Database change → Trigger → RabbitMQ message → SIR consumer → Solr update → Soft commit (1s)
```
**Update latency**: 1-5 seconds from database change to searchable
**Applicability to metadata aggregator**: **MEDIUM**
Real-time search is excellent UX but requires additional infrastructure (RabbitMQ, SIR).
**Recommendation**: For MVP, use periodic reindexing (hourly). For production, implement real-time updates.
### 7. Operational Maturity
**Excellence**: Production-ready monitoring, logging, and error tracking.
**Implementation quality**: High
**Monitoring stack**:
| Component | Purpose | Implementation |
|-----------|---------|----------------|
| **Sentry** | Error tracking | Redis-based rate limiting |
| **Telegraf** | Metrics collection | StatsD protocol |
| **Logging** | Application logs | Python stdlib logging |
| **Health checks** | Service availability | Docker health checks |
**Metrics tracked**:
- Request counts by endpoint
- Response times (histograms)
- Cache hit/miss rates
- Provider request counts
- Error rates by type
**Applicability to metadata aggregator**: **HIGH**
Observability is critical for production services. The Sentry rate limiting pattern prevents alert fatigue.
**Recommendation**: Implement comprehensive monitoring from day one. Use Sentry or similar for error tracking.
### 8. Dual-Version Deployment Strategy
**Excellence**: Running stable and testing versions simultaneously.
**Implementation quality**: High
**Deployment model**:
- **v0.3**: Stable production version (2 replicas)
- **testing**: Development version (1 replica)
**Benefits**:
- Gradual rollout of new features
- A/B testing capability
- Quick rollback if issues arise
- Reduced deployment risk
**Applicability to metadata aggregator**: **MEDIUM**
Dual-version deployment is valuable for mature services but overkill for early development.
**Recommendation**: Start with single version. Add dual deployment when service is stable and has significant traffic.
### 9. Spotify ID Mapping
**Excellence**: Cross-platform ID mapping with fuzzy matching.
**Implementation quality**: High
**Mapping algorithm**:
1. Search Spotify by artist name
2. Calculate Levenshtein distance for each result
3. Return best match if similarity ≥ 0.8
**Use cases**:
- Cross-platform linking
- Chart data correlation
- User playlist integration
**Applicability to metadata aggregator**: **HIGH**
Cross-platform ID mapping is essential for modern metadata services. The fuzzy matching approach handles name variations well.
**Recommendation**: Implement ID mapping for major platforms (Spotify, Apple Music, YouTube Music, Deezer).
### 10. Chart Integration
**Excellence**: Aggregates charts from 4 major sources.
**Implementation quality**: Medium
**Chart sources**:
- Last.fm (API)
- Billboard (web scraping)
- Apple Music (RSS API)
- iTunes (RSS API)
**MusicBrainz mapping**: Automatic mapping of chart entries to MusicBrainz IDs
**Applicability to metadata aggregator**: **MEDIUM**
Chart integration adds value but is not core functionality. Web scraping (Billboard) is fragile.
**Recommendation**: Implement chart integration if it aligns with product goals. Prefer API-based sources over scraping.
## Weaknesses
### 1. Outdated Dependencies
**Severity**: High
**Issues**:
| Dependency | Current | Latest | Issue |
|------------|---------|--------|-------|
| **Python** | 3.9 | 3.12 | EOL October 2025 |
| **aioredis** | 1.3.1 | Merged into redis-py 4.2+ | Deprecated |
| **Quart** | 0.14.1 | 0.19+ | 5 years of updates missed |
| **asyncpg** | 0.26.0 | 0.29+ | Missing features and fixes |
| **sentry-sdk** | 0.19.5 | 2.0+ | Major version behind |
**Impact**:
- Security vulnerabilities
- Missing performance improvements
- Incompatibility with modern tools
- Reduced community support
**Recommendation**: **CRITICAL UPGRADE REQUIRED**
Upgrade to Python 3.11+ and latest library versions before deploying to production.
**Migration effort**: Medium (2-3 days)
### 2. Insecure Defaults
**Severity**: Critical
**Issues**:
| Component | Default | Risk |
|-----------|---------|------|
| **Database password** | `abc` | Unauthorized access |
| **RabbitMQ password** | `abc` | Message queue compromise |
| **Redis password** | None | Cache manipulation |
| **API key** | `replaceme` | Unauthorized invalidation |
| **CORS** | `*` (all origins) | CSRF attacks |
**Impact**:
- Data breaches
- Service disruption
- Unauthorized access
- Compliance violations
**Recommendation**: **MUST FIX BEFORE PRODUCTION**
1. Generate strong random passwords
2. Use secrets management (Docker Secrets, Vault)
3. Implement proper authentication
4. Restrict CORS to specific origins
5. Enable TLS for all connections
**Migration effort**: Low (1 day)
### 3. No Authentication on Read Endpoints
**Severity**: Medium
**Issue**: All read endpoints are publicly accessible without authentication.
**Impact**:
- No usage tracking per client
- No rate limiting per user
- No access control
- Potential abuse
**Current mitigation**: Cloudflare CDN provides some DDoS protection
**Recommendation**: Implement API key authentication for production deployments.
**Options**:
1. **API keys**: Simple, good for server-to-server
2. **OAuth 2.0**: Better for user-facing applications
3. **JWT tokens**: Stateless, scalable
**Migration effort**: Medium (2-3 days)
### 4. Tests Disabled in CI
**Severity**: Medium
**Issue**: Test suite exists but is commented out in Azure Pipelines.
**Reason**: Tests require full infrastructure (MusicBrainz DB, Solr, Redis)
**Impact**:
- No automated regression testing
- Increased risk of breaking changes
- Reduced confidence in deployments
**Current test coverage**:
- Configuration: High (152 lines)
- Providers: Medium (98 lines)
- Cache: Medium (87 lines)
- API: Low (76 lines)
- Utilities: High (45 lines)
- Application: Low (34 lines)
**Recommendation**: Implement integration tests with Docker Compose in CI.
**Approach**:
```yaml
# Azure Pipelines
- script: |
docker-compose -f docker-compose.yml -f docker-compose.test.yml up -d
sleep 30 # Wait for services
poetry run pytest tests/
docker-compose down
displayName: 'Run integration tests'
```
**Migration effort**: Medium (2-3 days)
### 5. Complex Deployment
**Severity**: Medium
**Issue**: Deployment requires 8+ containers and 10-step initialization.
**Complexity factors**:
- MusicBrainz database dump (4-8 hours)
- Search index building (4-8 hours)
- Custom database indices
- AMQP trigger setup
- Replication configuration
**Total initialization time**: 8-16 hours
**Impact**:
- High barrier to entry
- Difficult local development
- Complex disaster recovery
- Expensive infrastructure
**Recommendation**: Provide simplified deployment options.
**Options**:
1. **Sample database**: Smaller dataset for development (1GB vs 100GB)
2. **Docker image with pre-loaded data**: Skip dump download
3. **Managed service**: Hosted MusicBrainz database
4. **API-only mode**: Use MusicBrainz web API instead of direct DB
**Migration effort**: High (1-2 weeks for managed service option)
### 6. Single Worker Default
**Severity**: Low
**Issue**: Gunicorn runs with 1 worker by default.
**Impact**:
- Limited concurrency
- Underutilized CPU cores
- Reduced throughput
**Current configuration**:
```bash
gunicorn -w 1 -k uvicorn.workers.UvicornWorker ...
```
**Recommendation**: Use multiple workers in production.
**Formula**: `workers = (2 * CPU_cores) + 1`
**Example** (4 CPU cores):
```bash
gunicorn -w 9 -k uvicorn.workers.UvicornWorker ...
```
**Migration effort**: Trivial (configuration change)
### 7. No Pagination
**Severity**: Low
**Issue**: Search and list endpoints return all results without pagination.
**Impact**:
- Large response sizes
- Increased latency
- Memory pressure
- Poor mobile experience
**Current workaround**: `limit` parameter on some endpoints
**Recommendation**: Implement cursor-based pagination.
**Example**:
```json
{
"results": [...],
"pagination": {
"next_cursor": "eyJpZCI6MTIzNDU2fQ==",
"has_more": true
}
}
```
**Migration effort**: Medium (2-3 days)
### 8. No Webhooks
**Severity**: Low
**Issue**: No webhook support for cache invalidation or updates.
**Impact**:
- Clients must poll for changes
- Increased API load
- Delayed updates
**Current workaround**: Poll `/recent/artist` and `/recent/album` endpoints
**Recommendation**: Implement webhooks for real-time notifications.
**Use cases**:
- Cache invalidation notifications
- New artist/album notifications
- Chart update notifications
**Migration effort**: Medium (3-5 days)
## Applicability to Metadata Aggregator Project
### High Applicability (Must Adopt)
#### 1. Provider Mixin Architecture
**Why**: Clean separation of concerns, testable, extensible
**Implementation priority**: High
**Effort**: Medium (3-5 days)
**Pattern**:
```python
class ArtistByIdMixin:
async def get_artist_by_id(self, mbid: str) -> dict:
raise NotImplementedError
class MusicBrainzProvider(ArtistByIdMixin):
async def get_artist_by_id(self, mbid: str) -> dict:
# Implementation
pass
class SpotifyProvider(ArtistByIdMixin):
async def get_artist_by_id(self, spotify_id: str) -> dict:
# Implementation
pass
```
#### 2. Three-Tier Caching
**Why**: Proven performance and cost optimization
**Implementation priority**: High
**Effort**: High (1-2 weeks)
**Tiers**:
1. Redis (hot cache, 512MB, LFU eviction)
2. PostgreSQL (persistent cache, compressed)
3. CDN (edge cache, Cloudflare/CloudFront)
#### 3. Fallback Chains
**Why**: Resilience to external service failures
**Implementation priority**: High
**Effort**: Low (1-2 days)
**Pattern**:
```python
async def get_artist_images(mbid):
providers = [
(fanart_provider, "FanArt.tv"),
(theaudiodb_provider, "TheAudioDB"),
(musicbrainz_provider, "MusicBrainz")
]
for provider, name in providers:
try:
images = await provider.get_artist_images(mbid)
if images:
return images
except Exception as e:
logger.warning(f"{name} failed: {e}")
return []
```
#### 4. Async-First Design
**Why**: High concurrency, efficient resource usage
**Implementation priority**: High
**Effort**: Low (built into Python 3.11+)
**Pattern**: Use asyncio, aiohttp, asyncpg throughout
#### 5. Comprehensive Monitoring
**Why**: Production readiness, operational visibility
**Implementation priority**: High
**Effort**: Medium (3-5 days)
**Stack**:
- Sentry (error tracking)
- Prometheus + Grafana (metrics)
- Structured logging (JSON logs)
### Medium Applicability (Consider Adopting)
#### 1. Direct Database Access
**Why**: Performance and flexibility
**Implementation priority**: Medium
**Effort**: High (2-3 weeks including setup)
**Decision factors**:
- Expected traffic volume (>1M requests/day → direct DB)
- Infrastructure budget (direct DB requires ~100GB storage)
- Maintenance capacity (schema changes require SQL updates)
**Recommendation**: Start with web API, migrate to direct DB if performance becomes an issue.
#### 2. Background Crawler
**Why**: Improved cache hit rate and user experience
**Implementation priority**: Medium
**Effort**: Medium (1 week)
**Decision factors**:
- Traffic patterns (predictable → crawler valuable)
- Cache hit rate (< 80% → crawler helps)
- Infrastructure capacity (crawler adds load)
**Recommendation**: Implement after MVP is stable and traffic patterns are understood.
#### 3. Real-Time Search Updates
**Why**: Better UX, always-current search results
**Implementation priority**: Low
**Effort**: High (2-3 weeks including RabbitMQ setup)
**Decision factors**:
- Search importance (core feature → real-time valuable)
- Infrastructure complexity tolerance
- Update frequency (hourly updates may be sufficient)
**Recommendation**: Start with periodic reindexing, add real-time updates if search is critical.
#### 4. Change Detection
**Why**: Automatic cache invalidation
**Implementation priority**: Medium
**Effort**: Medium (1 week)
**Decision factors**:
- Data freshness requirements
- Upstream change notification availability
- Cache invalidation strategy
**Recommendation**: Implement for data sources with change detection APIs or webhooks.
### Low Applicability (Optional)
#### 1. Dual-Version Deployment
**Why**: Gradual rollout, A/B testing
**Implementation priority**: Low
**Effort**: Low (configuration change)
**Recommendation**: Defer until service is mature and has significant traffic.
#### 2. Chart Integration
**Why**: Additional value-add feature
**Implementation priority**: Low
**Effort**: Medium (1 week per chart source)
**Recommendation**: Only implement if charts align with product goals.
#### 3. Spotify ID Mapping
**Why**: Cross-platform integration
**Implementation priority**: Medium
**Effort**: Medium (3-5 days)
**Recommendation**: Implement if cross-platform features are planned.
## Recommended Architecture for Metadata Aggregator
Based on this evaluation, here's a recommended architecture:
### Phase 1: MVP (4-6 weeks)
**Core features**:
- Provider mixin architecture
- MusicBrainz web API integration
- Two-tier caching (Redis + PostgreSQL)
- Basic monitoring (Sentry + structured logging)
- Async-first design
- Fallback chains
**Infrastructure**:
- 2 containers: API + Redis
- PostgreSQL for cache (can be shared with application DB)
- No MusicBrainz replica
- No search index (use MusicBrainz search API)
**Estimated cost**: $50-100/month
### Phase 2: Production (8-12 weeks)
**Additional features**:
- CDN integration (Cloudflare/CloudFront)
- Comprehensive monitoring (Prometheus + Grafana)
- API authentication
- Rate limiting
- Change detection
- Background crawler
**Infrastructure**:
- 4+ containers: API (x2) + Redis + Crawler
- Dedicated cache database
- CDN
- Monitoring stack
**Estimated cost**: $200-400/month
### Phase 3: Scale (16-24 weeks)
**Additional features**:
- Direct MusicBrainz database access
- Real-time search updates
- Horizontal scaling
- Multi-region deployment
**Infrastructure**:
- 8+ containers: API (x4) + MusicBrainz DB + Solr + Redis + RabbitMQ + Indexer + Crawler
- Multi-region CDN
- Load balancer
**Estimated cost**: $500-1000/month
## Key Takeaways
### What to Adopt Immediately
1. **Provider mixin architecture**: Clean, testable, extensible
2. **Three-tier caching**: Proven performance optimization
3. **Fallback chains**: Resilience to service failures
4. **Async-first design**: High concurrency
5. **Comprehensive monitoring**: Production readiness
### What to Defer
1. **Direct MusicBrainz database**: Start with web API
2. **Real-time search updates**: Periodic reindexing sufficient for MVP
3. **Dual-version deployment**: Overkill for early stage
4. **Chart integration**: Nice-to-have, not core
### What to Avoid
1. **Hardcoded credentials**: Use secrets management from day one
2. **No authentication**: Implement API keys for production
3. **Outdated dependencies**: Use latest stable versions
4. **Tests disabled in CI**: Invest in integration tests
## Conclusion
The Lidarr Metadata API is an excellent reference implementation that demonstrates production-grade metadata aggregation. Its strengths (multi-source aggregation, sophisticated caching, operational maturity) far outweigh its weaknesses (outdated dependencies, security issues, complex deployment).
**Overall recommendation**: Use this project as a blueprint for architecture and patterns, but modernize dependencies and security before deploying to production.
**Key learnings**:
1. Provider mixin architecture is elegant and scalable
2. Three-tier caching is essential for performance and cost
3. Direct database access is powerful but complex
4. Operational maturity (monitoring, logging, error tracking) is critical
5. Security must be addressed from day one
**Estimated effort to build similar system**:
- MVP: 4-6 weeks (1 developer)
- Production-ready: 12-16 weeks (1-2 developers)
- Full feature parity: 24-32 weeks (2-3 developers)
**Recommended approach**:
1. Start with simplified architecture (web API, two-tier cache)
2. Adopt proven patterns (provider mixins, fallback chains)
3. Invest in monitoring and testing from day one
4. Scale infrastructure as traffic grows
5. Add advanced features (direct DB, real-time search) when needed
This project proves that comprehensive metadata aggregation is achievable with the right architecture and patterns. The key is to start simple, adopt proven patterns, and scale incrementally based on actual needs.
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,419 @@
# Lidarr Metadata API - Overview
## Project Identity
| Property | Value |
|----------|-------|
| **Name** | LidarrAPI.Metadata |
| **Repository** | https://github.com/Lidarr/LidarrAPI.Metadata |
| **Version** | 10.0.0.0 |
| **License** | GPL-3.0 |
| **Primary Language** | Python 3.9 |
| **Purpose** | Enriched metadata aggregation API for Lidarr music manager |
## Core Purpose
LidarrAPI.Metadata serves as a metadata enrichment layer for the Lidarr music management application. It aggregates data from multiple authoritative sources (MusicBrainz, FanArt.tv, TheAudioDB, Wikipedia, Spotify, Last.fm, Billboard, Apple Music) to provide comprehensive artist and album metadata including:
- Artist biographical information
- Album release details
- High-quality cover art and artist images
- Genre classifications
- Music charts and trending data
- Cross-platform ID mappings (MusicBrainz, Spotify, TheAudioDB)
The API acts as an intelligent caching proxy that transforms raw MusicBrainz database records into enriched JSON responses suitable for consumption by Lidarr clients.
## Technology Stack
### Core Framework
| Component | Version | Purpose |
|-----------|---------|---------|
| **Python** | 3.9 | Runtime environment |
| **Quart** | 0.14.1 | Async web framework (Flask-compatible) |
| **Gunicorn** | Latest | WSGI HTTP server |
| **Uvicorn** | Latest | ASGI server (worker class) |
### Data Layer
| Component | Version | Purpose |
|-----------|---------|---------|
| **asyncpg** | 0.26.0 | PostgreSQL async driver |
| **aioredis** | 1.3.1 | Redis async client |
| **PostgreSQL** | 12+ | MusicBrainz database + cache storage |
| **Redis** | 6+ | Ephemeral cache + rate limiting |
| **Solr** | 8.x | Full-text search engine |
### External Integrations
| Library | Version | Purpose |
|---------|---------|---------|
| **spotipy** | 2.16.1 | Spotify API client |
| **pylast** | 4.3.0 | Last.fm API client |
| **billboard-py** | 7.0.0 | Billboard chart scraper |
| **beautifulsoup4** | Latest | HTML parsing (Wikipedia) |
| **sentry-sdk** | 0.19.5 | Error tracking |
## Application Entry Points
The project provides two executable entry points:
### 1. API Server
```bash
lidarr-metadata-server
```
**Implementation**: `lidarrmetadata/server.py`
Starts the Quart web application serving the metadata API on port 5001. Supports configurable path prefix via `APPLICATION_ROOT` environment variable.
**Production command**:
```bash
gunicorn -w 1 -k uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:5001 \
--access-logfile - \
lidarrmetadata.server:app
```
### 2. Background Crawler
```bash
lidarr-metadata-crawler
```
**Implementation**: `lidarrmetadata/crawler.py`
Runs background cache warming tasks to proactively fetch and cache metadata for recently updated artists and albums. Operates independently of the API server.
**Crawler types**:
- Wikipedia overview crawler
- FanArt.tv image crawler
- TheAudioDB metadata crawler
- Artist metadata crawler
- Album metadata crawler
## Network Configuration
| Setting | Default | Configurable Via |
|---------|---------|------------------|
| **Port** | 5001 | Docker/Gunicorn bind |
| **Path Prefix** | `/` | `APPLICATION_ROOT` env var |
| **Workers** | 1 | Gunicorn `-w` flag |
| **Worker Class** | uvicorn | Gunicorn `-k` flag |
## Related Ecosystem Components
### Lidarr Music Manager
The primary consumer of this API. Lidarr is an automated music collection manager for Usenet and BitTorrent users. It monitors multiple RSS feeds for new albums from favorite artists and grabs, sorts, and renames them.
**Integration**: Lidarr queries this API to enrich its local music library database with metadata, images, and biographical information.
### MusicBrainz Database
The authoritative source for music metadata. MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.
**Integration**: Direct PostgreSQL connection to a replicated MusicBrainz database instance. The API does NOT use the MusicBrainz web API; it queries the database directly for performance.
**Database size**: ~100GB+ for full MusicBrainz dataset with hourly replication.
### Cover Art Archive
A joint project between the Internet Archive and MusicBrainz providing cover art images for releases in the MusicBrainz database.
**Integration**: Images are proxied through `imagecache.lidarr.audio` CDN for performance and bandwidth optimization.
## Deployment Architecture
The application is designed for containerized deployment with Docker Compose. A typical production deployment includes:
| Container | Purpose | Resource Requirements |
|-----------|---------|----------------------|
| **musicbrainz** | PostgreSQL with MusicBrainz schema | 100GB+ storage, 4GB+ RAM |
| **solr** | Search index (artist/album) | 8GB+ storage, 2GB+ RAM |
| **redis** | Cache + rate limiting | 512MB RAM limit |
| **rabbitmq** | Search index updates | 1GB RAM |
| **indexer** | Solr index updater (SIR) | 512MB RAM |
| **api-v0.3** | Stable API version | 1GB+ RAM |
| **api-testing** | Development API version | 1GB+ RAM |
| **crawler** | Background cache warmer | 512MB RAM |
## Version Strategy
The project uses semantic versioning with a unique dual-deployment strategy:
- **v0.3**: Stable production version
- **testing**: Development/staging version
Both versions run simultaneously in production, allowing gradual rollout and A/B testing of new features.
## Configuration Management
Configuration is managed through a metaclass-based system with environment variable overrides:
```python
# Select configuration class
LIDARR_METADATA_CONFIG=lidarrmetadata.config.ProductionConfig
# Override specific settings (double underscore for nesting)
CACHE__REDIS_URL=redis://redis:6379/0
DATABASE__HOST=musicbrainz
```
## Key Features
### Multi-Source Aggregation
Combines data from 15+ external sources into unified artist/album responses:
- **Core metadata**: MusicBrainz database (direct SQL)
- **Images**: Cover Art Archive, FanArt.tv, TheAudioDB
- **Biographies**: Wikipedia (32 language fallback)
- **Cross-platform IDs**: Spotify, TheAudioDB, MusicBrainz
- **Charts**: Last.fm, Billboard, Apple Music, iTunes
### Intelligent Caching
Three-tier caching strategy:
1. **Redis**: Ephemeral cache (7-day TTL, 512MB limit, LFU eviction)
2. **PostgreSQL**: Persistent cache with zlib compression
3. **Cloudflare CDN**: Edge caching with programmatic invalidation
### Change Detection
Monitors MusicBrainz replication stream to detect updated artists/albums and invalidate stale cache entries. SQL queries track changes across 5 different update sources per entity type.
### Background Crawling
Proactive cache warming for recently updated entities. Crawlers run on configurable schedules to pre-fetch expensive metadata (Wikipedia overviews, FanArt images) before user requests.
### Provider Fallback Chain
Graceful degradation when external services are unavailable. Each metadata type has a primary provider and optional fallback providers with timeout handling.
## Performance Characteristics
| Metric | Value | Notes |
|--------|-------|-------|
| **Cache hit rate** | ~85%+ | With crawler enabled |
| **Cold request latency** | 2-5s | Multiple external API calls |
| **Cached request latency** | 50-200ms | Redis/PostgreSQL lookup |
| **CDN request latency** | 10-50ms | Cloudflare edge cache |
| **Database size** | 100GB+ | MusicBrainz full dataset |
| **Cache database size** | 10-50GB | Compressed metadata cache |
## API Response Format
All endpoints return JSON with consistent structure:
```json
{
"Id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
"ArtistName": "Nirvana",
"Disambiguation": "90s US grunge band",
"Overview": "Nirvana was an American rock band...",
"Images": [
{
"Url": "https://imagecache.lidarr.audio/...",
"CoverType": "poster",
"Extension": ".jpg"
}
],
"Links": [
{
"Url": "https://www.spotify.com/artist/...",
"Name": "spotify"
}
],
"Genres": ["Grunge", "Alternative Rock"],
"Albums": [...]
}
```
## Security Posture
**Current state**: Development-focused with insecure defaults.
| Aspect | Status | Details |
|--------|--------|---------|
| **API authentication** | None | Read endpoints are public |
| **Admin authentication** | Single API key | `/invalidate` endpoint only |
| **Database credentials** | Hardcoded | `abc/abc` in multiple configs |
| **RabbitMQ credentials** | Hardcoded | `abc/abc` default |
| **HTTPS** | Not enforced | Relies on reverse proxy |
| **Rate limiting** | Optional | Disabled by default (NullRateLimiter) |
**Production recommendation**: Deploy behind authenticated reverse proxy (Cloudflare Access, OAuth2 Proxy, etc.).
## Monitoring and Observability
### Error Tracking
Sentry integration with custom rate limiting to prevent alert fatigue:
```python
sentry_sdk.init(
dsn=config.SENTRY_DSN,
integrations=[FlaskIntegration()],
release=f"lidarr-metadata@{__version__}"
)
```
Redis-backed deduplication prevents duplicate error reports.
### Metrics
StatsD/Telegraf integration for operational metrics:
- Provider request counts
- Response time histograms
- Cache hit/miss rates
- Rate limiter state
### Logging
Python standard library logging with per-module handlers:
- **DEBUG**: Detailed request/response logging
- **INFO**: Request summaries, cache operations
- **WARN**: Provider timeouts, fallback usage
- **ERROR**: Unhandled exceptions, data inconsistencies
## Development Workflow
### Local Development
```bash
# Install dependencies
poetry install
# Start infrastructure
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -d
# Run API server
LIDARR_METADATA_CONFIG=lidarrmetadata.config.DevelopmentConfig \
python -m lidarrmetadata.server
# Run tests (currently disabled in CI)
pytest tests/
```
### Testing
Test suite uses pytest with async support:
- `tests/test_config.py`: Configuration system (152 lines, most comprehensive)
- `tests/test_provider.py`: Provider mixin behavior
- `tests/test_cache.py`: Cache layer functionality
- `tests/test_api.py`: API endpoint responses
- `tests/test_util.py`: Utility functions
- `tests/test_app.py`: Application initialization
**Note**: Tests are commented out in Azure Pipelines CI configuration.
## Project Maturity Assessment
| Aspect | Maturity | Evidence |
|--------|----------|----------|
| **Production readiness** | High | Running in production for Lidarr ecosystem |
| **Code quality** | Medium | SonarCloud integration, but tests disabled |
| **Security** | Low | Hardcoded credentials, no auth on read endpoints |
| **Documentation** | Medium | README comprehensive, inline docs sparse |
| **Dependency freshness** | Low | Python 3.9, aioredis 1.x (deprecated) |
| **Test coverage** | Unknown | Tests disabled in CI |
| **Operational maturity** | High | Sentry, metrics, multi-tier caching, CDN integration |
## Relevance to Metadata Aggregator Project
This codebase represents the closest real-world implementation of a production metadata aggregation service. Key learnings:
1. **Multi-source enrichment pattern**: MusicBrainz as authoritative core + specialized providers for images/bios/charts
2. **Caching strategy**: Three-tier approach with compression and invalidation is battle-tested
3. **Provider architecture**: Mixin-based design allows flexible composition of data sources
4. **Change detection**: Monitoring upstream data sources for cache invalidation is critical
5. **Background crawling**: Proactive cache warming significantly improves user experience
6. **Direct database access**: Querying MusicBrainz DB directly (vs API) enables complex aggregations
7. **SQL aggregation**: Using `row_to_json` and `json_agg` to build nested JSON in database is highly efficient
## File Structure Overview
```
lidarrmetadata/
├── __init__.py # Version and package metadata
├── server.py # API server entry point
├── crawler.py # Background crawler entry point
├── app.py # Quart application factory + routes
├── api.py # Business logic layer
├── provider.py # Provider mixins and implementations
├── cache.py # Multi-tier cache implementation
├── config.py # Configuration metaclass system
├── util.py # Utility functions
├── sql/ # MusicBrainz SQL queries
│ ├── artist.sql
│ ├── album.sql
│ ├── updated_artists.sql
│ └── updated_albums.sql
└── providers/ # Individual provider implementations
├── musicbrainz_db.py
├── solr_search.py
├── fanart.py
├── theaudiodb.py
├── wikipedia.py
└── spotify.py
```
## Dependencies Analysis
### Production Dependencies (17 total)
**Web framework**:
- quart==0.14.1 (async Flask alternative)
- hypercorn (ASGI server, Quart dependency)
**Database**:
- asyncpg==0.26.0 (PostgreSQL async driver)
- aioredis==1.3.1 (Redis async client, deprecated)
**External APIs**:
- spotipy==2.16.1 (Spotify)
- pylast==4.3.0 (Last.fm)
- billboard-py==7.0.0 (Billboard charts)
- beautifulsoup4 (Wikipedia scraping)
**Utilities**:
- python-dateutil (date parsing)
- pytz (timezone handling)
- requests (HTTP client for sync operations)
- lxml (XML parsing)
**Monitoring**:
- sentry-sdk==0.19.5 (error tracking)
- statsd (metrics)
**Server**:
- gunicorn (WSGI server)
- uvicorn (ASGI worker)
### Development Dependencies
- pytest
- pytest-asyncio
- black (code formatting)
- flake8 (linting)
### Dependency Concerns
1. **Python 3.9**: End of life October 2025, should upgrade to 3.11+
2. **aioredis 1.3.1**: Deprecated, merged into redis-py 4.2+
3. **Quart 0.14.1**: Current version is 0.19+, missing 5 years of updates
4. **asyncpg 0.26.0**: Current version is 0.29+
5. **sentry-sdk 0.19.5**: Current version is 2.0+, missing major version
## Conclusion
LidarrAPI.Metadata is a production-grade metadata aggregation service with sophisticated caching, multi-source enrichment, and operational maturity. While it has technical debt (outdated dependencies, disabled tests, insecure defaults), its architecture and patterns provide an excellent reference for building a modern metadata aggregator.
The direct MusicBrainz database integration, provider fallback chain, and three-tier caching strategy are particularly valuable patterns to adopt.