feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,761 @@
+# Music Metadata API - Evaluation
+
+## Executive Summary
+
+Music Metadata API is a **simple, focused, self-contained** service for querying metadata on 256 million music tracks. It excels at batch lookups and ISRC-based queries but lacks authentication, testing, and real-time data updates.
+
+**Best for:** Self-hosted metadata enrichment, high-volume batch processing, ISRC resolution  
+**Not suitable for:** Real-time data, production systems requiring authentication, mission-critical applications without testing
+
+## Strengths
+
+### 1. Massive Dataset
+
+**256 million tracks** across two SQLite databases (~216GB)
+
+**Coverage:**
+- Tracks with ISRC codes
+- Albums with artwork, labels, release dates
+- Artists with genres, follower counts, popularity
+- Extended metadata (lyrics flags, languages, artist roles)
+
+**Comparison:**
+- Spotify Web API: Full catalog (real-time)
+- MusicBrainz: ~40M recordings
+- Discogs: ~15M releases
+
+**Value:** Comprehensive coverage for metadata enrichment without API rate limits.
+
+### 2. Extremely Simple Architecture
+
+**No framework, no ORM, minimal dependencies:**
+- Go stdlib for HTTP, JSON, database
+- 2 external packages (sqlite driver, rate limiter)
+- ~1,100 lines of code
+- Single binary deployment
+
+**Benefits:**
+- Easy to understand and modify
+- Fast compilation
+- No framework lock-in
+- Minimal attack surface
+
+**Comparison:**
+- Typical web service: 10+ dependencies, framework overhead
+- Music Metadata API: 2 dependencies, stdlib only
+
+### 3. High-Performance Batch API
+
+**Batch endpoint:** Process up to 400 items per request
+
+**Performance gain:**
+- Individual requests: 400 × ~50ms = 20 seconds
+- Batch request: ~200-500ms total
+- **40-100x faster**
+
+**Query optimization:**
+- Without batching: 2,800+ queries for 400 tracks
+- With batching: 7 queries for 400 tracks
+- **400x fewer queries**
+
+**Use case:** Enriching large music libraries efficiently.
+
+### 4. Pure Go (No CGO)
+
+**CGO_ENABLED=0** - No C dependencies
+
+**Benefits:**
+- Cross-compilation trivial (GOOS/GOARCH)
+- No C toolchain required
+- Smaller attack surface
+- Easier deployment (static binary)
+
+**Tradeoff:** Larger binary size vs CGO SQLite driver (~2MB vs ~500KB)
+
+### 5. Read-Only Safety
+
+**Databases opened in read-only mode:**
+- No accidental writes
+- No data corruption risk
+- Safe concurrent reads
+- No write locks
+
+**PRAGMAs:**
+```
+mode=ro
+_journal_mode=off
+_query_only=true
+```
+
+**Benefit:** Multiple instances can share database files safely.
+
+### 6. OpenAPI Documentation
+
+**Comprehensive OpenAPI 3.1 spec:**
+- All endpoints documented
+- Request/response schemas
+- Example payloads
+- Interactive Swagger UI at `/docs`
+
+**Value:** Self-documenting API, easy integration.
+
+### 7. MIT License
+
+**Permissive license:**
+- Free for commercial use
+- No attribution required (recommended)
+- Modify and redistribute freely
+
+**Comparison:**
+- Spotify Web API: Proprietary, rate limited
+- MusicBrainz: CC0/Public Domain (data), GPL (server)
+
+### 8. Easy Deployment
+
+**Multiple deployment options:**
+- Standalone binary (single executable)
+- Docker container (official image)
+- Kubernetes (example manifests)
+- Cloud platforms (ECS, Cloud Run, ACI)
+
+**Minimal requirements:**
+- 216GB disk (databases)
+- 4GB RAM
+- 1 CPU core
+
+**No external dependencies:**
+- No database server (SQLite embedded)
+- No cache server (SQLite cache)
+- No message queue
+- No authentication service
+
+## Weaknesses
+
+### 1. Zero Test Coverage
+
+**No test files, no test framework, no CI testing**
+
+**Risks:**
+- No regression protection
+- Bugs discovered in production
+- Difficult to refactor safely
+- No documentation via tests
+
+**Evidence:**
+- `.gitignore` includes `coverage.out` (testing planned but not implemented)
+- GitHub Actions workflow has no test step
+
+**Impact:** High risk for production use without extensive manual testing.
+
+### 2. No Authentication
+
+**Public API with no access control:**
+- No OAuth
+- No API keys
+- No rate limiting per user (only per IP)
+- No usage tracking per user
+
+**Risks:**
+- Abuse (unlimited queries)
+- No accountability
+- No quota enforcement
+- Data scraping
+
+**Workarounds:**
+- Deploy behind reverse proxy with auth (nginx, Caddy)
+- Use API gateway (Kong, Tyk)
+- Implement custom middleware
+
+**Impact:** Not suitable for public internet deployment without additional security layer.
+
+### 3. Naive Health Check
+
+**Health endpoint always returns OK:**
+```go
+func handleHealth(w http.ResponseWriter, r *http.Request) {
+    json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
+}
+```
+
+**Problem:** Doesn't verify database connectivity
+
+**Scenario:**
+- Database file deleted/corrupted
+- Health check returns 200 OK
+- Actual queries fail with 500 errors
+- Monitoring systems don't detect failure
+
+**Impact:** False positives in monitoring, delayed incident detection.
+
+### 4. Rate Limiter Memory Leak
+
+**Visitor map grows unbounded:**
+```go
+type RateLimiter struct {
+    visitors map[string]*rate.Limiter  // Never cleaned up
+    mu       sync.RWMutex
+}
+```
+
+**Impact:**
+- Long-running servers accumulate IPs
+- Memory usage grows over time
+- 1M unique IPs = ~100MB leak
+
+**Workaround:** Restart server periodically
+
+**Fix required:** Implement visitor cleanup (remove inactive IPs after 24 hours)
+
+### 5. No CORS Support
+
+**No CORS headers:**
+- Browser-based clients blocked
+- Can't call from web apps directly
+- OPTIONS preflight requests fail
+
+**Workarounds:**
+- Add CORS middleware (custom implementation)
+- Use server-side proxy
+- Deploy API on same origin as web app
+
+**Impact:** Limited to server-side integrations.
+
+### 6. No Metrics/Monitoring
+
+**No instrumentation:**
+- No Prometheus metrics
+- No request counters
+- No latency histograms
+- No error rate tracking
+
+**Visibility gaps:**
+- Can't track usage patterns
+- Can't identify slow endpoints
+- Can't detect error spikes
+- No performance baselines
+
+**Workarounds:**
+- Parse logs for metrics
+- Use reverse proxy metrics (nginx)
+- Implement custom metrics middleware
+
+**Impact:** Blind operation, difficult to optimize.
+
+### 7. Database Provenance Unclear
+
+**Repository disclaimer:**
+> "This project is not affiliated with Spotify."
+
+**Concerns:**
+- Data source unclear (likely scraped)
+- Legal status uncertain
+- No official Spotify endorsement
+- Potential copyright issues
+
+**Risks:**
+- Takedown requests
+- Legal liability
+- Data quality unknown
+- No support/updates
+
+**Recommendation:** Verify legal compliance before production use.
+
+### 8. No Data Freshness Mechanism
+
+**Static snapshot:**
+- No update mechanism
+- Data frozen at time of database creation
+- No real-time sync with Spotify
+
+**Staleness:**
+- New releases not included
+- Popularity scores outdated
+- Artist follower counts stale
+- Deleted tracks still present
+
+**Workarounds:**
+- Periodically obtain updated database (if available)
+- Complement with real-time APIs for fresh data
+- Treat as historical snapshot
+
+**Impact:** Not suitable for applications requiring current data.
+
+### 9. Search Performance
+
+**LIKE %query% on 256M rows:**
+- Full table scan (can't use indexes)
+- 10-second timeout (can be hit)
+- CPU-intensive
+
+**Slow searches:**
+- Common words ("love", "the"): 5-10 seconds
+- Rare queries: 10+ seconds (full scan)
+
+**Alternative:** SQLite FTS5 (Full-Text Search)
+- Requires writable database (not compatible with read-only mode)
+- Would need separate FTS5 database
+
+**Impact:** Search functionality limited to specific queries.
+
+### 10. Hardcoded Configuration
+
+**All limits/timeouts hardcoded:**
+- Rate limit: 100 req/s, 200 burst
+- Search timeout: 10 seconds
+- Batch limit: 400 items
+- Connection pool: 8 connections
+- SQLite cache: 64MB
+
+**Problems:**
+- No flexibility
+- Requires recompilation to change
+- No environment-specific config
+
+**Workaround:** Fork and modify code
+
+**Impact:** Limited adaptability to different workloads.
+
+## Use Case Evaluation
+
+### Ideal Use Cases
+
+#### 1. Music Library Enrichment
+
+**Scenario:** Enrich local music library with metadata
+
+**Flow:**
+1. Extract ISRCs from audio files (via AcoustID)
+2. Batch lookup ISRCs (400 at a time)
+3. Store metadata in local database
+4. Display in music player UI
+
+**Why suitable:**
+- Batch API optimized for bulk lookups
+- ISRC-based lookup (industry standard)
+- No API rate limits (self-hosted)
+- Comprehensive metadata (genres, images, popularity)
+
+**Example:**
+```python
+# Enrich 10,000 tracks
+isrcs = extract_isrcs_from_library()  # 10,000 ISRCs
+
+# Batch lookup (25 requests for 10,000 tracks)
+for batch in chunks(isrcs, 400):
+    response = requests.post("http://localhost:8080/batch/lookup", json={"isrcs": batch})
+    store_metadata(response.json())
+```
+
+#### 2. Metadata Aggregator Pipeline
+
+**Scenario:** Combine data from multiple sources (MusicBrainz + Music Metadata API)
+
+**Flow:**
+1. Query MusicBrainz for recording by MBID
+2. Extract ISRC from MusicBrainz response
+3. Lookup ISRC in Music Metadata API
+4. Merge metadata (MusicBrainz credits + Spotify-style data)
+
+**Why suitable:**
+- Complements MusicBrainz (different data models)
+- ISRC as common key
+- Fast batch lookups
+- No external API dependencies
+
+**Example:**
+```python
+# Get MusicBrainz data
+mb_data = musicbrainz.get_recording(mbid)
+isrc = mb_data['isrcs'][0]
+
+# Get Spotify-style data
+mm_data = requests.get(f"http://localhost:8080/lookup/isrc/{isrc}").json()
+
+# Merge
+merged = {
+    "mbid": mbid,
+    "isrc": isrc,
+    "title": mm_data['name'],
+    "popularity": mm_data['popularity'],
+    "credits": mb_data['artist-credit'],
+    "genres": mm_data['artists'][0]['genres']
+}
+```
+
+#### 3. Self-Hosted Alternative to Spotify API
+
+**Scenario:** Replace Spotify Web API with local service
+
+**Why suitable:**
+- No OAuth complexity
+- No API rate limits
+- No per-request costs
+- Batch support (400 items vs Spotify's 50)
+
+**Tradeoffs:**
+- Static data (no real-time updates)
+- Database size (216GB)
+- No write operations
+
+**Example:**
+```python
+# Spotify Web API (rate limited, requires OAuth)
+spotify_data = spotify_client.search(q=f"isrc:{isrc}", type="track")
+
+# Music Metadata API (no auth, no rate limits)
+mm_data = requests.get(f"http://localhost:8080/lookup/isrc/{isrc}").json()
+```
+
+#### 4. DJ Software Metadata Provider
+
+**Scenario:** Enrich DJ library with popularity, genres, images
+
+**Why suitable:**
+- Batch processing for large libraries
+- Popularity scores for track selection
+- Genre tags for filtering
+- Album artwork for UI
+
+**Example:**
+```python
+# Enrich DJ library
+tracks = load_dj_library()  # 5,000 tracks
+isrcs = [t.isrc for t in tracks]
+
+# Batch lookup
+for batch in chunks(isrcs, 400):
+    response = requests.post("http://localhost:8080/batch/lookup", json={"isrcs": batch})
+    update_dj_library(response.json())
+```
+
+### Unsuitable Use Cases
+
+#### 1. Real-Time Music Discovery App
+
+**Why unsuitable:**
+- Static data (no new releases)
+- Outdated popularity scores
+- No personalization
+- No user-specific data
+
+**Alternative:** Spotify Web API, Apple Music API
+
+#### 2. Public-Facing API Service
+
+**Why unsuitable:**
+- No authentication (abuse risk)
+- No usage tracking
+- No quota enforcement
+- Rate limiter memory leak
+
+**Alternative:** Add authentication layer or use managed API service
+
+#### 3. Mission-Critical Production System
+
+**Why unsuitable:**
+- Zero test coverage
+- Naive health check
+- Memory leak
+- No metrics
+
+**Alternative:** Extensive testing + monitoring before production use
+
+#### 4. Applications Requiring Fresh Data
+
+**Why unsuitable:**
+- Static snapshot (no updates)
+- Stale popularity/follower counts
+- Missing new releases
+
+**Alternative:** Spotify Web API, MusicBrainz (community-updated)
+
+## Integration Evaluation
+
+### Complementary Services
+
+**Works well with:**
+- **MusicBrainz:** Different data models, ISRC as common key
+- **AcoustID:** Fingerprint to ISRC, then lookup metadata
+- **Local music libraries:** Enrich with metadata
+- **DJ software:** Popularity, genres, artwork
+
+**Conflicts with:**
+- **Spotify Web API:** Overlapping data, but Music Metadata API is static
+- **Real-time services:** Music Metadata API data is stale
+
+### Integration Complexity
+
+**Easy integrations:**
+- HTTP client (any language)
+- Batch processing pipelines
+- Local applications
+
+**Complex integrations:**
+- Browser-based apps (no CORS)
+- Authenticated services (no auth)
+- Real-time systems (static data)
+
+## Performance Evaluation
+
+### Throughput
+
+**Batch endpoint:**
+- 400 items per request
+- ~200-500ms per request
+- **800-2,000 items/second** (single instance)
+
+**Individual endpoints:**
+- ~50ms per request
+- Rate limited to 100 req/s
+- **100 items/second** (single instance)
+
+**Scaling:**
+- Horizontal: Run multiple instances (read-only safe)
+- Vertical: More RAM (larger cache), faster disk (SSD)
+
+### Latency
+
+**Typical latencies:**
+- Track lookup: 10-50ms
+- Album lookup: 10-50ms
+- Artist lookup: 10-50ms
+- Batch lookup (400 items): 200-500ms
+- Search: 1-10 seconds (depends on query)
+
+**Bottlenecks:**
+- Search queries (LIKE %query%)
+- Disk I/O (use SSD)
+- Rate limiter (RWMutex contention)
+
+### Resource Usage
+
+**Disk:** 216GB (databases)  
+**RAM:** 2.5GB (SQLite cache + mmap) + 1.5GB (app/OS) = 4GB minimum  
+**CPU:** 1 core minimum, 2+ recommended (search queries CPU-intensive)
+
+**Scaling costs:**
+- 10 instances = 2.16TB storage (expensive)
+- Shared filesystem (NFS, EFS) reduces storage cost but increases latency
+
+## Security Evaluation
+
+### Vulnerabilities
+
+**High severity:**
+- **No authentication:** Anyone can query API
+- **No rate limiting per user:** IP-based only (easily bypassed)
+
+**Medium severity:**
+- **Memory leak:** Rate limiter grows unbounded
+- **No input sanitization:** SQL injection risk (mitigated by parameterized queries)
+
+**Low severity:**
+- **No HTTPS:** Deploy behind reverse proxy with TLS
+- **No CORS:** Browser-based attacks limited
+
+### Mitigations
+
+**Authentication:**
+- Deploy behind reverse proxy with auth (nginx, Caddy)
+- Use API gateway (Kong, Tyk)
+
+**Rate limiting:**
+- Implement per-user rate limiting (requires auth)
+- Use distributed rate limiter (Redis)
+
+**Memory leak:**
+- Restart server periodically
+- Implement visitor cleanup
+
+**HTTPS:**
+- Terminate TLS at reverse proxy
+- Use Let's Encrypt for free certificates
+
+## Reliability Evaluation
+
+### Failure Modes
+
+**Database unavailable:**
+- Health check returns OK (false positive)
+- Queries fail with 500 errors
+- No automatic recovery
+
+**Memory exhaustion:**
+- Rate limiter leak accumulates
+- OOM kill by OS
+- Service restart required
+
+**Disk full:**
+- SQLite read-only (no writes)
+- No impact on service
+
+**Network partition:**
+- No external dependencies
+- Service continues (self-contained)
+
+### Recovery
+
+**Automatic recovery:**
+- Graceful shutdown on SIGINT/SIGTERM
+- Docker/Kubernetes restart on failure
+
+**Manual recovery:**
+- Restart service (clears rate limiter leak)
+- Restore database from backup
+- Check database integrity (PRAGMA integrity_check)
+
+### High Availability
+
+**Strategies:**
+- Run multiple instances (read-only safe)
+- Load balancer distributes traffic
+- Health checks route around failures (but naive health check is a problem)
+
+**Limitations:**
+- No shared state (rate limiter per-instance)
+- No session affinity required
+- Database replication (copy files to each instance)
+
+## Cost Evaluation
+
+### Infrastructure Costs
+
+**Single instance:**
+- Compute: $20-50/month (2 CPU, 8GB RAM)
+- Storage: $20-40/month (250GB SSD)
+- Network: $5-10/month (1TB transfer)
+- **Total: $45-100/month**
+
+**10 instances (high availability):**
+- Compute: $200-500/month
+- Storage: $200-400/month (2.5TB SSD, or shared filesystem)
+- Network: $50-100/month
+- **Total: $450-1,000/month**
+
+**Comparison:**
+- Spotify Web API: Free tier limited, paid tiers $0.001-0.01 per request
+- MusicBrainz: Free (donations encouraged)
+
+### Development Costs
+
+**Initial setup:**
+- Deploy service: 1-2 hours
+- Obtain databases: Unknown (not in repository)
+- Test integration: 2-4 hours
+- **Total: 4-8 hours**
+
+**Ongoing maintenance:**
+- Monitor service: 1-2 hours/month
+- Update databases: Unknown (no update mechanism)
+- Security patches: 1-2 hours/month
+- **Total: 2-4 hours/month**
+
+### Total Cost of Ownership
+
+**Year 1:**
+- Infrastructure: $540-1,200 (single instance)
+- Development: $400-800 (setup + 12 months maintenance)
+- **Total: $940-2,000**
+
+**Comparison:**
+- Spotify Web API: $0-10,000+ (depends on usage)
+- MusicBrainz: $0 (free, donations encouraged)
+
+## Recommendation Matrix
+
+| Use Case | Suitability | Reasoning |
+|----------|-------------|-----------|
+| Music library enrichment | ⭐⭐⭐⭐⭐ | Ideal: Batch API, ISRC lookup, no rate limits |
+| Metadata aggregator | ⭐⭐⭐⭐⭐ | Ideal: Complements MusicBrainz, fast lookups |
+| Self-hosted alternative | ⭐⭐⭐⭐ | Good: No auth complexity, but static data |
+| DJ software integration | ⭐⭐⭐⭐ | Good: Popularity, genres, artwork |
+| Real-time music app | ⭐⭐ | Poor: Static data, no updates |
+| Public API service | ⭐⭐ | Poor: No auth, no metrics, memory leak |
+| Mission-critical system | ⭐ | Very poor: No tests, naive health check |
+| Fresh data required | ⭐ | Very poor: Static snapshot, no updates |
+
+**Legend:**
+- ⭐⭐⭐⭐⭐ Ideal
+- ⭐⭐⭐⭐ Good
+- ⭐⭐⭐ Acceptable
+- ⭐⭐ Poor
+- ⭐ Very poor
+
+## Final Verdict
+
+### Overall Rating: 7/10
+
+**Breakdown:**
+- **Functionality:** 9/10 (comprehensive metadata, batch API)
+- **Performance:** 8/10 (fast batch, slow search)
+- **Reliability:** 5/10 (no tests, memory leak, naive health check)
+- **Security:** 4/10 (no auth, no metrics)
+- **Maintainability:** 6/10 (simple code, but no tests)
+- **Documentation:** 8/10 (OpenAPI spec, but minimal code comments)
+
+### Strengths Summary
+
+1. Massive dataset (256M tracks)
+2. Simple architecture (no framework)
+3. High-performance batch API (400 items/request)
+4. Pure Go (no CGO)
+5. Read-only safety
+6. OpenAPI documentation
+7. MIT license
+8. Easy deployment
+
+### Weaknesses Summary
+
+1. Zero test coverage
+2. No authentication
+3. Naive health check
+4. Rate limiter memory leak
+5. No CORS
+6. No metrics
+7. Database provenance unclear
+8. No data freshness
+9. Slow search (LIKE %query%)
+10. Hardcoded configuration
+
+### Recommendation
+
+**Use Music Metadata API if:**
+- You need to enrich large music libraries (batch processing)
+- You want ISRC-based lookups without API rate limits
+- You can tolerate static data (no real-time updates)
+- You can deploy behind reverse proxy (for auth/CORS)
+- You can implement monitoring (metrics, proper health checks)
+- You can accept legal uncertainty (database provenance)
+
+**Don't use Music Metadata API if:**
+- You need real-time data (use Spotify Web API)
+- You need production-grade reliability (no tests)
+- You need authentication out-of-the-box
+- You need fresh data (new releases, current popularity)
+- You can't tolerate 216GB storage requirement
+
+### Improvement Priorities
+
+**Critical (before production):**
+1. Add test coverage (unit + integration tests)
+2. Fix rate limiter memory leak
+3. Implement proper health check (verify database)
+4. Add authentication (or deploy behind auth proxy)
+
+**High priority:**
+1. Add metrics/monitoring (Prometheus)
+2. Implement CORS support
+3. Extract hardcoded config (environment variables)
+4. Use LOG_LEVEL environment variable
+
+**Medium priority:**
+1. Improve search performance (FTS5)
+2. Add request logging
+3. Structured error responses
+4. Documentation (code comments)
+
+**Low priority:**
+1. Caching layer (Redis)
+2. Horizontal scaling improvements
+3. Database update mechanism
+4. Admin API (stats, cache control)