Files
metadata-agregator/docs/research/music-metadata-api/analysis/EVALUATION.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

762 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Music Metadata API - Evaluation
## Executive Summary
Music Metadata API is a **simple, focused, self-contained** service for querying metadata on 256 million music tracks. It excels at batch lookups and ISRC-based queries but lacks authentication, testing, and real-time data updates.
**Best for:** Self-hosted metadata enrichment, high-volume batch processing, ISRC resolution
**Not suitable for:** Real-time data, production systems requiring authentication, mission-critical applications without testing
## Strengths
### 1. Massive Dataset
**256 million tracks** across two SQLite databases (~216GB)
**Coverage:**
- Tracks with ISRC codes
- Albums with artwork, labels, release dates
- Artists with genres, follower counts, popularity
- Extended metadata (lyrics flags, languages, artist roles)
**Comparison:**
- Spotify Web API: Full catalog (real-time)
- MusicBrainz: ~40M recordings
- Discogs: ~15M releases
**Value:** Comprehensive coverage for metadata enrichment without API rate limits.
### 2. Extremely Simple Architecture
**No framework, no ORM, minimal dependencies:**
- Go stdlib for HTTP, JSON, database
- 2 external packages (sqlite driver, rate limiter)
- ~1,100 lines of code
- Single binary deployment
**Benefits:**
- Easy to understand and modify
- Fast compilation
- No framework lock-in
- Minimal attack surface
**Comparison:**
- Typical web service: 10+ dependencies, framework overhead
- Music Metadata API: 2 dependencies, stdlib only
### 3. High-Performance Batch API
**Batch endpoint:** Process up to 400 items per request
**Performance gain:**
- Individual requests: 400 × ~50ms = 20 seconds
- Batch request: ~200-500ms total
- **40-100x faster**
**Query optimization:**
- Without batching: 2,800+ queries for 400 tracks
- With batching: 7 queries for 400 tracks
- **400x fewer queries**
**Use case:** Enriching large music libraries efficiently.
### 4. Pure Go (No CGO)
**CGO_ENABLED=0** - No C dependencies
**Benefits:**
- Cross-compilation trivial (GOOS/GOARCH)
- No C toolchain required
- Smaller attack surface
- Easier deployment (static binary)
**Tradeoff:** Larger binary size vs CGO SQLite driver (~2MB vs ~500KB)
### 5. Read-Only Safety
**Databases opened in read-only mode:**
- No accidental writes
- No data corruption risk
- Safe concurrent reads
- No write locks
**PRAGMAs:**
```
mode=ro
_journal_mode=off
_query_only=true
```
**Benefit:** Multiple instances can share database files safely.
### 6. OpenAPI Documentation
**Comprehensive OpenAPI 3.1 spec:**
- All endpoints documented
- Request/response schemas
- Example payloads
- Interactive Swagger UI at `/docs`
**Value:** Self-documenting API, easy integration.
### 7. MIT License
**Permissive license:**
- Free for commercial use
- No attribution required (recommended)
- Modify and redistribute freely
**Comparison:**
- Spotify Web API: Proprietary, rate limited
- MusicBrainz: CC0/Public Domain (data), GPL (server)
### 8. Easy Deployment
**Multiple deployment options:**
- Standalone binary (single executable)
- Docker container (official image)
- Kubernetes (example manifests)
- Cloud platforms (ECS, Cloud Run, ACI)
**Minimal requirements:**
- 216GB disk (databases)
- 4GB RAM
- 1 CPU core
**No external dependencies:**
- No database server (SQLite embedded)
- No cache server (SQLite cache)
- No message queue
- No authentication service
## Weaknesses
### 1. Zero Test Coverage
**No test files, no test framework, no CI testing**
**Risks:**
- No regression protection
- Bugs discovered in production
- Difficult to refactor safely
- No documentation via tests
**Evidence:**
- `.gitignore` includes `coverage.out` (testing planned but not implemented)
- GitHub Actions workflow has no test step
**Impact:** High risk for production use without extensive manual testing.
### 2. No Authentication
**Public API with no access control:**
- No OAuth
- No API keys
- No rate limiting per user (only per IP)
- No usage tracking per user
**Risks:**
- Abuse (unlimited queries)
- No accountability
- No quota enforcement
- Data scraping
**Workarounds:**
- Deploy behind reverse proxy with auth (nginx, Caddy)
- Use API gateway (Kong, Tyk)
- Implement custom middleware
**Impact:** Not suitable for public internet deployment without additional security layer.
### 3. Naive Health Check
**Health endpoint always returns OK:**
```go
func handleHealth(w http.ResponseWriter, r *http.Request) {
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}
```
**Problem:** Doesn't verify database connectivity
**Scenario:**
- Database file deleted/corrupted
- Health check returns 200 OK
- Actual queries fail with 500 errors
- Monitoring systems don't detect failure
**Impact:** False positives in monitoring, delayed incident detection.
### 4. Rate Limiter Memory Leak
**Visitor map grows unbounded:**
```go
type RateLimiter struct {
visitors map[string]*rate.Limiter // Never cleaned up
mu sync.RWMutex
}
```
**Impact:**
- Long-running servers accumulate IPs
- Memory usage grows over time
- 1M unique IPs = ~100MB leak
**Workaround:** Restart server periodically
**Fix required:** Implement visitor cleanup (remove inactive IPs after 24 hours)
### 5. No CORS Support
**No CORS headers:**
- Browser-based clients blocked
- Can't call from web apps directly
- OPTIONS preflight requests fail
**Workarounds:**
- Add CORS middleware (custom implementation)
- Use server-side proxy
- Deploy API on same origin as web app
**Impact:** Limited to server-side integrations.
### 6. No Metrics/Monitoring
**No instrumentation:**
- No Prometheus metrics
- No request counters
- No latency histograms
- No error rate tracking
**Visibility gaps:**
- Can't track usage patterns
- Can't identify slow endpoints
- Can't detect error spikes
- No performance baselines
**Workarounds:**
- Parse logs for metrics
- Use reverse proxy metrics (nginx)
- Implement custom metrics middleware
**Impact:** Blind operation, difficult to optimize.
### 7. Database Provenance Unclear
**Repository disclaimer:**
> "This project is not affiliated with Spotify."
**Concerns:**
- Data source unclear (likely scraped)
- Legal status uncertain
- No official Spotify endorsement
- Potential copyright issues
**Risks:**
- Takedown requests
- Legal liability
- Data quality unknown
- No support/updates
**Recommendation:** Verify legal compliance before production use.
### 8. No Data Freshness Mechanism
**Static snapshot:**
- No update mechanism
- Data frozen at time of database creation
- No real-time sync with Spotify
**Staleness:**
- New releases not included
- Popularity scores outdated
- Artist follower counts stale
- Deleted tracks still present
**Workarounds:**
- Periodically obtain updated database (if available)
- Complement with real-time APIs for fresh data
- Treat as historical snapshot
**Impact:** Not suitable for applications requiring current data.
### 9. Search Performance
**LIKE %query% on 256M rows:**
- Full table scan (can't use indexes)
- 10-second timeout (can be hit)
- CPU-intensive
**Slow searches:**
- Common words ("love", "the"): 5-10 seconds
- Rare queries: 10+ seconds (full scan)
**Alternative:** SQLite FTS5 (Full-Text Search)
- Requires writable database (not compatible with read-only mode)
- Would need separate FTS5 database
**Impact:** Search functionality limited to specific queries.
### 10. Hardcoded Configuration
**All limits/timeouts hardcoded:**
- Rate limit: 100 req/s, 200 burst
- Search timeout: 10 seconds
- Batch limit: 400 items
- Connection pool: 8 connections
- SQLite cache: 64MB
**Problems:**
- No flexibility
- Requires recompilation to change
- No environment-specific config
**Workaround:** Fork and modify code
**Impact:** Limited adaptability to different workloads.
## Use Case Evaluation
### Ideal Use Cases
#### 1. Music Library Enrichment
**Scenario:** Enrich local music library with metadata
**Flow:**
1. Extract ISRCs from audio files (via AcoustID)
2. Batch lookup ISRCs (400 at a time)
3. Store metadata in local database
4. Display in music player UI
**Why suitable:**
- Batch API optimized for bulk lookups
- ISRC-based lookup (industry standard)
- No API rate limits (self-hosted)
- Comprehensive metadata (genres, images, popularity)
**Example:**
```python
# Enrich 10,000 tracks
isrcs = extract_isrcs_from_library() # 10,000 ISRCs
# Batch lookup (25 requests for 10,000 tracks)
for batch in chunks(isrcs, 400):
response = requests.post("http://localhost:8080/batch/lookup", json={"isrcs": batch})
store_metadata(response.json())
```
#### 2. Metadata Aggregator Pipeline
**Scenario:** Combine data from multiple sources (MusicBrainz + Music Metadata API)
**Flow:**
1. Query MusicBrainz for recording by MBID
2. Extract ISRC from MusicBrainz response
3. Lookup ISRC in Music Metadata API
4. Merge metadata (MusicBrainz credits + Spotify-style data)
**Why suitable:**
- Complements MusicBrainz (different data models)
- ISRC as common key
- Fast batch lookups
- No external API dependencies
**Example:**
```python
# Get MusicBrainz data
mb_data = musicbrainz.get_recording(mbid)
isrc = mb_data['isrcs'][0]
# Get Spotify-style data
mm_data = requests.get(f"http://localhost:8080/lookup/isrc/{isrc}").json()
# Merge
merged = {
"mbid": mbid,
"isrc": isrc,
"title": mm_data['name'],
"popularity": mm_data['popularity'],
"credits": mb_data['artist-credit'],
"genres": mm_data['artists'][0]['genres']
}
```
#### 3. Self-Hosted Alternative to Spotify API
**Scenario:** Replace Spotify Web API with local service
**Why suitable:**
- No OAuth complexity
- No API rate limits
- No per-request costs
- Batch support (400 items vs Spotify's 50)
**Tradeoffs:**
- Static data (no real-time updates)
- Database size (216GB)
- No write operations
**Example:**
```python
# Spotify Web API (rate limited, requires OAuth)
spotify_data = spotify_client.search(q=f"isrc:{isrc}", type="track")
# Music Metadata API (no auth, no rate limits)
mm_data = requests.get(f"http://localhost:8080/lookup/isrc/{isrc}").json()
```
#### 4. DJ Software Metadata Provider
**Scenario:** Enrich DJ library with popularity, genres, images
**Why suitable:**
- Batch processing for large libraries
- Popularity scores for track selection
- Genre tags for filtering
- Album artwork for UI
**Example:**
```python
# Enrich DJ library
tracks = load_dj_library() # 5,000 tracks
isrcs = [t.isrc for t in tracks]
# Batch lookup
for batch in chunks(isrcs, 400):
response = requests.post("http://localhost:8080/batch/lookup", json={"isrcs": batch})
update_dj_library(response.json())
```
### Unsuitable Use Cases
#### 1. Real-Time Music Discovery App
**Why unsuitable:**
- Static data (no new releases)
- Outdated popularity scores
- No personalization
- No user-specific data
**Alternative:** Spotify Web API, Apple Music API
#### 2. Public-Facing API Service
**Why unsuitable:**
- No authentication (abuse risk)
- No usage tracking
- No quota enforcement
- Rate limiter memory leak
**Alternative:** Add authentication layer or use managed API service
#### 3. Mission-Critical Production System
**Why unsuitable:**
- Zero test coverage
- Naive health check
- Memory leak
- No metrics
**Alternative:** Extensive testing + monitoring before production use
#### 4. Applications Requiring Fresh Data
**Why unsuitable:**
- Static snapshot (no updates)
- Stale popularity/follower counts
- Missing new releases
**Alternative:** Spotify Web API, MusicBrainz (community-updated)
## Integration Evaluation
### Complementary Services
**Works well with:**
- **MusicBrainz:** Different data models, ISRC as common key
- **AcoustID:** Fingerprint to ISRC, then lookup metadata
- **Local music libraries:** Enrich with metadata
- **DJ software:** Popularity, genres, artwork
**Conflicts with:**
- **Spotify Web API:** Overlapping data, but Music Metadata API is static
- **Real-time services:** Music Metadata API data is stale
### Integration Complexity
**Easy integrations:**
- HTTP client (any language)
- Batch processing pipelines
- Local applications
**Complex integrations:**
- Browser-based apps (no CORS)
- Authenticated services (no auth)
- Real-time systems (static data)
## Performance Evaluation
### Throughput
**Batch endpoint:**
- 400 items per request
- ~200-500ms per request
- **800-2,000 items/second** (single instance)
**Individual endpoints:**
- ~50ms per request
- Rate limited to 100 req/s
- **100 items/second** (single instance)
**Scaling:**
- Horizontal: Run multiple instances (read-only safe)
- Vertical: More RAM (larger cache), faster disk (SSD)
### Latency
**Typical latencies:**
- Track lookup: 10-50ms
- Album lookup: 10-50ms
- Artist lookup: 10-50ms
- Batch lookup (400 items): 200-500ms
- Search: 1-10 seconds (depends on query)
**Bottlenecks:**
- Search queries (LIKE %query%)
- Disk I/O (use SSD)
- Rate limiter (RWMutex contention)
### Resource Usage
**Disk:** 216GB (databases)
**RAM:** 2.5GB (SQLite cache + mmap) + 1.5GB (app/OS) = 4GB minimum
**CPU:** 1 core minimum, 2+ recommended (search queries CPU-intensive)
**Scaling costs:**
- 10 instances = 2.16TB storage (expensive)
- Shared filesystem (NFS, EFS) reduces storage cost but increases latency
## Security Evaluation
### Vulnerabilities
**High severity:**
- **No authentication:** Anyone can query API
- **No rate limiting per user:** IP-based only (easily bypassed)
**Medium severity:**
- **Memory leak:** Rate limiter grows unbounded
- **No input sanitization:** SQL injection risk (mitigated by parameterized queries)
**Low severity:**
- **No HTTPS:** Deploy behind reverse proxy with TLS
- **No CORS:** Browser-based attacks limited
### Mitigations
**Authentication:**
- Deploy behind reverse proxy with auth (nginx, Caddy)
- Use API gateway (Kong, Tyk)
**Rate limiting:**
- Implement per-user rate limiting (requires auth)
- Use distributed rate limiter (Redis)
**Memory leak:**
- Restart server periodically
- Implement visitor cleanup
**HTTPS:**
- Terminate TLS at reverse proxy
- Use Let's Encrypt for free certificates
## Reliability Evaluation
### Failure Modes
**Database unavailable:**
- Health check returns OK (false positive)
- Queries fail with 500 errors
- No automatic recovery
**Memory exhaustion:**
- Rate limiter leak accumulates
- OOM kill by OS
- Service restart required
**Disk full:**
- SQLite read-only (no writes)
- No impact on service
**Network partition:**
- No external dependencies
- Service continues (self-contained)
### Recovery
**Automatic recovery:**
- Graceful shutdown on SIGINT/SIGTERM
- Docker/Kubernetes restart on failure
**Manual recovery:**
- Restart service (clears rate limiter leak)
- Restore database from backup
- Check database integrity (PRAGMA integrity_check)
### High Availability
**Strategies:**
- Run multiple instances (read-only safe)
- Load balancer distributes traffic
- Health checks route around failures (but naive health check is a problem)
**Limitations:**
- No shared state (rate limiter per-instance)
- No session affinity required
- Database replication (copy files to each instance)
## Cost Evaluation
### Infrastructure Costs
**Single instance:**
- Compute: $20-50/month (2 CPU, 8GB RAM)
- Storage: $20-40/month (250GB SSD)
- Network: $5-10/month (1TB transfer)
- **Total: $45-100/month**
**10 instances (high availability):**
- Compute: $200-500/month
- Storage: $200-400/month (2.5TB SSD, or shared filesystem)
- Network: $50-100/month
- **Total: $450-1,000/month**
**Comparison:**
- Spotify Web API: Free tier limited, paid tiers $0.001-0.01 per request
- MusicBrainz: Free (donations encouraged)
### Development Costs
**Initial setup:**
- Deploy service: 1-2 hours
- Obtain databases: Unknown (not in repository)
- Test integration: 2-4 hours
- **Total: 4-8 hours**
**Ongoing maintenance:**
- Monitor service: 1-2 hours/month
- Update databases: Unknown (no update mechanism)
- Security patches: 1-2 hours/month
- **Total: 2-4 hours/month**
### Total Cost of Ownership
**Year 1:**
- Infrastructure: $540-1,200 (single instance)
- Development: $400-800 (setup + 12 months maintenance)
- **Total: $940-2,000**
**Comparison:**
- Spotify Web API: $0-10,000+ (depends on usage)
- MusicBrainz: $0 (free, donations encouraged)
## Recommendation Matrix
| Use Case | Suitability | Reasoning |
|----------|-------------|-----------|
| Music library enrichment | ⭐⭐⭐⭐⭐ | Ideal: Batch API, ISRC lookup, no rate limits |
| Metadata aggregator | ⭐⭐⭐⭐⭐ | Ideal: Complements MusicBrainz, fast lookups |
| Self-hosted alternative | ⭐⭐⭐⭐ | Good: No auth complexity, but static data |
| DJ software integration | ⭐⭐⭐⭐ | Good: Popularity, genres, artwork |
| Real-time music app | ⭐⭐ | Poor: Static data, no updates |
| Public API service | ⭐⭐ | Poor: No auth, no metrics, memory leak |
| Mission-critical system | ⭐ | Very poor: No tests, naive health check |
| Fresh data required | ⭐ | Very poor: Static snapshot, no updates |
**Legend:**
- ⭐⭐⭐⭐⭐ Ideal
- ⭐⭐⭐⭐ Good
- ⭐⭐⭐ Acceptable
- ⭐⭐ Poor
- ⭐ Very poor
## Final Verdict
### Overall Rating: 7/10
**Breakdown:**
- **Functionality:** 9/10 (comprehensive metadata, batch API)
- **Performance:** 8/10 (fast batch, slow search)
- **Reliability:** 5/10 (no tests, memory leak, naive health check)
- **Security:** 4/10 (no auth, no metrics)
- **Maintainability:** 6/10 (simple code, but no tests)
- **Documentation:** 8/10 (OpenAPI spec, but minimal code comments)
### Strengths Summary
1. Massive dataset (256M tracks)
2. Simple architecture (no framework)
3. High-performance batch API (400 items/request)
4. Pure Go (no CGO)
5. Read-only safety
6. OpenAPI documentation
7. MIT license
8. Easy deployment
### Weaknesses Summary
1. Zero test coverage
2. No authentication
3. Naive health check
4. Rate limiter memory leak
5. No CORS
6. No metrics
7. Database provenance unclear
8. No data freshness
9. Slow search (LIKE %query%)
10. Hardcoded configuration
### Recommendation
**Use Music Metadata API if:**
- You need to enrich large music libraries (batch processing)
- You want ISRC-based lookups without API rate limits
- You can tolerate static data (no real-time updates)
- You can deploy behind reverse proxy (for auth/CORS)
- You can implement monitoring (metrics, proper health checks)
- You can accept legal uncertainty (database provenance)
**Don't use Music Metadata API if:**
- You need real-time data (use Spotify Web API)
- You need production-grade reliability (no tests)
- You need authentication out-of-the-box
- You need fresh data (new releases, current popularity)
- You can't tolerate 216GB storage requirement
### Improvement Priorities
**Critical (before production):**
1. Add test coverage (unit + integration tests)
2. Fix rate limiter memory leak
3. Implement proper health check (verify database)
4. Add authentication (or deploy behind auth proxy)
**High priority:**
1. Add metrics/monitoring (Prometheus)
2. Implement CORS support
3. Extract hardcoded config (environment variables)
4. Use LOG_LEVEL environment variable
**Medium priority:**
1. Improve search performance (FTS5)
2. Add request logging
3. Structured error responses
4. Documentation (code comments)
**Low priority:**
1. Caching layer (Redis)
2. Horizontal scaling improvements
3. Database update mechanism
4. Admin API (stats, cache control)