Files
metadata-agregator/docs/research/music-metadata-api/analysis/EVALUATION.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

19 KiB
Raw Blame History

Music Metadata API - Evaluation

Executive Summary

Music Metadata API is a simple, focused, self-contained service for querying metadata on 256 million music tracks. It excels at batch lookups and ISRC-based queries but lacks authentication, testing, and real-time data updates.

Best for: Self-hosted metadata enrichment, high-volume batch processing, ISRC resolution
Not suitable for: Real-time data, production systems requiring authentication, mission-critical applications without testing

Strengths

1. Massive Dataset

256 million tracks across two SQLite databases (~216GB)

Coverage:

  • Tracks with ISRC codes
  • Albums with artwork, labels, release dates
  • Artists with genres, follower counts, popularity
  • Extended metadata (lyrics flags, languages, artist roles)

Comparison:

  • Spotify Web API: Full catalog (real-time)
  • MusicBrainz: ~40M recordings
  • Discogs: ~15M releases

Value: Comprehensive coverage for metadata enrichment without API rate limits.

2. Extremely Simple Architecture

No framework, no ORM, minimal dependencies:

  • Go stdlib for HTTP, JSON, database
  • 2 external packages (sqlite driver, rate limiter)
  • ~1,100 lines of code
  • Single binary deployment

Benefits:

  • Easy to understand and modify
  • Fast compilation
  • No framework lock-in
  • Minimal attack surface

Comparison:

  • Typical web service: 10+ dependencies, framework overhead
  • Music Metadata API: 2 dependencies, stdlib only

3. High-Performance Batch API

Batch endpoint: Process up to 400 items per request

Performance gain:

  • Individual requests: 400 × ~50ms = 20 seconds
  • Batch request: ~200-500ms total
  • 40-100x faster

Query optimization:

  • Without batching: 2,800+ queries for 400 tracks
  • With batching: 7 queries for 400 tracks
  • 400x fewer queries

Use case: Enriching large music libraries efficiently.

4. Pure Go (No CGO)

CGO_ENABLED=0 - No C dependencies

Benefits:

  • Cross-compilation trivial (GOOS/GOARCH)
  • No C toolchain required
  • Smaller attack surface
  • Easier deployment (static binary)

Tradeoff: Larger binary size vs CGO SQLite driver (~2MB vs ~500KB)

5. Read-Only Safety

Databases opened in read-only mode:

  • No accidental writes
  • No data corruption risk
  • Safe concurrent reads
  • No write locks

PRAGMAs:

mode=ro
_journal_mode=off
_query_only=true

Benefit: Multiple instances can share database files safely.

6. OpenAPI Documentation

Comprehensive OpenAPI 3.1 spec:

  • All endpoints documented
  • Request/response schemas
  • Example payloads
  • Interactive Swagger UI at /docs

Value: Self-documenting API, easy integration.

7. MIT License

Permissive license:

  • Free for commercial use
  • No attribution required (recommended)
  • Modify and redistribute freely

Comparison:

  • Spotify Web API: Proprietary, rate limited
  • MusicBrainz: CC0/Public Domain (data), GPL (server)

8. Easy Deployment

Multiple deployment options:

  • Standalone binary (single executable)
  • Docker container (official image)
  • Kubernetes (example manifests)
  • Cloud platforms (ECS, Cloud Run, ACI)

Minimal requirements:

  • 216GB disk (databases)
  • 4GB RAM
  • 1 CPU core

No external dependencies:

  • No database server (SQLite embedded)
  • No cache server (SQLite cache)
  • No message queue
  • No authentication service

Weaknesses

1. Zero Test Coverage

No test files, no test framework, no CI testing

Risks:

  • No regression protection
  • Bugs discovered in production
  • Difficult to refactor safely
  • No documentation via tests

Evidence:

  • .gitignore includes coverage.out (testing planned but not implemented)
  • GitHub Actions workflow has no test step

Impact: High risk for production use without extensive manual testing.

2. No Authentication

Public API with no access control:

  • No OAuth
  • No API keys
  • No rate limiting per user (only per IP)
  • No usage tracking per user

Risks:

  • Abuse (unlimited queries)
  • No accountability
  • No quota enforcement
  • Data scraping

Workarounds:

  • Deploy behind reverse proxy with auth (nginx, Caddy)
  • Use API gateway (Kong, Tyk)
  • Implement custom middleware

Impact: Not suitable for public internet deployment without additional security layer.

3. Naive Health Check

Health endpoint always returns OK:

func handleHealth(w http.ResponseWriter, r *http.Request) {
    json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}

Problem: Doesn't verify database connectivity

Scenario:

  • Database file deleted/corrupted
  • Health check returns 200 OK
  • Actual queries fail with 500 errors
  • Monitoring systems don't detect failure

Impact: False positives in monitoring, delayed incident detection.

4. Rate Limiter Memory Leak

Visitor map grows unbounded:

type RateLimiter struct {
    visitors map[string]*rate.Limiter  // Never cleaned up
    mu       sync.RWMutex
}

Impact:

  • Long-running servers accumulate IPs
  • Memory usage grows over time
  • 1M unique IPs = ~100MB leak

Workaround: Restart server periodically

Fix required: Implement visitor cleanup (remove inactive IPs after 24 hours)

5. No CORS Support

No CORS headers:

  • Browser-based clients blocked
  • Can't call from web apps directly
  • OPTIONS preflight requests fail

Workarounds:

  • Add CORS middleware (custom implementation)
  • Use server-side proxy
  • Deploy API on same origin as web app

Impact: Limited to server-side integrations.

6. No Metrics/Monitoring

No instrumentation:

  • No Prometheus metrics
  • No request counters
  • No latency histograms
  • No error rate tracking

Visibility gaps:

  • Can't track usage patterns
  • Can't identify slow endpoints
  • Can't detect error spikes
  • No performance baselines

Workarounds:

  • Parse logs for metrics
  • Use reverse proxy metrics (nginx)
  • Implement custom metrics middleware

Impact: Blind operation, difficult to optimize.

7. Database Provenance Unclear

Repository disclaimer:

"This project is not affiliated with Spotify."

Concerns:

  • Data source unclear (likely scraped)
  • Legal status uncertain
  • No official Spotify endorsement
  • Potential copyright issues

Risks:

  • Takedown requests
  • Legal liability
  • Data quality unknown
  • No support/updates

Recommendation: Verify legal compliance before production use.

8. No Data Freshness Mechanism

Static snapshot:

  • No update mechanism
  • Data frozen at time of database creation
  • No real-time sync with Spotify

Staleness:

  • New releases not included
  • Popularity scores outdated
  • Artist follower counts stale
  • Deleted tracks still present

Workarounds:

  • Periodically obtain updated database (if available)
  • Complement with real-time APIs for fresh data
  • Treat as historical snapshot

Impact: Not suitable for applications requiring current data.

9. Search Performance

LIKE %query% on 256M rows:

  • Full table scan (can't use indexes)
  • 10-second timeout (can be hit)
  • CPU-intensive

Slow searches:

  • Common words ("love", "the"): 5-10 seconds
  • Rare queries: 10+ seconds (full scan)

Alternative: SQLite FTS5 (Full-Text Search)

  • Requires writable database (not compatible with read-only mode)
  • Would need separate FTS5 database

Impact: Search functionality limited to specific queries.

10. Hardcoded Configuration

All limits/timeouts hardcoded:

  • Rate limit: 100 req/s, 200 burst
  • Search timeout: 10 seconds
  • Batch limit: 400 items
  • Connection pool: 8 connections
  • SQLite cache: 64MB

Problems:

  • No flexibility
  • Requires recompilation to change
  • No environment-specific config

Workaround: Fork and modify code

Impact: Limited adaptability to different workloads.

Use Case Evaluation

Ideal Use Cases

1. Music Library Enrichment

Scenario: Enrich local music library with metadata

Flow:

  1. Extract ISRCs from audio files (via AcoustID)
  2. Batch lookup ISRCs (400 at a time)
  3. Store metadata in local database
  4. Display in music player UI

Why suitable:

  • Batch API optimized for bulk lookups
  • ISRC-based lookup (industry standard)
  • No API rate limits (self-hosted)
  • Comprehensive metadata (genres, images, popularity)

Example:

# Enrich 10,000 tracks
isrcs = extract_isrcs_from_library()  # 10,000 ISRCs

# Batch lookup (25 requests for 10,000 tracks)
for batch in chunks(isrcs, 400):
    response = requests.post("http://localhost:8080/batch/lookup", json={"isrcs": batch})
    store_metadata(response.json())

2. Metadata Aggregator Pipeline

Scenario: Combine data from multiple sources (MusicBrainz + Music Metadata API)

Flow:

  1. Query MusicBrainz for recording by MBID
  2. Extract ISRC from MusicBrainz response
  3. Lookup ISRC in Music Metadata API
  4. Merge metadata (MusicBrainz credits + Spotify-style data)

Why suitable:

  • Complements MusicBrainz (different data models)
  • ISRC as common key
  • Fast batch lookups
  • No external API dependencies

Example:

# Get MusicBrainz data
mb_data = musicbrainz.get_recording(mbid)
isrc = mb_data['isrcs'][0]

# Get Spotify-style data
mm_data = requests.get(f"http://localhost:8080/lookup/isrc/{isrc}").json()

# Merge
merged = {
    "mbid": mbid,
    "isrc": isrc,
    "title": mm_data['name'],
    "popularity": mm_data['popularity'],
    "credits": mb_data['artist-credit'],
    "genres": mm_data['artists'][0]['genres']
}

3. Self-Hosted Alternative to Spotify API

Scenario: Replace Spotify Web API with local service

Why suitable:

  • No OAuth complexity
  • No API rate limits
  • No per-request costs
  • Batch support (400 items vs Spotify's 50)

Tradeoffs:

  • Static data (no real-time updates)
  • Database size (216GB)
  • No write operations

Example:

# Spotify Web API (rate limited, requires OAuth)
spotify_data = spotify_client.search(q=f"isrc:{isrc}", type="track")

# Music Metadata API (no auth, no rate limits)
mm_data = requests.get(f"http://localhost:8080/lookup/isrc/{isrc}").json()

4. DJ Software Metadata Provider

Scenario: Enrich DJ library with popularity, genres, images

Why suitable:

  • Batch processing for large libraries
  • Popularity scores for track selection
  • Genre tags for filtering
  • Album artwork for UI

Example:

# Enrich DJ library
tracks = load_dj_library()  # 5,000 tracks
isrcs = [t.isrc for t in tracks]

# Batch lookup
for batch in chunks(isrcs, 400):
    response = requests.post("http://localhost:8080/batch/lookup", json={"isrcs": batch})
    update_dj_library(response.json())

Unsuitable Use Cases

1. Real-Time Music Discovery App

Why unsuitable:

  • Static data (no new releases)
  • Outdated popularity scores
  • No personalization
  • No user-specific data

Alternative: Spotify Web API, Apple Music API

2. Public-Facing API Service

Why unsuitable:

  • No authentication (abuse risk)
  • No usage tracking
  • No quota enforcement
  • Rate limiter memory leak

Alternative: Add authentication layer or use managed API service

3. Mission-Critical Production System

Why unsuitable:

  • Zero test coverage
  • Naive health check
  • Memory leak
  • No metrics

Alternative: Extensive testing + monitoring before production use

4. Applications Requiring Fresh Data

Why unsuitable:

  • Static snapshot (no updates)
  • Stale popularity/follower counts
  • Missing new releases

Alternative: Spotify Web API, MusicBrainz (community-updated)

Integration Evaluation

Complementary Services

Works well with:

  • MusicBrainz: Different data models, ISRC as common key
  • AcoustID: Fingerprint to ISRC, then lookup metadata
  • Local music libraries: Enrich with metadata
  • DJ software: Popularity, genres, artwork

Conflicts with:

  • Spotify Web API: Overlapping data, but Music Metadata API is static
  • Real-time services: Music Metadata API data is stale

Integration Complexity

Easy integrations:

  • HTTP client (any language)
  • Batch processing pipelines
  • Local applications

Complex integrations:

  • Browser-based apps (no CORS)
  • Authenticated services (no auth)
  • Real-time systems (static data)

Performance Evaluation

Throughput

Batch endpoint:

  • 400 items per request
  • ~200-500ms per request
  • 800-2,000 items/second (single instance)

Individual endpoints:

  • ~50ms per request
  • Rate limited to 100 req/s
  • 100 items/second (single instance)

Scaling:

  • Horizontal: Run multiple instances (read-only safe)
  • Vertical: More RAM (larger cache), faster disk (SSD)

Latency

Typical latencies:

  • Track lookup: 10-50ms
  • Album lookup: 10-50ms
  • Artist lookup: 10-50ms
  • Batch lookup (400 items): 200-500ms
  • Search: 1-10 seconds (depends on query)

Bottlenecks:

  • Search queries (LIKE %query%)
  • Disk I/O (use SSD)
  • Rate limiter (RWMutex contention)

Resource Usage

Disk: 216GB (databases)
RAM: 2.5GB (SQLite cache + mmap) + 1.5GB (app/OS) = 4GB minimum
CPU: 1 core minimum, 2+ recommended (search queries CPU-intensive)

Scaling costs:

  • 10 instances = 2.16TB storage (expensive)
  • Shared filesystem (NFS, EFS) reduces storage cost but increases latency

Security Evaluation

Vulnerabilities

High severity:

  • No authentication: Anyone can query API
  • No rate limiting per user: IP-based only (easily bypassed)

Medium severity:

  • Memory leak: Rate limiter grows unbounded
  • No input sanitization: SQL injection risk (mitigated by parameterized queries)

Low severity:

  • No HTTPS: Deploy behind reverse proxy with TLS
  • No CORS: Browser-based attacks limited

Mitigations

Authentication:

  • Deploy behind reverse proxy with auth (nginx, Caddy)
  • Use API gateway (Kong, Tyk)

Rate limiting:

  • Implement per-user rate limiting (requires auth)
  • Use distributed rate limiter (Redis)

Memory leak:

  • Restart server periodically
  • Implement visitor cleanup

HTTPS:

  • Terminate TLS at reverse proxy
  • Use Let's Encrypt for free certificates

Reliability Evaluation

Failure Modes

Database unavailable:

  • Health check returns OK (false positive)
  • Queries fail with 500 errors
  • No automatic recovery

Memory exhaustion:

  • Rate limiter leak accumulates
  • OOM kill by OS
  • Service restart required

Disk full:

  • SQLite read-only (no writes)
  • No impact on service

Network partition:

  • No external dependencies
  • Service continues (self-contained)

Recovery

Automatic recovery:

  • Graceful shutdown on SIGINT/SIGTERM
  • Docker/Kubernetes restart on failure

Manual recovery:

  • Restart service (clears rate limiter leak)
  • Restore database from backup
  • Check database integrity (PRAGMA integrity_check)

High Availability

Strategies:

  • Run multiple instances (read-only safe)
  • Load balancer distributes traffic
  • Health checks route around failures (but naive health check is a problem)

Limitations:

  • No shared state (rate limiter per-instance)
  • No session affinity required
  • Database replication (copy files to each instance)

Cost Evaluation

Infrastructure Costs

Single instance:

  • Compute: $20-50/month (2 CPU, 8GB RAM)
  • Storage: $20-40/month (250GB SSD)
  • Network: $5-10/month (1TB transfer)
  • Total: $45-100/month

10 instances (high availability):

  • Compute: $200-500/month
  • Storage: $200-400/month (2.5TB SSD, or shared filesystem)
  • Network: $50-100/month
  • Total: $450-1,000/month

Comparison:

  • Spotify Web API: Free tier limited, paid tiers $0.001-0.01 per request
  • MusicBrainz: Free (donations encouraged)

Development Costs

Initial setup:

  • Deploy service: 1-2 hours
  • Obtain databases: Unknown (not in repository)
  • Test integration: 2-4 hours
  • Total: 4-8 hours

Ongoing maintenance:

  • Monitor service: 1-2 hours/month
  • Update databases: Unknown (no update mechanism)
  • Security patches: 1-2 hours/month
  • Total: 2-4 hours/month

Total Cost of Ownership

Year 1:

  • Infrastructure: $540-1,200 (single instance)
  • Development: $400-800 (setup + 12 months maintenance)
  • Total: $940-2,000

Comparison:

  • Spotify Web API: $0-10,000+ (depends on usage)
  • MusicBrainz: $0 (free, donations encouraged)

Recommendation Matrix

Use Case Suitability Reasoning
Music library enrichment Ideal: Batch API, ISRC lookup, no rate limits
Metadata aggregator Ideal: Complements MusicBrainz, fast lookups
Self-hosted alternative Good: No auth complexity, but static data
DJ software integration Good: Popularity, genres, artwork
Real-time music app Poor: Static data, no updates
Public API service Poor: No auth, no metrics, memory leak
Mission-critical system Very poor: No tests, naive health check
Fresh data required Very poor: Static snapshot, no updates

Legend:

  • Ideal
  • Good
  • Acceptable
  • Poor
  • Very poor

Final Verdict

Overall Rating: 7/10

Breakdown:

  • Functionality: 9/10 (comprehensive metadata, batch API)
  • Performance: 8/10 (fast batch, slow search)
  • Reliability: 5/10 (no tests, memory leak, naive health check)
  • Security: 4/10 (no auth, no metrics)
  • Maintainability: 6/10 (simple code, but no tests)
  • Documentation: 8/10 (OpenAPI spec, but minimal code comments)

Strengths Summary

  1. Massive dataset (256M tracks)
  2. Simple architecture (no framework)
  3. High-performance batch API (400 items/request)
  4. Pure Go (no CGO)
  5. Read-only safety
  6. OpenAPI documentation
  7. MIT license
  8. Easy deployment

Weaknesses Summary

  1. Zero test coverage
  2. No authentication
  3. Naive health check
  4. Rate limiter memory leak
  5. No CORS
  6. No metrics
  7. Database provenance unclear
  8. No data freshness
  9. Slow search (LIKE %query%)
  10. Hardcoded configuration

Recommendation

Use Music Metadata API if:

  • You need to enrich large music libraries (batch processing)
  • You want ISRC-based lookups without API rate limits
  • You can tolerate static data (no real-time updates)
  • You can deploy behind reverse proxy (for auth/CORS)
  • You can implement monitoring (metrics, proper health checks)
  • You can accept legal uncertainty (database provenance)

Don't use Music Metadata API if:

  • You need real-time data (use Spotify Web API)
  • You need production-grade reliability (no tests)
  • You need authentication out-of-the-box
  • You need fresh data (new releases, current popularity)
  • You can't tolerate 216GB storage requirement

Improvement Priorities

Critical (before production):

  1. Add test coverage (unit + integration tests)
  2. Fix rate limiter memory leak
  3. Implement proper health check (verify database)
  4. Add authentication (or deploy behind auth proxy)

High priority:

  1. Add metrics/monitoring (Prometheus)
  2. Implement CORS support
  3. Extract hardcoded config (environment variables)
  4. Use LOG_LEVEL environment variable

Medium priority:

  1. Improve search performance (FTS5)
  2. Add request logging
  3. Structured error responses
  4. Documentation (code comments)

Low priority:

  1. Caching layer (Redis)
  2. Horizontal scaling improvements
  3. Database update mechanism
  4. Admin API (stats, cache control)