Files
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

16 KiB

MusicBrainz Server Evaluation

Strengths

1. Canonical Music Metadata Source

Evidence: MusicBrainz is the de facto standard for music metadata. Used by:

  • Spotify (artist/release matching)
  • Last.fm (scrobbling normalization)
  • Roon (music library management)
  • Picard (music tagging)
  • Beets (music organization)
  • Hundreds of other music applications

Impact: Any music metadata aggregator must include MusicBrainz data to be comprehensive. It's the foundation that other services build upon.

Data Quality: Community-driven editing with voting system ensures high accuracy. Over 2 million edits per year, with auto-editors providing quality control.

2. Massive, Comprehensive Dataset

Scale (as of 2024):

  • 2.1+ million artists
  • 3.5+ million releases
  • 30+ million recordings
  • 1.5+ million works
  • 1.3+ million labels
  • 100+ million relationships

Coverage: Extensive coverage across:

  • All genres (classical, jazz, rock, electronic, world music, etc.)
  • All eras (historical recordings to latest releases)
  • All regions (global coverage with strong international community)
  • All formats (vinyl, CD, digital, cassette, etc.)

Relationships: Rich relationship data connecting:

  • Artists to recordings (performer, conductor, engineer, etc.)
  • Recordings to works (performance of composition)
  • Artists to artists (member of, collaboration, etc.)
  • Releases to labels, areas, events, etc.

Identifiers: Comprehensive identifier coverage:

  • ISRCs (International Standard Recording Code)
  • ISWCs (International Standard Musical Work Code)
  • Barcodes (EAN, UPC)
  • Disc IDs (CD table of contents)
  • External links (Wikipedia, Discogs, AllMusic, etc.)

3. Mature, Battle-Tested Codebase

Age: 15+ years of continuous development (since 2001)

Stability: Proven reliability serving millions of requests daily with minimal downtime.

Evolution: Gradual modernization while maintaining backward compatibility:

  • Started with Template Toolkit (still used)
  • Added Knockout.js (being phased out)
  • Migrating to React (ongoing)
  • API has remained stable since v2 (2011)

Community: Large, active open-source community:

  • 500+ contributors on GitHub
  • Active development (commits daily)
  • Responsive to issues and pull requests
  • Strong documentation culture

4. Comprehensive, Well-Designed API

Maturity: API v2 stable since 2011, widely adopted

Formats: Multiple serialization formats:

  • JSON (modern, widely supported)
  • XML (legacy, still used by many clients)
  • JSON-LD (semantic web, Schema.org vocabulary)

Features:

  • Lookup by MBID (unique identifier)
  • Browse by relationships (all releases by artist, etc.)
  • Search with Lucene query syntax
  • Include parameters for fine-grained control
  • Pagination for large result sets
  • CORS enabled for browser clients

Rate Limiting: Reasonable limits (1 req/sec recommended) with clear documentation

Authentication: Modern OAuth2 with PKCE for user-specific operations

Documentation: Comprehensive API docs with examples at musicbrainz.org/doc/Development/XML_Web_Service/Version_2

5. Transparent Edit/Voting System

Command Pattern: All modifications are versioned edits, providing:

  • Full audit trail (who changed what, when, why)
  • Rollback capability (edits can be reverted)
  • Transparency (all edits publicly visible)
  • Accountability (editors build reputation)

Community Quality Control:

  • 7-day voting period for most edits
  • Community votes yes/no/abstain
  • Auto-editors can approve immediately (earned privilege)
  • Failed edits can be resubmitted with improvements

Edit Types: 100+ edit types covering all operations:

  • Create/edit/delete entities
  • Add/edit/delete relationships
  • Merge duplicates
  • Add identifiers (ISRC, barcode, etc.)

Benefits:

  • High data quality through peer review
  • Prevents vandalism and spam
  • Encourages collaboration and discussion
  • Builds trust in the data

6. Replication Support for Mirrors

Architecture: Master-Mirror via dbmirror2 packet system

Use Cases:

  • Organizations needing local copy (reduced latency, offline access)
  • High-volume API users (avoid rate limits)
  • Research projects (full dataset access)
  • Backup/disaster recovery

Replication Packets:

  • Incremental updates (not full dumps)
  • Hourly packets available
  • Efficient bandwidth usage
  • Verifiable integrity

Mirror Benefits:

  • Full read access to entire dataset
  • No rate limiting
  • Custom queries and analytics
  • Integration with internal systems

7. Rich Relationship Model

Advanced Relationships: Not just artist-to-release, but:

  • Artist-to-artist (member of, collaboration, married to, etc.)
  • Recording-to-work (performance of composition)
  • Release-to-event (recorded at festival, etc.)
  • Work-to-work (arrangement of, medley of, etc.)

Relationship Attributes:

  • Dates (begin/end)
  • Credits (custom artist credits)
  • Instruments (performer played guitar, etc.)
  • Roles (producer, engineer, etc.)

Use Cases:

  • Music discovery (find similar artists)
  • Discography completeness (all releases by artist)
  • Session musician tracking (who played on what)
  • Classical music (composer, conductor, orchestra, etc.)

Weaknesses

1. Perl Language Ecosystem Decline

Evidence:

  • Perl ranked #19 in TIOBE index (down from top 5 in 2000s)
  • Declining CPAN module releases (peak 2014, declining since)
  • Fewer Perl developers entering workforce
  • Most new web projects use Python, JavaScript, Go, Rust

Impact:

  • Harder to recruit Perl developers
  • Smaller pool of contributors
  • Slower adoption of modern practices
  • Dependency on aging CPAN modules

Mitigation:

  • MusicBrainz has stable, experienced Perl team
  • Codebase is well-documented
  • Gradual migration to JavaScript on frontend
  • API allows language-agnostic integration

Reality Check: While Perl is declining, MusicBrainz's Perl codebase is mature and stable. The bigger risk is long-term maintainability (10+ years), not immediate functionality.

2. Heavy Infrastructure Requirements

Database Size: ~350GB for production dataset (with indexes)

Resource Requirements:

  • 8+ CPU cores
  • 16+ GB RAM
  • 500+ GB SSD storage
  • PostgreSQL 16+ (specific version requirement)
  • Redis (16 databases)
  • Apache Solr (13 cores)

Deployment Complexity:

  • Multiple services to coordinate
  • Complex build process (Perl + Node.js)
  • Long initial setup (schema load, index build)
  • Replication setup requires FTP server

Cost Implications:

  • Self-hosting requires dedicated server (~$200+/month)
  • Cloud hosting even more expensive
  • Bandwidth costs for replication
  • Operational overhead (backups, monitoring, updates)

Practical Impact: For most use cases, using the public API is far more practical than self-hosting. Only large organizations with specific needs (high volume, custom queries, offline access) should consider self-hosting.

3. No Modern Observability

Missing:

  • Prometheus metrics endpoint
  • Structured logging (JSON logs)
  • Distributed tracing (OpenTelemetry)
  • Health check endpoint
  • Readiness/liveness probes

Current State:

  • Plain text logs
  • No metrics export
  • Manual log parsing for monitoring
  • No standardized health checks

Impact:

  • Harder to integrate with modern monitoring stacks (Grafana, Datadog, etc.)
  • Limited visibility into performance bottlenecks
  • Difficult to debug production issues
  • No SLO/SLA tracking

Workarounds:

  • Parse logs with Logstash/Fluentd
  • Monitor HTTP responses
  • Database query monitoring
  • Custom metrics collection

Future: Prometheus exporter is planned but not yet implemented.

4. Incomplete Frontend Modernization

Legacy Code:

  • Knockout.js still present in many views
  • jQuery used extensively
  • Inline JavaScript in templates
  • Mixed Template Toolkit + React

Evidence:

  • root/static/scripts/ contains both Knockout and React
  • Some pages fully React, others fully Knockout, some mixed
  • Inconsistent UI patterns across pages

Impact:

  • Larger JavaScript bundle size
  • Maintenance burden (two frameworks)
  • Inconsistent user experience
  • Harder for new contributors

Migration Status:

  • New features use React
  • Old features gradually migrated
  • No timeline for complete migration
  • Knockout removal is low priority

Reality Check: This is a cosmetic issue, not a functional one. The site works well despite the mixed frontend. For API users, this is irrelevant.

5. Custom ORM Instead of Standard

Architecture: Custom Moose-based data layer, not DBIx::Class

Characteristics:

  • 106 Data modules (26,000 lines)
  • Raw SQL via DBD::Pg
  • Custom query builder (Sql.pm)
  • Moose roles for common patterns

Drawbacks:

  • Steeper learning curve for new contributors
  • No ecosystem of plugins/extensions
  • Manual query construction
  • No automatic migrations

Benefits:

  • Better performance (no ORM overhead)
  • Full control over SQL
  • Simpler for complex queries
  • Fewer dependencies

Reality Check: The custom ORM is well-designed and battle-tested. It's not a weakness in functionality, but in onboarding and maintainability. For a project this mature, changing to a standard ORM would be a massive undertaking with little benefit.

6. Limited Real-Time Capabilities

Current State:

  • No WebSocket support
  • No Server-Sent Events
  • No real-time notifications
  • Polling required for updates

Impact:

  • Edit notifications delayed
  • Search results not live-updated
  • Collaborative editing limited
  • Higher server load from polling

Workarounds:

  • Redis pub/sub for internal events
  • Periodic polling from clients
  • Email notifications for edits

Future: Real-time features not prioritized (low demand).

Integration Considerations

Best For:

  • Most use cases
  • Low to medium volume (<1M requests/month)
  • No custom query requirements
  • Budget-conscious projects

Approach:

import requests

# Lookup artist by MBID
response = requests.get(
    'https://musicbrainz.org/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da',
    params={'fmt': 'json', 'inc': 'releases+recordings'},
    headers={'User-Agent': 'MyApp/1.0 (contact@example.com)'}
)
artist = response.json()

Advantages:

  • No infrastructure to manage
  • Always up-to-date data
  • No storage costs
  • Simple integration

Limitations:

  • Rate limiting (1 req/sec recommended)
  • Network latency
  • No custom queries
  • Dependent on MusicBrainz uptime

Best Practices:

  • Cache responses aggressively
  • Respect rate limits
  • Include User-Agent with contact info
  • Handle errors gracefully

Replication/Mirror (Advanced)

Best For:

  • High volume (>10M requests/month)
  • Custom queries and analytics
  • Offline access required
  • Research projects

Approach:

  1. Set up PostgreSQL 16+ server (500GB+ storage)
  2. Download initial database dump
  3. Load schema and data
  4. Configure replication (RT_MIRROR mode)
  5. Download and apply hourly replication packets

Advantages:

  • No rate limiting
  • Full dataset access
  • Custom queries
  • Low latency

Disadvantages:

  • High infrastructure cost (~$200+/month)
  • Operational overhead
  • Replication lag (minutes to hours)
  • Storage requirements (350GB+)

Maintenance:

  • Apply replication packets hourly
  • Monitor replication lag
  • Rebuild indexes periodically
  • Backup database regularly

Hybrid Approach (Optimal)

Strategy:

  • Use API for lookups and searches
  • Cache frequently accessed data locally
  • Replicate subset of data for custom queries
  • Fall back to API for cache misses

Example:

# Check local cache first
artist = cache.get(f'artist:{mbid}')

if not artist:
    # Cache miss - fetch from API
    response = requests.get(f'https://musicbrainz.org/ws/2/artist/{mbid}')
    artist = response.json()
    
    # Cache for 1 hour
    cache.set(f'artist:{mbid}', artist, ttl=3600)

return artist

Benefits:

  • Lower API usage (respect rate limits)
  • Faster response times
  • Reduced infrastructure costs
  • Graceful degradation

Relevance to Metadata Aggregator Project

Primary Data Source

Role: MusicBrainz is the foundational music metadata source. All other music metadata projects reference or build upon MusicBrainz:

  • Discogs: Cross-references MusicBrainz IDs
  • Last.fm: Uses MusicBrainz for artist/track normalization
  • AcousticBrainz: Audio analysis keyed by MusicBrainz recording ID
  • ListenBrainz: Listening history using MusicBrainz IDs
  • CritiqueBrainz: Reviews keyed by MusicBrainz release ID

Implication: A metadata aggregator without MusicBrainz is incomplete. MusicBrainz provides the canonical identifiers (MBIDs) that link data across services.

Integration Priority: Critical

Rationale:

  1. Canonical IDs: MBIDs are the standard for music entity identification
  2. Comprehensive Coverage: Largest open music metadata database
  3. Relationship Data: Rich connections between entities
  4. Community Trust: High data quality through peer review
  5. API Stability: Mature, stable API with long-term support

Recommended Integration:

  • Use MusicBrainz API as primary metadata source
  • Cache responses locally (1-hour TTL)
  • Use MBIDs as primary keys in aggregator database
  • Cross-reference with other sources (Discogs, Last.fm, etc.)
  • Contribute improvements back to MusicBrainz

Data Model Alignment

MusicBrainz Entities Map Well to Aggregator Needs:

MusicBrainz Entity Aggregator Use Case
Artist Artist profiles, discographies
Release Album/single metadata
Recording Track metadata, audio fingerprinting
Work Composition metadata, cover detection
Label Label discographies, release attribution
Relationship Music discovery, session musician tracking

Identifiers:

  • MBID as primary key
  • ISRC for recording matching
  • Barcode for release matching
  • Disc ID for CD identification

Complementary Data Sources

MusicBrainz Strengths:

  • Canonical entity IDs
  • Relationship data
  • Release metadata
  • Identifier coverage

MusicBrainz Gaps (fill with other sources):

  • Album reviews → CritiqueBrainz, AllMusic
  • Listening statistics → Last.fm, Spotify
  • Audio features → AcousticBrainz, Spotify
  • Lyrics → LyricWiki, Genius
  • Album art → Cover Art Archive (integrated)
  • Popularity metrics → Last.fm, Spotify

Implementation Roadmap

Phase 1: Basic Integration

  1. Implement MusicBrainz API client
  2. Cache artist/release/recording lookups
  3. Store MBIDs as primary keys
  4. Handle rate limiting gracefully

Phase 2: Enhanced Integration

  1. Implement relationship traversal
  2. Add search functionality
  3. Integrate Cover Art Archive
  4. Add identifier lookups (ISRC, barcode)

Phase 3: Advanced Integration

  1. Consider replication for high volume
  2. Contribute improvements to MusicBrainz
  3. Implement edit submission (if applicable)
  4. Add real-time update monitoring

Phase 4: Ecosystem Integration

  1. Integrate complementary services (Last.fm, etc.)
  2. Cross-reference data across sources
  3. Resolve conflicts and duplicates
  4. Build unified metadata view

Conclusion

Overall Assessment: MusicBrainz is an essential, high-quality music metadata source with a mature codebase and comprehensive API. While it has some technical debt (Perl, legacy frontend, custom ORM), these are manageable and don't impact its value as a data source.

Recommendation for Metadata Aggregator:

  • Priority: Critical - integrate early
  • Approach: API-based with aggressive caching
  • Timeline: Phase 1 in first sprint
  • Resources: Low (API integration is straightforward)

Key Takeaway: MusicBrainz is the foundation of music metadata. Any serious music metadata aggregator must integrate MusicBrainz to be comprehensive and credible.