Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

16 KiB

Raw Permalink Blame History

MusicBrainz Server Evaluation

Strengths

1. Canonical Music Metadata Source

Evidence: MusicBrainz is the de facto standard for music metadata. Used by:

Spotify (artist/release matching)
Last.fm (scrobbling normalization)
Roon (music library management)
Picard (music tagging)
Beets (music organization)
Hundreds of other music applications

Impact: Any music metadata aggregator must include MusicBrainz data to be comprehensive. It's the foundation that other services build upon.

Data Quality: Community-driven editing with voting system ensures high accuracy. Over 2 million edits per year, with auto-editors providing quality control.

2. Massive, Comprehensive Dataset

Scale (as of 2024):

2.1+ million artists
3.5+ million releases
30+ million recordings
1.5+ million works
1.3+ million labels
100+ million relationships

Coverage: Extensive coverage across:

All genres (classical, jazz, rock, electronic, world music, etc.)
All eras (historical recordings to latest releases)
All regions (global coverage with strong international community)
All formats (vinyl, CD, digital, cassette, etc.)

Relationships: Rich relationship data connecting:

Artists to recordings (performer, conductor, engineer, etc.)
Recordings to works (performance of composition)
Artists to artists (member of, collaboration, etc.)
Releases to labels, areas, events, etc.

Identifiers: Comprehensive identifier coverage:

ISRCs (International Standard Recording Code)
ISWCs (International Standard Musical Work Code)
Barcodes (EAN, UPC)
Disc IDs (CD table of contents)
External links (Wikipedia, Discogs, AllMusic, etc.)

3. Mature, Battle-Tested Codebase

Age: 15+ years of continuous development (since 2001)

Stability: Proven reliability serving millions of requests daily with minimal downtime.

Evolution: Gradual modernization while maintaining backward compatibility:

Started with Template Toolkit (still used)
Added Knockout.js (being phased out)
Migrating to React (ongoing)
API has remained stable since v2 (2011)

Community: Large, active open-source community:

500+ contributors on GitHub
Active development (commits daily)
Responsive to issues and pull requests
Strong documentation culture

4. Comprehensive, Well-Designed API

Maturity: API v2 stable since 2011, widely adopted

Formats: Multiple serialization formats:

JSON (modern, widely supported)
XML (legacy, still used by many clients)
JSON-LD (semantic web, Schema.org vocabulary)

Features:

Lookup by MBID (unique identifier)
Browse by relationships (all releases by artist, etc.)
Search with Lucene query syntax
Include parameters for fine-grained control
Pagination for large result sets
CORS enabled for browser clients

Rate Limiting: Reasonable limits (1 req/sec recommended) with clear documentation

Authentication: Modern OAuth2 with PKCE for user-specific operations

Documentation: Comprehensive API docs with examples at musicbrainz.org/doc/Development/XML_Web_Service/Version_2

5. Transparent Edit/Voting System

Command Pattern: All modifications are versioned edits, providing:

Full audit trail (who changed what, when, why)
Rollback capability (edits can be reverted)
Transparency (all edits publicly visible)
Accountability (editors build reputation)

Community Quality Control:

7-day voting period for most edits
Community votes yes/no/abstain
Auto-editors can approve immediately (earned privilege)
Failed edits can be resubmitted with improvements

Edit Types: 100+ edit types covering all operations:

Create/edit/delete entities
Add/edit/delete relationships
Merge duplicates
Add identifiers (ISRC, barcode, etc.)

Benefits:

High data quality through peer review
Prevents vandalism and spam
Encourages collaboration and discussion
Builds trust in the data

6. Replication Support for Mirrors

Architecture: Master-Mirror via dbmirror2 packet system

Use Cases:

Organizations needing local copy (reduced latency, offline access)
High-volume API users (avoid rate limits)
Research projects (full dataset access)
Backup/disaster recovery

Replication Packets:

Incremental updates (not full dumps)
Hourly packets available
Efficient bandwidth usage
Verifiable integrity

Mirror Benefits:

Full read access to entire dataset
No rate limiting
Custom queries and analytics
Integration with internal systems

7. Rich Relationship Model

Advanced Relationships: Not just artist-to-release, but:

Artist-to-artist (member of, collaboration, married to, etc.)
Recording-to-work (performance of composition)
Release-to-event (recorded at festival, etc.)
Work-to-work (arrangement of, medley of, etc.)

Relationship Attributes:

Dates (begin/end)
Credits (custom artist credits)
Instruments (performer played guitar, etc.)
Roles (producer, engineer, etc.)

Use Cases:

Music discovery (find similar artists)
Discography completeness (all releases by artist)
Session musician tracking (who played on what)
Classical music (composer, conductor, orchestra, etc.)

Weaknesses

1. Perl Language Ecosystem Decline

Evidence:

Perl ranked #19 in TIOBE index (down from top 5 in 2000s)
Declining CPAN module releases (peak 2014, declining since)
Fewer Perl developers entering workforce
Most new web projects use Python, JavaScript, Go, Rust

Impact:

Harder to recruit Perl developers
Smaller pool of contributors
Slower adoption of modern practices
Dependency on aging CPAN modules

Mitigation:

MusicBrainz has stable, experienced Perl team
Codebase is well-documented
Gradual migration to JavaScript on frontend
API allows language-agnostic integration

Reality Check: While Perl is declining, MusicBrainz's Perl codebase is mature and stable. The bigger risk is long-term maintainability (10+ years), not immediate functionality.

2. Heavy Infrastructure Requirements

Database Size: ~350GB for production dataset (with indexes)

Resource Requirements:

8+ CPU cores
16+ GB RAM
500+ GB SSD storage
PostgreSQL 16+ (specific version requirement)
Redis (16 databases)
Apache Solr (13 cores)

Deployment Complexity:

Multiple services to coordinate
Complex build process (Perl + Node.js)
Long initial setup (schema load, index build)
Replication setup requires FTP server

Cost Implications:

Self-hosting requires dedicated server (~$200+/month)
Cloud hosting even more expensive
Bandwidth costs for replication
Operational overhead (backups, monitoring, updates)

Practical Impact: For most use cases, using the public API is far more practical than self-hosting. Only large organizations with specific needs (high volume, custom queries, offline access) should consider self-hosting.

3. No Modern Observability

Missing:

Prometheus metrics endpoint
Structured logging (JSON logs)
Distributed tracing (OpenTelemetry)
Health check endpoint
Readiness/liveness probes

Current State:

Plain text logs
No metrics export
Manual log parsing for monitoring
No standardized health checks

Impact:

Harder to integrate with modern monitoring stacks (Grafana, Datadog, etc.)
Limited visibility into performance bottlenecks
Difficult to debug production issues
No SLO/SLA tracking

Workarounds:

Parse logs with Logstash/Fluentd
Monitor HTTP responses
Database query monitoring
Custom metrics collection

Future: Prometheus exporter is planned but not yet implemented.

4. Incomplete Frontend Modernization

Legacy Code:

Knockout.js still present in many views
jQuery used extensively
Inline JavaScript in templates
Mixed Template Toolkit + React

Evidence:

root/static/scripts/ contains both Knockout and React
Some pages fully React, others fully Knockout, some mixed
Inconsistent UI patterns across pages

Impact:

Larger JavaScript bundle size
Maintenance burden (two frameworks)
Inconsistent user experience
Harder for new contributors

Migration Status:

New features use React
Old features gradually migrated
No timeline for complete migration
Knockout removal is low priority

Reality Check: This is a cosmetic issue, not a functional one. The site works well despite the mixed frontend. For API users, this is irrelevant.

5. Custom ORM Instead of Standard

Architecture: Custom Moose-based data layer, not DBIx::Class

Characteristics:

106 Data modules (26,000 lines)
Raw SQL via DBD::Pg
Custom query builder (Sql.pm)
Moose roles for common patterns

Drawbacks:

Steeper learning curve for new contributors
No ecosystem of plugins/extensions
Manual query construction
No automatic migrations

Benefits:

Better performance (no ORM overhead)
Full control over SQL
Simpler for complex queries
Fewer dependencies

Reality Check: The custom ORM is well-designed and battle-tested. It's not a weakness in functionality, but in onboarding and maintainability. For a project this mature, changing to a standard ORM would be a massive undertaking with little benefit.

6. Limited Real-Time Capabilities

Current State:

No WebSocket support
No Server-Sent Events
No real-time notifications
Polling required for updates

Impact:

Edit notifications delayed
Search results not live-updated
Collaborative editing limited
Higher server load from polling

Workarounds:

Redis pub/sub for internal events
Periodic polling from clients
Email notifications for edits

Future: Real-time features not prioritized (low demand).

Integration Considerations

API Integration (Recommended)

Best For:

Most use cases
Low to medium volume (<1M requests/month)
No custom query requirements
Budget-conscious projects

Approach:

import requests

# Lookup artist by MBID
response = requests.get(
    'https://musicbrainz.org/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da',
    params={'fmt': 'json', 'inc': 'releases+recordings'},
    headers={'User-Agent': 'MyApp/1.0 (contact@example.com)'}
)
artist = response.json()

Advantages:

No infrastructure to manage
Always up-to-date data
No storage costs
Simple integration

Limitations:

Rate limiting (1 req/sec recommended)
Network latency
No custom queries
Dependent on MusicBrainz uptime

Best Practices:

Cache responses aggressively
Respect rate limits
Include User-Agent with contact info
Handle errors gracefully

Replication/Mirror (Advanced)

Best For:

High volume (>10M requests/month)
Custom queries and analytics
Offline access required
Research projects

Approach:

Set up PostgreSQL 16+ server (500GB+ storage)
Download initial database dump
Load schema and data
Configure replication (RT_MIRROR mode)
Download and apply hourly replication packets

Advantages:

No rate limiting
Full dataset access
Custom queries
Low latency

Disadvantages:

High infrastructure cost (~$200+/month)
Operational overhead
Replication lag (minutes to hours)
Storage requirements (350GB+)

Maintenance:

Apply replication packets hourly
Monitor replication lag
Rebuild indexes periodically
Backup database regularly

Hybrid Approach (Optimal)

Strategy:

Use API for lookups and searches
Cache frequently accessed data locally
Replicate subset of data for custom queries
Fall back to API for cache misses

Example:

# Check local cache first
artist = cache.get(f'artist:{mbid}')

if not artist:
    # Cache miss - fetch from API
    response = requests.get(f'https://musicbrainz.org/ws/2/artist/{mbid}')
    artist = response.json()
    
    # Cache for 1 hour
    cache.set(f'artist:{mbid}', artist, ttl=3600)

return artist

Benefits:

Lower API usage (respect rate limits)
Faster response times
Reduced infrastructure costs
Graceful degradation

Relevance to Metadata Aggregator Project

Primary Data Source

Role: MusicBrainz is the foundational music metadata source. All other music metadata projects reference or build upon MusicBrainz:

Discogs: Cross-references MusicBrainz IDs
Last.fm: Uses MusicBrainz for artist/track normalization
AcousticBrainz: Audio analysis keyed by MusicBrainz recording ID
ListenBrainz: Listening history using MusicBrainz IDs
CritiqueBrainz: Reviews keyed by MusicBrainz release ID

Implication: A metadata aggregator without MusicBrainz is incomplete. MusicBrainz provides the canonical identifiers (MBIDs) that link data across services.

Integration Priority: Critical

Rationale:

Canonical IDs: MBIDs are the standard for music entity identification
Comprehensive Coverage: Largest open music metadata database
Relationship Data: Rich connections between entities
Community Trust: High data quality through peer review
API Stability: Mature, stable API with long-term support

Recommended Integration:

Use MusicBrainz API as primary metadata source
Cache responses locally (1-hour TTL)
Use MBIDs as primary keys in aggregator database
Cross-reference with other sources (Discogs, Last.fm, etc.)
Contribute improvements back to MusicBrainz

Data Model Alignment

MusicBrainz Entities Map Well to Aggregator Needs:

MusicBrainz Entity	Aggregator Use Case
Artist	Artist profiles, discographies
Release	Album/single metadata
Recording	Track metadata, audio fingerprinting
Work	Composition metadata, cover detection
Label	Label discographies, release attribution
Relationship	Music discovery, session musician tracking

Identifiers:

MBID as primary key
ISRC for recording matching
Barcode for release matching
Disc ID for CD identification

Complementary Data Sources

MusicBrainz Strengths:

Canonical entity IDs
Relationship data
Release metadata
Identifier coverage

MusicBrainz Gaps (fill with other sources):

Album reviews → CritiqueBrainz, AllMusic
Listening statistics → Last.fm, Spotify
Audio features → AcousticBrainz, Spotify
Lyrics → LyricWiki, Genius
Album art → Cover Art Archive (integrated)
Popularity metrics → Last.fm, Spotify

Implementation Roadmap

Phase 1: Basic Integration

Implement MusicBrainz API client
Cache artist/release/recording lookups
Store MBIDs as primary keys
Handle rate limiting gracefully

Phase 2: Enhanced Integration

Implement relationship traversal
Add search functionality
Integrate Cover Art Archive
Add identifier lookups (ISRC, barcode)

Phase 3: Advanced Integration

Consider replication for high volume
Contribute improvements to MusicBrainz
Implement edit submission (if applicable)
Add real-time update monitoring

Phase 4: Ecosystem Integration

Integrate complementary services (Last.fm, etc.)
Cross-reference data across sources
Resolve conflicts and duplicates
Build unified metadata view

Conclusion

Overall Assessment: MusicBrainz is an essential, high-quality music metadata source with a mature codebase and comprehensive API. While it has some technical debt (Perl, legacy frontend, custom ORM), these are manageable and don't impact its value as a data source.

Recommendation for Metadata Aggregator:

Priority: Critical - integrate early
Approach: API-based with aggressive caching
Timeline: Phase 1 in first sprint
Resources: Low (API integration is straightforward)

Key Takeaway: MusicBrainz is the foundation of music metadata. Any serious music metadata aggregator must integrate MusicBrainz to be comprehensive and credible.

16 KiB Raw Permalink Blame History

MusicBrainz Server Evaluation

Strengths

1. Canonical Music Metadata Source

2. Massive, Comprehensive Dataset

3. Mature, Battle-Tested Codebase

4. Comprehensive, Well-Designed API

5. Transparent Edit/Voting System

6. Replication Support for Mirrors

7. Rich Relationship Model

Weaknesses

1. Perl Language Ecosystem Decline

2. Heavy Infrastructure Requirements

3. No Modern Observability

4. Incomplete Frontend Modernization

5. Custom ORM Instead of Standard

6. Limited Real-Time Capabilities

Integration Considerations

API Integration (Recommended)

Replication/Mirror (Advanced)

Hybrid Approach (Optimal)

Relevance to Metadata Aggregator Project

Primary Data Source

Integration Priority: Critical

Data Model Alignment

Complementary Data Sources

Implementation Roadmap

Conclusion

16 KiB

Raw Permalink Blame History