feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,513 @@
# MusicBrainz Server Evaluation
## Strengths
### 1. Canonical Music Metadata Source
**Evidence:** MusicBrainz is the de facto standard for music metadata. Used by:
- Spotify (artist/release matching)
- Last.fm (scrobbling normalization)
- Roon (music library management)
- Picard (music tagging)
- Beets (music organization)
- Hundreds of other music applications
**Impact:** Any music metadata aggregator must include MusicBrainz data to be comprehensive. It's the foundation that other services build upon.
**Data Quality:** Community-driven editing with voting system ensures high accuracy. Over 2 million edits per year, with auto-editors providing quality control.
### 2. Massive, Comprehensive Dataset
**Scale (as of 2024):**
- 2.1+ million artists
- 3.5+ million releases
- 30+ million recordings
- 1.5+ million works
- 1.3+ million labels
- 100+ million relationships
**Coverage:** Extensive coverage across:
- All genres (classical, jazz, rock, electronic, world music, etc.)
- All eras (historical recordings to latest releases)
- All regions (global coverage with strong international community)
- All formats (vinyl, CD, digital, cassette, etc.)
**Relationships:** Rich relationship data connecting:
- Artists to recordings (performer, conductor, engineer, etc.)
- Recordings to works (performance of composition)
- Artists to artists (member of, collaboration, etc.)
- Releases to labels, areas, events, etc.
**Identifiers:** Comprehensive identifier coverage:
- ISRCs (International Standard Recording Code)
- ISWCs (International Standard Musical Work Code)
- Barcodes (EAN, UPC)
- Disc IDs (CD table of contents)
- External links (Wikipedia, Discogs, AllMusic, etc.)
### 3. Mature, Battle-Tested Codebase
**Age:** 15+ years of continuous development (since 2001)
**Stability:** Proven reliability serving millions of requests daily with minimal downtime.
**Evolution:** Gradual modernization while maintaining backward compatibility:
- Started with Template Toolkit (still used)
- Added Knockout.js (being phased out)
- Migrating to React (ongoing)
- API has remained stable since v2 (2011)
**Community:** Large, active open-source community:
- 500+ contributors on GitHub
- Active development (commits daily)
- Responsive to issues and pull requests
- Strong documentation culture
### 4. Comprehensive, Well-Designed API
**Maturity:** API v2 stable since 2011, widely adopted
**Formats:** Multiple serialization formats:
- JSON (modern, widely supported)
- XML (legacy, still used by many clients)
- JSON-LD (semantic web, Schema.org vocabulary)
**Features:**
- Lookup by MBID (unique identifier)
- Browse by relationships (all releases by artist, etc.)
- Search with Lucene query syntax
- Include parameters for fine-grained control
- Pagination for large result sets
- CORS enabled for browser clients
**Rate Limiting:** Reasonable limits (1 req/sec recommended) with clear documentation
**Authentication:** Modern OAuth2 with PKCE for user-specific operations
**Documentation:** Comprehensive API docs with examples at musicbrainz.org/doc/Development/XML_Web_Service/Version_2
### 5. Transparent Edit/Voting System
**Command Pattern:** All modifications are versioned edits, providing:
- Full audit trail (who changed what, when, why)
- Rollback capability (edits can be reverted)
- Transparency (all edits publicly visible)
- Accountability (editors build reputation)
**Community Quality Control:**
- 7-day voting period for most edits
- Community votes yes/no/abstain
- Auto-editors can approve immediately (earned privilege)
- Failed edits can be resubmitted with improvements
**Edit Types:** 100+ edit types covering all operations:
- Create/edit/delete entities
- Add/edit/delete relationships
- Merge duplicates
- Add identifiers (ISRC, barcode, etc.)
**Benefits:**
- High data quality through peer review
- Prevents vandalism and spam
- Encourages collaboration and discussion
- Builds trust in the data
### 6. Replication Support for Mirrors
**Architecture:** Master-Mirror via dbmirror2 packet system
**Use Cases:**
- Organizations needing local copy (reduced latency, offline access)
- High-volume API users (avoid rate limits)
- Research projects (full dataset access)
- Backup/disaster recovery
**Replication Packets:**
- Incremental updates (not full dumps)
- Hourly packets available
- Efficient bandwidth usage
- Verifiable integrity
**Mirror Benefits:**
- Full read access to entire dataset
- No rate limiting
- Custom queries and analytics
- Integration with internal systems
### 7. Rich Relationship Model
**Advanced Relationships:** Not just artist-to-release, but:
- Artist-to-artist (member of, collaboration, married to, etc.)
- Recording-to-work (performance of composition)
- Release-to-event (recorded at festival, etc.)
- Work-to-work (arrangement of, medley of, etc.)
**Relationship Attributes:**
- Dates (begin/end)
- Credits (custom artist credits)
- Instruments (performer played guitar, etc.)
- Roles (producer, engineer, etc.)
**Use Cases:**
- Music discovery (find similar artists)
- Discography completeness (all releases by artist)
- Session musician tracking (who played on what)
- Classical music (composer, conductor, orchestra, etc.)
## Weaknesses
### 1. Perl Language Ecosystem Decline
**Evidence:**
- Perl ranked #19 in TIOBE index (down from top 5 in 2000s)
- Declining CPAN module releases (peak 2014, declining since)
- Fewer Perl developers entering workforce
- Most new web projects use Python, JavaScript, Go, Rust
**Impact:**
- Harder to recruit Perl developers
- Smaller pool of contributors
- Slower adoption of modern practices
- Dependency on aging CPAN modules
**Mitigation:**
- MusicBrainz has stable, experienced Perl team
- Codebase is well-documented
- Gradual migration to JavaScript on frontend
- API allows language-agnostic integration
**Reality Check:** While Perl is declining, MusicBrainz's Perl codebase is mature and stable. The bigger risk is long-term maintainability (10+ years), not immediate functionality.
### 2. Heavy Infrastructure Requirements
**Database Size:** ~350GB for production dataset (with indexes)
**Resource Requirements:**
- 8+ CPU cores
- 16+ GB RAM
- 500+ GB SSD storage
- PostgreSQL 16+ (specific version requirement)
- Redis (16 databases)
- Apache Solr (13 cores)
**Deployment Complexity:**
- Multiple services to coordinate
- Complex build process (Perl + Node.js)
- Long initial setup (schema load, index build)
- Replication setup requires FTP server
**Cost Implications:**
- Self-hosting requires dedicated server (~$200+/month)
- Cloud hosting even more expensive
- Bandwidth costs for replication
- Operational overhead (backups, monitoring, updates)
**Practical Impact:** For most use cases, using the public API is far more practical than self-hosting. Only large organizations with specific needs (high volume, custom queries, offline access) should consider self-hosting.
### 3. No Modern Observability
**Missing:**
- Prometheus metrics endpoint
- Structured logging (JSON logs)
- Distributed tracing (OpenTelemetry)
- Health check endpoint
- Readiness/liveness probes
**Current State:**
- Plain text logs
- No metrics export
- Manual log parsing for monitoring
- No standardized health checks
**Impact:**
- Harder to integrate with modern monitoring stacks (Grafana, Datadog, etc.)
- Limited visibility into performance bottlenecks
- Difficult to debug production issues
- No SLO/SLA tracking
**Workarounds:**
- Parse logs with Logstash/Fluentd
- Monitor HTTP responses
- Database query monitoring
- Custom metrics collection
**Future:** Prometheus exporter is planned but not yet implemented.
### 4. Incomplete Frontend Modernization
**Legacy Code:**
- Knockout.js still present in many views
- jQuery used extensively
- Inline JavaScript in templates
- Mixed Template Toolkit + React
**Evidence:**
- `root/static/scripts/` contains both Knockout and React
- Some pages fully React, others fully Knockout, some mixed
- Inconsistent UI patterns across pages
**Impact:**
- Larger JavaScript bundle size
- Maintenance burden (two frameworks)
- Inconsistent user experience
- Harder for new contributors
**Migration Status:**
- New features use React
- Old features gradually migrated
- No timeline for complete migration
- Knockout removal is low priority
**Reality Check:** This is a cosmetic issue, not a functional one. The site works well despite the mixed frontend. For API users, this is irrelevant.
### 5. Custom ORM Instead of Standard
**Architecture:** Custom Moose-based data layer, not DBIx::Class
**Characteristics:**
- 106 Data modules (26,000 lines)
- Raw SQL via DBD::Pg
- Custom query builder (Sql.pm)
- Moose roles for common patterns
**Drawbacks:**
- Steeper learning curve for new contributors
- No ecosystem of plugins/extensions
- Manual query construction
- No automatic migrations
**Benefits:**
- Better performance (no ORM overhead)
- Full control over SQL
- Simpler for complex queries
- Fewer dependencies
**Reality Check:** The custom ORM is well-designed and battle-tested. It's not a weakness in functionality, but in onboarding and maintainability. For a project this mature, changing to a standard ORM would be a massive undertaking with little benefit.
### 6. Limited Real-Time Capabilities
**Current State:**
- No WebSocket support
- No Server-Sent Events
- No real-time notifications
- Polling required for updates
**Impact:**
- Edit notifications delayed
- Search results not live-updated
- Collaborative editing limited
- Higher server load from polling
**Workarounds:**
- Redis pub/sub for internal events
- Periodic polling from clients
- Email notifications for edits
**Future:** Real-time features not prioritized (low demand).
## Integration Considerations
### API Integration (Recommended)
**Best For:**
- Most use cases
- Low to medium volume (<1M requests/month)
- No custom query requirements
- Budget-conscious projects
**Approach:**
```python
import requests
# Lookup artist by MBID
response = requests.get(
'https://musicbrainz.org/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da',
params={'fmt': 'json', 'inc': 'releases+recordings'},
headers={'User-Agent': 'MyApp/1.0 (contact@example.com)'}
)
artist = response.json()
```
**Advantages:**
- No infrastructure to manage
- Always up-to-date data
- No storage costs
- Simple integration
**Limitations:**
- Rate limiting (1 req/sec recommended)
- Network latency
- No custom queries
- Dependent on MusicBrainz uptime
**Best Practices:**
- Cache responses aggressively
- Respect rate limits
- Include User-Agent with contact info
- Handle errors gracefully
### Replication/Mirror (Advanced)
**Best For:**
- High volume (>10M requests/month)
- Custom queries and analytics
- Offline access required
- Research projects
**Approach:**
1. Set up PostgreSQL 16+ server (500GB+ storage)
2. Download initial database dump
3. Load schema and data
4. Configure replication (RT_MIRROR mode)
5. Download and apply hourly replication packets
**Advantages:**
- No rate limiting
- Full dataset access
- Custom queries
- Low latency
**Disadvantages:**
- High infrastructure cost (~$200+/month)
- Operational overhead
- Replication lag (minutes to hours)
- Storage requirements (350GB+)
**Maintenance:**
- Apply replication packets hourly
- Monitor replication lag
- Rebuild indexes periodically
- Backup database regularly
### Hybrid Approach (Optimal)
**Strategy:**
- Use API for lookups and searches
- Cache frequently accessed data locally
- Replicate subset of data for custom queries
- Fall back to API for cache misses
**Example:**
```python
# Check local cache first
artist = cache.get(f'artist:{mbid}')
if not artist:
# Cache miss - fetch from API
response = requests.get(f'https://musicbrainz.org/ws/2/artist/{mbid}')
artist = response.json()
# Cache for 1 hour
cache.set(f'artist:{mbid}', artist, ttl=3600)
return artist
```
**Benefits:**
- Lower API usage (respect rate limits)
- Faster response times
- Reduced infrastructure costs
- Graceful degradation
## Relevance to Metadata Aggregator Project
### Primary Data Source
**Role:** MusicBrainz is the foundational music metadata source. All other music metadata projects reference or build upon MusicBrainz:
- **Discogs:** Cross-references MusicBrainz IDs
- **Last.fm:** Uses MusicBrainz for artist/track normalization
- **AcousticBrainz:** Audio analysis keyed by MusicBrainz recording ID
- **ListenBrainz:** Listening history using MusicBrainz IDs
- **CritiqueBrainz:** Reviews keyed by MusicBrainz release ID
**Implication:** A metadata aggregator without MusicBrainz is incomplete. MusicBrainz provides the canonical identifiers (MBIDs) that link data across services.
### Integration Priority: Critical
**Rationale:**
1. **Canonical IDs:** MBIDs are the standard for music entity identification
2. **Comprehensive Coverage:** Largest open music metadata database
3. **Relationship Data:** Rich connections between entities
4. **Community Trust:** High data quality through peer review
5. **API Stability:** Mature, stable API with long-term support
**Recommended Integration:**
- Use MusicBrainz API as primary metadata source
- Cache responses locally (1-hour TTL)
- Use MBIDs as primary keys in aggregator database
- Cross-reference with other sources (Discogs, Last.fm, etc.)
- Contribute improvements back to MusicBrainz
### Data Model Alignment
**MusicBrainz Entities Map Well to Aggregator Needs:**
| MusicBrainz Entity | Aggregator Use Case |
|-------------------|---------------------|
| Artist | Artist profiles, discographies |
| Release | Album/single metadata |
| Recording | Track metadata, audio fingerprinting |
| Work | Composition metadata, cover detection |
| Label | Label discographies, release attribution |
| Relationship | Music discovery, session musician tracking |
**Identifiers:**
- MBID as primary key
- ISRC for recording matching
- Barcode for release matching
- Disc ID for CD identification
### Complementary Data Sources
**MusicBrainz Strengths:**
- Canonical entity IDs
- Relationship data
- Release metadata
- Identifier coverage
**MusicBrainz Gaps (fill with other sources):**
- Album reviews → CritiqueBrainz, AllMusic
- Listening statistics → Last.fm, Spotify
- Audio features → AcousticBrainz, Spotify
- Lyrics → LyricWiki, Genius
- Album art → Cover Art Archive (integrated)
- Popularity metrics → Last.fm, Spotify
### Implementation Roadmap
**Phase 1: Basic Integration**
1. Implement MusicBrainz API client
2. Cache artist/release/recording lookups
3. Store MBIDs as primary keys
4. Handle rate limiting gracefully
**Phase 2: Enhanced Integration**
1. Implement relationship traversal
2. Add search functionality
3. Integrate Cover Art Archive
4. Add identifier lookups (ISRC, barcode)
**Phase 3: Advanced Integration**
1. Consider replication for high volume
2. Contribute improvements to MusicBrainz
3. Implement edit submission (if applicable)
4. Add real-time update monitoring
**Phase 4: Ecosystem Integration**
1. Integrate complementary services (Last.fm, etc.)
2. Cross-reference data across sources
3. Resolve conflicts and duplicates
4. Build unified metadata view
## Conclusion
**Overall Assessment:** MusicBrainz is an essential, high-quality music metadata source with a mature codebase and comprehensive API. While it has some technical debt (Perl, legacy frontend, custom ORM), these are manageable and don't impact its value as a data source.
**Recommendation for Metadata Aggregator:**
- **Priority:** Critical - integrate early
- **Approach:** API-based with aggressive caching
- **Timeline:** Phase 1 in first sprint
- **Resources:** Low (API integration is straightforward)
**Key Takeaway:** MusicBrainz is the foundation of music metadata. Any serious music metadata aggregator must integrate MusicBrainz to be comprehensive and credible.