feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,513 @@
+# MusicBrainz Server Evaluation
+
+## Strengths
+
+### 1. Canonical Music Metadata Source
+
+**Evidence:** MusicBrainz is the de facto standard for music metadata. Used by:
+- Spotify (artist/release matching)
+- Last.fm (scrobbling normalization)
+- Roon (music library management)
+- Picard (music tagging)
+- Beets (music organization)
+- Hundreds of other music applications
+
+**Impact:** Any music metadata aggregator must include MusicBrainz data to be comprehensive. It's the foundation that other services build upon.
+
+**Data Quality:** Community-driven editing with voting system ensures high accuracy. Over 2 million edits per year, with auto-editors providing quality control.
+
+### 2. Massive, Comprehensive Dataset
+
+**Scale (as of 2024):**
+- 2.1+ million artists
+- 3.5+ million releases
+- 30+ million recordings
+- 1.5+ million works
+- 1.3+ million labels
+- 100+ million relationships
+
+**Coverage:** Extensive coverage across:
+- All genres (classical, jazz, rock, electronic, world music, etc.)
+- All eras (historical recordings to latest releases)
+- All regions (global coverage with strong international community)
+- All formats (vinyl, CD, digital, cassette, etc.)
+
+**Relationships:** Rich relationship data connecting:
+- Artists to recordings (performer, conductor, engineer, etc.)
+- Recordings to works (performance of composition)
+- Artists to artists (member of, collaboration, etc.)
+- Releases to labels, areas, events, etc.
+
+**Identifiers:** Comprehensive identifier coverage:
+- ISRCs (International Standard Recording Code)
+- ISWCs (International Standard Musical Work Code)
+- Barcodes (EAN, UPC)
+- Disc IDs (CD table of contents)
+- External links (Wikipedia, Discogs, AllMusic, etc.)
+
+### 3. Mature, Battle-Tested Codebase
+
+**Age:** 15+ years of continuous development (since 2001)
+
+**Stability:** Proven reliability serving millions of requests daily with minimal downtime.
+
+**Evolution:** Gradual modernization while maintaining backward compatibility:
+- Started with Template Toolkit (still used)
+- Added Knockout.js (being phased out)
+- Migrating to React (ongoing)
+- API has remained stable since v2 (2011)
+
+**Community:** Large, active open-source community:
+- 500+ contributors on GitHub
+- Active development (commits daily)
+- Responsive to issues and pull requests
+- Strong documentation culture
+
+### 4. Comprehensive, Well-Designed API
+
+**Maturity:** API v2 stable since 2011, widely adopted
+
+**Formats:** Multiple serialization formats:
+- JSON (modern, widely supported)
+- XML (legacy, still used by many clients)
+- JSON-LD (semantic web, Schema.org vocabulary)
+
+**Features:**
+- Lookup by MBID (unique identifier)
+- Browse by relationships (all releases by artist, etc.)
+- Search with Lucene query syntax
+- Include parameters for fine-grained control
+- Pagination for large result sets
+- CORS enabled for browser clients
+
+**Rate Limiting:** Reasonable limits (1 req/sec recommended) with clear documentation
+
+**Authentication:** Modern OAuth2 with PKCE for user-specific operations
+
+**Documentation:** Comprehensive API docs with examples at musicbrainz.org/doc/Development/XML_Web_Service/Version_2
+
+### 5. Transparent Edit/Voting System
+
+**Command Pattern:** All modifications are versioned edits, providing:
+- Full audit trail (who changed what, when, why)
+- Rollback capability (edits can be reverted)
+- Transparency (all edits publicly visible)
+- Accountability (editors build reputation)
+
+**Community Quality Control:**
+- 7-day voting period for most edits
+- Community votes yes/no/abstain
+- Auto-editors can approve immediately (earned privilege)
+- Failed edits can be resubmitted with improvements
+
+**Edit Types:** 100+ edit types covering all operations:
+- Create/edit/delete entities
+- Add/edit/delete relationships
+- Merge duplicates
+- Add identifiers (ISRC, barcode, etc.)
+
+**Benefits:**
+- High data quality through peer review
+- Prevents vandalism and spam
+- Encourages collaboration and discussion
+- Builds trust in the data
+
+### 6. Replication Support for Mirrors
+
+**Architecture:** Master-Mirror via dbmirror2 packet system
+
+**Use Cases:**
+- Organizations needing local copy (reduced latency, offline access)
+- High-volume API users (avoid rate limits)
+- Research projects (full dataset access)
+- Backup/disaster recovery
+
+**Replication Packets:**
+- Incremental updates (not full dumps)
+- Hourly packets available
+- Efficient bandwidth usage
+- Verifiable integrity
+
+**Mirror Benefits:**
+- Full read access to entire dataset
+- No rate limiting
+- Custom queries and analytics
+- Integration with internal systems
+
+### 7. Rich Relationship Model
+
+**Advanced Relationships:** Not just artist-to-release, but:
+- Artist-to-artist (member of, collaboration, married to, etc.)
+- Recording-to-work (performance of composition)
+- Release-to-event (recorded at festival, etc.)
+- Work-to-work (arrangement of, medley of, etc.)
+
+**Relationship Attributes:**
+- Dates (begin/end)
+- Credits (custom artist credits)
+- Instruments (performer played guitar, etc.)
+- Roles (producer, engineer, etc.)
+
+**Use Cases:**
+- Music discovery (find similar artists)
+- Discography completeness (all releases by artist)
+- Session musician tracking (who played on what)
+- Classical music (composer, conductor, orchestra, etc.)
+
+## Weaknesses
+
+### 1. Perl Language Ecosystem Decline
+
+**Evidence:**
+- Perl ranked #19 in TIOBE index (down from top 5 in 2000s)
+- Declining CPAN module releases (peak 2014, declining since)
+- Fewer Perl developers entering workforce
+- Most new web projects use Python, JavaScript, Go, Rust
+
+**Impact:**
+- Harder to recruit Perl developers
+- Smaller pool of contributors
+- Slower adoption of modern practices
+- Dependency on aging CPAN modules
+
+**Mitigation:**
+- MusicBrainz has stable, experienced Perl team
+- Codebase is well-documented
+- Gradual migration to JavaScript on frontend
+- API allows language-agnostic integration
+
+**Reality Check:** While Perl is declining, MusicBrainz's Perl codebase is mature and stable. The bigger risk is long-term maintainability (10+ years), not immediate functionality.
+
+### 2. Heavy Infrastructure Requirements
+
+**Database Size:** ~350GB for production dataset (with indexes)
+
+**Resource Requirements:**
+- 8+ CPU cores
+- 16+ GB RAM
+- 500+ GB SSD storage
+- PostgreSQL 16+ (specific version requirement)
+- Redis (16 databases)
+- Apache Solr (13 cores)
+
+**Deployment Complexity:**
+- Multiple services to coordinate
+- Complex build process (Perl + Node.js)
+- Long initial setup (schema load, index build)
+- Replication setup requires FTP server
+
+**Cost Implications:**
+- Self-hosting requires dedicated server (~$200+/month)
+- Cloud hosting even more expensive
+- Bandwidth costs for replication
+- Operational overhead (backups, monitoring, updates)
+
+**Practical Impact:** For most use cases, using the public API is far more practical than self-hosting. Only large organizations with specific needs (high volume, custom queries, offline access) should consider self-hosting.
+
+### 3. No Modern Observability
+
+**Missing:**
+- Prometheus metrics endpoint
+- Structured logging (JSON logs)
+- Distributed tracing (OpenTelemetry)
+- Health check endpoint
+- Readiness/liveness probes
+
+**Current State:**
+- Plain text logs
+- No metrics export
+- Manual log parsing for monitoring
+- No standardized health checks
+
+**Impact:**
+- Harder to integrate with modern monitoring stacks (Grafana, Datadog, etc.)
+- Limited visibility into performance bottlenecks
+- Difficult to debug production issues
+- No SLO/SLA tracking
+
+**Workarounds:**
+- Parse logs with Logstash/Fluentd
+- Monitor HTTP responses
+- Database query monitoring
+- Custom metrics collection
+
+**Future:** Prometheus exporter is planned but not yet implemented.
+
+### 4. Incomplete Frontend Modernization
+
+**Legacy Code:**
+- Knockout.js still present in many views
+- jQuery used extensively
+- Inline JavaScript in templates
+- Mixed Template Toolkit + React
+
+**Evidence:**
+- `root/static/scripts/` contains both Knockout and React
+- Some pages fully React, others fully Knockout, some mixed
+- Inconsistent UI patterns across pages
+
+**Impact:**
+- Larger JavaScript bundle size
+- Maintenance burden (two frameworks)
+- Inconsistent user experience
+- Harder for new contributors
+
+**Migration Status:**
+- New features use React
+- Old features gradually migrated
+- No timeline for complete migration
+- Knockout removal is low priority
+
+**Reality Check:** This is a cosmetic issue, not a functional one. The site works well despite the mixed frontend. For API users, this is irrelevant.
+
+### 5. Custom ORM Instead of Standard
+
+**Architecture:** Custom Moose-based data layer, not DBIx::Class
+
+**Characteristics:**
+- 106 Data modules (26,000 lines)
+- Raw SQL via DBD::Pg
+- Custom query builder (Sql.pm)
+- Moose roles for common patterns
+
+**Drawbacks:**
+- Steeper learning curve for new contributors
+- No ecosystem of plugins/extensions
+- Manual query construction
+- No automatic migrations
+
+**Benefits:**
+- Better performance (no ORM overhead)
+- Full control over SQL
+- Simpler for complex queries
+- Fewer dependencies
+
+**Reality Check:** The custom ORM is well-designed and battle-tested. It's not a weakness in functionality, but in onboarding and maintainability. For a project this mature, changing to a standard ORM would be a massive undertaking with little benefit.
+
+### 6. Limited Real-Time Capabilities
+
+**Current State:**
+- No WebSocket support
+- No Server-Sent Events
+- No real-time notifications
+- Polling required for updates
+
+**Impact:**
+- Edit notifications delayed
+- Search results not live-updated
+- Collaborative editing limited
+- Higher server load from polling
+
+**Workarounds:**
+- Redis pub/sub for internal events
+- Periodic polling from clients
+- Email notifications for edits
+
+**Future:** Real-time features not prioritized (low demand).
+
+## Integration Considerations
+
+### API Integration (Recommended)
+
+**Best For:**
+- Most use cases
+- Low to medium volume (<1M requests/month)
+- No custom query requirements
+- Budget-conscious projects
+
+**Approach:**
+```python
+import requests
+
+# Lookup artist by MBID
+response = requests.get(
+    'https://musicbrainz.org/ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da',
+    params={'fmt': 'json', 'inc': 'releases+recordings'},
+    headers={'User-Agent': 'MyApp/1.0 (contact@example.com)'}
+)
+artist = response.json()
+```
+
+**Advantages:**
+- No infrastructure to manage
+- Always up-to-date data
+- No storage costs
+- Simple integration
+
+**Limitations:**
+- Rate limiting (1 req/sec recommended)
+- Network latency
+- No custom queries
+- Dependent on MusicBrainz uptime
+
+**Best Practices:**
+- Cache responses aggressively
+- Respect rate limits
+- Include User-Agent with contact info
+- Handle errors gracefully
+
+### Replication/Mirror (Advanced)
+
+**Best For:**
+- High volume (>10M requests/month)
+- Custom queries and analytics
+- Offline access required
+- Research projects
+
+**Approach:**
+1. Set up PostgreSQL 16+ server (500GB+ storage)
+2. Download initial database dump
+3. Load schema and data
+4. Configure replication (RT_MIRROR mode)
+5. Download and apply hourly replication packets
+
+**Advantages:**
+- No rate limiting
+- Full dataset access
+- Custom queries
+- Low latency
+
+**Disadvantages:**
+- High infrastructure cost (~$200+/month)
+- Operational overhead
+- Replication lag (minutes to hours)
+- Storage requirements (350GB+)
+
+**Maintenance:**
+- Apply replication packets hourly
+- Monitor replication lag
+- Rebuild indexes periodically
+- Backup database regularly
+
+### Hybrid Approach (Optimal)
+
+**Strategy:**
+- Use API for lookups and searches
+- Cache frequently accessed data locally
+- Replicate subset of data for custom queries
+- Fall back to API for cache misses
+
+**Example:**
+```python
+# Check local cache first
+artist = cache.get(f'artist:{mbid}')
+
+if not artist:
+    # Cache miss - fetch from API
+    response = requests.get(f'https://musicbrainz.org/ws/2/artist/{mbid}')
+    artist = response.json()
+    
+    # Cache for 1 hour
+    cache.set(f'artist:{mbid}', artist, ttl=3600)
+
+return artist
+```
+
+**Benefits:**
+- Lower API usage (respect rate limits)
+- Faster response times
+- Reduced infrastructure costs
+- Graceful degradation
+
+## Relevance to Metadata Aggregator Project
+
+### Primary Data Source
+
+**Role:** MusicBrainz is the foundational music metadata source. All other music metadata projects reference or build upon MusicBrainz:
+
+- **Discogs:** Cross-references MusicBrainz IDs
+- **Last.fm:** Uses MusicBrainz for artist/track normalization
+- **AcousticBrainz:** Audio analysis keyed by MusicBrainz recording ID
+- **ListenBrainz:** Listening history using MusicBrainz IDs
+- **CritiqueBrainz:** Reviews keyed by MusicBrainz release ID
+
+**Implication:** A metadata aggregator without MusicBrainz is incomplete. MusicBrainz provides the canonical identifiers (MBIDs) that link data across services.
+
+### Integration Priority: Critical
+
+**Rationale:**
+1. **Canonical IDs:** MBIDs are the standard for music entity identification
+2. **Comprehensive Coverage:** Largest open music metadata database
+3. **Relationship Data:** Rich connections between entities
+4. **Community Trust:** High data quality through peer review
+5. **API Stability:** Mature, stable API with long-term support
+
+**Recommended Integration:**
+- Use MusicBrainz API as primary metadata source
+- Cache responses locally (1-hour TTL)
+- Use MBIDs as primary keys in aggregator database
+- Cross-reference with other sources (Discogs, Last.fm, etc.)
+- Contribute improvements back to MusicBrainz
+
+### Data Model Alignment
+
+**MusicBrainz Entities Map Well to Aggregator Needs:**
+
+| MusicBrainz Entity | Aggregator Use Case |
+|-------------------|---------------------|
+| Artist | Artist profiles, discographies |
+| Release | Album/single metadata |
+| Recording | Track metadata, audio fingerprinting |
+| Work | Composition metadata, cover detection |
+| Label | Label discographies, release attribution |
+| Relationship | Music discovery, session musician tracking |
+
+**Identifiers:**
+- MBID as primary key
+- ISRC for recording matching
+- Barcode for release matching
+- Disc ID for CD identification
+
+### Complementary Data Sources
+
+**MusicBrainz Strengths:**
+- Canonical entity IDs
+- Relationship data
+- Release metadata
+- Identifier coverage
+
+**MusicBrainz Gaps (fill with other sources):**
+- Album reviews → CritiqueBrainz, AllMusic
+- Listening statistics → Last.fm, Spotify
+- Audio features → AcousticBrainz, Spotify
+- Lyrics → LyricWiki, Genius
+- Album art → Cover Art Archive (integrated)
+- Popularity metrics → Last.fm, Spotify
+
+### Implementation Roadmap
+
+**Phase 1: Basic Integration**
+1. Implement MusicBrainz API client
+2. Cache artist/release/recording lookups
+3. Store MBIDs as primary keys
+4. Handle rate limiting gracefully
+
+**Phase 2: Enhanced Integration**
+1. Implement relationship traversal
+2. Add search functionality
+3. Integrate Cover Art Archive
+4. Add identifier lookups (ISRC, barcode)
+
+**Phase 3: Advanced Integration**
+1. Consider replication for high volume
+2. Contribute improvements to MusicBrainz
+3. Implement edit submission (if applicable)
+4. Add real-time update monitoring
+
+**Phase 4: Ecosystem Integration**
+1. Integrate complementary services (Last.fm, etc.)
+2. Cross-reference data across sources
+3. Resolve conflicts and duplicates
+4. Build unified metadata view
+
+## Conclusion
+
+**Overall Assessment:** MusicBrainz is an essential, high-quality music metadata source with a mature codebase and comprehensive API. While it has some technical debt (Perl, legacy frontend, custom ORM), these are manageable and don't impact its value as a data source.
+
+**Recommendation for Metadata Aggregator:**
+- **Priority:** Critical - integrate early
+- **Approach:** API-based with aggressive caching
+- **Timeline:** Phase 1 in first sprint
+- **Resources:** Low (API integration is straightforward)
+
+**Key Takeaway:** MusicBrainz is the foundation of music metadata. Any serious music metadata aggregator must integrate MusicBrainz to be comprehensive and credible.