feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
+500
View File
@@ -0,0 +1,500 @@
# Aggregators Architecture Analysis & Proposed Solution
Deep analysis of 5 music metadata aggregators, identifying common flaws and proposing a ground-up redesign.
---
## Executive Summary
All 5 aggregators share **common architectural mistakes** that lead to data quality issues, performance problems, and poor extensibility:
| Pattern | Projects Affected | Impact |
|---------|-------------------|--------|
| **No confidence scoring** | 5/5 | Can't distinguish good data from bad |
| **First/last-write-wins merging** | 4/5 | Data loss, no conflict resolution |
| **Silent failure cascades** | 4/5 | Debugging nightmare, data corruption |
| **Naive entity resolution** | 4/5 | Duplicates, mismatches |
| **Provider-specific error handling** | 3/5 | Inconsistent reliability |
| **URL-based cache keys** | 2/5 | Same entity cached multiple times |
| **Disabled batching** | 2/5 | Catastrophic performance |
---
## 1. Harmony - Architectural Flaws
### Critical Issues
#### 1.1 Naive Deduplication (`deduplicate.ts:4-25`)
```typescript
// FLAW: Exact string match only
if (mbid) {
if (!mbids.has(mbid)) { result.push(entity); mbids.add(mbid); }
} else if (name) {
if (!names.has(name)) { result.push(entity); names.add(name); }
}
```
**Problem**: "The Beatles" ≠ "Beatles" ≠ "BEATULAR" - all treated as different entities.
**Fix**: Implement phonetic blocking (Metaphone) + Levenshtein similarity threshold.
#### 1.2 Limited Compatibility Checks (`compatibility.ts:60-67`)
```typescript
const releaseCompatibilityChecks: CompatibilityCheck<HarmonyRelease>[] = [{
property: (release) => release.gtin ? Number(release.gtin) : undefined,
errorMessage: 'Providers have returned multiple different GTIN',
}, {
property: trackCountSummary,
errorMessage: 'Providers have returned incompatible track lists',
}];
```
**Problem**: Only checks GTIN and track count. No artist validation, title similarity, or duration checks.
**Fix**: Add artist credit comparison, title Levenshtein distance, duration tolerance (±3%).
#### 1.3 First-Wins Merge with No Confidence (`merge.ts:105-124`)
```typescript
missingReleaseProperties.forEach((property) => {
const value = cloneInto(mergedRelease, sourceRelease, property);
if (isFilled(value)) {
mergedRelease.info.sourceMap[property] = providerName;
missingReleaseProperties.delete(property); // First wins, done
}
});
```
**Problem**: First provider to fill a field wins. No quality assessment.
**Fix**: Score each value by source trust × recency × consensus, pick highest.
#### 1.4 No Data Quality Metrics
**Missing**: Confidence scores, match quality, conflict counts, field completeness.
---
## 2. GraphBrainz - Architectural Flaws
### Critical Issues
#### 2.1 BATCHING COMPLETELY DISABLED (`loaders.js:38-42`)
```javascript
const lookup = new DataLoader(
(keys) => { /* ... */ },
{ batch: false } // ← DEFEATS ENTIRE PURPOSE OF DATALOADER
);
```
**Impact**: Query for 20 entities = 20 sequential HTTP requests. With rate limit of 5 req/5.5s = **22 seconds minimum**.
**Fix**: Implement request coalescing even without batch API. Deduplicate concurrent identical requests.
#### 2.2 N+1 Queries by Design (`relationship.js:127-138`)
```javascript
relationships: {
resolve: (entity, args, { loaders }, info) => {
// If relations not included in initial fetch...
promise = loaders.lookup.load([entityType, id, params]); // N+1 QUERY
return promise.then((entity) => entity.relations);
},
}
```
**Also in**: `recording.js:51-61` (ISRCs), `helpers.js:56-64` (fieldWithID pattern)
**Impact**: Query 100 artists with relationships = 1 + 100 requests.
**Fix**: Query planning phase - analyze full GraphQL query before any resolvers, compute optimal `inc` parameters.
#### 2.3 Cache Fragmentation (`loaders.js:11-20`)
```javascript
// Same artist cached 3 times with different completeness:
loaders.lookup.load(['artist', 'abc', {}])
loaders.lookup.load(['artist', 'abc', { inc: ['releases'] }])
loaders.lookup.load(['artist', 'abc', { inc: ['recordings'] }])
```
**Problem**: URL-based cache keys mean same entity with different `inc` params = different cache entries.
**Fix**: Entity-based cache with incremental enrichment.
#### 2.4 Extension System Limitations (`extensions/index.js`)
```javascript
// Only 18 lines. No lifecycle hooks, no dependency management.
export async function loadExtension(extensionModule) {
return typeof extensionModule === 'string'
? await import(extensionModule)
: extensionModule;
}
```
**Missing**: Lifecycle hooks, resolver interception, middleware support, error boundaries.
---
## 3. Bedrock-API - Architectural Flaws
### Critical Issues
#### 3.1 Missing Proto Fields (`bedrock_service.proto`)
| Missing Field | Impact |
|---------------|--------|
| `album_id` on Track | Can't link tracks to albums bidirectionally |
| `release_date` on Track | Temporal data lost |
| `explicit` flag | Content rating lost |
| `isrc` | International standard ID lost (critical for rights) |
| `verified` on Artist | Badge status lost |
| `label` on Album | Publisher info lost |
| `upc/ean` | Barcode identifiers lost |
#### 3.2 SoundCloud artist_id Bug (`soundcloud.go:457`)
```go
// BUG: Uses track ID instead of user ID
artist_id: fmt.Sprintf("soundcloud:%d", t.ID), // Should be t.User.ID
```
#### 3.3 Listening Stats Don't Persist (`main.go:984-1000`)
```go
func (s *BedrockServer) RecordPlay(ctx context.Context, req *pb.RecordPlayRequest) (*pb.RecordPlayResponse, error) {
eventID := uuid.New().String()
// TODO: persist event ← STUB!
return &pb.RecordPlayResponse{EventId: eventID, Status: pb.ResponseStatus_STATUS_OK}, nil
}
```
**Impact**: `GetPopularTracks` and `GetListeningHistory` return empty - feature non-functional.
#### 3.4 Resolver Bridging Has No Validation (`resolver.go:152-159`)
```go
// Takes first search result without scoring
results, err := s.sc.SearchTracks(ctx, cleanedQuery, 1)
return results[0] // Wrong track if covers/remixes rank first
```
**Missing**: Duration comparison, artist name fuzzy matching, ISRC/UPC verification.
#### 3.5 Spotify Panic Risk (`spotify.go:76-78`)
```go
// No bounds check before indexing
ArtistIDs: wrapper.ArtistIDs[0], // PANIC if empty array
```
---
## 4. minim - Architectural Flaws
### Critical Issues
#### 4.1 Inconsistent Error Handling Per Provider
| Provider | Error Pattern |
|----------|---------------|
| Spotify | Retries on 401, raises `RuntimeError` |
| TIDAL | Parses JSON error, falls back to status |
| Qobuz | Raises with `error['code']` |
| iTunes | Tries `errorMessage`, uses JSONDecodeError fallback |
| Discogs | Parses nested `detail` field |
**Impact**: Consumers need provider-specific error handling.
#### 4.2 Missing Retry Logic (3/5 providers)
Only Spotify and Qobuz implement retry. TIDAL, iTunes, Discogs fail immediately on transient errors.
#### 4.3 No Rate Limit Handling
```python
# Missing everywhere:
# - 429 Too Many Requests detection
# - Retry-After header parsing
# - Exponential backoff
```
#### 4.4 Response Structure Inconsistency
| Provider | Artist Field | Duration Field |
|----------|-------------|----------------|
| Spotify | `album.artists[0].name` | `duration_ms` |
| TIDAL | `data.attributes.name` | `duration` (seconds) |
| iTunes | `artistName` | `trackTimeMillis` |
| Discogs | `artists[0].name` | N/A |
**Impact**: No common data model. Every consumer writes provider-specific parsing.
---
## 5. MusicMetaLinker - Architectural Flaws
### Critical Issues
#### 5.1 Naive Cascading Fallback (`linking.py:159-182`)
```python
def get_artist(self) -> str | None:
if self.artist: return self.artist
artist = self.mb_link.get_artist()
if artist is None:
artist = self.dz_link.get_artist_name()
if artist is None:
artist = self.mb_link.get_artist() # Called twice!
if artist is None:
artist = self.yt_link.get_youtube_artist()
return artist # First non-None wins, no quality check
```
**Problems**:
- No confidence scoring
- No conflict detection ("Beyoncé" vs "Beyonce" vs "Beyoncé Knowles")
- Redundant MusicBrainz calls
- Order bias (Deezer always wins over YouTube)
#### 5.2 Silent Failures (`deezer_links.py:102-107`)
```python
try:
return [res for res in results][:limit]
except Exception: # Catches EVERYTHING
return None # Network error? Invalid input? Who knows!
```
**Impact**: Can't distinguish "no match" from "API failed" from "invalid input".
#### 5.3 ISRC Handling Bug (`musicbrainz_links.py:77-85`)
```python
for isrc in self.isrc:
try:
isrc_result = mb.get_recordings_by_isrc(isrc, ...)
return isrc_result # Returns on first success
except mb.ResponseError:
return None # BUG: Should be `continue`, not `return`!
```
#### 5.4 Album Name Truncation (`deezer_links.py:63-78`)
```python
if self.album and " " in self.album:
self.album = " ".join(self.album.split(" ")[:2]) # Only first 2 words!
```
"The Beatles (Remastered)" → "The Beatles" - loses critical specificity.
#### 5.5 Naive Duration Comparison
Fixed 3-second threshold regardless of track length:
- 3s is huge for 30-second track (10% error)
- 3s is tiny for 10-minute track (0.5% error)
---
## Proposed Architecture
### Design Principles
1. **Observations are immutable** - No "last write wins"; always preserve raw data
2. **Field-level confidence** - Trust title from MusicBrainz while using duration from Spotify
3. **Three-stage entity resolution** - Blocking → Similarity → Decision
4. **Provenance by default** - Every value is explainable
### Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Provider │ │ Provider │ │ Provider │ │ Provider │ │
│ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ └────────────────┴───────┬────────┴────────────────┘ │
│ ┌─────────────▼──────────────┐ │
│ │ Unified Provider Gateway │ │
│ │ • Per-provider rate limit │ │
│ │ • Retry + exp. backoff │ │
│ │ • Circuit breaker │ │
│ │ • Request batching │ │
│ └─────────────┬──────────────┘ │
└──────────────────────────────────┼──────────────────────────────────────┘
┌──────────────▼──────────────┐
│ RAW OBSERVATION STORE │
│ (append-only, immutable) │
└──────────────┬──────────────┘
┌──────────────────────────────────┼──────────────────────────────────────┐
│ ENTITY RESOLUTION LAYER │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ BLOCKING STAGE │ │
│ │ • ISRC/UPC exact match (99.7% pair reduction) │ │
│ │ • Phonetic blocking (Metaphone) for names │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ SIMILARITY STAGE │ │
│ │ • Title: Levenshtein + token Jaccard │ │
│ │ • Artist: embedding cosine similarity │ │
│ │ • Duration: relative threshold (±3% or ±5s) │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ DECISION STAGE │ │
│ │ • ≥0.95 → auto-merge │ │
│ │ • 0.70-0.95 → human review queue │ │
│ │ • <0.70 → distinct entities │ │
│ └────────────────────────┬────────────────────────┘ │
└──────────────────────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────┼──────────────────────────────────────┐
│ CONFLICT RESOLUTION ENGINE │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ FIELD-LEVEL MERGE RULES │ │
│ │ confidence = source_trust × recency × consensus │ │
│ │ │ │
│ │ • Identifiers: ISRC > provider ID │ │
│ │ • Duration: median within 2s tolerance │ │
│ │ • Title: MusicBrainz > label > streaming │ │
│ │ • Release date: earliest credible │ │
│ │ • Explicit: OR across sources │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ CANONICAL ENTITY STORE │ │
│ │ • Materialized "best known" values │ │
│ │ • Per-field confidence scores │ │
│ │ • Links to all source observations │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
```
---
### Core Data Model
```sql
-- Immutable observations from providers
CREATE TABLE observations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
provider TEXT NOT NULL,
provider_id TEXT NOT NULL,
entity_type TEXT NOT NULL,
payload JSONB NOT NULL,
fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(),
checksum BYTEA NOT NULL,
UNIQUE(provider, provider_id, checksum)
);
-- Canonical entities with confidence
CREATE TABLE tracks (
id UUID PRIMARY KEY,
-- Identifiers
isrc TEXT,
iswc TEXT,
mbid UUID,
-- Fields with confidence
title TEXT NOT NULL,
title_confidence REAL NOT NULL DEFAULT 0.0,
duration_ms INT,
duration_confidence REAL NOT NULL DEFAULT 0.0,
explicit BOOLEAN,
explicit_confidence REAL NOT NULL DEFAULT 0.0,
-- Denormalized
artist_credit TEXT NOT NULL,
album_title TEXT,
-- Metadata
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
merge_version INT NOT NULL DEFAULT 1
);
-- Field-level provenance
CREATE TABLE field_sources (
entity_type TEXT NOT NULL,
entity_id UUID NOT NULL,
field_name TEXT NOT NULL,
observation_id UUID NOT NULL REFERENCES observations(id),
confidence REAL NOT NULL,
selected BOOLEAN NOT NULL DEFAULT false,
PRIMARY KEY (entity_type, entity_id, field_name, observation_id)
);
-- Cross-reference table
CREATE TABLE provider_links (
entity_type TEXT NOT NULL,
entity_id UUID NOT NULL,
provider TEXT NOT NULL,
provider_id TEXT NOT NULL,
verified BOOLEAN NOT NULL DEFAULT false,
PRIMARY KEY (entity_type, provider, provider_id)
);
-- Entity resolution audit trail
CREATE TABLE merge_decisions (
id UUID PRIMARY KEY,
entity_type TEXT NOT NULL,
source_ids UUID[] NOT NULL,
target_id UUID NOT NULL,
similarity_score REAL NOT NULL,
decision TEXT NOT NULL, -- 'auto', 'human_approved', 'human_rejected'
decided_by TEXT,
decided_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
```
---
### Source Trust Hierarchy
```python
SOURCE_TRUST = {
'musicbrainz': 0.95, # Community-curated, high accuracy
'discogs': 0.85, # Community + physical media focus
'tidal': 0.80, # Label direct relationships
'spotify': 0.75, # Large scale, some noise
'deezer': 0.70, # Good coverage, less curation
'youtube': 0.60, # User-generated, low accuracy
}
```
---
### Conflict Resolution Rules
| Field | Strategy | Implementation |
|-------|----------|----------------|
| **Title** | Highest trust + consensus | Score = trust + 0.1×(agreeing_sources - 1) |
| **Duration** | Median within tolerance | Filter to ±3% or ±5s, take median |
| **Explicit** | OR logic | If any source says explicit → explicit |
| **Release Date** | Earliest credible | Must be ≤ today and ≥ 1900 |
| **ISRC** | First valid | Validate format, take highest-trust source |
| **Artist** | Embedding similarity | Cluster similar names, pick canonical |
---
### Technical Choices
| Component | Choice | Rationale |
|-----------|--------|-----------|
| **Core Language** | Python 3.11+ | Rapid iteration, rich ecosystem |
| **Hot Path** | Rust via PyO3 | Entity resolution blocking/embedding |
| **Database** | PostgreSQL 15+ | JSONB, trigram, pgvector |
| **Cache** | Redis | Entity-keyed, not URL-keyed |
| **Embeddings** | all-MiniLM-L6-v2 | 384-dim, fast, good quality |
| **API** | GraphQL + DataLoader | Explicit batching, no N+1 |
| **Queue** | PostgreSQL SKIP LOCKED | Human review, async processing |
| **Observability** | OpenTelemetry | Trace entity resolution decisions |
---
### Estimated Effort
| Component | Effort | Notes |
|-----------|--------|-------|
| Data model + migrations | 1-4 hours | PostgreSQL schema |
| Provider gateway | 1-2 days | Unified error handling, rate limiting |
| Entity resolution pipeline | 1-2 days | Blocking, similarity, decision |
| Conflict resolution engine | 1-4 hours | Field-level rules |
| Provenance system | 1-4 hours | Audit tables, explain API |
| Human review UI | 1-2 days | Queue management |
| **Total MVP** | **1-2 weeks** | |
---
## Key Takeaways
1. **Hybrid approaches win**: Audio + metadata outperforms either alone (Spotify research: 2-6% improvement)
2. **Provenance is non-negotiable**: Every field needs source tracking, confidence scores, snapshot URLs
3. **Identifier hierarchy matters**: ISWC (work) → ISRC (recording) → UPC (release) with MBIDs as glue
4. **Fuzzy matching requires stages**: Blocking (99.7% reduction) → Similarity → Threshold → Human review
5. **Conflict resolution needs policy**: Field-level precedence rules, not "last write wins"
6. **Cache entities, not requests**: Avoid GraphBrainz's URL-fragmentation trap
7. **Unified error handling**: Result types that force error handling, not silent exceptions