Files
metadata-agregator/docs/research/AGGREGATORS_ANALYSIS.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

501 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Aggregators Architecture Analysis & Proposed Solution
Deep analysis of 5 music metadata aggregators, identifying common flaws and proposing a ground-up redesign.
---
## Executive Summary
All 5 aggregators share **common architectural mistakes** that lead to data quality issues, performance problems, and poor extensibility:
| Pattern | Projects Affected | Impact |
|---------|-------------------|--------|
| **No confidence scoring** | 5/5 | Can't distinguish good data from bad |
| **First/last-write-wins merging** | 4/5 | Data loss, no conflict resolution |
| **Silent failure cascades** | 4/5 | Debugging nightmare, data corruption |
| **Naive entity resolution** | 4/5 | Duplicates, mismatches |
| **Provider-specific error handling** | 3/5 | Inconsistent reliability |
| **URL-based cache keys** | 2/5 | Same entity cached multiple times |
| **Disabled batching** | 2/5 | Catastrophic performance |
---
## 1. Harmony - Architectural Flaws
### Critical Issues
#### 1.1 Naive Deduplication (`deduplicate.ts:4-25`)
```typescript
// FLAW: Exact string match only
if (mbid) {
if (!mbids.has(mbid)) { result.push(entity); mbids.add(mbid); }
} else if (name) {
if (!names.has(name)) { result.push(entity); names.add(name); }
}
```
**Problem**: "The Beatles" ≠ "Beatles" ≠ "BEATULAR" - all treated as different entities.
**Fix**: Implement phonetic blocking (Metaphone) + Levenshtein similarity threshold.
#### 1.2 Limited Compatibility Checks (`compatibility.ts:60-67`)
```typescript
const releaseCompatibilityChecks: CompatibilityCheck<HarmonyRelease>[] = [{
property: (release) => release.gtin ? Number(release.gtin) : undefined,
errorMessage: 'Providers have returned multiple different GTIN',
}, {
property: trackCountSummary,
errorMessage: 'Providers have returned incompatible track lists',
}];
```
**Problem**: Only checks GTIN and track count. No artist validation, title similarity, or duration checks.
**Fix**: Add artist credit comparison, title Levenshtein distance, duration tolerance (±3%).
#### 1.3 First-Wins Merge with No Confidence (`merge.ts:105-124`)
```typescript
missingReleaseProperties.forEach((property) => {
const value = cloneInto(mergedRelease, sourceRelease, property);
if (isFilled(value)) {
mergedRelease.info.sourceMap[property] = providerName;
missingReleaseProperties.delete(property); // First wins, done
}
});
```
**Problem**: First provider to fill a field wins. No quality assessment.
**Fix**: Score each value by source trust × recency × consensus, pick highest.
#### 1.4 No Data Quality Metrics
**Missing**: Confidence scores, match quality, conflict counts, field completeness.
---
## 2. GraphBrainz - Architectural Flaws
### Critical Issues
#### 2.1 BATCHING COMPLETELY DISABLED (`loaders.js:38-42`)
```javascript
const lookup = new DataLoader(
(keys) => { /* ... */ },
{ batch: false } // ← DEFEATS ENTIRE PURPOSE OF DATALOADER
);
```
**Impact**: Query for 20 entities = 20 sequential HTTP requests. With rate limit of 5 req/5.5s = **22 seconds minimum**.
**Fix**: Implement request coalescing even without batch API. Deduplicate concurrent identical requests.
#### 2.2 N+1 Queries by Design (`relationship.js:127-138`)
```javascript
relationships: {
resolve: (entity, args, { loaders }, info) => {
// If relations not included in initial fetch...
promise = loaders.lookup.load([entityType, id, params]); // N+1 QUERY
return promise.then((entity) => entity.relations);
},
}
```
**Also in**: `recording.js:51-61` (ISRCs), `helpers.js:56-64` (fieldWithID pattern)
**Impact**: Query 100 artists with relationships = 1 + 100 requests.
**Fix**: Query planning phase - analyze full GraphQL query before any resolvers, compute optimal `inc` parameters.
#### 2.3 Cache Fragmentation (`loaders.js:11-20`)
```javascript
// Same artist cached 3 times with different completeness:
loaders.lookup.load(['artist', 'abc', {}])
loaders.lookup.load(['artist', 'abc', { inc: ['releases'] }])
loaders.lookup.load(['artist', 'abc', { inc: ['recordings'] }])
```
**Problem**: URL-based cache keys mean same entity with different `inc` params = different cache entries.
**Fix**: Entity-based cache with incremental enrichment.
#### 2.4 Extension System Limitations (`extensions/index.js`)
```javascript
// Only 18 lines. No lifecycle hooks, no dependency management.
export async function loadExtension(extensionModule) {
return typeof extensionModule === 'string'
? await import(extensionModule)
: extensionModule;
}
```
**Missing**: Lifecycle hooks, resolver interception, middleware support, error boundaries.
---
## 3. Bedrock-API - Architectural Flaws
### Critical Issues
#### 3.1 Missing Proto Fields (`bedrock_service.proto`)
| Missing Field | Impact |
|---------------|--------|
| `album_id` on Track | Can't link tracks to albums bidirectionally |
| `release_date` on Track | Temporal data lost |
| `explicit` flag | Content rating lost |
| `isrc` | International standard ID lost (critical for rights) |
| `verified` on Artist | Badge status lost |
| `label` on Album | Publisher info lost |
| `upc/ean` | Barcode identifiers lost |
#### 3.2 SoundCloud artist_id Bug (`soundcloud.go:457`)
```go
// BUG: Uses track ID instead of user ID
artist_id: fmt.Sprintf("soundcloud:%d", t.ID), // Should be t.User.ID
```
#### 3.3 Listening Stats Don't Persist (`main.go:984-1000`)
```go
func (s *BedrockServer) RecordPlay(ctx context.Context, req *pb.RecordPlayRequest) (*pb.RecordPlayResponse, error) {
eventID := uuid.New().String()
// TODO: persist event ← STUB!
return &pb.RecordPlayResponse{EventId: eventID, Status: pb.ResponseStatus_STATUS_OK}, nil
}
```
**Impact**: `GetPopularTracks` and `GetListeningHistory` return empty - feature non-functional.
#### 3.4 Resolver Bridging Has No Validation (`resolver.go:152-159`)
```go
// Takes first search result without scoring
results, err := s.sc.SearchTracks(ctx, cleanedQuery, 1)
return results[0] // Wrong track if covers/remixes rank first
```
**Missing**: Duration comparison, artist name fuzzy matching, ISRC/UPC verification.
#### 3.5 Spotify Panic Risk (`spotify.go:76-78`)
```go
// No bounds check before indexing
ArtistIDs: wrapper.ArtistIDs[0], // PANIC if empty array
```
---
## 4. minim - Architectural Flaws
### Critical Issues
#### 4.1 Inconsistent Error Handling Per Provider
| Provider | Error Pattern |
|----------|---------------|
| Spotify | Retries on 401, raises `RuntimeError` |
| TIDAL | Parses JSON error, falls back to status |
| Qobuz | Raises with `error['code']` |
| iTunes | Tries `errorMessage`, uses JSONDecodeError fallback |
| Discogs | Parses nested `detail` field |
**Impact**: Consumers need provider-specific error handling.
#### 4.2 Missing Retry Logic (3/5 providers)
Only Spotify and Qobuz implement retry. TIDAL, iTunes, Discogs fail immediately on transient errors.
#### 4.3 No Rate Limit Handling
```python
# Missing everywhere:
# - 429 Too Many Requests detection
# - Retry-After header parsing
# - Exponential backoff
```
#### 4.4 Response Structure Inconsistency
| Provider | Artist Field | Duration Field |
|----------|-------------|----------------|
| Spotify | `album.artists[0].name` | `duration_ms` |
| TIDAL | `data.attributes.name` | `duration` (seconds) |
| iTunes | `artistName` | `trackTimeMillis` |
| Discogs | `artists[0].name` | N/A |
**Impact**: No common data model. Every consumer writes provider-specific parsing.
---
## 5. MusicMetaLinker - Architectural Flaws
### Critical Issues
#### 5.1 Naive Cascading Fallback (`linking.py:159-182`)
```python
def get_artist(self) -> str | None:
if self.artist: return self.artist
artist = self.mb_link.get_artist()
if artist is None:
artist = self.dz_link.get_artist_name()
if artist is None:
artist = self.mb_link.get_artist() # Called twice!
if artist is None:
artist = self.yt_link.get_youtube_artist()
return artist # First non-None wins, no quality check
```
**Problems**:
- No confidence scoring
- No conflict detection ("Beyoncé" vs "Beyonce" vs "Beyoncé Knowles")
- Redundant MusicBrainz calls
- Order bias (Deezer always wins over YouTube)
#### 5.2 Silent Failures (`deezer_links.py:102-107`)
```python
try:
return [res for res in results][:limit]
except Exception: # Catches EVERYTHING
return None # Network error? Invalid input? Who knows!
```
**Impact**: Can't distinguish "no match" from "API failed" from "invalid input".
#### 5.3 ISRC Handling Bug (`musicbrainz_links.py:77-85`)
```python
for isrc in self.isrc:
try:
isrc_result = mb.get_recordings_by_isrc(isrc, ...)
return isrc_result # Returns on first success
except mb.ResponseError:
return None # BUG: Should be `continue`, not `return`!
```
#### 5.4 Album Name Truncation (`deezer_links.py:63-78`)
```python
if self.album and " " in self.album:
self.album = " ".join(self.album.split(" ")[:2]) # Only first 2 words!
```
"The Beatles (Remastered)" → "The Beatles" - loses critical specificity.
#### 5.5 Naive Duration Comparison
Fixed 3-second threshold regardless of track length:
- 3s is huge for 30-second track (10% error)
- 3s is tiny for 10-minute track (0.5% error)
---
## Proposed Architecture
### Design Principles
1. **Observations are immutable** - No "last write wins"; always preserve raw data
2. **Field-level confidence** - Trust title from MusicBrainz while using duration from Spotify
3. **Three-stage entity resolution** - Blocking → Similarity → Decision
4. **Provenance by default** - Every value is explainable
### Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Provider │ │ Provider │ │ Provider │ │ Provider │ │
│ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ └────────────────┴───────┬────────┴────────────────┘ │
│ ┌─────────────▼──────────────┐ │
│ │ Unified Provider Gateway │ │
│ │ • Per-provider rate limit │ │
│ │ • Retry + exp. backoff │ │
│ │ • Circuit breaker │ │
│ │ • Request batching │ │
│ └─────────────┬──────────────┘ │
└──────────────────────────────────┼──────────────────────────────────────┘
┌──────────────▼──────────────┐
│ RAW OBSERVATION STORE │
│ (append-only, immutable) │
└──────────────┬──────────────┘
┌──────────────────────────────────┼──────────────────────────────────────┐
│ ENTITY RESOLUTION LAYER │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ BLOCKING STAGE │ │
│ │ • ISRC/UPC exact match (99.7% pair reduction) │ │
│ │ • Phonetic blocking (Metaphone) for names │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ SIMILARITY STAGE │ │
│ │ • Title: Levenshtein + token Jaccard │ │
│ │ • Artist: embedding cosine similarity │ │
│ │ • Duration: relative threshold (±3% or ±5s) │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ DECISION STAGE │ │
│ │ • ≥0.95 → auto-merge │ │
│ │ • 0.70-0.95 → human review queue │ │
│ │ • <0.70 → distinct entities │ │
│ └────────────────────────┬────────────────────────┘ │
└──────────────────────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────┼──────────────────────────────────────┐
│ CONFLICT RESOLUTION ENGINE │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ FIELD-LEVEL MERGE RULES │ │
│ │ confidence = source_trust × recency × consensus │ │
│ │ │ │
│ │ • Identifiers: ISRC > provider ID │ │
│ │ • Duration: median within 2s tolerance │ │
│ │ • Title: MusicBrainz > label > streaming │ │
│ │ • Release date: earliest credible │ │
│ │ • Explicit: OR across sources │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ CANONICAL ENTITY STORE │ │
│ │ • Materialized "best known" values │ │
│ │ • Per-field confidence scores │ │
│ │ • Links to all source observations │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
```
---
### Core Data Model
```sql
-- Immutable observations from providers
CREATE TABLE observations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
provider TEXT NOT NULL,
provider_id TEXT NOT NULL,
entity_type TEXT NOT NULL,
payload JSONB NOT NULL,
fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(),
checksum BYTEA NOT NULL,
UNIQUE(provider, provider_id, checksum)
);
-- Canonical entities with confidence
CREATE TABLE tracks (
id UUID PRIMARY KEY,
-- Identifiers
isrc TEXT,
iswc TEXT,
mbid UUID,
-- Fields with confidence
title TEXT NOT NULL,
title_confidence REAL NOT NULL DEFAULT 0.0,
duration_ms INT,
duration_confidence REAL NOT NULL DEFAULT 0.0,
explicit BOOLEAN,
explicit_confidence REAL NOT NULL DEFAULT 0.0,
-- Denormalized
artist_credit TEXT NOT NULL,
album_title TEXT,
-- Metadata
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
merge_version INT NOT NULL DEFAULT 1
);
-- Field-level provenance
CREATE TABLE field_sources (
entity_type TEXT NOT NULL,
entity_id UUID NOT NULL,
field_name TEXT NOT NULL,
observation_id UUID NOT NULL REFERENCES observations(id),
confidence REAL NOT NULL,
selected BOOLEAN NOT NULL DEFAULT false,
PRIMARY KEY (entity_type, entity_id, field_name, observation_id)
);
-- Cross-reference table
CREATE TABLE provider_links (
entity_type TEXT NOT NULL,
entity_id UUID NOT NULL,
provider TEXT NOT NULL,
provider_id TEXT NOT NULL,
verified BOOLEAN NOT NULL DEFAULT false,
PRIMARY KEY (entity_type, provider, provider_id)
);
-- Entity resolution audit trail
CREATE TABLE merge_decisions (
id UUID PRIMARY KEY,
entity_type TEXT NOT NULL,
source_ids UUID[] NOT NULL,
target_id UUID NOT NULL,
similarity_score REAL NOT NULL,
decision TEXT NOT NULL, -- 'auto', 'human_approved', 'human_rejected'
decided_by TEXT,
decided_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
```
---
### Source Trust Hierarchy
```python
SOURCE_TRUST = {
'musicbrainz': 0.95, # Community-curated, high accuracy
'discogs': 0.85, # Community + physical media focus
'tidal': 0.80, # Label direct relationships
'spotify': 0.75, # Large scale, some noise
'deezer': 0.70, # Good coverage, less curation
'youtube': 0.60, # User-generated, low accuracy
}
```
---
### Conflict Resolution Rules
| Field | Strategy | Implementation |
|-------|----------|----------------|
| **Title** | Highest trust + consensus | Score = trust + 0.1×(agreeing_sources - 1) |
| **Duration** | Median within tolerance | Filter to ±3% or ±5s, take median |
| **Explicit** | OR logic | If any source says explicit → explicit |
| **Release Date** | Earliest credible | Must be ≤ today and ≥ 1900 |
| **ISRC** | First valid | Validate format, take highest-trust source |
| **Artist** | Embedding similarity | Cluster similar names, pick canonical |
---
### Technical Choices
| Component | Choice | Rationale |
|-----------|--------|-----------|
| **Core Language** | Python 3.11+ | Rapid iteration, rich ecosystem |
| **Hot Path** | Rust via PyO3 | Entity resolution blocking/embedding |
| **Database** | PostgreSQL 15+ | JSONB, trigram, pgvector |
| **Cache** | Redis | Entity-keyed, not URL-keyed |
| **Embeddings** | all-MiniLM-L6-v2 | 384-dim, fast, good quality |
| **API** | GraphQL + DataLoader | Explicit batching, no N+1 |
| **Queue** | PostgreSQL SKIP LOCKED | Human review, async processing |
| **Observability** | OpenTelemetry | Trace entity resolution decisions |
---
### Estimated Effort
| Component | Effort | Notes |
|-----------|--------|-------|
| Data model + migrations | 1-4 hours | PostgreSQL schema |
| Provider gateway | 1-2 days | Unified error handling, rate limiting |
| Entity resolution pipeline | 1-2 days | Blocking, similarity, decision |
| Conflict resolution engine | 1-4 hours | Field-level rules |
| Provenance system | 1-4 hours | Audit tables, explain API |
| Human review UI | 1-2 days | Queue management |
| **Total MVP** | **1-2 weeks** | |
---
## Key Takeaways
1. **Hybrid approaches win**: Audio + metadata outperforms either alone (Spotify research: 2-6% improvement)
2. **Provenance is non-negotiable**: Every field needs source tracking, confidence scores, snapshot URLs
3. **Identifier hierarchy matters**: ISWC (work) → ISRC (recording) → UPC (release) with MBIDs as glue
4. **Fuzzy matching requires stages**: Blocking (99.7% reduction) → Similarity → Threshold → Human review
5. **Conflict resolution needs policy**: Field-level precedence rules, not "last write wins"
6. **Cache entities, not requests**: Avoid GraphBrainz's URL-fragmentation trap
7. **Unified error handling**: Result types that force error handling, not silent exceptions