# Aggregators Architecture Analysis & Proposed Solution Deep analysis of 5 music metadata aggregators, identifying common flaws and proposing a ground-up redesign. --- ## Executive Summary All 5 aggregators share **common architectural mistakes** that lead to data quality issues, performance problems, and poor extensibility: | Pattern | Projects Affected | Impact | |---------|-------------------|--------| | **No confidence scoring** | 5/5 | Can't distinguish good data from bad | | **First/last-write-wins merging** | 4/5 | Data loss, no conflict resolution | | **Silent failure cascades** | 4/5 | Debugging nightmare, data corruption | | **Naive entity resolution** | 4/5 | Duplicates, mismatches | | **Provider-specific error handling** | 3/5 | Inconsistent reliability | | **URL-based cache keys** | 2/5 | Same entity cached multiple times | | **Disabled batching** | 2/5 | Catastrophic performance | --- ## 1. Harmony - Architectural Flaws ### Critical Issues #### 1.1 Naive Deduplication (`deduplicate.ts:4-25`) ```typescript // FLAW: Exact string match only if (mbid) { if (!mbids.has(mbid)) { result.push(entity); mbids.add(mbid); } } else if (name) { if (!names.has(name)) { result.push(entity); names.add(name); } } ``` **Problem**: "The Beatles" ≠ "Beatles" ≠ "BEATULAR" - all treated as different entities. **Fix**: Implement phonetic blocking (Metaphone) + Levenshtein similarity threshold. #### 1.2 Limited Compatibility Checks (`compatibility.ts:60-67`) ```typescript const releaseCompatibilityChecks: CompatibilityCheck[] = [{ property: (release) => release.gtin ? Number(release.gtin) : undefined, errorMessage: 'Providers have returned multiple different GTIN', }, { property: trackCountSummary, errorMessage: 'Providers have returned incompatible track lists', }]; ``` **Problem**: Only checks GTIN and track count. No artist validation, title similarity, or duration checks. **Fix**: Add artist credit comparison, title Levenshtein distance, duration tolerance (±3%). #### 1.3 First-Wins Merge with No Confidence (`merge.ts:105-124`) ```typescript missingReleaseProperties.forEach((property) => { const value = cloneInto(mergedRelease, sourceRelease, property); if (isFilled(value)) { mergedRelease.info.sourceMap[property] = providerName; missingReleaseProperties.delete(property); // First wins, done } }); ``` **Problem**: First provider to fill a field wins. No quality assessment. **Fix**: Score each value by source trust × recency × consensus, pick highest. #### 1.4 No Data Quality Metrics **Missing**: Confidence scores, match quality, conflict counts, field completeness. --- ## 2. GraphBrainz - Architectural Flaws ### Critical Issues #### 2.1 BATCHING COMPLETELY DISABLED (`loaders.js:38-42`) ```javascript const lookup = new DataLoader( (keys) => { /* ... */ }, { batch: false } // ← DEFEATS ENTIRE PURPOSE OF DATALOADER ); ``` **Impact**: Query for 20 entities = 20 sequential HTTP requests. With rate limit of 5 req/5.5s = **22 seconds minimum**. **Fix**: Implement request coalescing even without batch API. Deduplicate concurrent identical requests. #### 2.2 N+1 Queries by Design (`relationship.js:127-138`) ```javascript relationships: { resolve: (entity, args, { loaders }, info) => { // If relations not included in initial fetch... promise = loaders.lookup.load([entityType, id, params]); // N+1 QUERY return promise.then((entity) => entity.relations); }, } ``` **Also in**: `recording.js:51-61` (ISRCs), `helpers.js:56-64` (fieldWithID pattern) **Impact**: Query 100 artists with relationships = 1 + 100 requests. **Fix**: Query planning phase - analyze full GraphQL query before any resolvers, compute optimal `inc` parameters. #### 2.3 Cache Fragmentation (`loaders.js:11-20`) ```javascript // Same artist cached 3 times with different completeness: loaders.lookup.load(['artist', 'abc', {}]) loaders.lookup.load(['artist', 'abc', { inc: ['releases'] }]) loaders.lookup.load(['artist', 'abc', { inc: ['recordings'] }]) ``` **Problem**: URL-based cache keys mean same entity with different `inc` params = different cache entries. **Fix**: Entity-based cache with incremental enrichment. #### 2.4 Extension System Limitations (`extensions/index.js`) ```javascript // Only 18 lines. No lifecycle hooks, no dependency management. export async function loadExtension(extensionModule) { return typeof extensionModule === 'string' ? await import(extensionModule) : extensionModule; } ``` **Missing**: Lifecycle hooks, resolver interception, middleware support, error boundaries. --- ## 3. Bedrock-API - Architectural Flaws ### Critical Issues #### 3.1 Missing Proto Fields (`bedrock_service.proto`) | Missing Field | Impact | |---------------|--------| | `album_id` on Track | Can't link tracks to albums bidirectionally | | `release_date` on Track | Temporal data lost | | `explicit` flag | Content rating lost | | `isrc` | International standard ID lost (critical for rights) | | `verified` on Artist | Badge status lost | | `label` on Album | Publisher info lost | | `upc/ean` | Barcode identifiers lost | #### 3.2 SoundCloud artist_id Bug (`soundcloud.go:457`) ```go // BUG: Uses track ID instead of user ID artist_id: fmt.Sprintf("soundcloud:%d", t.ID), // Should be t.User.ID ``` #### 3.3 Listening Stats Don't Persist (`main.go:984-1000`) ```go func (s *BedrockServer) RecordPlay(ctx context.Context, req *pb.RecordPlayRequest) (*pb.RecordPlayResponse, error) { eventID := uuid.New().String() // TODO: persist event ← STUB! return &pb.RecordPlayResponse{EventId: eventID, Status: pb.ResponseStatus_STATUS_OK}, nil } ``` **Impact**: `GetPopularTracks` and `GetListeningHistory` return empty - feature non-functional. #### 3.4 Resolver Bridging Has No Validation (`resolver.go:152-159`) ```go // Takes first search result without scoring results, err := s.sc.SearchTracks(ctx, cleanedQuery, 1) return results[0] // Wrong track if covers/remixes rank first ``` **Missing**: Duration comparison, artist name fuzzy matching, ISRC/UPC verification. #### 3.5 Spotify Panic Risk (`spotify.go:76-78`) ```go // No bounds check before indexing ArtistIDs: wrapper.ArtistIDs[0], // PANIC if empty array ``` --- ## 4. minim - Architectural Flaws ### Critical Issues #### 4.1 Inconsistent Error Handling Per Provider | Provider | Error Pattern | |----------|---------------| | Spotify | Retries on 401, raises `RuntimeError` | | TIDAL | Parses JSON error, falls back to status | | Qobuz | Raises with `error['code']` | | iTunes | Tries `errorMessage`, uses JSONDecodeError fallback | | Discogs | Parses nested `detail` field | **Impact**: Consumers need provider-specific error handling. #### 4.2 Missing Retry Logic (3/5 providers) Only Spotify and Qobuz implement retry. TIDAL, iTunes, Discogs fail immediately on transient errors. #### 4.3 No Rate Limit Handling ```python # Missing everywhere: # - 429 Too Many Requests detection # - Retry-After header parsing # - Exponential backoff ``` #### 4.4 Response Structure Inconsistency | Provider | Artist Field | Duration Field | |----------|-------------|----------------| | Spotify | `album.artists[0].name` | `duration_ms` | | TIDAL | `data.attributes.name` | `duration` (seconds) | | iTunes | `artistName` | `trackTimeMillis` | | Discogs | `artists[0].name` | N/A | **Impact**: No common data model. Every consumer writes provider-specific parsing. --- ## 5. MusicMetaLinker - Architectural Flaws ### Critical Issues #### 5.1 Naive Cascading Fallback (`linking.py:159-182`) ```python def get_artist(self) -> str | None: if self.artist: return self.artist artist = self.mb_link.get_artist() if artist is None: artist = self.dz_link.get_artist_name() if artist is None: artist = self.mb_link.get_artist() # Called twice! if artist is None: artist = self.yt_link.get_youtube_artist() return artist # First non-None wins, no quality check ``` **Problems**: - No confidence scoring - No conflict detection ("Beyoncé" vs "Beyonce" vs "Beyoncé Knowles") - Redundant MusicBrainz calls - Order bias (Deezer always wins over YouTube) #### 5.2 Silent Failures (`deezer_links.py:102-107`) ```python try: return [res for res in results][:limit] except Exception: # Catches EVERYTHING return None # Network error? Invalid input? Who knows! ``` **Impact**: Can't distinguish "no match" from "API failed" from "invalid input". #### 5.3 ISRC Handling Bug (`musicbrainz_links.py:77-85`) ```python for isrc in self.isrc: try: isrc_result = mb.get_recordings_by_isrc(isrc, ...) return isrc_result # Returns on first success except mb.ResponseError: return None # BUG: Should be `continue`, not `return`! ``` #### 5.4 Album Name Truncation (`deezer_links.py:63-78`) ```python if self.album and " " in self.album: self.album = " ".join(self.album.split(" ")[:2]) # Only first 2 words! ``` "The Beatles (Remastered)" → "The Beatles" - loses critical specificity. #### 5.5 Naive Duration Comparison Fixed 3-second threshold regardless of track length: - 3s is huge for 30-second track (10% error) - 3s is tiny for 10-minute track (0.5% error) --- ## Proposed Architecture ### Design Principles 1. **Observations are immutable** - No "last write wins"; always preserve raw data 2. **Field-level confidence** - Trust title from MusicBrainz while using duration from Spotify 3. **Three-stage entity resolution** - Blocking → Similarity → Decision 4. **Provenance by default** - Every value is explainable ### Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ INGESTION LAYER │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Provider │ │ Provider │ │ Provider │ │ Provider │ │ │ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ └────────────────┴───────┬────────┴────────────────┘ │ │ ┌─────────────▼──────────────┐ │ │ │ Unified Provider Gateway │ │ │ │ • Per-provider rate limit │ │ │ │ • Retry + exp. backoff │ │ │ │ • Circuit breaker │ │ │ │ • Request batching │ │ │ └─────────────┬──────────────┘ │ └──────────────────────────────────┼──────────────────────────────────────┘ │ ┌──────────────▼──────────────┐ │ RAW OBSERVATION STORE │ │ (append-only, immutable) │ └──────────────┬──────────────┘ │ ┌──────────────────────────────────┼──────────────────────────────────────┐ │ ENTITY RESOLUTION LAYER │ │ ┌────────────────────────▼────────────────────────┐ │ │ │ BLOCKING STAGE │ │ │ │ • ISRC/UPC exact match (99.7% pair reduction) │ │ │ │ • Phonetic blocking (Metaphone) for names │ │ │ └────────────────────────┬────────────────────────┘ │ │ ┌────────────────────────▼────────────────────────┐ │ │ │ SIMILARITY STAGE │ │ │ │ • Title: Levenshtein + token Jaccard │ │ │ │ • Artist: embedding cosine similarity │ │ │ │ • Duration: relative threshold (±3% or ±5s) │ │ │ └────────────────────────┬────────────────────────┘ │ │ ┌────────────────────────▼────────────────────────┐ │ │ │ DECISION STAGE │ │ │ │ • ≥0.95 → auto-merge │ │ │ │ • 0.70-0.95 → human review queue │ │ │ │ • <0.70 → distinct entities │ │ │ └────────────────────────┬────────────────────────┘ │ └──────────────────────────────────┼──────────────────────────────────────┘ │ ┌──────────────────────────────────┼──────────────────────────────────────┐ │ CONFLICT RESOLUTION ENGINE │ │ ┌────────────────────────▼────────────────────────┐ │ │ │ FIELD-LEVEL MERGE RULES │ │ │ │ confidence = source_trust × recency × consensus │ │ │ │ │ │ │ │ • Identifiers: ISRC > provider ID │ │ │ │ • Duration: median within 2s tolerance │ │ │ │ • Title: MusicBrainz > label > streaming │ │ │ │ • Release date: earliest credible │ │ │ │ • Explicit: OR across sources │ │ │ └────────────────────────┬────────────────────────┘ │ │ ┌────────────────────────▼────────────────────────┐ │ │ │ CANONICAL ENTITY STORE │ │ │ │ • Materialized "best known" values │ │ │ │ • Per-field confidence scores │ │ │ │ • Links to all source observations │ │ │ └─────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ ``` --- ### Core Data Model ```sql -- Immutable observations from providers CREATE TABLE observations ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), provider TEXT NOT NULL, provider_id TEXT NOT NULL, entity_type TEXT NOT NULL, payload JSONB NOT NULL, fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(), checksum BYTEA NOT NULL, UNIQUE(provider, provider_id, checksum) ); -- Canonical entities with confidence CREATE TABLE tracks ( id UUID PRIMARY KEY, -- Identifiers isrc TEXT, iswc TEXT, mbid UUID, -- Fields with confidence title TEXT NOT NULL, title_confidence REAL NOT NULL DEFAULT 0.0, duration_ms INT, duration_confidence REAL NOT NULL DEFAULT 0.0, explicit BOOLEAN, explicit_confidence REAL NOT NULL DEFAULT 0.0, -- Denormalized artist_credit TEXT NOT NULL, album_title TEXT, -- Metadata created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), merge_version INT NOT NULL DEFAULT 1 ); -- Field-level provenance CREATE TABLE field_sources ( entity_type TEXT NOT NULL, entity_id UUID NOT NULL, field_name TEXT NOT NULL, observation_id UUID NOT NULL REFERENCES observations(id), confidence REAL NOT NULL, selected BOOLEAN NOT NULL DEFAULT false, PRIMARY KEY (entity_type, entity_id, field_name, observation_id) ); -- Cross-reference table CREATE TABLE provider_links ( entity_type TEXT NOT NULL, entity_id UUID NOT NULL, provider TEXT NOT NULL, provider_id TEXT NOT NULL, verified BOOLEAN NOT NULL DEFAULT false, PRIMARY KEY (entity_type, provider, provider_id) ); -- Entity resolution audit trail CREATE TABLE merge_decisions ( id UUID PRIMARY KEY, entity_type TEXT NOT NULL, source_ids UUID[] NOT NULL, target_id UUID NOT NULL, similarity_score REAL NOT NULL, decision TEXT NOT NULL, -- 'auto', 'human_approved', 'human_rejected' decided_by TEXT, decided_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` --- ### Source Trust Hierarchy ```python SOURCE_TRUST = { 'musicbrainz': 0.95, # Community-curated, high accuracy 'discogs': 0.85, # Community + physical media focus 'tidal': 0.80, # Label direct relationships 'spotify': 0.75, # Large scale, some noise 'deezer': 0.70, # Good coverage, less curation 'youtube': 0.60, # User-generated, low accuracy } ``` --- ### Conflict Resolution Rules | Field | Strategy | Implementation | |-------|----------|----------------| | **Title** | Highest trust + consensus | Score = trust + 0.1×(agreeing_sources - 1) | | **Duration** | Median within tolerance | Filter to ±3% or ±5s, take median | | **Explicit** | OR logic | If any source says explicit → explicit | | **Release Date** | Earliest credible | Must be ≤ today and ≥ 1900 | | **ISRC** | First valid | Validate format, take highest-trust source | | **Artist** | Embedding similarity | Cluster similar names, pick canonical | --- ### Technical Choices | Component | Choice | Rationale | |-----------|--------|-----------| | **Core Language** | Python 3.11+ | Rapid iteration, rich ecosystem | | **Hot Path** | Rust via PyO3 | Entity resolution blocking/embedding | | **Database** | PostgreSQL 15+ | JSONB, trigram, pgvector | | **Cache** | Redis | Entity-keyed, not URL-keyed | | **Embeddings** | all-MiniLM-L6-v2 | 384-dim, fast, good quality | | **API** | GraphQL + DataLoader | Explicit batching, no N+1 | | **Queue** | PostgreSQL SKIP LOCKED | Human review, async processing | | **Observability** | OpenTelemetry | Trace entity resolution decisions | --- ### Estimated Effort | Component | Effort | Notes | |-----------|--------|-------| | Data model + migrations | 1-4 hours | PostgreSQL schema | | Provider gateway | 1-2 days | Unified error handling, rate limiting | | Entity resolution pipeline | 1-2 days | Blocking, similarity, decision | | Conflict resolution engine | 1-4 hours | Field-level rules | | Provenance system | 1-4 hours | Audit tables, explain API | | Human review UI | 1-2 days | Queue management | | **Total MVP** | **1-2 weeks** | | --- ## Key Takeaways 1. **Hybrid approaches win**: Audio + metadata outperforms either alone (Spotify research: 2-6% improvement) 2. **Provenance is non-negotiable**: Every field needs source tracking, confidence scores, snapshot URLs 3. **Identifier hierarchy matters**: ISWC (work) → ISRC (recording) → UPC (release) with MBIDs as glue 4. **Fuzzy matching requires stages**: Blocking (99.7% reduction) → Similarity → Threshold → Human review 5. **Conflict resolution needs policy**: Field-level precedence rules, not "last write wins" 6. **Cache entities, not requests**: Avoid GraphBrainz's URL-fragmentation trap 7. **Unified error handling**: Result types that force error handling, not silent exceptions