- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
22 KiB
Aggregators Architecture Analysis & Proposed Solution
Deep analysis of 5 music metadata aggregators, identifying common flaws and proposing a ground-up redesign.
Executive Summary
All 5 aggregators share common architectural mistakes that lead to data quality issues, performance problems, and poor extensibility:
| Pattern | Projects Affected | Impact |
|---|---|---|
| No confidence scoring | 5/5 | Can't distinguish good data from bad |
| First/last-write-wins merging | 4/5 | Data loss, no conflict resolution |
| Silent failure cascades | 4/5 | Debugging nightmare, data corruption |
| Naive entity resolution | 4/5 | Duplicates, mismatches |
| Provider-specific error handling | 3/5 | Inconsistent reliability |
| URL-based cache keys | 2/5 | Same entity cached multiple times |
| Disabled batching | 2/5 | Catastrophic performance |
1. Harmony - Architectural Flaws
Critical Issues
1.1 Naive Deduplication (deduplicate.ts:4-25)
// FLAW: Exact string match only
if (mbid) {
if (!mbids.has(mbid)) { result.push(entity); mbids.add(mbid); }
} else if (name) {
if (!names.has(name)) { result.push(entity); names.add(name); }
}
Problem: "The Beatles" ≠ "Beatles" ≠ "BEATULAR" - all treated as different entities.
Fix: Implement phonetic blocking (Metaphone) + Levenshtein similarity threshold.
1.2 Limited Compatibility Checks (compatibility.ts:60-67)
const releaseCompatibilityChecks: CompatibilityCheck<HarmonyRelease>[] = [{
property: (release) => release.gtin ? Number(release.gtin) : undefined,
errorMessage: 'Providers have returned multiple different GTIN',
}, {
property: trackCountSummary,
errorMessage: 'Providers have returned incompatible track lists',
}];
Problem: Only checks GTIN and track count. No artist validation, title similarity, or duration checks.
Fix: Add artist credit comparison, title Levenshtein distance, duration tolerance (±3%).
1.3 First-Wins Merge with No Confidence (merge.ts:105-124)
missingReleaseProperties.forEach((property) => {
const value = cloneInto(mergedRelease, sourceRelease, property);
if (isFilled(value)) {
mergedRelease.info.sourceMap[property] = providerName;
missingReleaseProperties.delete(property); // First wins, done
}
});
Problem: First provider to fill a field wins. No quality assessment.
Fix: Score each value by source trust × recency × consensus, pick highest.
1.4 No Data Quality Metrics
Missing: Confidence scores, match quality, conflict counts, field completeness.
2. GraphBrainz - Architectural Flaws
Critical Issues
2.1 BATCHING COMPLETELY DISABLED (loaders.js:38-42)
const lookup = new DataLoader(
(keys) => { /* ... */ },
{ batch: false } // ← DEFEATS ENTIRE PURPOSE OF DATALOADER
);
Impact: Query for 20 entities = 20 sequential HTTP requests. With rate limit of 5 req/5.5s = 22 seconds minimum.
Fix: Implement request coalescing even without batch API. Deduplicate concurrent identical requests.
2.2 N+1 Queries by Design (relationship.js:127-138)
relationships: {
resolve: (entity, args, { loaders }, info) => {
// If relations not included in initial fetch...
promise = loaders.lookup.load([entityType, id, params]); // N+1 QUERY
return promise.then((entity) => entity.relations);
},
}
Also in: recording.js:51-61 (ISRCs), helpers.js:56-64 (fieldWithID pattern)
Impact: Query 100 artists with relationships = 1 + 100 requests.
Fix: Query planning phase - analyze full GraphQL query before any resolvers, compute optimal inc parameters.
2.3 Cache Fragmentation (loaders.js:11-20)
// Same artist cached 3 times with different completeness:
loaders.lookup.load(['artist', 'abc', {}])
loaders.lookup.load(['artist', 'abc', { inc: ['releases'] }])
loaders.lookup.load(['artist', 'abc', { inc: ['recordings'] }])
Problem: URL-based cache keys mean same entity with different inc params = different cache entries.
Fix: Entity-based cache with incremental enrichment.
2.4 Extension System Limitations (extensions/index.js)
// Only 18 lines. No lifecycle hooks, no dependency management.
export async function loadExtension(extensionModule) {
return typeof extensionModule === 'string'
? await import(extensionModule)
: extensionModule;
}
Missing: Lifecycle hooks, resolver interception, middleware support, error boundaries.
3. Bedrock-API - Architectural Flaws
Critical Issues
3.1 Missing Proto Fields (bedrock_service.proto)
| Missing Field | Impact |
|---|---|
album_id on Track |
Can't link tracks to albums bidirectionally |
release_date on Track |
Temporal data lost |
explicit flag |
Content rating lost |
isrc |
International standard ID lost (critical for rights) |
verified on Artist |
Badge status lost |
label on Album |
Publisher info lost |
upc/ean |
Barcode identifiers lost |
3.2 SoundCloud artist_id Bug (soundcloud.go:457)
// BUG: Uses track ID instead of user ID
artist_id: fmt.Sprintf("soundcloud:%d", t.ID), // Should be t.User.ID
3.3 Listening Stats Don't Persist (main.go:984-1000)
func (s *BedrockServer) RecordPlay(ctx context.Context, req *pb.RecordPlayRequest) (*pb.RecordPlayResponse, error) {
eventID := uuid.New().String()
// TODO: persist event ← STUB!
return &pb.RecordPlayResponse{EventId: eventID, Status: pb.ResponseStatus_STATUS_OK}, nil
}
Impact: GetPopularTracks and GetListeningHistory return empty - feature non-functional.
3.4 Resolver Bridging Has No Validation (resolver.go:152-159)
// Takes first search result without scoring
results, err := s.sc.SearchTracks(ctx, cleanedQuery, 1)
return results[0] // Wrong track if covers/remixes rank first
Missing: Duration comparison, artist name fuzzy matching, ISRC/UPC verification.
3.5 Spotify Panic Risk (spotify.go:76-78)
// No bounds check before indexing
ArtistIDs: wrapper.ArtistIDs[0], // PANIC if empty array
4. minim - Architectural Flaws
Critical Issues
4.1 Inconsistent Error Handling Per Provider
| Provider | Error Pattern |
|---|---|
| Spotify | Retries on 401, raises RuntimeError |
| TIDAL | Parses JSON error, falls back to status |
| Qobuz | Raises with error['code'] |
| iTunes | Tries errorMessage, uses JSONDecodeError fallback |
| Discogs | Parses nested detail field |
Impact: Consumers need provider-specific error handling.
4.2 Missing Retry Logic (3/5 providers)
Only Spotify and Qobuz implement retry. TIDAL, iTunes, Discogs fail immediately on transient errors.
4.3 No Rate Limit Handling
# Missing everywhere:
# - 429 Too Many Requests detection
# - Retry-After header parsing
# - Exponential backoff
4.4 Response Structure Inconsistency
| Provider | Artist Field | Duration Field |
|---|---|---|
| Spotify | album.artists[0].name |
duration_ms |
| TIDAL | data.attributes.name |
duration (seconds) |
| iTunes | artistName |
trackTimeMillis |
| Discogs | artists[0].name |
N/A |
Impact: No common data model. Every consumer writes provider-specific parsing.
5. MusicMetaLinker - Architectural Flaws
Critical Issues
5.1 Naive Cascading Fallback (linking.py:159-182)
def get_artist(self) -> str | None:
if self.artist: return self.artist
artist = self.mb_link.get_artist()
if artist is None:
artist = self.dz_link.get_artist_name()
if artist is None:
artist = self.mb_link.get_artist() # Called twice!
if artist is None:
artist = self.yt_link.get_youtube_artist()
return artist # First non-None wins, no quality check
Problems:
- No confidence scoring
- No conflict detection ("Beyoncé" vs "Beyonce" vs "Beyoncé Knowles")
- Redundant MusicBrainz calls
- Order bias (Deezer always wins over YouTube)
5.2 Silent Failures (deezer_links.py:102-107)
try:
return [res for res in results][:limit]
except Exception: # Catches EVERYTHING
return None # Network error? Invalid input? Who knows!
Impact: Can't distinguish "no match" from "API failed" from "invalid input".
5.3 ISRC Handling Bug (musicbrainz_links.py:77-85)
for isrc in self.isrc:
try:
isrc_result = mb.get_recordings_by_isrc(isrc, ...)
return isrc_result # Returns on first success
except mb.ResponseError:
return None # BUG: Should be `continue`, not `return`!
5.4 Album Name Truncation (deezer_links.py:63-78)
if self.album and " " in self.album:
self.album = " ".join(self.album.split(" ")[:2]) # Only first 2 words!
"The Beatles (Remastered)" → "The Beatles" - loses critical specificity.
5.5 Naive Duration Comparison
Fixed 3-second threshold regardless of track length:
- 3s is huge for 30-second track (10% error)
- 3s is tiny for 10-minute track (0.5% error)
Proposed Architecture
Design Principles
- Observations are immutable - No "last write wins"; always preserve raw data
- Field-level confidence - Trust title from MusicBrainz while using duration from Spotify
- Three-stage entity resolution - Blocking → Similarity → Decision
- Provenance by default - Every value is explainable
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Provider │ │ Provider │ │ Provider │ │ Provider │ │
│ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ └────────────────┴───────┬────────┴────────────────┘ │
│ ┌─────────────▼──────────────┐ │
│ │ Unified Provider Gateway │ │
│ │ • Per-provider rate limit │ │
│ │ • Retry + exp. backoff │ │
│ │ • Circuit breaker │ │
│ │ • Request batching │ │
│ └─────────────┬──────────────┘ │
└──────────────────────────────────┼──────────────────────────────────────┘
│
┌──────────────▼──────────────┐
│ RAW OBSERVATION STORE │
│ (append-only, immutable) │
└──────────────┬──────────────┘
│
┌──────────────────────────────────┼──────────────────────────────────────┐
│ ENTITY RESOLUTION LAYER │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ BLOCKING STAGE │ │
│ │ • ISRC/UPC exact match (99.7% pair reduction) │ │
│ │ • Phonetic blocking (Metaphone) for names │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ SIMILARITY STAGE │ │
│ │ • Title: Levenshtein + token Jaccard │ │
│ │ • Artist: embedding cosine similarity │ │
│ │ • Duration: relative threshold (±3% or ±5s) │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ DECISION STAGE │ │
│ │ • ≥0.95 → auto-merge │ │
│ │ • 0.70-0.95 → human review queue │ │
│ │ • <0.70 → distinct entities │ │
│ └────────────────────────┬────────────────────────┘ │
└──────────────────────────────────┼──────────────────────────────────────┘
│
┌──────────────────────────────────┼──────────────────────────────────────┐
│ CONFLICT RESOLUTION ENGINE │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ FIELD-LEVEL MERGE RULES │ │
│ │ confidence = source_trust × recency × consensus │ │
│ │ │ │
│ │ • Identifiers: ISRC > provider ID │ │
│ │ • Duration: median within 2s tolerance │ │
│ │ • Title: MusicBrainz > label > streaming │ │
│ │ • Release date: earliest credible │ │
│ │ • Explicit: OR across sources │ │
│ └────────────────────────┬────────────────────────┘ │
│ ┌────────────────────────▼────────────────────────┐ │
│ │ CANONICAL ENTITY STORE │ │
│ │ • Materialized "best known" values │ │
│ │ • Per-field confidence scores │ │
│ │ • Links to all source observations │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Core Data Model
-- Immutable observations from providers
CREATE TABLE observations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
provider TEXT NOT NULL,
provider_id TEXT NOT NULL,
entity_type TEXT NOT NULL,
payload JSONB NOT NULL,
fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(),
checksum BYTEA NOT NULL,
UNIQUE(provider, provider_id, checksum)
);
-- Canonical entities with confidence
CREATE TABLE tracks (
id UUID PRIMARY KEY,
-- Identifiers
isrc TEXT,
iswc TEXT,
mbid UUID,
-- Fields with confidence
title TEXT NOT NULL,
title_confidence REAL NOT NULL DEFAULT 0.0,
duration_ms INT,
duration_confidence REAL NOT NULL DEFAULT 0.0,
explicit BOOLEAN,
explicit_confidence REAL NOT NULL DEFAULT 0.0,
-- Denormalized
artist_credit TEXT NOT NULL,
album_title TEXT,
-- Metadata
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
merge_version INT NOT NULL DEFAULT 1
);
-- Field-level provenance
CREATE TABLE field_sources (
entity_type TEXT NOT NULL,
entity_id UUID NOT NULL,
field_name TEXT NOT NULL,
observation_id UUID NOT NULL REFERENCES observations(id),
confidence REAL NOT NULL,
selected BOOLEAN NOT NULL DEFAULT false,
PRIMARY KEY (entity_type, entity_id, field_name, observation_id)
);
-- Cross-reference table
CREATE TABLE provider_links (
entity_type TEXT NOT NULL,
entity_id UUID NOT NULL,
provider TEXT NOT NULL,
provider_id TEXT NOT NULL,
verified BOOLEAN NOT NULL DEFAULT false,
PRIMARY KEY (entity_type, provider, provider_id)
);
-- Entity resolution audit trail
CREATE TABLE merge_decisions (
id UUID PRIMARY KEY,
entity_type TEXT NOT NULL,
source_ids UUID[] NOT NULL,
target_id UUID NOT NULL,
similarity_score REAL NOT NULL,
decision TEXT NOT NULL, -- 'auto', 'human_approved', 'human_rejected'
decided_by TEXT,
decided_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Source Trust Hierarchy
SOURCE_TRUST = {
'musicbrainz': 0.95, # Community-curated, high accuracy
'discogs': 0.85, # Community + physical media focus
'tidal': 0.80, # Label direct relationships
'spotify': 0.75, # Large scale, some noise
'deezer': 0.70, # Good coverage, less curation
'youtube': 0.60, # User-generated, low accuracy
}
Conflict Resolution Rules
| Field | Strategy | Implementation |
|---|---|---|
| Title | Highest trust + consensus | Score = trust + 0.1×(agreeing_sources - 1) |
| Duration | Median within tolerance | Filter to ±3% or ±5s, take median |
| Explicit | OR logic | If any source says explicit → explicit |
| Release Date | Earliest credible | Must be ≤ today and ≥ 1900 |
| ISRC | First valid | Validate format, take highest-trust source |
| Artist | Embedding similarity | Cluster similar names, pick canonical |
Technical Choices
| Component | Choice | Rationale |
|---|---|---|
| Core Language | Python 3.11+ | Rapid iteration, rich ecosystem |
| Hot Path | Rust via PyO3 | Entity resolution blocking/embedding |
| Database | PostgreSQL 15+ | JSONB, trigram, pgvector |
| Cache | Redis | Entity-keyed, not URL-keyed |
| Embeddings | all-MiniLM-L6-v2 | 384-dim, fast, good quality |
| API | GraphQL + DataLoader | Explicit batching, no N+1 |
| Queue | PostgreSQL SKIP LOCKED | Human review, async processing |
| Observability | OpenTelemetry | Trace entity resolution decisions |
Estimated Effort
| Component | Effort | Notes |
|---|---|---|
| Data model + migrations | 1-4 hours | PostgreSQL schema |
| Provider gateway | 1-2 days | Unified error handling, rate limiting |
| Entity resolution pipeline | 1-2 days | Blocking, similarity, decision |
| Conflict resolution engine | 1-4 hours | Field-level rules |
| Provenance system | 1-4 hours | Audit tables, explain API |
| Human review UI | 1-2 days | Queue management |
| Total MVP | 1-2 weeks |
Key Takeaways
-
Hybrid approaches win: Audio + metadata outperforms either alone (Spotify research: 2-6% improvement)
-
Provenance is non-negotiable: Every field needs source tracking, confidence scores, snapshot URLs
-
Identifier hierarchy matters: ISWC (work) → ISRC (recording) → UPC (release) with MBIDs as glue
-
Fuzzy matching requires stages: Blocking (99.7% reduction) → Similarity → Threshold → Human review
-
Conflict resolution needs policy: Field-level precedence rules, not "last write wins"
-
Cache entities, not requests: Avoid GraphBrainz's URL-fragmentation trap
-
Unified error handling: Result types that force error handling, not silent exceptions