Files
metadata-agregator/docs/research/AGGREGATORS_ANALYSIS.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

22 KiB
Raw Blame History

Aggregators Architecture Analysis & Proposed Solution

Deep analysis of 5 music metadata aggregators, identifying common flaws and proposing a ground-up redesign.


Executive Summary

All 5 aggregators share common architectural mistakes that lead to data quality issues, performance problems, and poor extensibility:

Pattern Projects Affected Impact
No confidence scoring 5/5 Can't distinguish good data from bad
First/last-write-wins merging 4/5 Data loss, no conflict resolution
Silent failure cascades 4/5 Debugging nightmare, data corruption
Naive entity resolution 4/5 Duplicates, mismatches
Provider-specific error handling 3/5 Inconsistent reliability
URL-based cache keys 2/5 Same entity cached multiple times
Disabled batching 2/5 Catastrophic performance

1. Harmony - Architectural Flaws

Critical Issues

1.1 Naive Deduplication (deduplicate.ts:4-25)

// FLAW: Exact string match only
if (mbid) {
  if (!mbids.has(mbid)) { result.push(entity); mbids.add(mbid); }
} else if (name) {
  if (!names.has(name)) { result.push(entity); names.add(name); }
}

Problem: "The Beatles" ≠ "Beatles" ≠ "BEATULAR" - all treated as different entities.

Fix: Implement phonetic blocking (Metaphone) + Levenshtein similarity threshold.

1.2 Limited Compatibility Checks (compatibility.ts:60-67)

const releaseCompatibilityChecks: CompatibilityCheck<HarmonyRelease>[] = [{
  property: (release) => release.gtin ? Number(release.gtin) : undefined,
  errorMessage: 'Providers have returned multiple different GTIN',
}, {
  property: trackCountSummary,
  errorMessage: 'Providers have returned incompatible track lists',
}];

Problem: Only checks GTIN and track count. No artist validation, title similarity, or duration checks.

Fix: Add artist credit comparison, title Levenshtein distance, duration tolerance (±3%).

1.3 First-Wins Merge with No Confidence (merge.ts:105-124)

missingReleaseProperties.forEach((property) => {
  const value = cloneInto(mergedRelease, sourceRelease, property);
  if (isFilled(value)) {
    mergedRelease.info.sourceMap[property] = providerName;
    missingReleaseProperties.delete(property);  // First wins, done
  }
});

Problem: First provider to fill a field wins. No quality assessment.

Fix: Score each value by source trust × recency × consensus, pick highest.

1.4 No Data Quality Metrics

Missing: Confidence scores, match quality, conflict counts, field completeness.


2. GraphBrainz - Architectural Flaws

Critical Issues

2.1 BATCHING COMPLETELY DISABLED (loaders.js:38-42)

const lookup = new DataLoader(
  (keys) => { /* ... */ },
  { batch: false }  // ← DEFEATS ENTIRE PURPOSE OF DATALOADER
);

Impact: Query for 20 entities = 20 sequential HTTP requests. With rate limit of 5 req/5.5s = 22 seconds minimum.

Fix: Implement request coalescing even without batch API. Deduplicate concurrent identical requests.

2.2 N+1 Queries by Design (relationship.js:127-138)

relationships: {
  resolve: (entity, args, { loaders }, info) => {
    // If relations not included in initial fetch...
    promise = loaders.lookup.load([entityType, id, params]);  // N+1 QUERY
    return promise.then((entity) => entity.relations);
  },
}

Also in: recording.js:51-61 (ISRCs), helpers.js:56-64 (fieldWithID pattern)

Impact: Query 100 artists with relationships = 1 + 100 requests.

Fix: Query planning phase - analyze full GraphQL query before any resolvers, compute optimal inc parameters.

2.3 Cache Fragmentation (loaders.js:11-20)

// Same artist cached 3 times with different completeness:
loaders.lookup.load(['artist', 'abc', {}])
loaders.lookup.load(['artist', 'abc', { inc: ['releases'] }])
loaders.lookup.load(['artist', 'abc', { inc: ['recordings'] }])

Problem: URL-based cache keys mean same entity with different inc params = different cache entries.

Fix: Entity-based cache with incremental enrichment.

2.4 Extension System Limitations (extensions/index.js)

// Only 18 lines. No lifecycle hooks, no dependency management.
export async function loadExtension(extensionModule) {
  return typeof extensionModule === 'string' 
    ? await import(extensionModule) 
    : extensionModule;
}

Missing: Lifecycle hooks, resolver interception, middleware support, error boundaries.


3. Bedrock-API - Architectural Flaws

Critical Issues

3.1 Missing Proto Fields (bedrock_service.proto)

Missing Field Impact
album_id on Track Can't link tracks to albums bidirectionally
release_date on Track Temporal data lost
explicit flag Content rating lost
isrc International standard ID lost (critical for rights)
verified on Artist Badge status lost
label on Album Publisher info lost
upc/ean Barcode identifiers lost

3.2 SoundCloud artist_id Bug (soundcloud.go:457)

// BUG: Uses track ID instead of user ID
artist_id: fmt.Sprintf("soundcloud:%d", t.ID),  // Should be t.User.ID

3.3 Listening Stats Don't Persist (main.go:984-1000)

func (s *BedrockServer) RecordPlay(ctx context.Context, req *pb.RecordPlayRequest) (*pb.RecordPlayResponse, error) {
    eventID := uuid.New().String()
    // TODO: persist event  ← STUB!
    return &pb.RecordPlayResponse{EventId: eventID, Status: pb.ResponseStatus_STATUS_OK}, nil
}

Impact: GetPopularTracks and GetListeningHistory return empty - feature non-functional.

3.4 Resolver Bridging Has No Validation (resolver.go:152-159)

// Takes first search result without scoring
results, err := s.sc.SearchTracks(ctx, cleanedQuery, 1)
return results[0]  // Wrong track if covers/remixes rank first

Missing: Duration comparison, artist name fuzzy matching, ISRC/UPC verification.

3.5 Spotify Panic Risk (spotify.go:76-78)

// No bounds check before indexing
ArtistIDs: wrapper.ArtistIDs[0],  // PANIC if empty array

4. minim - Architectural Flaws

Critical Issues

4.1 Inconsistent Error Handling Per Provider

Provider Error Pattern
Spotify Retries on 401, raises RuntimeError
TIDAL Parses JSON error, falls back to status
Qobuz Raises with error['code']
iTunes Tries errorMessage, uses JSONDecodeError fallback
Discogs Parses nested detail field

Impact: Consumers need provider-specific error handling.

4.2 Missing Retry Logic (3/5 providers)

Only Spotify and Qobuz implement retry. TIDAL, iTunes, Discogs fail immediately on transient errors.

4.3 No Rate Limit Handling

# Missing everywhere:
# - 429 Too Many Requests detection
# - Retry-After header parsing
# - Exponential backoff

4.4 Response Structure Inconsistency

Provider Artist Field Duration Field
Spotify album.artists[0].name duration_ms
TIDAL data.attributes.name duration (seconds)
iTunes artistName trackTimeMillis
Discogs artists[0].name N/A

Impact: No common data model. Every consumer writes provider-specific parsing.


5. MusicMetaLinker - Architectural Flaws

Critical Issues

5.1 Naive Cascading Fallback (linking.py:159-182)

def get_artist(self) -> str | None:
    if self.artist: return self.artist
    artist = self.mb_link.get_artist()
    if artist is None:
        artist = self.dz_link.get_artist_name()
        if artist is None:
            artist = self.mb_link.get_artist()  # Called twice!
            if artist is None:
                artist = self.yt_link.get_youtube_artist()
    return artist  # First non-None wins, no quality check

Problems:

  • No confidence scoring
  • No conflict detection ("Beyoncé" vs "Beyonce" vs "Beyoncé Knowles")
  • Redundant MusicBrainz calls
  • Order bias (Deezer always wins over YouTube)

5.2 Silent Failures (deezer_links.py:102-107)

try:
    return [res for res in results][:limit]
except Exception:  # Catches EVERYTHING
    return None  # Network error? Invalid input? Who knows!

Impact: Can't distinguish "no match" from "API failed" from "invalid input".

5.3 ISRC Handling Bug (musicbrainz_links.py:77-85)

for isrc in self.isrc:
    try:
        isrc_result = mb.get_recordings_by_isrc(isrc, ...)
        return isrc_result  # Returns on first success
    except mb.ResponseError:
        return None  # BUG: Should be `continue`, not `return`!

5.4 Album Name Truncation (deezer_links.py:63-78)

if self.album and " " in self.album:
    self.album = " ".join(self.album.split(" ")[:2])  # Only first 2 words!

"The Beatles (Remastered)" → "The Beatles" - loses critical specificity.

5.5 Naive Duration Comparison

Fixed 3-second threshold regardless of track length:

  • 3s is huge for 30-second track (10% error)
  • 3s is tiny for 10-minute track (0.5% error)

Proposed Architecture

Design Principles

  1. Observations are immutable - No "last write wins"; always preserve raw data
  2. Field-level confidence - Trust title from MusicBrainz while using duration from Spotify
  3. Three-stage entity resolution - Blocking → Similarity → Decision
  4. Provenance by default - Every value is explainable

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                           INGESTION LAYER                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │
│  │  Provider   │  │  Provider   │  │  Provider   │  │  Provider   │    │
│  │  Adapter    │  │  Adapter    │  │  Adapter    │  │  Adapter    │    │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘    │
│         └────────────────┴───────┬────────┴────────────────┘           │
│                    ┌─────────────▼──────────────┐                      │
│                    │  Unified Provider Gateway  │                      │
│                    │  • Per-provider rate limit │                      │
│                    │  • Retry + exp. backoff    │                      │
│                    │  • Circuit breaker         │                      │
│                    │  • Request batching        │                      │
│                    └─────────────┬──────────────┘                      │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │   RAW OBSERVATION STORE     │
                    │   (append-only, immutable)  │
                    └──────────────┬──────────────┘
                                   │
┌──────────────────────────────────┼──────────────────────────────────────┐
│                    ENTITY RESOLUTION LAYER                              │
│         ┌────────────────────────▼────────────────────────┐            │
│         │              BLOCKING STAGE                      │            │
│         │  • ISRC/UPC exact match (99.7% pair reduction)  │            │
│         │  • Phonetic blocking (Metaphone) for names      │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │            SIMILARITY STAGE                      │            │
│         │  • Title: Levenshtein + token Jaccard           │            │
│         │  • Artist: embedding cosine similarity          │            │
│         │  • Duration: relative threshold (±3% or ±5s)    │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │            DECISION STAGE                        │            │
│         │  • ≥0.95 → auto-merge                           │            │
│         │  • 0.70-0.95 → human review queue               │            │
│         │  • <0.70 → distinct entities                    │            │
│         └────────────────────────┬────────────────────────┘            │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
┌──────────────────────────────────┼──────────────────────────────────────┐
│                    CONFLICT RESOLUTION ENGINE                           │
│         ┌────────────────────────▼────────────────────────┐            │
│         │         FIELD-LEVEL MERGE RULES                  │            │
│         │  confidence = source_trust × recency × consensus │            │
│         │                                                  │            │
│         │  • Identifiers: ISRC > provider ID              │            │
│         │  • Duration: median within 2s tolerance         │            │
│         │  • Title: MusicBrainz > label > streaming       │            │
│         │  • Release date: earliest credible              │            │
│         │  • Explicit: OR across sources                  │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │         CANONICAL ENTITY STORE                   │            │
│         │  • Materialized "best known" values             │            │
│         │  • Per-field confidence scores                  │            │
│         │  • Links to all source observations             │            │
│         └─────────────────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────────────────────┘

Core Data Model

-- Immutable observations from providers
CREATE TABLE observations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    provider        TEXT NOT NULL,
    provider_id     TEXT NOT NULL,
    entity_type     TEXT NOT NULL,
    payload         JSONB NOT NULL,
    fetched_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    checksum        BYTEA NOT NULL,
    UNIQUE(provider, provider_id, checksum)
);

-- Canonical entities with confidence
CREATE TABLE tracks (
    id                    UUID PRIMARY KEY,
    
    -- Identifiers
    isrc                  TEXT,
    iswc                  TEXT,
    mbid                  UUID,
    
    -- Fields with confidence
    title                 TEXT NOT NULL,
    title_confidence      REAL NOT NULL DEFAULT 0.0,
    
    duration_ms           INT,
    duration_confidence   REAL NOT NULL DEFAULT 0.0,
    
    explicit              BOOLEAN,
    explicit_confidence   REAL NOT NULL DEFAULT 0.0,
    
    -- Denormalized
    artist_credit         TEXT NOT NULL,
    album_title           TEXT,
    
    -- Metadata
    created_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
    merge_version         INT NOT NULL DEFAULT 1
);

-- Field-level provenance
CREATE TABLE field_sources (
    entity_type     TEXT NOT NULL,
    entity_id       UUID NOT NULL,
    field_name      TEXT NOT NULL,
    observation_id  UUID NOT NULL REFERENCES observations(id),
    confidence      REAL NOT NULL,
    selected        BOOLEAN NOT NULL DEFAULT false,
    PRIMARY KEY (entity_type, entity_id, field_name, observation_id)
);

-- Cross-reference table
CREATE TABLE provider_links (
    entity_type     TEXT NOT NULL,
    entity_id       UUID NOT NULL,
    provider        TEXT NOT NULL,
    provider_id     TEXT NOT NULL,
    verified        BOOLEAN NOT NULL DEFAULT false,
    PRIMARY KEY (entity_type, provider, provider_id)
);

-- Entity resolution audit trail
CREATE TABLE merge_decisions (
    id               UUID PRIMARY KEY,
    entity_type      TEXT NOT NULL,
    source_ids       UUID[] NOT NULL,
    target_id        UUID NOT NULL,
    similarity_score REAL NOT NULL,
    decision         TEXT NOT NULL,  -- 'auto', 'human_approved', 'human_rejected'
    decided_by       TEXT,
    decided_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

Source Trust Hierarchy

SOURCE_TRUST = {
    'musicbrainz': 0.95,  # Community-curated, high accuracy
    'discogs':     0.85,  # Community + physical media focus
    'tidal':       0.80,  # Label direct relationships
    'spotify':     0.75,  # Large scale, some noise
    'deezer':      0.70,  # Good coverage, less curation
    'youtube':     0.60,  # User-generated, low accuracy
}

Conflict Resolution Rules

Field Strategy Implementation
Title Highest trust + consensus Score = trust + 0.1×(agreeing_sources - 1)
Duration Median within tolerance Filter to ±3% or ±5s, take median
Explicit OR logic If any source says explicit → explicit
Release Date Earliest credible Must be ≤ today and ≥ 1900
ISRC First valid Validate format, take highest-trust source
Artist Embedding similarity Cluster similar names, pick canonical

Technical Choices

Component Choice Rationale
Core Language Python 3.11+ Rapid iteration, rich ecosystem
Hot Path Rust via PyO3 Entity resolution blocking/embedding
Database PostgreSQL 15+ JSONB, trigram, pgvector
Cache Redis Entity-keyed, not URL-keyed
Embeddings all-MiniLM-L6-v2 384-dim, fast, good quality
API GraphQL + DataLoader Explicit batching, no N+1
Queue PostgreSQL SKIP LOCKED Human review, async processing
Observability OpenTelemetry Trace entity resolution decisions

Estimated Effort

Component Effort Notes
Data model + migrations 1-4 hours PostgreSQL schema
Provider gateway 1-2 days Unified error handling, rate limiting
Entity resolution pipeline 1-2 days Blocking, similarity, decision
Conflict resolution engine 1-4 hours Field-level rules
Provenance system 1-4 hours Audit tables, explain API
Human review UI 1-2 days Queue management
Total MVP 1-2 weeks

Key Takeaways

  1. Hybrid approaches win: Audio + metadata outperforms either alone (Spotify research: 2-6% improvement)

  2. Provenance is non-negotiable: Every field needs source tracking, confidence scores, snapshot URLs

  3. Identifier hierarchy matters: ISWC (work) → ISRC (recording) → UPC (release) with MBIDs as glue

  4. Fuzzy matching requires stages: Blocking (99.7% reduction) → Similarity → Threshold → Human review

  5. Conflict resolution needs policy: Field-level precedence rules, not "last write wins"

  6. Cache entities, not requests: Avoid GraphBrainz's URL-fragmentation trap

  7. Unified error handling: Result types that force error handling, not silent exceptions