Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

22 KiB

Raw Blame History

Aggregators Architecture Analysis & Proposed Solution

Deep analysis of 5 music metadata aggregators, identifying common flaws and proposing a ground-up redesign.

Executive Summary

All 5 aggregators share common architectural mistakes that lead to data quality issues, performance problems, and poor extensibility:

Pattern	Projects Affected	Impact
No confidence scoring	5/5	Can't distinguish good data from bad
First/last-write-wins merging	4/5	Data loss, no conflict resolution
Silent failure cascades	4/5	Debugging nightmare, data corruption
Naive entity resolution	4/5	Duplicates, mismatches
Provider-specific error handling	3/5	Inconsistent reliability
URL-based cache keys	2/5	Same entity cached multiple times
Disabled batching	2/5	Catastrophic performance

1. Harmony - Architectural Flaws

Critical Issues

1.1 Naive Deduplication (`deduplicate.ts:4-25`)

// FLAW: Exact string match only
if (mbid) {
  if (!mbids.has(mbid)) { result.push(entity); mbids.add(mbid); }
} else if (name) {
  if (!names.has(name)) { result.push(entity); names.add(name); }
}

Problem: "The Beatles" ≠ "Beatles" ≠ "BEATULAR" - all treated as different entities.

Fix: Implement phonetic blocking (Metaphone) + Levenshtein similarity threshold.

1.2 Limited Compatibility Checks (`compatibility.ts:60-67`)

const releaseCompatibilityChecks: CompatibilityCheck<HarmonyRelease>[] = [{
  property: (release) => release.gtin ? Number(release.gtin) : undefined,
  errorMessage: 'Providers have returned multiple different GTIN',
}, {
  property: trackCountSummary,
  errorMessage: 'Providers have returned incompatible track lists',
}];

Problem: Only checks GTIN and track count. No artist validation, title similarity, or duration checks.

Fix: Add artist credit comparison, title Levenshtein distance, duration tolerance (±3%).

1.3 First-Wins Merge with No Confidence (`merge.ts:105-124`)

missingReleaseProperties.forEach((property) => {
  const value = cloneInto(mergedRelease, sourceRelease, property);
  if (isFilled(value)) {
    mergedRelease.info.sourceMap[property] = providerName;
    missingReleaseProperties.delete(property);  // First wins, done
  }
});

Problem: First provider to fill a field wins. No quality assessment.

Fix: Score each value by source trust × recency × consensus, pick highest.

1.4 No Data Quality Metrics

Missing: Confidence scores, match quality, conflict counts, field completeness.

2. GraphBrainz - Architectural Flaws

Critical Issues

2.1 BATCHING COMPLETELY DISABLED (`loaders.js:38-42`)

const lookup = new DataLoader(
  (keys) => { /* ... */ },
  { batch: false }  // ← DEFEATS ENTIRE PURPOSE OF DATALOADER
);

Impact: Query for 20 entities = 20 sequential HTTP requests. With rate limit of 5 req/5.5s = 22 seconds minimum.

Fix: Implement request coalescing even without batch API. Deduplicate concurrent identical requests.

2.2 N+1 Queries by Design (`relationship.js:127-138`)

relationships: {
  resolve: (entity, args, { loaders }, info) => {
    // If relations not included in initial fetch...
    promise = loaders.lookup.load([entityType, id, params]);  // N+1 QUERY
    return promise.then((entity) => entity.relations);
  },
}

Also in: recording.js:51-61 (ISRCs), helpers.js:56-64 (fieldWithID pattern)

Impact: Query 100 artists with relationships = 1 + 100 requests.

Fix: Query planning phase - analyze full GraphQL query before any resolvers, compute optimal inc parameters.

2.3 Cache Fragmentation (`loaders.js:11-20`)

// Same artist cached 3 times with different completeness:
loaders.lookup.load(['artist', 'abc', {}])
loaders.lookup.load(['artist', 'abc', { inc: ['releases'] }])
loaders.lookup.load(['artist', 'abc', { inc: ['recordings'] }])

Problem: URL-based cache keys mean same entity with different inc params = different cache entries.

Fix: Entity-based cache with incremental enrichment.

2.4 Extension System Limitations (`extensions/index.js`)

// Only 18 lines. No lifecycle hooks, no dependency management.
export async function loadExtension(extensionModule) {
  return typeof extensionModule === 'string' 
    ? await import(extensionModule) 
    : extensionModule;
}

Missing: Lifecycle hooks, resolver interception, middleware support, error boundaries.

3. Bedrock-API - Architectural Flaws

Critical Issues

3.1 Missing Proto Fields (`bedrock_service.proto`)

Missing Field	Impact
`album_id` on Track	Can't link tracks to albums bidirectionally
`release_date` on Track	Temporal data lost
`explicit` flag	Content rating lost
`isrc`	International standard ID lost (critical for rights)
`verified` on Artist	Badge status lost
`label` on Album	Publisher info lost
`upc/ean`	Barcode identifiers lost

3.2 SoundCloud artist_id Bug (`soundcloud.go:457`)

// BUG: Uses track ID instead of user ID
artist_id: fmt.Sprintf("soundcloud:%d", t.ID),  // Should be t.User.ID

3.3 Listening Stats Don't Persist (`main.go:984-1000`)

func (s *BedrockServer) RecordPlay(ctx context.Context, req *pb.RecordPlayRequest) (*pb.RecordPlayResponse, error) {
    eventID := uuid.New().String()
    // TODO: persist event  ← STUB!
    return &pb.RecordPlayResponse{EventId: eventID, Status: pb.ResponseStatus_STATUS_OK}, nil
}

Impact: GetPopularTracks and GetListeningHistory return empty - feature non-functional.

3.4 Resolver Bridging Has No Validation (`resolver.go:152-159`)

// Takes first search result without scoring
results, err := s.sc.SearchTracks(ctx, cleanedQuery, 1)
return results[0]  // Wrong track if covers/remixes rank first

Missing: Duration comparison, artist name fuzzy matching, ISRC/UPC verification.

3.5 Spotify Panic Risk (`spotify.go:76-78`)

// No bounds check before indexing
ArtistIDs: wrapper.ArtistIDs[0],  // PANIC if empty array

4. minim - Architectural Flaws

Critical Issues

4.1 Inconsistent Error Handling Per Provider

Provider	Error Pattern
Spotify	Retries on 401, raises `RuntimeError`
TIDAL	Parses JSON error, falls back to status
Qobuz	Raises with `error['code']`
iTunes	Tries `errorMessage`, uses JSONDecodeError fallback
Discogs	Parses nested `detail` field

Impact: Consumers need provider-specific error handling.

4.2 Missing Retry Logic (3/5 providers)

Only Spotify and Qobuz implement retry. TIDAL, iTunes, Discogs fail immediately on transient errors.

4.3 No Rate Limit Handling

# Missing everywhere:
# - 429 Too Many Requests detection
# - Retry-After header parsing
# - Exponential backoff

4.4 Response Structure Inconsistency

Provider	Artist Field	Duration Field
Spotify	`album.artists[0].name`	`duration_ms`
TIDAL	`data.attributes.name`	`duration` (seconds)
iTunes	`artistName`	`trackTimeMillis`
Discogs	`artists[0].name`	N/A

Impact: No common data model. Every consumer writes provider-specific parsing.

5. MusicMetaLinker - Architectural Flaws

Critical Issues

5.1 Naive Cascading Fallback (`linking.py:159-182`)

def get_artist(self) -> str | None:
    if self.artist: return self.artist
    artist = self.mb_link.get_artist()
    if artist is None:
        artist = self.dz_link.get_artist_name()
        if artist is None:
            artist = self.mb_link.get_artist()  # Called twice!
            if artist is None:
                artist = self.yt_link.get_youtube_artist()
    return artist  # First non-None wins, no quality check

Problems:

No confidence scoring
No conflict detection ("Beyoncé" vs "Beyonce" vs "Beyoncé Knowles")
Redundant MusicBrainz calls
Order bias (Deezer always wins over YouTube)

5.2 Silent Failures (`deezer_links.py:102-107`)

try:
    return [res for res in results][:limit]
except Exception:  # Catches EVERYTHING
    return None  # Network error? Invalid input? Who knows!

Impact: Can't distinguish "no match" from "API failed" from "invalid input".

5.3 ISRC Handling Bug (`musicbrainz_links.py:77-85`)

for isrc in self.isrc:
    try:
        isrc_result = mb.get_recordings_by_isrc(isrc, ...)
        return isrc_result  # Returns on first success
    except mb.ResponseError:
        return None  # BUG: Should be `continue`, not `return`!

5.4 Album Name Truncation (`deezer_links.py:63-78`)

if self.album and " " in self.album:
    self.album = " ".join(self.album.split(" ")[:2])  # Only first 2 words!

"The Beatles (Remastered)" → "The Beatles" - loses critical specificity.

5.5 Naive Duration Comparison

Fixed 3-second threshold regardless of track length:

3s is huge for 30-second track (10% error)
3s is tiny for 10-minute track (0.5% error)

Proposed Architecture

Design Principles

Observations are immutable - No "last write wins"; always preserve raw data
Field-level confidence - Trust title from MusicBrainz while using duration from Spotify
Three-stage entity resolution - Blocking → Similarity → Decision
Provenance by default - Every value is explainable

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                           INGESTION LAYER                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │
│  │  Provider   │  │  Provider   │  │  Provider   │  │  Provider   │    │
│  │  Adapter    │  │  Adapter    │  │  Adapter    │  │  Adapter    │    │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘    │
│         └────────────────┴───────┬────────┴────────────────┘           │
│                    ┌─────────────▼──────────────┐                      │
│                    │  Unified Provider Gateway  │                      │
│                    │  • Per-provider rate limit │                      │
│                    │  • Retry + exp. backoff    │                      │
│                    │  • Circuit breaker         │                      │
│                    │  • Request batching        │                      │
│                    └─────────────┬──────────────┘                      │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │   RAW OBSERVATION STORE     │
                    │   (append-only, immutable)  │
                    └──────────────┬──────────────┘
                                   │
┌──────────────────────────────────┼──────────────────────────────────────┐
│                    ENTITY RESOLUTION LAYER                              │
│         ┌────────────────────────▼────────────────────────┐            │
│         │              BLOCKING STAGE                      │            │
│         │  • ISRC/UPC exact match (99.7% pair reduction)  │            │
│         │  • Phonetic blocking (Metaphone) for names      │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │            SIMILARITY STAGE                      │            │
│         │  • Title: Levenshtein + token Jaccard           │            │
│         │  • Artist: embedding cosine similarity          │            │
│         │  • Duration: relative threshold (±3% or ±5s)    │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │            DECISION STAGE                        │            │
│         │  • ≥0.95 → auto-merge                           │            │
│         │  • 0.70-0.95 → human review queue               │            │
│         │  • <0.70 → distinct entities                    │            │
│         └────────────────────────┬────────────────────────┘            │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
┌──────────────────────────────────┼──────────────────────────────────────┐
│                    CONFLICT RESOLUTION ENGINE                           │
│         ┌────────────────────────▼────────────────────────┐            │
│         │         FIELD-LEVEL MERGE RULES                  │            │
│         │  confidence = source_trust × recency × consensus │            │
│         │                                                  │            │
│         │  • Identifiers: ISRC > provider ID              │            │
│         │  • Duration: median within 2s tolerance         │            │
│         │  • Title: MusicBrainz > label > streaming       │            │
│         │  • Release date: earliest credible              │            │
│         │  • Explicit: OR across sources                  │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │         CANONICAL ENTITY STORE                   │            │
│         │  • Materialized "best known" values             │            │
│         │  • Per-field confidence scores                  │            │
│         │  • Links to all source observations             │            │
│         └─────────────────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────────────────────┘

Core Data Model

-- Immutable observations from providers
CREATE TABLE observations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    provider        TEXT NOT NULL,
    provider_id     TEXT NOT NULL,
    entity_type     TEXT NOT NULL,
    payload         JSONB NOT NULL,
    fetched_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    checksum        BYTEA NOT NULL,
    UNIQUE(provider, provider_id, checksum)
);

-- Canonical entities with confidence
CREATE TABLE tracks (
    id                    UUID PRIMARY KEY,
    
    -- Identifiers
    isrc                  TEXT,
    iswc                  TEXT,
    mbid                  UUID,
    
    -- Fields with confidence
    title                 TEXT NOT NULL,
    title_confidence      REAL NOT NULL DEFAULT 0.0,
    
    duration_ms           INT,
    duration_confidence   REAL NOT NULL DEFAULT 0.0,
    
    explicit              BOOLEAN,
    explicit_confidence   REAL NOT NULL DEFAULT 0.0,
    
    -- Denormalized
    artist_credit         TEXT NOT NULL,
    album_title           TEXT,
    
    -- Metadata
    created_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
    merge_version         INT NOT NULL DEFAULT 1
);

-- Field-level provenance
CREATE TABLE field_sources (
    entity_type     TEXT NOT NULL,
    entity_id       UUID NOT NULL,
    field_name      TEXT NOT NULL,
    observation_id  UUID NOT NULL REFERENCES observations(id),
    confidence      REAL NOT NULL,
    selected        BOOLEAN NOT NULL DEFAULT false,
    PRIMARY KEY (entity_type, entity_id, field_name, observation_id)
);

-- Cross-reference table
CREATE TABLE provider_links (
    entity_type     TEXT NOT NULL,
    entity_id       UUID NOT NULL,
    provider        TEXT NOT NULL,
    provider_id     TEXT NOT NULL,
    verified        BOOLEAN NOT NULL DEFAULT false,
    PRIMARY KEY (entity_type, provider, provider_id)
);

-- Entity resolution audit trail
CREATE TABLE merge_decisions (
    id               UUID PRIMARY KEY,
    entity_type      TEXT NOT NULL,
    source_ids       UUID[] NOT NULL,
    target_id        UUID NOT NULL,
    similarity_score REAL NOT NULL,
    decision         TEXT NOT NULL,  -- 'auto', 'human_approved', 'human_rejected'
    decided_by       TEXT,
    decided_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

Source Trust Hierarchy

SOURCE_TRUST = {
    'musicbrainz': 0.95,  # Community-curated, high accuracy
    'discogs':     0.85,  # Community + physical media focus
    'tidal':       0.80,  # Label direct relationships
    'spotify':     0.75,  # Large scale, some noise
    'deezer':      0.70,  # Good coverage, less curation
    'youtube':     0.60,  # User-generated, low accuracy
}

Conflict Resolution Rules

Field	Strategy	Implementation
Title	Highest trust + consensus	Score = trust + 0.1×(agreeing_sources - 1)
Duration	Median within tolerance	Filter to ±3% or ±5s, take median
Explicit	OR logic	If any source says explicit → explicit
Release Date	Earliest credible	Must be ≤ today and ≥ 1900
ISRC	First valid	Validate format, take highest-trust source
Artist	Embedding similarity	Cluster similar names, pick canonical

Technical Choices

Component	Choice	Rationale
Core Language	Python 3.11+	Rapid iteration, rich ecosystem
Hot Path	Rust via PyO3	Entity resolution blocking/embedding
Database	PostgreSQL 15+	JSONB, trigram, pgvector
Cache	Redis	Entity-keyed, not URL-keyed
Embeddings	all-MiniLM-L6-v2	384-dim, fast, good quality
API	GraphQL + DataLoader	Explicit batching, no N+1
Queue	PostgreSQL SKIP LOCKED	Human review, async processing
Observability	OpenTelemetry	Trace entity resolution decisions

Estimated Effort

Component	Effort	Notes
Data model + migrations	1-4 hours	PostgreSQL schema
Provider gateway	1-2 days	Unified error handling, rate limiting
Entity resolution pipeline	1-2 days	Blocking, similarity, decision
Conflict resolution engine	1-4 hours	Field-level rules
Provenance system	1-4 hours	Audit tables, explain API
Human review UI	1-2 days	Queue management
Total MVP	1-2 weeks

Key Takeaways

Hybrid approaches win: Audio + metadata outperforms either alone (Spotify research: 2-6% improvement)
Provenance is non-negotiable: Every field needs source tracking, confidence scores, snapshot URLs
Identifier hierarchy matters: ISWC (work) → ISRC (recording) → UPC (release) with MBIDs as glue
Fuzzy matching requires stages: Blocking (99.7% reduction) → Similarity → Threshold → Human review
Conflict resolution needs policy: Field-level precedence rules, not "last write wins"
Cache entities, not requests: Avoid GraphBrainz's URL-fragmentation trap
Unified error handling: Result types that force error handling, not silent exceptions

22 KiB Raw Blame History Unescape Escape

Aggregators Architecture Analysis & Proposed Solution

Executive Summary

1. Harmony - Architectural Flaws

Critical Issues

1.1 Naive Deduplication (deduplicate.ts:4-25)

1.2 Limited Compatibility Checks (compatibility.ts:60-67)

1.3 First-Wins Merge with No Confidence (merge.ts:105-124)

1.4 No Data Quality Metrics

2. GraphBrainz - Architectural Flaws

Critical Issues

2.1 BATCHING COMPLETELY DISABLED (loaders.js:38-42)

2.2 N+1 Queries by Design (relationship.js:127-138)

2.3 Cache Fragmentation (loaders.js:11-20)

2.4 Extension System Limitations (extensions/index.js)

3. Bedrock-API - Architectural Flaws

Critical Issues

3.1 Missing Proto Fields (bedrock_service.proto)

3.2 SoundCloud artist_id Bug (soundcloud.go:457)

3.3 Listening Stats Don't Persist (main.go:984-1000)

3.4 Resolver Bridging Has No Validation (resolver.go:152-159)

3.5 Spotify Panic Risk (spotify.go:76-78)

4. minim - Architectural Flaws

Critical Issues

4.1 Inconsistent Error Handling Per Provider

4.2 Missing Retry Logic (3/5 providers)

4.3 No Rate Limit Handling

4.4 Response Structure Inconsistency

5. MusicMetaLinker - Architectural Flaws

Critical Issues

5.1 Naive Cascading Fallback (linking.py:159-182)

5.2 Silent Failures (deezer_links.py:102-107)

5.3 ISRC Handling Bug (musicbrainz_links.py:77-85)

5.4 Album Name Truncation (deezer_links.py:63-78)

5.5 Naive Duration Comparison

Proposed Architecture

Design Principles

Architecture Diagram

Core Data Model

Source Trust Hierarchy

Conflict Resolution Rules

Technical Choices

Estimated Effort

Key Takeaways

22 KiB

Raw Blame History

1.1 Naive Deduplication (`deduplicate.ts:4-25`)

1.2 Limited Compatibility Checks (`compatibility.ts:60-67`)

1.3 First-Wins Merge with No Confidence (`merge.ts:105-124`)

2.1 BATCHING COMPLETELY DISABLED (`loaders.js:38-42`)

2.2 N+1 Queries by Design (`relationship.js:127-138`)

2.3 Cache Fragmentation (`loaders.js:11-20`)

2.4 Extension System Limitations (`extensions/index.js`)

3.1 Missing Proto Fields (`bedrock_service.proto`)

3.2 SoundCloud artist_id Bug (`soundcloud.go:457`)

3.3 Listening Stats Don't Persist (`main.go:984-1000`)

3.4 Resolver Bridging Has No Validation (`resolver.go:152-159`)

3.5 Spotify Panic Risk (`spotify.go:76-78`)

5.1 Naive Cascading Fallback (`linking.py:159-182`)

5.2 Silent Failures (`deezer_links.py:102-107`)

5.3 ISRC Handling Bug (`musicbrainz_links.py:77-85`)

5.4 Album Name Truncation (`deezer_links.py:63-78`)