metadata-agregator/docs/research/AGGREGATORS_ANALYSIS.md

# Aggregators Architecture Analysis & Proposed Solution

Deep analysis of 5 music metadata aggregators, identifying common flaws and proposing a ground-up redesign.

---

## Executive Summary

All 5 aggregators share **common architectural mistakes** that lead to data quality issues, performance problems, and poor extensibility:

| Pattern | Projects Affected | Impact |
|---------|-------------------|--------|
| **No confidence scoring** | 5/5 | Can't distinguish good data from bad |
| **First/last-write-wins merging** | 4/5 | Data loss, no conflict resolution |
| **Silent failure cascades** | 4/5 | Debugging nightmare, data corruption |
| **Naive entity resolution** | 4/5 | Duplicates, mismatches |
| **Provider-specific error handling** | 3/5 | Inconsistent reliability |
| **URL-based cache keys** | 2/5 | Same entity cached multiple times |
| **Disabled batching** | 2/5 | Catastrophic performance |

---

## 1. Harmony - Architectural Flaws

### Critical Issues

#### 1.1 Naive Deduplication (`deduplicate.ts:4-25`)
```typescript
// FLAW: Exact string match only
if (mbid) {
  if (!mbids.has(mbid)) { result.push(entity); mbids.add(mbid); }
} else if (name) {
  if (!names.has(name)) { result.push(entity); names.add(name); }
}
```
**Problem**: "The Beatles" ≠ "Beatles" ≠ "BEATULAR" - all treated as different entities.

**Fix**: Implement phonetic blocking (Metaphone) + Levenshtein similarity threshold.

#### 1.2 Limited Compatibility Checks (`compatibility.ts:60-67`)
```typescript
const releaseCompatibilityChecks: CompatibilityCheck<HarmonyRelease>[] = [{
  property: (release) => release.gtin ? Number(release.gtin) : undefined,
  errorMessage: 'Providers have returned multiple different GTIN',
}, {
  property: trackCountSummary,
  errorMessage: 'Providers have returned incompatible track lists',
}];
```
**Problem**: Only checks GTIN and track count. No artist validation, title similarity, or duration checks.

**Fix**: Add artist credit comparison, title Levenshtein distance, duration tolerance (±3%).

#### 1.3 First-Wins Merge with No Confidence (`merge.ts:105-124`)
```typescript
missingReleaseProperties.forEach((property) => {
  const value = cloneInto(mergedRelease, sourceRelease, property);
  if (isFilled(value)) {
    mergedRelease.info.sourceMap[property] = providerName;
    missingReleaseProperties.delete(property);  // First wins, done
  }
});
```
**Problem**: First provider to fill a field wins. No quality assessment.

**Fix**: Score each value by source trust × recency × consensus, pick highest.

#### 1.4 No Data Quality Metrics
**Missing**: Confidence scores, match quality, conflict counts, field completeness.

---

## 2. GraphBrainz - Architectural Flaws

### Critical Issues

#### 2.1 BATCHING COMPLETELY DISABLED (`loaders.js:38-42`)
```javascript
const lookup = new DataLoader(
  (keys) => { /* ... */ },
  { batch: false }  // ← DEFEATS ENTIRE PURPOSE OF DATALOADER
);
```
**Impact**: Query for 20 entities = 20 sequential HTTP requests. With rate limit of 5 req/5.5s = **22 seconds minimum**.

**Fix**: Implement request coalescing even without batch API. Deduplicate concurrent identical requests.

#### 2.2 N+1 Queries by Design (`relationship.js:127-138`)
```javascript
relationships: {
  resolve: (entity, args, { loaders }, info) => {
    // If relations not included in initial fetch...
    promise = loaders.lookup.load([entityType, id, params]);  // N+1 QUERY
    return promise.then((entity) => entity.relations);
  },
}
```
**Also in**: `recording.js:51-61` (ISRCs), `helpers.js:56-64` (fieldWithID pattern)

**Impact**: Query 100 artists with relationships = 1 + 100 requests.

**Fix**: Query planning phase - analyze full GraphQL query before any resolvers, compute optimal `inc` parameters.

#### 2.3 Cache Fragmentation (`loaders.js:11-20`)
```javascript
// Same artist cached 3 times with different completeness:
loaders.lookup.load(['artist', 'abc', {}])
loaders.lookup.load(['artist', 'abc', { inc: ['releases'] }])
loaders.lookup.load(['artist', 'abc', { inc: ['recordings'] }])
```
**Problem**: URL-based cache keys mean same entity with different `inc` params = different cache entries.

**Fix**: Entity-based cache with incremental enrichment.

#### 2.4 Extension System Limitations (`extensions/index.js`)
```javascript
// Only 18 lines. No lifecycle hooks, no dependency management.
export async function loadExtension(extensionModule) {
  return typeof extensionModule === 'string'
    ? await import(extensionModule)
    : extensionModule;
}
```
**Missing**: Lifecycle hooks, resolver interception, middleware support, error boundaries.

---

## 3. Bedrock-API - Architectural Flaws

### Critical Issues

#### 3.1 Missing Proto Fields (`bedrock_service.proto`)

| Missing Field | Impact |
|---------------|--------|
| `album_id` on Track | Can't link tracks to albums bidirectionally |
| `release_date` on Track | Temporal data lost |
| `explicit` flag | Content rating lost |
| `isrc` | International standard ID lost (critical for rights) |
| `verified` on Artist | Badge status lost |
| `label` on Album | Publisher info lost |
| `upc/ean` | Barcode identifiers lost |

#### 3.2 SoundCloud artist_id Bug (`soundcloud.go:457`)
```go
// BUG: Uses track ID instead of user ID
artist_id: fmt.Sprintf("soundcloud:%d", t.ID),  // Should be t.User.ID
```

#### 3.3 Listening Stats Don't Persist (`main.go:984-1000`)
```go
func (s *BedrockServer) RecordPlay(ctx context.Context, req *pb.RecordPlayRequest) (*pb.RecordPlayResponse, error) {
    eventID := uuid.New().String()
    // TODO: persist event  ← STUB!
    return &pb.RecordPlayResponse{EventId: eventID, Status: pb.ResponseStatus_STATUS_OK}, nil
}
```
**Impact**: `GetPopularTracks` and `GetListeningHistory` return empty - feature non-functional.

#### 3.4 Resolver Bridging Has No Validation (`resolver.go:152-159`)
```go
// Takes first search result without scoring
results, err := s.sc.SearchTracks(ctx, cleanedQuery, 1)
return results[0]  // Wrong track if covers/remixes rank first
```
**Missing**: Duration comparison, artist name fuzzy matching, ISRC/UPC verification.

#### 3.5 Spotify Panic Risk (`spotify.go:76-78`)
```go
// No bounds check before indexing
ArtistIDs: wrapper.ArtistIDs[0],  // PANIC if empty array
```

---

## 4. minim - Architectural Flaws

### Critical Issues

#### 4.1 Inconsistent Error Handling Per Provider

| Provider | Error Pattern |
|----------|---------------|
| Spotify | Retries on 401, raises `RuntimeError` |
| TIDAL | Parses JSON error, falls back to status |
| Qobuz | Raises with `error['code']` |
| iTunes | Tries `errorMessage`, uses JSONDecodeError fallback |
| Discogs | Parses nested `detail` field |

**Impact**: Consumers need provider-specific error handling.

#### 4.2 Missing Retry Logic (3/5 providers)
Only Spotify and Qobuz implement retry. TIDAL, iTunes, Discogs fail immediately on transient errors.

#### 4.3 No Rate Limit Handling
```python
# Missing everywhere:
# - 429 Too Many Requests detection
# - Retry-After header parsing
# - Exponential backoff
```

#### 4.4 Response Structure Inconsistency

| Provider | Artist Field | Duration Field |
|----------|-------------|----------------|
| Spotify | `album.artists[0].name` | `duration_ms` |
| TIDAL | `data.attributes.name` | `duration` (seconds) |
| iTunes | `artistName` | `trackTimeMillis` |
| Discogs | `artists[0].name` | N/A |

**Impact**: No common data model. Every consumer writes provider-specific parsing.

---

## 5. MusicMetaLinker - Architectural Flaws

### Critical Issues

#### 5.1 Naive Cascading Fallback (`linking.py:159-182`)
```python
def get_artist(self) -> str | None:
    if self.artist: return self.artist
    artist = self.mb_link.get_artist()
    if artist is None:
        artist = self.dz_link.get_artist_name()
        if artist is None:
            artist = self.mb_link.get_artist()  # Called twice!
            if artist is None:
                artist = self.yt_link.get_youtube_artist()
    return artist  # First non-None wins, no quality check
```
**Problems**:
- No confidence scoring
- No conflict detection ("Beyoncé" vs "Beyonce" vs "Beyoncé Knowles")
- Redundant MusicBrainz calls
- Order bias (Deezer always wins over YouTube)

#### 5.2 Silent Failures (`deezer_links.py:102-107`)
```python
try:
    return [res for res in results][:limit]
except Exception:  # Catches EVERYTHING
    return None  # Network error? Invalid input? Who knows!
```
**Impact**: Can't distinguish "no match" from "API failed" from "invalid input".

#### 5.3 ISRC Handling Bug (`musicbrainz_links.py:77-85`)
```python
for isrc in self.isrc:
    try:
        isrc_result = mb.get_recordings_by_isrc(isrc, ...)
        return isrc_result  # Returns on first success
    except mb.ResponseError:
        return None  # BUG: Should be `continue`, not `return`!
```

#### 5.4 Album Name Truncation (`deezer_links.py:63-78`)
```python
if self.album and " " in self.album:
    self.album = " ".join(self.album.split(" ")[:2])  # Only first 2 words!
```
"The Beatles (Remastered)" → "The Beatles" - loses critical specificity.

#### 5.5 Naive Duration Comparison
Fixed 3-second threshold regardless of track length:
- 3s is huge for 30-second track (10% error)
- 3s is tiny for 10-minute track (0.5% error)

---

## Proposed Architecture

### Design Principles

1. **Observations are immutable** - No "last write wins"; always preserve raw data
2. **Field-level confidence** - Trust title from MusicBrainz while using duration from Spotify
3. **Three-stage entity resolution** - Blocking → Similarity → Decision
4. **Provenance by default** - Every value is explainable

### Architecture Diagram

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           INGESTION LAYER                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │
│  │  Provider   │  │  Provider   │  │  Provider   │  │  Provider   │    │
│  │  Adapter    │  │  Adapter    │  │  Adapter    │  │  Adapter    │    │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘    │
│         └────────────────┴───────┬────────┴────────────────┘           │
│                    ┌─────────────▼──────────────┐                      │
│                    │  Unified Provider Gateway  │                      │
│                    │  • Per-provider rate limit │                      │
│                    │  • Retry + exp. backoff    │                      │
│                    │  • Circuit breaker         │                      │
│                    │  • Request batching        │                      │
│                    └─────────────┬──────────────┘                      │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │   RAW OBSERVATION STORE     │
                    │   (append-only, immutable)  │
                    └──────────────┬──────────────┘
                                   │
┌──────────────────────────────────┼──────────────────────────────────────┐
│                    ENTITY RESOLUTION LAYER                              │
│         ┌────────────────────────▼────────────────────────┐            │
│         │              BLOCKING STAGE                      │            │
│         │  • ISRC/UPC exact match (99.7% pair reduction)  │            │
│         │  • Phonetic blocking (Metaphone) for names      │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │            SIMILARITY STAGE                      │            │
│         │  • Title: Levenshtein + token Jaccard           │            │
│         │  • Artist: embedding cosine similarity          │            │
│         │  • Duration: relative threshold (±3% or ±5s)    │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │            DECISION STAGE                        │            │
│         │  • ≥0.95 → auto-merge                           │            │
│         │  • 0.70-0.95 → human review queue               │            │
│         │  • <0.70 → distinct entities                    │            │
│         └────────────────────────┬────────────────────────┘            │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
┌──────────────────────────────────┼──────────────────────────────────────┐
│                    CONFLICT RESOLUTION ENGINE                           │
│         ┌────────────────────────▼────────────────────────┐            │
│         │         FIELD-LEVEL MERGE RULES                  │            │
│         │  confidence = source_trust × recency × consensus │            │
│         │                                                  │            │
│         │  • Identifiers: ISRC > provider ID              │            │
│         │  • Duration: median within 2s tolerance         │            │
│         │  • Title: MusicBrainz > label > streaming       │            │
│         │  • Release date: earliest credible              │            │
│         │  • Explicit: OR across sources                  │            │
│         └────────────────────────┬────────────────────────┘            │
│         ┌────────────────────────▼────────────────────────┐            │
│         │         CANONICAL ENTITY STORE                   │            │
│         │  • Materialized "best known" values             │            │
│         │  • Per-field confidence scores                  │            │
│         │  • Links to all source observations             │            │
│         └─────────────────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────────────────────┘
```

---

### Core Data Model

```sql
-- Immutable observations from providers
CREATE TABLE observations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    provider        TEXT NOT NULL,
    provider_id     TEXT NOT NULL,
    entity_type     TEXT NOT NULL,
    payload         JSONB NOT NULL,
    fetched_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    checksum        BYTEA NOT NULL,
    UNIQUE(provider, provider_id, checksum)
);

-- Canonical entities with confidence
CREATE TABLE tracks (
    id                    UUID PRIMARY KEY,

    -- Identifiers
    isrc                  TEXT,
    iswc                  TEXT,
    mbid                  UUID,

    -- Fields with confidence
    title                 TEXT NOT NULL,
    title_confidence      REAL NOT NULL DEFAULT 0.0,

    duration_ms           INT,
    duration_confidence   REAL NOT NULL DEFAULT 0.0,

    explicit              BOOLEAN,
    explicit_confidence   REAL NOT NULL DEFAULT 0.0,

    -- Denormalized
    artist_credit         TEXT NOT NULL,
    album_title           TEXT,

    -- Metadata
    created_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
    merge_version         INT NOT NULL DEFAULT 1
);

-- Field-level provenance
CREATE TABLE field_sources (
    entity_type     TEXT NOT NULL,
    entity_id       UUID NOT NULL,
    field_name      TEXT NOT NULL,
    observation_id  UUID NOT NULL REFERENCES observations(id),
    confidence      REAL NOT NULL,
    selected        BOOLEAN NOT NULL DEFAULT false,
    PRIMARY KEY (entity_type, entity_id, field_name, observation_id)
);

-- Cross-reference table
CREATE TABLE provider_links (
    entity_type     TEXT NOT NULL,
    entity_id       UUID NOT NULL,
    provider        TEXT NOT NULL,
    provider_id     TEXT NOT NULL,
    verified        BOOLEAN NOT NULL DEFAULT false,
    PRIMARY KEY (entity_type, provider, provider_id)
);

-- Entity resolution audit trail
CREATE TABLE merge_decisions (
    id               UUID PRIMARY KEY,
    entity_type      TEXT NOT NULL,
    source_ids       UUID[] NOT NULL,
    target_id        UUID NOT NULL,
    similarity_score REAL NOT NULL,
    decision         TEXT NOT NULL,  -- 'auto', 'human_approved', 'human_rejected'
    decided_by       TEXT,
    decided_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);
```

---

### Source Trust Hierarchy

```python
SOURCE_TRUST = {
    'musicbrainz': 0.95,  # Community-curated, high accuracy
    'discogs':     0.85,  # Community + physical media focus
    'tidal':       0.80,  # Label direct relationships
    'spotify':     0.75,  # Large scale, some noise
    'deezer':      0.70,  # Good coverage, less curation
    'youtube':     0.60,  # User-generated, low accuracy
}
```

---

### Conflict Resolution Rules

| Field | Strategy | Implementation |
|-------|----------|----------------|
| **Title** | Highest trust + consensus | Score = trust + 0.1×(agreeing_sources - 1) |
| **Duration** | Median within tolerance | Filter to ±3% or ±5s, take median |
| **Explicit** | OR logic | If any source says explicit → explicit |
| **Release Date** | Earliest credible | Must be ≤ today and ≥ 1900 |
| **ISRC** | First valid | Validate format, take highest-trust source |
| **Artist** | Embedding similarity | Cluster similar names, pick canonical |

---

### Technical Choices

| Component | Choice | Rationale |
|-----------|--------|-----------|
| **Core Language** | Python 3.11+ | Rapid iteration, rich ecosystem |
| **Hot Path** | Rust via PyO3 | Entity resolution blocking/embedding |
| **Database** | PostgreSQL 15+ | JSONB, trigram, pgvector |
| **Cache** | Redis | Entity-keyed, not URL-keyed |
| **Embeddings** | all-MiniLM-L6-v2 | 384-dim, fast, good quality |
| **API** | GraphQL + DataLoader | Explicit batching, no N+1 |
| **Queue** | PostgreSQL SKIP LOCKED | Human review, async processing |
| **Observability** | OpenTelemetry | Trace entity resolution decisions |

---

### Estimated Effort

| Component | Effort | Notes |
|-----------|--------|-------|
| Data model + migrations | 1-4 hours | PostgreSQL schema |
| Provider gateway | 1-2 days | Unified error handling, rate limiting |
| Entity resolution pipeline | 1-2 days | Blocking, similarity, decision |
| Conflict resolution engine | 1-4 hours | Field-level rules |
| Provenance system | 1-4 hours | Audit tables, explain API |
| Human review UI | 1-2 days | Queue management |
| **Total MVP** | **1-2 weeks** | |

---

## Key Takeaways

1. **Hybrid approaches win**: Audio + metadata outperforms either alone (Spotify research: 2-6% improvement)

2. **Provenance is non-negotiable**: Every field needs source tracking, confidence scores, snapshot URLs

3. **Identifier hierarchy matters**: ISWC (work) → ISRC (recording) → UPC (release) with MBIDs as glue

4. **Fuzzy matching requires stages**: Blocking (99.7% reduction) → Similarity → Threshold → Human review

5. **Conflict resolution needs policy**: Field-level precedence rules, not "last write wins"

6. **Cache entities, not requests**: Avoid GraphBrainz's URL-fragmentation trap

7. **Unified error handling**: Result types that force error handling, not silent exceptions