a1f6701bac
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
501 lines
22 KiB
Markdown
501 lines
22 KiB
Markdown
# Aggregators Architecture Analysis & Proposed Solution
|
||
|
||
Deep analysis of 5 music metadata aggregators, identifying common flaws and proposing a ground-up redesign.
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
All 5 aggregators share **common architectural mistakes** that lead to data quality issues, performance problems, and poor extensibility:
|
||
|
||
| Pattern | Projects Affected | Impact |
|
||
|---------|-------------------|--------|
|
||
| **No confidence scoring** | 5/5 | Can't distinguish good data from bad |
|
||
| **First/last-write-wins merging** | 4/5 | Data loss, no conflict resolution |
|
||
| **Silent failure cascades** | 4/5 | Debugging nightmare, data corruption |
|
||
| **Naive entity resolution** | 4/5 | Duplicates, mismatches |
|
||
| **Provider-specific error handling** | 3/5 | Inconsistent reliability |
|
||
| **URL-based cache keys** | 2/5 | Same entity cached multiple times |
|
||
| **Disabled batching** | 2/5 | Catastrophic performance |
|
||
|
||
---
|
||
|
||
## 1. Harmony - Architectural Flaws
|
||
|
||
### Critical Issues
|
||
|
||
#### 1.1 Naive Deduplication (`deduplicate.ts:4-25`)
|
||
```typescript
|
||
// FLAW: Exact string match only
|
||
if (mbid) {
|
||
if (!mbids.has(mbid)) { result.push(entity); mbids.add(mbid); }
|
||
} else if (name) {
|
||
if (!names.has(name)) { result.push(entity); names.add(name); }
|
||
}
|
||
```
|
||
**Problem**: "The Beatles" ≠ "Beatles" ≠ "BEATULAR" - all treated as different entities.
|
||
|
||
**Fix**: Implement phonetic blocking (Metaphone) + Levenshtein similarity threshold.
|
||
|
||
#### 1.2 Limited Compatibility Checks (`compatibility.ts:60-67`)
|
||
```typescript
|
||
const releaseCompatibilityChecks: CompatibilityCheck<HarmonyRelease>[] = [{
|
||
property: (release) => release.gtin ? Number(release.gtin) : undefined,
|
||
errorMessage: 'Providers have returned multiple different GTIN',
|
||
}, {
|
||
property: trackCountSummary,
|
||
errorMessage: 'Providers have returned incompatible track lists',
|
||
}];
|
||
```
|
||
**Problem**: Only checks GTIN and track count. No artist validation, title similarity, or duration checks.
|
||
|
||
**Fix**: Add artist credit comparison, title Levenshtein distance, duration tolerance (±3%).
|
||
|
||
#### 1.3 First-Wins Merge with No Confidence (`merge.ts:105-124`)
|
||
```typescript
|
||
missingReleaseProperties.forEach((property) => {
|
||
const value = cloneInto(mergedRelease, sourceRelease, property);
|
||
if (isFilled(value)) {
|
||
mergedRelease.info.sourceMap[property] = providerName;
|
||
missingReleaseProperties.delete(property); // First wins, done
|
||
}
|
||
});
|
||
```
|
||
**Problem**: First provider to fill a field wins. No quality assessment.
|
||
|
||
**Fix**: Score each value by source trust × recency × consensus, pick highest.
|
||
|
||
#### 1.4 No Data Quality Metrics
|
||
**Missing**: Confidence scores, match quality, conflict counts, field completeness.
|
||
|
||
---
|
||
|
||
## 2. GraphBrainz - Architectural Flaws
|
||
|
||
### Critical Issues
|
||
|
||
#### 2.1 BATCHING COMPLETELY DISABLED (`loaders.js:38-42`)
|
||
```javascript
|
||
const lookup = new DataLoader(
|
||
(keys) => { /* ... */ },
|
||
{ batch: false } // ← DEFEATS ENTIRE PURPOSE OF DATALOADER
|
||
);
|
||
```
|
||
**Impact**: Query for 20 entities = 20 sequential HTTP requests. With rate limit of 5 req/5.5s = **22 seconds minimum**.
|
||
|
||
**Fix**: Implement request coalescing even without batch API. Deduplicate concurrent identical requests.
|
||
|
||
#### 2.2 N+1 Queries by Design (`relationship.js:127-138`)
|
||
```javascript
|
||
relationships: {
|
||
resolve: (entity, args, { loaders }, info) => {
|
||
// If relations not included in initial fetch...
|
||
promise = loaders.lookup.load([entityType, id, params]); // N+1 QUERY
|
||
return promise.then((entity) => entity.relations);
|
||
},
|
||
}
|
||
```
|
||
**Also in**: `recording.js:51-61` (ISRCs), `helpers.js:56-64` (fieldWithID pattern)
|
||
|
||
**Impact**: Query 100 artists with relationships = 1 + 100 requests.
|
||
|
||
**Fix**: Query planning phase - analyze full GraphQL query before any resolvers, compute optimal `inc` parameters.
|
||
|
||
#### 2.3 Cache Fragmentation (`loaders.js:11-20`)
|
||
```javascript
|
||
// Same artist cached 3 times with different completeness:
|
||
loaders.lookup.load(['artist', 'abc', {}])
|
||
loaders.lookup.load(['artist', 'abc', { inc: ['releases'] }])
|
||
loaders.lookup.load(['artist', 'abc', { inc: ['recordings'] }])
|
||
```
|
||
**Problem**: URL-based cache keys mean same entity with different `inc` params = different cache entries.
|
||
|
||
**Fix**: Entity-based cache with incremental enrichment.
|
||
|
||
#### 2.4 Extension System Limitations (`extensions/index.js`)
|
||
```javascript
|
||
// Only 18 lines. No lifecycle hooks, no dependency management.
|
||
export async function loadExtension(extensionModule) {
|
||
return typeof extensionModule === 'string'
|
||
? await import(extensionModule)
|
||
: extensionModule;
|
||
}
|
||
```
|
||
**Missing**: Lifecycle hooks, resolver interception, middleware support, error boundaries.
|
||
|
||
---
|
||
|
||
## 3. Bedrock-API - Architectural Flaws
|
||
|
||
### Critical Issues
|
||
|
||
#### 3.1 Missing Proto Fields (`bedrock_service.proto`)
|
||
|
||
| Missing Field | Impact |
|
||
|---------------|--------|
|
||
| `album_id` on Track | Can't link tracks to albums bidirectionally |
|
||
| `release_date` on Track | Temporal data lost |
|
||
| `explicit` flag | Content rating lost |
|
||
| `isrc` | International standard ID lost (critical for rights) |
|
||
| `verified` on Artist | Badge status lost |
|
||
| `label` on Album | Publisher info lost |
|
||
| `upc/ean` | Barcode identifiers lost |
|
||
|
||
#### 3.2 SoundCloud artist_id Bug (`soundcloud.go:457`)
|
||
```go
|
||
// BUG: Uses track ID instead of user ID
|
||
artist_id: fmt.Sprintf("soundcloud:%d", t.ID), // Should be t.User.ID
|
||
```
|
||
|
||
#### 3.3 Listening Stats Don't Persist (`main.go:984-1000`)
|
||
```go
|
||
func (s *BedrockServer) RecordPlay(ctx context.Context, req *pb.RecordPlayRequest) (*pb.RecordPlayResponse, error) {
|
||
eventID := uuid.New().String()
|
||
// TODO: persist event ← STUB!
|
||
return &pb.RecordPlayResponse{EventId: eventID, Status: pb.ResponseStatus_STATUS_OK}, nil
|
||
}
|
||
```
|
||
**Impact**: `GetPopularTracks` and `GetListeningHistory` return empty - feature non-functional.
|
||
|
||
#### 3.4 Resolver Bridging Has No Validation (`resolver.go:152-159`)
|
||
```go
|
||
// Takes first search result without scoring
|
||
results, err := s.sc.SearchTracks(ctx, cleanedQuery, 1)
|
||
return results[0] // Wrong track if covers/remixes rank first
|
||
```
|
||
**Missing**: Duration comparison, artist name fuzzy matching, ISRC/UPC verification.
|
||
|
||
#### 3.5 Spotify Panic Risk (`spotify.go:76-78`)
|
||
```go
|
||
// No bounds check before indexing
|
||
ArtistIDs: wrapper.ArtistIDs[0], // PANIC if empty array
|
||
```
|
||
|
||
---
|
||
|
||
## 4. minim - Architectural Flaws
|
||
|
||
### Critical Issues
|
||
|
||
#### 4.1 Inconsistent Error Handling Per Provider
|
||
|
||
| Provider | Error Pattern |
|
||
|----------|---------------|
|
||
| Spotify | Retries on 401, raises `RuntimeError` |
|
||
| TIDAL | Parses JSON error, falls back to status |
|
||
| Qobuz | Raises with `error['code']` |
|
||
| iTunes | Tries `errorMessage`, uses JSONDecodeError fallback |
|
||
| Discogs | Parses nested `detail` field |
|
||
|
||
**Impact**: Consumers need provider-specific error handling.
|
||
|
||
#### 4.2 Missing Retry Logic (3/5 providers)
|
||
Only Spotify and Qobuz implement retry. TIDAL, iTunes, Discogs fail immediately on transient errors.
|
||
|
||
#### 4.3 No Rate Limit Handling
|
||
```python
|
||
# Missing everywhere:
|
||
# - 429 Too Many Requests detection
|
||
# - Retry-After header parsing
|
||
# - Exponential backoff
|
||
```
|
||
|
||
#### 4.4 Response Structure Inconsistency
|
||
|
||
| Provider | Artist Field | Duration Field |
|
||
|----------|-------------|----------------|
|
||
| Spotify | `album.artists[0].name` | `duration_ms` |
|
||
| TIDAL | `data.attributes.name` | `duration` (seconds) |
|
||
| iTunes | `artistName` | `trackTimeMillis` |
|
||
| Discogs | `artists[0].name` | N/A |
|
||
|
||
**Impact**: No common data model. Every consumer writes provider-specific parsing.
|
||
|
||
---
|
||
|
||
## 5. MusicMetaLinker - Architectural Flaws
|
||
|
||
### Critical Issues
|
||
|
||
#### 5.1 Naive Cascading Fallback (`linking.py:159-182`)
|
||
```python
|
||
def get_artist(self) -> str | None:
|
||
if self.artist: return self.artist
|
||
artist = self.mb_link.get_artist()
|
||
if artist is None:
|
||
artist = self.dz_link.get_artist_name()
|
||
if artist is None:
|
||
artist = self.mb_link.get_artist() # Called twice!
|
||
if artist is None:
|
||
artist = self.yt_link.get_youtube_artist()
|
||
return artist # First non-None wins, no quality check
|
||
```
|
||
**Problems**:
|
||
- No confidence scoring
|
||
- No conflict detection ("Beyoncé" vs "Beyonce" vs "Beyoncé Knowles")
|
||
- Redundant MusicBrainz calls
|
||
- Order bias (Deezer always wins over YouTube)
|
||
|
||
#### 5.2 Silent Failures (`deezer_links.py:102-107`)
|
||
```python
|
||
try:
|
||
return [res for res in results][:limit]
|
||
except Exception: # Catches EVERYTHING
|
||
return None # Network error? Invalid input? Who knows!
|
||
```
|
||
**Impact**: Can't distinguish "no match" from "API failed" from "invalid input".
|
||
|
||
#### 5.3 ISRC Handling Bug (`musicbrainz_links.py:77-85`)
|
||
```python
|
||
for isrc in self.isrc:
|
||
try:
|
||
isrc_result = mb.get_recordings_by_isrc(isrc, ...)
|
||
return isrc_result # Returns on first success
|
||
except mb.ResponseError:
|
||
return None # BUG: Should be `continue`, not `return`!
|
||
```
|
||
|
||
#### 5.4 Album Name Truncation (`deezer_links.py:63-78`)
|
||
```python
|
||
if self.album and " " in self.album:
|
||
self.album = " ".join(self.album.split(" ")[:2]) # Only first 2 words!
|
||
```
|
||
"The Beatles (Remastered)" → "The Beatles" - loses critical specificity.
|
||
|
||
#### 5.5 Naive Duration Comparison
|
||
Fixed 3-second threshold regardless of track length:
|
||
- 3s is huge for 30-second track (10% error)
|
||
- 3s is tiny for 10-minute track (0.5% error)
|
||
|
||
---
|
||
|
||
## Proposed Architecture
|
||
|
||
### Design Principles
|
||
|
||
1. **Observations are immutable** - No "last write wins"; always preserve raw data
|
||
2. **Field-level confidence** - Trust title from MusicBrainz while using duration from Spotify
|
||
3. **Three-stage entity resolution** - Blocking → Similarity → Decision
|
||
4. **Provenance by default** - Every value is explainable
|
||
|
||
### Architecture Diagram
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────┐
|
||
│ INGESTION LAYER │
|
||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||
│ │ Provider │ │ Provider │ │ Provider │ │ Provider │ │
|
||
│ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │
|
||
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
||
│ └────────────────┴───────┬────────┴────────────────┘ │
|
||
│ ┌─────────────▼──────────────┐ │
|
||
│ │ Unified Provider Gateway │ │
|
||
│ │ • Per-provider rate limit │ │
|
||
│ │ • Retry + exp. backoff │ │
|
||
│ │ • Circuit breaker │ │
|
||
│ │ • Request batching │ │
|
||
│ └─────────────┬──────────────┘ │
|
||
└──────────────────────────────────┼──────────────────────────────────────┘
|
||
│
|
||
┌──────────────▼──────────────┐
|
||
│ RAW OBSERVATION STORE │
|
||
│ (append-only, immutable) │
|
||
└──────────────┬──────────────┘
|
||
│
|
||
┌──────────────────────────────────┼──────────────────────────────────────┐
|
||
│ ENTITY RESOLUTION LAYER │
|
||
│ ┌────────────────────────▼────────────────────────┐ │
|
||
│ │ BLOCKING STAGE │ │
|
||
│ │ • ISRC/UPC exact match (99.7% pair reduction) │ │
|
||
│ │ • Phonetic blocking (Metaphone) for names │ │
|
||
│ └────────────────────────┬────────────────────────┘ │
|
||
│ ┌────────────────────────▼────────────────────────┐ │
|
||
│ │ SIMILARITY STAGE │ │
|
||
│ │ • Title: Levenshtein + token Jaccard │ │
|
||
│ │ • Artist: embedding cosine similarity │ │
|
||
│ │ • Duration: relative threshold (±3% or ±5s) │ │
|
||
│ └────────────────────────┬────────────────────────┘ │
|
||
│ ┌────────────────────────▼────────────────────────┐ │
|
||
│ │ DECISION STAGE │ │
|
||
│ │ • ≥0.95 → auto-merge │ │
|
||
│ │ • 0.70-0.95 → human review queue │ │
|
||
│ │ • <0.70 → distinct entities │ │
|
||
│ └────────────────────────┬────────────────────────┘ │
|
||
└──────────────────────────────────┼──────────────────────────────────────┘
|
||
│
|
||
┌──────────────────────────────────┼──────────────────────────────────────┐
|
||
│ CONFLICT RESOLUTION ENGINE │
|
||
│ ┌────────────────────────▼────────────────────────┐ │
|
||
│ │ FIELD-LEVEL MERGE RULES │ │
|
||
│ │ confidence = source_trust × recency × consensus │ │
|
||
│ │ │ │
|
||
│ │ • Identifiers: ISRC > provider ID │ │
|
||
│ │ • Duration: median within 2s tolerance │ │
|
||
│ │ • Title: MusicBrainz > label > streaming │ │
|
||
│ │ • Release date: earliest credible │ │
|
||
│ │ • Explicit: OR across sources │ │
|
||
│ └────────────────────────┬────────────────────────┘ │
|
||
│ ┌────────────────────────▼────────────────────────┐ │
|
||
│ │ CANONICAL ENTITY STORE │ │
|
||
│ │ • Materialized "best known" values │ │
|
||
│ │ • Per-field confidence scores │ │
|
||
│ │ • Links to all source observations │ │
|
||
│ └─────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
### Core Data Model
|
||
|
||
```sql
|
||
-- Immutable observations from providers
|
||
CREATE TABLE observations (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
provider TEXT NOT NULL,
|
||
provider_id TEXT NOT NULL,
|
||
entity_type TEXT NOT NULL,
|
||
payload JSONB NOT NULL,
|
||
fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
checksum BYTEA NOT NULL,
|
||
UNIQUE(provider, provider_id, checksum)
|
||
);
|
||
|
||
-- Canonical entities with confidence
|
||
CREATE TABLE tracks (
|
||
id UUID PRIMARY KEY,
|
||
|
||
-- Identifiers
|
||
isrc TEXT,
|
||
iswc TEXT,
|
||
mbid UUID,
|
||
|
||
-- Fields with confidence
|
||
title TEXT NOT NULL,
|
||
title_confidence REAL NOT NULL DEFAULT 0.0,
|
||
|
||
duration_ms INT,
|
||
duration_confidence REAL NOT NULL DEFAULT 0.0,
|
||
|
||
explicit BOOLEAN,
|
||
explicit_confidence REAL NOT NULL DEFAULT 0.0,
|
||
|
||
-- Denormalized
|
||
artist_credit TEXT NOT NULL,
|
||
album_title TEXT,
|
||
|
||
-- Metadata
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
merge_version INT NOT NULL DEFAULT 1
|
||
);
|
||
|
||
-- Field-level provenance
|
||
CREATE TABLE field_sources (
|
||
entity_type TEXT NOT NULL,
|
||
entity_id UUID NOT NULL,
|
||
field_name TEXT NOT NULL,
|
||
observation_id UUID NOT NULL REFERENCES observations(id),
|
||
confidence REAL NOT NULL,
|
||
selected BOOLEAN NOT NULL DEFAULT false,
|
||
PRIMARY KEY (entity_type, entity_id, field_name, observation_id)
|
||
);
|
||
|
||
-- Cross-reference table
|
||
CREATE TABLE provider_links (
|
||
entity_type TEXT NOT NULL,
|
||
entity_id UUID NOT NULL,
|
||
provider TEXT NOT NULL,
|
||
provider_id TEXT NOT NULL,
|
||
verified BOOLEAN NOT NULL DEFAULT false,
|
||
PRIMARY KEY (entity_type, provider, provider_id)
|
||
);
|
||
|
||
-- Entity resolution audit trail
|
||
CREATE TABLE merge_decisions (
|
||
id UUID PRIMARY KEY,
|
||
entity_type TEXT NOT NULL,
|
||
source_ids UUID[] NOT NULL,
|
||
target_id UUID NOT NULL,
|
||
similarity_score REAL NOT NULL,
|
||
decision TEXT NOT NULL, -- 'auto', 'human_approved', 'human_rejected'
|
||
decided_by TEXT,
|
||
decided_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
### Source Trust Hierarchy
|
||
|
||
```python
|
||
SOURCE_TRUST = {
|
||
'musicbrainz': 0.95, # Community-curated, high accuracy
|
||
'discogs': 0.85, # Community + physical media focus
|
||
'tidal': 0.80, # Label direct relationships
|
||
'spotify': 0.75, # Large scale, some noise
|
||
'deezer': 0.70, # Good coverage, less curation
|
||
'youtube': 0.60, # User-generated, low accuracy
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Conflict Resolution Rules
|
||
|
||
| Field | Strategy | Implementation |
|
||
|-------|----------|----------------|
|
||
| **Title** | Highest trust + consensus | Score = trust + 0.1×(agreeing_sources - 1) |
|
||
| **Duration** | Median within tolerance | Filter to ±3% or ±5s, take median |
|
||
| **Explicit** | OR logic | If any source says explicit → explicit |
|
||
| **Release Date** | Earliest credible | Must be ≤ today and ≥ 1900 |
|
||
| **ISRC** | First valid | Validate format, take highest-trust source |
|
||
| **Artist** | Embedding similarity | Cluster similar names, pick canonical |
|
||
|
||
---
|
||
|
||
### Technical Choices
|
||
|
||
| Component | Choice | Rationale |
|
||
|-----------|--------|-----------|
|
||
| **Core Language** | Python 3.11+ | Rapid iteration, rich ecosystem |
|
||
| **Hot Path** | Rust via PyO3 | Entity resolution blocking/embedding |
|
||
| **Database** | PostgreSQL 15+ | JSONB, trigram, pgvector |
|
||
| **Cache** | Redis | Entity-keyed, not URL-keyed |
|
||
| **Embeddings** | all-MiniLM-L6-v2 | 384-dim, fast, good quality |
|
||
| **API** | GraphQL + DataLoader | Explicit batching, no N+1 |
|
||
| **Queue** | PostgreSQL SKIP LOCKED | Human review, async processing |
|
||
| **Observability** | OpenTelemetry | Trace entity resolution decisions |
|
||
|
||
---
|
||
|
||
### Estimated Effort
|
||
|
||
| Component | Effort | Notes |
|
||
|-----------|--------|-------|
|
||
| Data model + migrations | 1-4 hours | PostgreSQL schema |
|
||
| Provider gateway | 1-2 days | Unified error handling, rate limiting |
|
||
| Entity resolution pipeline | 1-2 days | Blocking, similarity, decision |
|
||
| Conflict resolution engine | 1-4 hours | Field-level rules |
|
||
| Provenance system | 1-4 hours | Audit tables, explain API |
|
||
| Human review UI | 1-2 days | Queue management |
|
||
| **Total MVP** | **1-2 weeks** | |
|
||
|
||
---
|
||
|
||
## Key Takeaways
|
||
|
||
1. **Hybrid approaches win**: Audio + metadata outperforms either alone (Spotify research: 2-6% improvement)
|
||
|
||
2. **Provenance is non-negotiable**: Every field needs source tracking, confidence scores, snapshot URLs
|
||
|
||
3. **Identifier hierarchy matters**: ISWC (work) → ISRC (recording) → UPC (release) with MBIDs as glue
|
||
|
||
4. **Fuzzy matching requires stages**: Blocking (99.7% reduction) → Similarity → Threshold → Human review
|
||
|
||
5. **Conflict resolution needs policy**: Field-level precedence rules, not "last write wins"
|
||
|
||
6. **Cache entities, not requests**: Avoid GraphBrainz's URL-fragmentation trap
|
||
|
||
7. **Unified error handling**: Result types that force error handling, not silent exceptions
|