feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,795 @@
# Harmony - Architecture Analysis
## System Architecture Overview
Harmony implements a **4-stage pipeline architecture** for metadata aggregation and harmonization:
```
┌──────────┐ ┌────────────┐ ┌───────┐ ┌──────┐
│ LOOKUP │ --> │ HARMONIZE │ --> │ MERGE │ --> │ SEED │
└──────────┘ └────────────┘ └───────┘ └──────┘
│ │ │ │
Parallel Provider 3-phase MusicBrainz
Multi-source Conversion Merge Format
Queries to Harmony Algorithm Conversion
```
Each stage has distinct responsibilities and operates on well-defined data structures.
## Stage 1: LOOKUP
### CombinedReleaseLookup
The entry point for all metadata retrieval operations.
**Location**: `harmonizer/combined_lookup.ts`
**Responsibilities**:
- Accepts GTIN, URLs, or provider-specific IDs
- Determines which providers to query based on input
- Executes provider lookups in parallel
- Handles provider failures gracefully via `Promise.allSettled`
- Returns array of provider-specific release objects
**Input Types**:
```typescript
interface LookupInput {
gtin?: string; // Global Trade Item Number (barcode)
urls?: string[]; // Provider URLs
region?: string[]; // Market regions (e.g., ['GB', 'US', 'JP'])
category?: string; // Provider category filter
providerIds?: Record<string, string>; // Provider-specific IDs
}
```
**Parallel Execution**:
```typescript
// Conceptual flow
const lookupPromises = providers.map(provider =>
provider.lookup(input).catch(error => ({ error }))
);
const results = await Promise.allSettled(lookupPromises);
```
**Output**: Array of provider-native release objects (Spotify, Deezer, iTunes formats, etc.)
### Provider Selection Logic
1. **URL-based**: Extract provider from URL pattern matching
2. **GTIN-based**: Query all providers supporting GTIN lookup
3. **Category filtering**: Apply user preferences (all/default/preferred)
4. **Region filtering**: Pass region codes to region-aware providers
## Stage 2: HARMONIZE
### Provider Conversion
Each provider implements a `harmonize()` method that converts its native format to `HarmonyRelease`.
**Location**: Individual provider files in `providers/`
**Conversion Responsibilities**:
- Map provider-specific field names to Harmony schema
- Normalize data types (dates, durations, ISRCs)
- Extract nested structures (artists, labels, media)
- Detect language and script from metadata
- Resolve release types (album, single, EP, etc.)
- Extract external links and identifiers
**Example Provider Conversion** (conceptual):
```typescript
class SpotifyProvider extends MetadataApiProvider {
harmonize(spotifyAlbum: SpotifyAlbum): HarmonyRelease {
return {
title: spotifyAlbum.name,
artists: this.convertArtists(spotifyAlbum.artists),
gtin: spotifyAlbum.external_ids?.upc,
media: this.convertTracks(spotifyAlbum.tracks),
releaseDate: this.parseDate(spotifyAlbum.release_date),
images: this.convertImages(spotifyAlbum.images),
externalLinks: [{
url: spotifyAlbum.external_urls.spotify,
types: ['streaming']
}],
// ... additional fields
};
}
}
```
### HarmonyRelease Schema
**Location**: `harmonizer/types.ts` (273 lines)
**Core Structure**:
```typescript
interface HarmonyRelease {
// Basic metadata
title: string;
artists: ArtistCreditName[];
gtin?: string;
// Media and tracks
media: HarmonyMedium[];
// Release details
language?: string;
script?: string;
status?: ReleaseStatus;
types: ReleaseType[];
releaseDate?: PartialDate;
// Commercial info
labels: Label[];
packaging?: PackagingType;
copyright?: string;
// Distribution
availableIn?: string[]; // Country codes
excludedFrom?: string[]; // Country codes
// Visual assets
images: Image[];
// Links and identifiers
externalLinks: ExternalLink[];
// Metadata about metadata
info: {
providers: string[]; // Which providers contributed
messages: Message[]; // Warnings, errors
sourceMap?: SourceMap; // Property -> provider mapping
incompatibleData?: IncompatibilityInfo;
};
}
```
**Key Sub-structures**:
#### ArtistCreditName
```typescript
interface ArtistCreditName {
name: string; // Display name
creditedName?: string; // Alternative credit
joinPhrase?: string; // Separator (e.g., " & ", " feat. ")
mbid?: string; // MusicBrainz ID
}
```
#### HarmonyMedium
```typescript
interface HarmonyMedium {
title?: string;
format?: MediumFormat; // CD, Vinyl, Digital, etc.
position: number;
tracks: HarmonyTrack[];
}
```
#### HarmonyTrack
```typescript
interface HarmonyTrack {
title: string;
artists?: ArtistCreditName[];
position: number;
length?: number; // Duration in milliseconds
isrc?: string; // International Standard Recording Code
}
```
#### Label
```typescript
interface Label {
name: string;
catalogNumber?: string;
mbid?: string;
}
```
#### Image
```typescript
interface Image {
url: string;
types: ImageType[]; // 'front', 'back', 'medium', etc.
width?: number;
height?: number;
comment?: string;
}
```
### Harmonizer Modules
**Location**: `harmonizer/` directory
| Module | Purpose | Lines |
|--------|---------|-------|
| `types.ts` | HarmonyRelease schema and type definitions | 273 |
| `merge.ts` | 3-phase merge algorithm | ~200 |
| `compatibility.ts` | Conflict detection and resolution | ~150 |
| `deduplicate.ts` | Remove duplicate entries | ~100 |
| `isrc.ts` | ISRC validation and normalization | ~50 |
| `language_script.ts` | Auto-detect language and script | ~100 |
| `release_label.ts` | Label normalization | ~80 |
| `release_types.ts` | Release type inference | ~120 |
| `tracklist_gap.ts` | Detect missing tracks | ~60 |
## Stage 3: MERGE
### 3-Phase Merge Algorithm
**Location**: `harmonizer/merge.ts`
The merge algorithm combines multiple `HarmonyRelease` objects into a single `MergedHarmonyRelease` using provider preferences and compatibility checking.
#### Phase 1: Property Collection
Collect all values for each property across all releases:
```typescript
// Conceptual
const propertyValues = {
title: ['Album Title', 'Album Title (Deluxe)', 'Album Title'],
gtin: ['0602537347377', '0602537347377'],
releaseDate: ['2014-11-24', '2014-11-24', '2014-11-25'],
// ... all properties
};
```
#### Phase 2: Compatibility Checking
For each property, check if values are compatible:
```typescript
interface CompatibilityCheck {
compatible: boolean;
canonicalValue?: any;
conflicts?: ConflictInfo[];
}
```
**Compatibility Rules**:
- **Strings**: Case-insensitive comparison, whitespace normalization
- **Dates**: Partial date matching (year-only vs. full date)
- **Arrays**: Set comparison (order-independent)
- **Numbers**: Exact match or within tolerance
- **Objects**: Recursive field comparison
**Example Compatibility**:
```typescript
// Compatible
'2014-11-24' '2014-11' // Partial date match
'Album Title' 'album title' // Case-insensitive
// Incompatible
'2014-11-24' '2014-11-25' // Date conflict
'Album' 'EP' // Type conflict
```
#### Phase 3: Value Selection
For each property, select the best value using provider preferences:
**Provider Preference Order** (configurable):
1. MusicBrainz (template/reference)
2. Spotify (high quality, comprehensive)
3. Tidal (high quality audio metadata)
4. Deezer (good coverage)
5. iTunes (region-specific)
6. Bandcamp (artist-verified)
7. Beatport (electronic music specialist)
8. Mora (Japan specialist)
9. Ototoy (Japan specialist)
**Selection Logic**:
```typescript
function selectBestValue(values: PropertyValues, preferences: string[]): any {
// 1. Filter to compatible values only
const compatible = values.filter(v => v.isCompatible);
// 2. If no compatible values, mark as conflict
if (compatible.length === 0) {
return { conflict: true, values };
}
// 3. Select from highest-preference provider
for (const provider of preferences) {
const value = compatible.find(v => v.provider === provider);
if (value) return value.data;
}
// 4. Fallback to first compatible value
return compatible[0].data;
}
```
### MergedHarmonyRelease
Extends `HarmonyRelease` with merge metadata:
```typescript
interface MergedHarmonyRelease extends HarmonyRelease {
sourceMap: SourceMap; // Property -> provider mapping
incompatibleData?: IncompatibilityInfo;
}
interface SourceMap {
[propertyPath: string]: string; // e.g., "title" -> "spotify"
}
interface IncompatibilityInfo {
conflicts: Conflict[];
warnings: string[];
}
interface Conflict {
property: string;
values: Array<{
provider: string;
value: any;
}>;
}
```
### Deduplication
**Location**: `harmonizer/deduplicate.ts`
Removes duplicate entries in arrays:
- **Artists**: Match by name (case-insensitive) or MBID
- **Labels**: Match by name and catalog number
- **Tracks**: Match by position and title
- **Images**: Match by URL or dimensions
- **External links**: Match by URL
### Compatibility Checking
**Location**: `harmonizer/compatibility.ts`
Detects and reports incompatible data:
**Incompatibility Types**:
1. **Value conflicts**: Different values for same property
2. **Type conflicts**: Different data types
3. **Structural conflicts**: Different array lengths, missing required fields
4. **Semantic conflicts**: Logically incompatible values (e.g., release date before artist birth)
**Handling**:
- **Strict mode**: Reject merge if any conflicts
- **Lenient mode**: Prefer highest-quality provider, log warnings
- **User override**: Allow manual conflict resolution
## Stage 4: SEED
### MusicBrainz Seeding
**Location**: `musicbrainz/seeding.ts`
Converts `MergedHarmonyRelease` to MusicBrainz import format.
**Conversion Steps**:
1. Map HarmonyRelease fields to MusicBrainz schema
2. Generate edit notes with provider URLs
3. Create permalink for reproducibility
4. Build annotation with extra data (copyright, availability)
5. Format for MusicBrainz seeder form
**MusicBrainz Mapping**:
| Harmony Field | MusicBrainz Field | Notes |
|---------------|-------------------|-------|
| `title` | Release name | Direct mapping |
| `artists` | Artist credit | Join with `joinPhrase` |
| `gtin` | Barcode | Validate format |
| `releaseDate` | Release events | Per-country events |
| `labels` | Release labels | With catalog numbers |
| `media` | Mediums | With format and tracks |
| `types` | Release group types | Primary + secondary |
| `language` | Language | ISO 639-3 code |
| `script` | Script | ISO 15924 code |
| `packaging` | Packaging | Jewel case, digipak, etc. |
**Edit Note Generation**:
```typescript
function generateEditNote(release: MergedHarmonyRelease, permalink: string): string {
const sources = release.info.providers.join(', ');
return `
Imported from ${sources} via Harmony
Permalink: ${permalink}
${release.externalLinks.map(link => link.url).join('\n')}
`.trim();
}
```
### MBID Resolution
**Location**: `musicbrainz/mbid_mapping.ts`
Resolves external URLs to MusicBrainz IDs (MBIDs).
**Batch Lookup**:
- Collects up to 100 URLs
- Single MusicBrainz API request: `GET /ws/2/url?resource={url1}&resource={url2}&...`
- Caches results in localStorage (dev) or sessionStorage (prod)
- Returns MBID mappings
**Duplicate Detection**:
- Checks if release already exists in MusicBrainz
- Warns user before creating duplicate
- Provides link to existing release
**Cache Strategy**:
```typescript
interface MBIDCache {
[externalUrl: string]: {
mbid: string;
type: 'release' | 'release-group' | 'recording' | 'artist';
cached: number; // Timestamp
};
}
```
### Annotation Builder
**Location**: `musicbrainz/annotation.ts`
Generates MusicBrainz annotation text for additional metadata:
**Included Data**:
- Copyright information
- Availability/exclusion regions
- Provider-specific notes
- Compatibility warnings
- Image URLs (if not added as cover art)
**Format**:
```
Copyright: © 2014 Record Label
Available in: US, GB, DE, JP
Excluded from: CN
Sources:
- Spotify: https://open.spotify.com/album/xyz
- Deezer: https://www.deezer.com/album/123
Notes:
- Release date conflict: Spotify (2014-11-24) vs iTunes (2014-11-25)
```
## Provider Architecture
### Base Class Hierarchy
```
MetadataProvider (abstract)
├── MetadataApiProvider (OAuth2 support)
│ ├── SpotifyProvider
│ └── TidalProvider
├── ReleaseLookup (GTIN/URL/ID support)
│ ├── DeezerProvider
│ ├── iTunesProvider
│ ├── BandcampProvider
│ ├── BeatportProvider
│ ├── MoraProvider
│ └── OtotoyProvider
└── ReleaseApiLookup (multi-region support)
├── iTunesProvider
└── DeezerProvider
```
### MetadataProvider (Abstract Base)
**Location**: `providers/base.ts`
**Core Responsibilities**:
- URL pattern matching via `URLPattern`
- Rate limiting with configurable delays
- HTTP response caching via `snap_storage`
- Error handling and retry logic
- Feature quality ratings
**Key Methods**:
```typescript
abstract class MetadataProvider {
// URL pattern matching
abstract urlPattern: URLPattern;
matchesUrl(url: string): boolean;
// Lookup methods
abstract lookupByUrl(url: string): Promise<Release>;
abstract lookupByGtin(gtin: string, region?: string): Promise<Release>;
// Harmonization
abstract harmonize(release: Release): HarmonyRelease;
// Rate limiting
protected rateLimit: RateLimiter;
protected async throttle(): Promise<void>;
// Caching
protected cache: SnapStorage;
protected async getCached(key: string): Promise<Response | null>;
protected async setCached(key: string, response: Response): Promise<void>;
// Feature quality
abstract featureQuality: FeatureQualityMap;
}
```
### MetadataApiProvider (OAuth2)
**Location**: `providers/api_base.ts`
**Additional Responsibilities**:
- OAuth2 token acquisition and refresh
- Token caching in localStorage
- Automatic token renewal
- API client configuration
**OAuth2 Flow**:
```typescript
class MetadataApiProvider extends MetadataProvider {
protected async getAccessToken(): Promise<string> {
// 1. Check cache
const cached = localStorage.getItem(`${this.name}_token`);
if (cached && !this.isTokenExpired(cached)) {
return cached.access_token;
}
// 2. Request new token
const token = await this.requestToken();
// 3. Cache token
localStorage.setItem(`${this.name}_token`, JSON.stringify(token));
return token.access_token;
}
protected abstract async requestToken(): Promise<OAuth2Token>;
}
```
### ReleaseLookup
**Location**: `providers/release_lookup.ts`
**Lookup Methods**:
```typescript
interface ReleaseLookup {
lookupByUrl(url: string): Promise<Release>;
lookupByGtin(gtin: string): Promise<Release>;
lookupById(id: string): Promise<Release>;
}
```
### ReleaseApiLookup (Multi-Region)
**Location**: `providers/release_api_lookup.ts`
**Region Handling**:
```typescript
class ReleaseApiLookup extends ReleaseLookup {
protected supportedRegions: string[]; // ['US', 'GB', 'JP', ...]
async lookupByGtin(gtin: string, regions: string[]): Promise<Release[]> {
const lookups = regions
.filter(r => this.supportedRegions.includes(r))
.map(r => this.lookupInRegion(gtin, r));
const results = await Promise.allSettled(lookups);
return results
.filter(r => r.status === 'fulfilled')
.map(r => r.value);
}
protected abstract lookupInRegion(gtin: string, region: string): Promise<Release>;
}
```
### Provider Registry
**Location**: `providers/registry.ts`
Manages provider instantiation and categorization.
**Registry Structure**:
```typescript
class ProviderRegistry {
private providers: Map<string, MetadataProvider>;
private categories: Map<string, string[]>; // category -> provider names
register(provider: MetadataProvider, category: string): void;
get(name: string): MetadataProvider | undefined;
getByCategory(category: string): MetadataProvider[];
getByUrl(url: string): MetadataProvider | undefined;
getByGtin(): MetadataProvider[]; // All GTIN-supporting providers
}
```
**Categories**:
- `default`: Commonly used providers (Spotify, Deezer, iTunes)
- `preferred`: High-quality providers (Spotify, Tidal, MusicBrainz)
- `all`: All registered providers
- `japan`: Japan-specific providers (Mora, Ototoy)
- `electronic`: Electronic music specialists (Beatport)
### Feature Quality Ratings
Each provider declares quality ratings for supported features:
```typescript
interface FeatureQualityMap {
gtin: FeatureQuality;
title: FeatureQuality;
artists: FeatureQuality;
releaseDate: FeatureQuality;
labels: FeatureQuality;
media: FeatureQuality;
tracks: FeatureQuality;
isrc: FeatureQuality;
images: FeatureQuality | number; // Number = max dimension
copyright: FeatureQuality;
availability: FeatureQuality;
}
enum FeatureQuality {
MISSING = 0,
BAD = 1,
PRESENT = 2,
GOOD = 3,
}
```
**Example** (Spotify):
```typescript
featureQuality = {
gtin: FeatureQuality.GOOD,
title: FeatureQuality.GOOD,
artists: FeatureQuality.GOOD,
releaseDate: FeatureQuality.GOOD,
labels: FeatureQuality.PRESENT,
media: FeatureQuality.GOOD,
tracks: FeatureQuality.GOOD,
isrc: FeatureQuality.GOOD,
images: 2000, // Max 2000px
copyright: FeatureQuality.PRESENT,
availability: FeatureQuality.GOOD,
};
```
## Server Architecture (Fresh Framework)
### Fresh Islands Architecture
Fresh uses a hybrid rendering model:
- **Server-side rendering (SSR)**: Default for all components
- **Islands**: Client-side interactive components
**Benefits**:
- Minimal JavaScript shipped to client
- Fast initial page load
- Progressive enhancement
- SEO-friendly
### Route Structure
**Location**: `routes/` directory
| Route File | URL | Purpose |
|------------|-----|---------|
| `index.tsx` | `/` | Landing page |
| `release.tsx` | `/release` | Main lookup interface |
| `release/actions.tsx` | `/release/actions` | ISRC/cover submission |
| `about.tsx` | `/about` | Provider documentation |
| `settings.tsx` | `/settings` | User preferences |
### Components
**Location**: `components/` directory
**22 Static Components** (server-rendered):
- Layout components (Header, Footer, Navigation)
- Display components (ReleaseInfo, TrackList, ArtistCredit)
- Comparison components (ProviderTable, FeatureMatrix)
- Form components (LookupForm, SeederForm)
**5 Interactive Islands** (client-side):
- `LookupForm.tsx`: Dynamic form with validation
- `ProviderSelector.tsx`: Provider category filtering
- `RegionSelector.tsx`: Multi-region selection
- `PermalinkGenerator.tsx`: Timestamp-based permalink creation
- `SeederForm.tsx`: MusicBrainz import form with copy-to-clipboard
### Request Flow
```
1. Browser Request
2. Fresh Router (routes/release.tsx)
3. CombinedReleaseLookup (parallel provider queries)
4. Provider Harmonization (convert to HarmonyRelease)
5. Merge Algorithm (combine releases)
6. Server-Side Rendering (generate HTML)
7. Island Hydration (activate interactive components)
8. Browser Response
```
## Data Flow Diagram
```
┌─────────────────────────────────────────────────────────────┐
│ User Input │
│ GTIN: 0602537347377 URLs: [spotify, deezer] Region: US │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ CombinedReleaseLookup │
│ - Parse input │
│ - Select providers (Spotify, Deezer) │
│ - Execute parallel lookups │
└────────────────────────┬────────────────────────────────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Spotify │ │ Deezer │ │ iTunes │
│ Provider │ │ Provider │ │ Provider │
│ │ │ │ │ │
│ - API call │ │ - API call │ │ - API call │
│ - Cache │ │ - Cache │ │ - Cache │
│ - Parse │ │ - Parse │ │ - Parse │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Harmonize │ │ Harmonize │ │ Harmonize │
│ (Spotify) │ │ (Deezer) │ │ (iTunes) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────┼────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Merge Algorithm │
│ Phase 1: Collect property values from all releases │
│ Phase 2: Check compatibility │
│ Phase 3: Select best value per property │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ MergedHarmonyRelease │
│ - Unified metadata │
│ - Source map (property -> provider) │
│ - Incompatibility warnings │
└────────────────────────┬────────────────────────────────────┘
┌───────────────┼───────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Web UI Display │ │ MusicBrainz │
│ - Comparison │ │ Seeding │
│ - Warnings │ │ - Convert │
│ - Permalink │ │ - Edit note │
└─────────────────┘ │ - Annotation │
└─────────────────┘
```
## Summary
Harmony's architecture demonstrates:
1. **Clear separation of concerns**: 4-stage pipeline with distinct responsibilities
2. **Provider abstraction**: Base classes handle common functionality (caching, rate limiting, OAuth2)
3. **Type safety**: 273-line HarmonyRelease schema ensures data consistency
4. **Intelligent merging**: 3-phase algorithm with compatibility checking and provider preferences
5. **Graceful degradation**: `Promise.allSettled` ensures partial results on provider failures
6. **MusicBrainz integration**: Seamless conversion to MB format with MBID resolution
7. **Modern web stack**: Fresh framework with SSR and islands for optimal performance
This architecture is production-ready and serves as an excellent reference for building metadata aggregation systems.