Files
metadata-agregator/docs/research/harmony/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

21 KiB

Harmony - Data Model and Storage Analysis

Storage Philosophy

Harmony employs a cache-first, no-database architecture:

  • No traditional database: No PostgreSQL, MySQL, MongoDB, etc.
  • No persistent user data: No accounts, no saved searches, no user-generated content
  • Cache as storage: HTTP response caching via snap_storage library
  • In-memory processing: All data transformations happen in memory
  • Stateless design: Each request is independent

This approach prioritizes:

  • Simplicity: No database migrations, no schema evolution
  • Reproducibility: Permalink system enables exact result replay
  • API compliance: Caching reduces provider API calls
  • Deployment ease: No database server required

Persistence Layer: snap_storage

Overview

snap_storage is a Deno library for HTTP response caching with SQLite backend.

Repository: https://github.com/kellnerd/snap-storage (same author as Harmony)

Purpose: Store HTTP responses with timestamps for later retrieval

Storage Structure

SQLite Database: snaps.db

Location: ${HARMONY_DATA_DIR}/snaps.db (default: ./snaps.db)

Schema (conceptual):

CREATE TABLE snaps (
	id INTEGER PRIMARY KEY AUTOINCREMENT,
	key TEXT NOT NULL UNIQUE,
	url TEXT NOT NULL,
	timestamp INTEGER NOT NULL,
	status INTEGER NOT NULL,
	headers TEXT NOT NULL,
	body_path TEXT NOT NULL,
	created_at INTEGER NOT NULL
);

CREATE INDEX idx_snaps_key ON snaps(key);
CREATE INDEX idx_snaps_timestamp ON snaps(timestamp);
CREATE INDEX idx_snaps_url ON snaps(url);

Fields:

  • key: Cache key (hash of URL + parameters)
  • url: Original request URL
  • timestamp: Unix timestamp of request
  • status: HTTP status code
  • headers: JSON-encoded response headers
  • body_path: Path to response body file in snaps/ directory
  • created_at: Record creation timestamp

File Directory: snaps/

Location: ${HARMONY_DATA_DIR}/snaps/ (default: ./snaps/)

Structure:

snaps/
├── 0a/
│   ├── 0a1b2c3d4e5f6g7h8i9j.json
│   └── 0a9f8e7d6c5b4a3.json
├── 1b/
│   └── 1b2c3d4e5f6g7h8i9j0a.json
└── ...

File naming: First 2 characters of hash as directory, full hash as filename

File content: Raw HTTP response body (JSON, HTML, XML, etc.)

Cache Operations

Store Response

interface CacheEntry {
	url: string;
	timestamp: number;
	response: Response;
}

async function storeResponse(entry: CacheEntry): Promise<void> {
	const key = hashUrl(entry.url);
	const bodyPath = `snaps/${key.slice(0, 2)}/${key}.json`;
	
	// Store body to file
	await Deno.writeTextFile(bodyPath, await entry.response.text());
	
	// Store metadata to database
	await db.execute(`
		INSERT INTO snaps (key, url, timestamp, status, headers, body_path, created_at)
		VALUES (?, ?, ?, ?, ?, ?, ?)
	`, [
		key,
		entry.url,
		entry.timestamp,
		entry.response.status,
		JSON.stringify(Object.fromEntries(entry.response.headers)),
		bodyPath,
		Date.now()
	]);
}

Retrieve Response

async function getResponse(url: string, timestamp?: number): Promise<Response | null> {
	const key = hashUrl(url);
	
	let query = `SELECT * FROM snaps WHERE key = ?`;
	const params = [key];
	
	if (timestamp) {
		// Permalink mode: exact timestamp match
		query += ` AND timestamp = ?`;
		params.push(timestamp);
	} else {
		// Normal mode: most recent within cache duration
		const maxAge = 24 * 60 * 60 * 1000; // 24 hours
		query += ` AND created_at > ? ORDER BY created_at DESC LIMIT 1`;
		params.push(Date.now() - maxAge);
	}
	
	const row = await db.queryOne(query, params);
	if (!row) return null;
	
	// Read body from file
	const body = await Deno.readTextFile(row.body_path);
	
	// Reconstruct Response object
	return new Response(body, {
		status: row.status,
		headers: JSON.parse(row.headers)
	});
}

Cache Policy

Default Policy

  • Duration: 24 hours
  • Eviction: No automatic eviction (manual cleanup required)
  • Size limit: No enforced limit (grows indefinitely)
  • Duration: Indefinite (never evicted)
  • Purpose: Enable reproducible results
  • Lookup: Exact timestamp match

Cache Key Generation

function hashUrl(url: string): string {
	// Normalize URL
	const normalized = new URL(url);
	normalized.searchParams.sort(); // Consistent parameter order
	
	// Hash normalized URL
	const encoder = new TextEncoder();
	const data = encoder.encode(normalized.toString());
	const hashBuffer = await crypto.subtle.digest('SHA-256', data);
	const hashArray = Array.from(new Uint8Array(hashBuffer));
	return hashArray.map(b => b.toString(16).padStart(2, '0')).join('');
}

Cache Management

Manual Cleanup

No automatic cleanup. Users must manually delete old cache entries:

# Delete cache older than 30 days
sqlite3 snaps.db "DELETE FROM snaps WHERE created_at < $(date -d '30 days ago' +%s)000"

# Clean up orphaned files
find snaps/ -type f -mtime +30 -delete

Cache Statistics

# Total cache entries
sqlite3 snaps.db "SELECT COUNT(*) FROM snaps"

# Cache size
du -sh snaps/

# Entries per provider
sqlite3 snaps.db "SELECT url, COUNT(*) FROM snaps GROUP BY url"

MBID Cache

Purpose

Cache MusicBrainz ID (MBID) mappings for external URLs to avoid repeated API calls.

Storage Location

  • Development: localStorage (persistent across sessions)
  • Production: sessionStorage (cleared on browser close)

Rationale: Development benefits from persistent cache, production prioritizes fresh data.

Cache Structure

interface MBIDCache {
	[externalUrl: string]: MBIDCacheEntry;
}

interface MBIDCacheEntry {
	mbid: string;
	type: 'release' | 'release-group' | 'recording' | 'artist' | 'label';
	cached: number; // Unix timestamp
}

Cache Operations

Store MBID Mapping

function cacheMBID(url: string, mbid: string, type: string): void {
	const cache = getMBIDCache();
	cache[url] = {
		mbid,
		type,
		cached: Date.now()
	};
	setMBIDCache(cache);
}

function getMBIDCache(): MBIDCache {
	const storage = DENO_DEPLOYMENT_ID ? sessionStorage : localStorage;
	const cached = storage.getItem('harmony_mbid_cache');
	return cached ? JSON.parse(cached) : {};
}

function setMBIDCache(cache: MBIDCache): void {
	const storage = DENO_DEPLOYMENT_ID ? sessionStorage : localStorage;
	storage.setItem('harmony_mbid_cache', JSON.stringify(cache));
}

Retrieve MBID Mapping

function getCachedMBID(url: string): MBIDCacheEntry | null {
	const cache = getMBIDCache();
	const entry = cache[url];
	
	if (!entry) return null;
	
	// Check if cache is stale (24 hours)
	const maxAge = 24 * 60 * 60 * 1000;
	if (Date.now() - entry.cached > maxAge) {
		delete cache[url];
		setMBIDCache(cache);
		return null;
	}
	
	return entry;
}

Batch MBID Lookup

MusicBrainz API supports batch URL lookup (up to 100 URLs per request):

async function resolveMBIDs(urls: string[]): Promise<Map<string, MBIDCacheEntry>> {
	const results = new Map<string, MBIDCacheEntry>();
	
	// Check cache first
	const uncached: string[] = [];
	for (const url of urls) {
		const cached = getCachedMBID(url);
		if (cached) {
			results.set(url, cached);
		} else {
			uncached.push(url);
		}
	}
	
	// Batch lookup uncached URLs (100 at a time)
	for (let i = 0; i < uncached.length; i += 100) {
		const batch = uncached.slice(i, i + 100);
		const params = batch.map(url => `resource=${encodeURIComponent(url)}`).join('&');
		const response = await fetch(`https://musicbrainz.org/ws/2/url?${params}`);
		const data = await response.json();
		
		// Parse response and cache results
		for (const urlData of data.urls) {
			const mbid = urlData.relations[0]?.release?.id;
			const type = urlData.relations[0]?.type;
			if (mbid) {
				cacheMBID(urlData.resource, mbid, type);
				results.set(urlData.resource, { mbid, type, cached: Date.now() });
			}
		}
	}
	
	return results;
}

Core Data Model: HarmonyRelease

Schema Definition

Location: harmonizer/types.ts (273 lines)

Full Interface:

interface HarmonyRelease {
	// ===== Basic Metadata =====
	title: string;
	artists: ArtistCreditName[];
	gtin?: string; // Global Trade Item Number (barcode)
	
	// ===== Media and Tracks =====
	media: HarmonyMedium[];
	
	// ===== Release Details =====
	language?: string; // ISO 639-3 code
	script?: string; // ISO 15924 code
	status?: ReleaseStatus;
	types: ReleaseType[];
	releaseDate?: PartialDate;
	
	// ===== Commercial Information =====
	labels: Label[];
	packaging?: PackagingType;
	copyright?: string;
	
	// ===== Distribution =====
	availableIn?: string[]; // ISO 3166-1 alpha-2 country codes
	excludedFrom?: string[]; // ISO 3166-1 alpha-2 country codes
	
	// ===== Visual Assets =====
	images: Image[];
	
	// ===== External Links =====
	externalLinks: ExternalLink[];
	
	// ===== Metadata About Metadata =====
	info: ReleaseInfo;
}

Sub-Structures

ArtistCreditName

interface ArtistCreditName {
	name: string; // Artist name
	creditedName?: string; // Alternative credit (e.g., "feat. Artist")
	joinPhrase?: string; // Separator (e.g., " & ", " feat. ", " vs. ")
	mbid?: string; // MusicBrainz artist ID
}

Example:

[
	{ name: "Artist A", joinPhrase: " & " },
	{ name: "Artist B", joinPhrase: " feat. " },
	{ name: "Artist C", creditedName: "Artist C (DJ Set)" }
]

Rendering: "Artist A & Artist B feat. Artist C (DJ Set)"

HarmonyMedium

interface HarmonyMedium {
	title?: string; // Medium title (e.g., "Disc 1: The Album")
	format?: MediumFormat;
	position: number; // 1-indexed
	tracks: HarmonyTrack[];
}

enum MediumFormat {
	CD = 'CD',
	Vinyl = 'Vinyl',
	Digital = 'Digital Media',
	Cassette = 'Cassette',
	DVD = 'DVD',
	BluRay = 'Blu-ray',
	Other = 'Other'
}

HarmonyTrack

interface HarmonyTrack {
	title: string;
	artists?: ArtistCreditName[]; // Track-specific artists (overrides release artists)
	position: number; // 1-indexed within medium
	length?: number; // Duration in milliseconds
	isrc?: string; // International Standard Recording Code
}

Example:

{
	title: "Track Title",
	artists: [{ name: "Track Artist" }],
	position: 1,
	length: 245000, // 4:05
	isrc: "USRC17607839"
}

Label

interface Label {
	name: string;
	catalogNumber?: string;
	mbid?: string; // MusicBrainz label ID
}

Example:

[
	{ name: "Record Label", catalogNumber: "RL-12345" },
	{ name: "Distributor", catalogNumber: "DIST-67890" }
]

Image

interface Image {
	url: string;
	types: ImageType[];
	width?: number;
	height?: number;
	comment?: string;
}

enum ImageType {
	Front = 'front',
	Back = 'back',
	Medium = 'medium',
	Tray = 'tray',
	Booklet = 'booklet',
	Obi = 'obi',
	Spine = 'spine',
	Track = 'track',
	Liner = 'liner',
	Sticker = 'sticker',
	Poster = 'poster',
	Watermark = 'watermark',
	Raw = 'raw',
	Unedited = 'unedited'
}

Example:

[
	{
		url: "https://i.scdn.co/image/ab67616d0000b273...",
		types: [ImageType.Front],
		width: 2000,
		height: 2000
	},
	{
		url: "https://e-cdn-images.dzcdn.net/images/cover/...",
		types: [ImageType.Front],
		width: 1400,
		height: 1400,
		comment: "Deezer cover"
	}
]
interface ExternalLink {
	url: string;
	types: LinkType[];
}

enum LinkType {
	Streaming = 'streaming',
	Purchase = 'purchase',
	Download = 'download',
	License = 'license',
	Crowdfunding = 'crowdfunding',
	Other = 'other'
}

Example:

[
	{
		url: "https://open.spotify.com/album/xyz",
		types: [LinkType.Streaming]
	},
	{
		url: "https://bandcamp.com/album/xyz",
		types: [LinkType.Streaming, LinkType.Purchase]
	}
]

ReleaseInfo

interface ReleaseInfo {
	providers: string[]; // Provider names that contributed data
	messages: Message[]; // Warnings, errors, info messages
	sourceMap?: SourceMap; // Property -> provider mapping (only in MergedHarmonyRelease)
	incompatibleData?: IncompatibilityInfo; // Conflicts (only in MergedHarmonyRelease)
}

interface Message {
	level: 'error' | 'warning' | 'info';
	text: string;
	provider?: string;
}

Example:

{
	providers: ["spotify", "deezer", "itunes"],
	messages: [
		{
			level: "warning",
			text: "Release date conflict: Spotify (2014-11-24) vs iTunes (2014-11-25)",
			provider: "itunes"
		},
		{
			level: "info",
			text: "Using Spotify value (higher preference)"
		}
	]
}

Enumerations

ReleaseStatus

enum ReleaseStatus {
	Official = 'official',
	Promotion = 'promotion',
	Bootleg = 'bootleg',
	PseudoRelease = 'pseudo-release'
}

ReleaseType

enum ReleaseType {
	// Primary types
	Album = 'album',
	Single = 'single',
	EP = 'ep',
	Broadcast = 'broadcast',
	Other = 'other',
	
	// Secondary types
	Compilation = 'compilation',
	Soundtrack = 'soundtrack',
	Spokenword = 'spokenword',
	Interview = 'interview',
	Audiobook = 'audiobook',
	AudioDrama = 'audio drama',
	Live = 'live',
	Remix = 'remix',
	DJMix = 'dj-mix',
	Mixtape = 'mixtape',
	Demo = 'demo',
	FieldRecording = 'field recording'
}

Usage: Array of types (primary + secondary)

types: [ReleaseType.Album, ReleaseType.Live] // Live album
types: [ReleaseType.EP, ReleaseType.Remix] // Remix EP

PackagingType

enum PackagingType {
	JewelCase = 'jewel case',
	SlimJewelCase = 'slim jewel case',
	Digipak = 'digipak',
	Cardboard = 'cardboard/paper sleeve',
	KeepCase = 'keep case',
	None = 'none',
	Other = 'other'
}

PartialDate

interface PartialDate {
	year: number;
	month?: number; // 1-12
	day?: number; // 1-31
}

Examples:

{ year: 2014 } // Year only
{ year: 2014, month: 11 } // Year and month
{ year: 2014, month: 11, day: 24 } // Full date

Serialization:

function serializePartialDate(date: PartialDate): string {
	let result = date.year.toString();
	if (date.month) {
		result += `-${date.month.toString().padStart(2, '0')}`;
		if (date.day) {
			result += `-${date.day.toString().padStart(2, '0')}`;
		}
	}
	return result;
}

// Examples:
// { year: 2014 } -> "2014"
// { year: 2014, month: 11 } -> "2014-11"
// { year: 2014, month: 11, day: 24 } -> "2014-11-24"

MergedHarmonyRelease

Extends HarmonyRelease with merge metadata.

interface MergedHarmonyRelease extends HarmonyRelease {
	info: ReleaseInfo & {
		sourceMap: SourceMap;
		incompatibleData?: IncompatibilityInfo;
	};
}

interface SourceMap {
	[propertyPath: string]: string; // Property path -> provider name
}

interface IncompatibilityInfo {
	conflicts: Conflict[];
	warnings: string[];
}

interface Conflict {
	property: string;
	values: ConflictValue[];
}

interface ConflictValue {
	provider: string;
	value: any;
}

Example:

{
	title: "Album Title",
	releaseDate: { year: 2014, month: 11, day: 24 },
	// ... other fields
	info: {
		providers: ["spotify", "deezer", "itunes"],
		sourceMap: {
			"title": "spotify",
			"releaseDate": "spotify",
			"gtin": "deezer",
			"media[0].tracks[0].isrc": "spotify"
		},
		incompatibleData: {
			conflicts: [
				{
					property: "releaseDate",
					values: [
						{ provider: "spotify", value: { year: 2014, month: 11, day: 24 } },
						{ provider: "itunes", value: { year: 2014, month: 11, day: 25 } }
					]
				}
			],
			warnings: [
				"Release date conflict resolved using Spotify value (higher preference)"
			]
		},
		messages: []
	}
}

Data Transformations

Provider-Specific to HarmonyRelease

Each provider implements a harmonize() method:

// Spotify example (conceptual)
class SpotifyProvider {
	harmonize(spotifyAlbum: SpotifyAlbum): HarmonyRelease {
		return {
			title: spotifyAlbum.name,
			artists: spotifyAlbum.artists.map(a => ({
				name: a.name,
				mbid: undefined // Spotify doesn't provide MBIDs
			})),
			gtin: spotifyAlbum.external_ids?.upc,
			media: [{
				format: MediumFormat.Digital,
				position: 1,
				tracks: spotifyAlbum.tracks.items.map((t, i) => ({
					title: t.name,
					position: i + 1,
					length: t.duration_ms,
					isrc: t.external_ids?.isrc
				}))
			}],
			releaseDate: this.parseDate(spotifyAlbum.release_date),
			types: this.inferTypes(spotifyAlbum.album_type),
			images: spotifyAlbum.images.map(img => ({
				url: img.url,
				types: [ImageType.Front],
				width: img.width,
				height: img.height
			})),
			externalLinks: [{
				url: spotifyAlbum.external_urls.spotify,
				types: [LinkType.Streaming]
			}],
			labels: spotifyAlbum.label ? [{ name: spotifyAlbum.label }] : [],
			copyright: spotifyAlbum.copyrights?.[0]?.text,
			availableIn: spotifyAlbum.available_markets,
			info: {
				providers: ["spotify"],
				messages: []
			}
		};
	}
}

HarmonyRelease to MusicBrainz Format

Location: musicbrainz/seeding.ts

interface MusicBrainzRelease {
	name: string;
	artist_credit: MBArtistCredit[];
	barcode?: string;
	release_events: MBReleaseEvent[];
	labels: MBLabel[];
	mediums: MBMedium[];
	release_group: {
		primary_type: string;
		secondary_types: string[];
	};
	language?: string;
	script?: string;
	packaging?: string;
	annotation?: string;
}

function convertToMusicBrainz(release: MergedHarmonyRelease): MusicBrainzRelease {
	return {
		name: release.title,
		artist_credit: release.artists.map(a => ({
			name: a.name,
			credited_name: a.creditedName,
			join_phrase: a.joinPhrase || '',
			mbid: a.mbid
		})),
		barcode: release.gtin,
		release_events: convertReleaseEvents(release.releaseDate, release.availableIn),
		labels: release.labels.map(l => ({
			name: l.name,
			catalog_number: l.catalogNumber,
			mbid: l.mbid
		})),
		mediums: release.media.map(m => ({
			format: m.format,
			position: m.position,
			title: m.title,
			tracks: m.tracks.map(t => ({
				title: t.title,
				position: t.position,
				length: t.length,
				isrc: t.isrc,
				artist_credit: t.artists?.map(a => ({
					name: a.name,
					join_phrase: a.joinPhrase || ''
				}))
			}))
		})),
		release_group: {
			primary_type: release.types.find(t => isPrimaryType(t)) || 'album',
			secondary_types: release.types.filter(t => !isPrimaryType(t))
		},
		language: release.language,
		script: release.script,
		packaging: release.packaging,
		annotation: buildAnnotation(release)
	};
}

Data Validation

GTIN Validation

function validateGTIN(gtin: string): boolean {
	// GTIN-13 (EAN-13) validation
	if (!/^\d{13}$/.test(gtin)) return false;
	
	// Check digit validation
	const digits = gtin.split('').map(Number);
	const checksum = digits.slice(0, 12).reduce((sum, digit, i) => {
		return sum + digit * (i % 2 === 0 ? 1 : 3);
	}, 0);
	const checkDigit = (10 - (checksum % 10)) % 10;
	
	return checkDigit === digits[12];
}

ISRC Validation

function validateISRC(isrc: string): boolean {
	// Format: CC-XXX-YY-NNNNN
	// CC: Country code (2 letters)
	// XXX: Registrant code (3 alphanumeric)
	// YY: Year (2 digits)
	// NNNNN: Designation code (5 digits)
	return /^[A-Z]{2}-?[A-Z0-9]{3}-?\d{2}-?\d{5}$/.test(isrc);
}

function normalizeISRC(isrc: string): string {
	// Remove hyphens
	return isrc.replace(/-/g, '');
}

Date Validation

function validatePartialDate(date: PartialDate): boolean {
	if (date.year < 1000 || date.year > 9999) return false;
	if (date.month && (date.month < 1 || date.month > 12)) return false;
	if (date.day && (date.day < 1 || date.day > 31)) return false;
	
	// Validate day for specific month
	if (date.month && date.day) {
		const daysInMonth = new Date(date.year, date.month, 0).getDate();
		if (date.day > daysInMonth) return false;
	}
	
	return true;
}

Data Size Estimates

Typical HarmonyRelease Size

Single-disc album (12 tracks):

  • JSON serialized: ~15-25 KB
  • With images: ~20-30 KB (image URLs only, not image data)

Multi-disc compilation (50 tracks):

  • JSON serialized: ~50-80 KB

Cache Size Estimates

Provider response sizes:

  • Spotify album: ~10-20 KB
  • Deezer album: ~15-25 KB
  • iTunes album: ~20-30 KB
  • Bandcamp page: ~50-100 KB (HTML)

Daily cache growth (100 lookups/day):

  • Database: ~50 KB (metadata only)
  • Files: ~2-5 MB (response bodies)

Annual cache size (36,500 lookups/year):

  • Database: ~18 MB
  • Files: ~730 MB - 1.8 GB

No Migrations

Since Harmony has no traditional database, there are no schema migrations.

Schema evolution strategy:

  1. Add new optional fields to HarmonyRelease interface
  2. Update provider harmonize() methods to populate new fields
  3. Update merge algorithm to handle new fields
  4. No data migration required (old cached responses still valid)

Breaking changes:

  1. Rename or remove fields in HarmonyRelease
  2. Clear cache (delete snaps.db and snaps/)
  3. Rebuild cache on next lookup

Summary

Harmony's data architecture demonstrates:

  1. Cache-first design: snap_storage eliminates need for traditional database
  2. Permalink system: Timestamp-based cache replay enables reproducibility
  3. Rich data model: 273-line HarmonyRelease schema covers all metadata needs
  4. Type safety: Full TypeScript coverage ensures data consistency
  5. No migrations: Schema evolution without data migration complexity
  6. Stateless processing: All transformations in-memory, no persistent state
  7. MBID caching: Efficient batch lookup reduces MusicBrainz API calls

This architecture is ideal for read-heavy, stateless applications where reproducibility and API compliance are priorities.