Files
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

25 KiB

Music Metadata API - Data Layer

Database Architecture

Music Metadata API uses a dual-database architecture with two separate SQLite files:

┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                         │
└─────────────────────────────────────────────────────────────┘
                            │
                ┌───────────┴───────────┐
                ▼                       ▼
┌──────────────────────────┐  ┌──────────────────────────┐
│  main_database.sqlite3   │  │  track_files.sqlite3     │
│       (~117GB)           │  │       (~99GB)            │
│                          │  │                          │
│  - tracks                │  │  - track_files           │
│  - albums                │  │    (extended metadata)   │
│  - artists               │  │                          │
│  - track_artists         │  │                          │
│  - artist_albums         │  │                          │
│  - album_images          │  │                          │
│  - artist_images         │  │                          │
│  - artist_genres         │  │                          │
└──────────────────────────┘  └──────────────────────────┘

Total storage: ~216GB
Total tracks: 256 million
Connection mode: Read-only
Driver: modernc.org/sqlite v1.34.4 (pure Go, no CGO)

Connection Configuration

Connection Strings

Main database:

file:/path/to/main_database.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true

Track files database:

file:/path/to/track_files.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true

PRAGMA Settings

PRAGMA Value Purpose Impact
mode=ro Read-only Prevents writes No write locks, safe concurrent reads
_journal_mode=off Disabled No WAL/rollback journal Faster reads, safe for read-only
_cache_size=-64000 64MB Page cache size Reduces disk I/O for hot data
_mmap_size=1073741824 1GB Memory-mapped I/O Faster reads via mmap
_query_only=true Enabled Additional read-only enforcement Extra safety layer

Cache size calculation:

  • Negative value = kilobytes
  • -64000 = 64,000 KB = 64 MB
  • Default SQLite cache is ~2MB (32x increase)

Memory-mapped I/O:

  • Maps 1GB of database file into process memory
  • OS handles paging (faster than read() syscalls)
  • Effective for frequently accessed data

Connection Pool

db.SetMaxOpenConns(8)      // Conservative limit (8 concurrent queries)
db.SetMaxIdleConns(8)      // Keep all connections warm
db.SetConnMaxLifetime(0)   // No expiration (read-only safe)

Rationale:

  • Read-only workload (no write contention)
  • SQLite handles concurrent reads well
  • 8 connections balance throughput vs resource usage
  • No connection recycling needed (no state changes)

Main Database Schema

tracks Table

Purpose: Core track metadata

Column Type Description Nullable
rowid INTEGER SQLite internal row ID No
id TEXT Internal track ID No
name TEXT Track title No
isrc TEXT ISRC code Yes
duration_ms INTEGER Duration in milliseconds No
explicit INTEGER Explicit content flag (0/1) No
track_number INTEGER Track number on album No
disc_number INTEGER Disc number No
popularity INTEGER Popularity score (0-100) No
preview_url TEXT 30-second preview URL Yes
album_rowid INTEGER Foreign key to albums.rowid No

Indexes:

  • Primary key on id
  • Index on isrc (for ISRC lookups)
  • Index on album_rowid (for album track listings)

Sample row:

id: 4cOdK2wGLETKBW3PvgPWqT
name: Bohemian Rhapsody
isrc: GBUM71029604
duration_ms: 354320
explicit: 0
track_number: 11
disc_number: 1
popularity: 89
preview_url: https://p.scdn.co/mp3-preview/...
album_rowid: 12345

Estimated rows: 256 million

albums Table

Purpose: Album metadata

Column Type Description Nullable
rowid INTEGER SQLite internal row ID No
id TEXT Internal album ID No
name TEXT Album title No
album_type TEXT "album", "single", "compilation" No
label TEXT Record label Yes
release_date TEXT ISO 8601 date (YYYY-MM-DD) No
release_date_precision TEXT "year", "month", "day" No
external_id_upc TEXT UPC barcode Yes
total_tracks INTEGER Total tracks on album No
copyright_c TEXT Copyright notice Yes
copyright_p TEXT Phonographic copyright Yes

Indexes:

  • Primary key on id
  • Index on rowid (for track joins)

Sample row:

id: 2ODvWsOgouMbaA5xf0RkJe
name: A Night at the Opera
album_type: album
label: Hollywood Records
release_date: 1975-11-21
release_date_precision: day
external_id_upc: 050087246679
total_tracks: 12
copyright_c: 1975 Queen Productions Ltd
copyright_p: 1975 Queen Productions Ltd

Estimated rows: Tens of millions (fewer than tracks)

artists Table

Purpose: Artist metadata

Column Type Description Nullable
rowid INTEGER SQLite internal row ID No
id TEXT Internal artist ID No
name TEXT Artist name No
followers_total INTEGER Total followers Yes
popularity INTEGER Popularity score (0-100) Yes

Indexes:

  • Primary key on id
  • Index on name (for search)

Sample row:

id: 0TnOYISbd1XYRBk9myaseg
name: Queen
followers_total: 45000000
popularity: 92

Estimated rows: Millions (fewer than albums)

track_artists Table

Purpose: Many-to-many relationship between tracks and artists

Column Type Description Nullable
track_id TEXT Foreign key to tracks.id No
artist_id TEXT Foreign key to artists.id No

Indexes:

  • Composite index on (track_id, artist_id)
  • Index on artist_id (for artist track listings)

Sample rows:

track_id: 4cOdK2wGLETKBW3PvgPWqT, artist_id: 0TnOYISbd1XYRBk9myaseg
track_id: 4cOdK2wGLETKBW3PvgPWqT, artist_id: 1A2B3C4D5E6F7G8H9I0J

Estimated rows: Hundreds of millions (tracks can have multiple artists)

artist_albums Table

Purpose: Many-to-many relationship between artists and albums with ordering

Column Type Description Nullable
artist_id TEXT Foreign key to artists.id No
album_id TEXT Foreign key to albums.id No
index_in_album INTEGER Artist order on album No

Indexes:

  • Composite index on (album_id, index_in_album)
  • Index on artist_id (for artist discography)

Sample rows:

artist_id: 0TnOYISbd1XYRBk9myaseg, album_id: 2ODvWsOgouMbaA5xf0RkJe, index_in_album: 0
artist_id: 1A2B3C4D5E6F7G8H9I0J, album_id: 2ODvWsOgouMbaA5xf0RkJe, index_in_album: 1

Purpose of index_in_album: Preserves artist order for multi-artist albums (e.g., "Artist A & Artist B")

album_images Table

Purpose: Album artwork URLs

Column Type Description Nullable
album_id TEXT Foreign key to albums.id No
url TEXT Image URL No
width INTEGER Width in pixels No
height INTEGER Height in pixels No

Indexes:

  • Index on album_id

Sample rows:

album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d0000b273..., width: 640, height: 640
album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d00001e02..., width: 300, height: 300
album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d00004851..., width: 64, height: 64

Typical sizes: 640x640, 300x300, 64x64

Image hosting: External CDN (i.scdn.co), not hosted by API

artist_images Table

Purpose: Artist images/photos

Column Type Description Nullable
artist_id TEXT Foreign key to artists.id No
url TEXT Image URL No
width INTEGER Width in pixels No
height INTEGER Height in pixels No

Indexes:

  • Index on artist_id

Sample rows:

artist_id: 0TnOYISbd1XYRBk9myaseg, url: https://i.scdn.co/image/af2b8e57f6d7b5d..., width: 640, height: 640
artist_id: 0TnOYISbd1XYRBk9myaseg, url: https://i.scdn.co/image/c06971e9ff81696..., width: 320, height: 320

artist_genres Table

Purpose: Artist genre tags

Column Type Description Nullable
artist_id TEXT Foreign key to artists.id No
genre TEXT Genre name No

Indexes:

  • Index on artist_id

Sample rows:

artist_id: 0TnOYISbd1XYRBk9myaseg, genre: rock
artist_id: 0TnOYISbd1XYRBk9myaseg, genre: classic rock
artist_id: 0TnOYISbd1XYRBk9myaseg, genre: glam rock

Genre characteristics:

  • Multiple genres per artist
  • Lowercase, hyphenated (e.g., "indie-rock")
  • Spotify-style genre taxonomy

Track Files Database Schema

track_files Table

Purpose: Extended track metadata not in main database

Column Type Description Nullable
track_id TEXT Foreign key to tracks.id No
has_lyrics INTEGER Lyrics availability flag (0/1) No
original_title TEXT Original title (if different) Yes
version_title TEXT Version descriptor (e.g., "Radio Edit") Yes
language_of_performance TEXT JSON array of language codes Yes
artist_roles TEXT JSON object mapping artist IDs to roles Yes

Indexes:

  • Primary key on track_id

Sample row:

track_id: 4cOdK2wGLETKBW3PvgPWqT
has_lyrics: 1
original_title: Bohemian Rhapsody
version_title: NULL
language_of_performance: ["en"]
artist_roles: {"0TnOYISbd1XYRBk9myaseg": ["performer", "composer"]}

JSON field parsing:

language_of_performance:

["en", "es"]  // ISO 639-1 language codes

artist_roles:

{
  "artist_id_1": ["performer", "composer"],
  "artist_id_2": ["producer"],
  "artist_id_3": ["lyricist"]
}

Common roles:

  • performer - Main performer
  • composer - Music composer
  • lyricist - Lyrics writer
  • producer - Producer
  • engineer - Recording engineer
  • mixer - Mix engineer

Estimated rows: 256 million (one per track)

Query Patterns

Individual Track Lookup

-- Step 1: Fetch track + album (single JOIN)
SELECT 
    t.id, t.name, t.isrc, t.duration_ms, t.explicit,
    t.track_number, t.disc_number, t.popularity, t.preview_url,
    a.id AS album_id, a.name AS album_name, a.album_type,
    a.label, a.release_date, a.release_date_precision,
    a.external_id_upc, a.total_tracks, a.copyright_c, a.copyright_p
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.id = ?

-- Step 2: Fetch album images
SELECT url, width, height
FROM album_images
WHERE album_id = ?
ORDER BY width DESC

-- Step 3: Fetch album artists
SELECT a.id, a.name, a.followers_total, a.popularity
FROM artists a
JOIN artist_albums aa ON a.id = aa.artist_id
WHERE aa.album_id = ?
ORDER BY aa.index_in_album

-- Step 4: Fetch track artists
SELECT a.id, a.name, a.followers_total, a.popularity
FROM artists a
JOIN track_artists ta ON a.id = ta.artist_id
WHERE ta.track_id = ?

-- Step 5: Fetch artist genres (for each artist)
SELECT genre
FROM artist_genres
WHERE artist_id = ?

-- Step 6: Fetch artist images (for each artist)
SELECT url, width, height
FROM artist_images
WHERE artist_id = ?
ORDER BY width DESC

-- Step 7: Fetch track files (from track_files.sqlite3)
SELECT has_lyrics, original_title, version_title,
       language_of_performance, artist_roles
FROM track_files
WHERE track_id = ?

Total queries for single track: 7+ (depending on number of artists)

Batch ISRC Lookup

-- Step 1: Fetch all tracks by ISRC (single query with IN clause)
SELECT 
    t.id, t.name, t.isrc, t.duration_ms, t.explicit,
    t.track_number, t.disc_number, t.popularity, t.preview_url,
    a.id AS album_id, a.name AS album_name, a.album_type,
    a.label, a.release_date, a.release_date_precision,
    a.external_id_upc, a.total_tracks, a.copyright_c, a.copyright_p
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.isrc IN (?, ?, ?, ...)  -- Up to 400 placeholders

-- Step 2: Batch fetch album images (all albums at once)
SELECT album_id, url, width, height
FROM album_images
WHERE album_id IN (?, ?, ?, ...)
ORDER BY album_id, width DESC

-- Step 3: Batch fetch album artists
SELECT aa.album_id, a.id, a.name, a.followers_total, a.popularity, aa.index_in_album
FROM artists a
JOIN artist_albums aa ON a.id = aa.artist_id
WHERE aa.album_id IN (?, ?, ?, ...)
ORDER BY aa.album_id, aa.index_in_album

-- Step 4: Batch fetch track artists
SELECT ta.track_id, a.id, a.name, a.followers_total, a.popularity
FROM artists a
JOIN track_artists ta ON a.id = ta.artist_id
WHERE ta.track_id IN (?, ?, ?, ...)

-- Step 5: Batch fetch artist genres
SELECT artist_id, genre
FROM artist_genres
WHERE artist_id IN (?, ?, ?, ...)

-- Step 6: Batch fetch artist images
SELECT artist_id, url, width, height
FROM artist_images
WHERE artist_id IN (?, ?, ?, ...)
ORDER BY artist_id, width DESC

-- Step 7: Batch fetch track files
SELECT track_id, has_lyrics, original_title, version_title,
       language_of_performance, artist_roles
FROM track_files
WHERE track_id IN (?, ?, ?, ...)

Total queries for 400 tracks: 7 (vs 2,800+ for individual lookups)

Performance gain: 400x fewer queries

Search Queries

Track search:

SELECT id, name, isrc, duration_ms, popularity, album_rowid
FROM tracks
WHERE name LIKE ? COLLATE NOCASE  -- ? = '%query%'
ORDER BY popularity DESC
LIMIT ?

Artist search:

SELECT id, name, followers_total, popularity
FROM artists
WHERE name LIKE ? COLLATE NOCASE  -- ? = '%query%'
ORDER BY followers_total DESC
LIMIT ?

Search characteristics:

  • LIKE %query% can't use indexes (full table scan)
  • COLLATE NOCASE for case-insensitive matching
  • Ordered by popularity/followers (most relevant first)
  • Limited to 50 results max
  • 10-second timeout via context deadline

Performance concern: Searching 256M tracks with LIKE %query% is slow. Full-text search (FTS5) would be faster but not implemented.

Album Tracks Lookup

-- Fetch all tracks for an album
SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit,
       t.track_number, t.disc_number, t.popularity, t.preview_url
FROM tracks t
WHERE t.album_rowid = (
    SELECT rowid FROM albums WHERE id = ?
)
ORDER BY t.disc_number, t.track_number

Ordering: Disc number first, then track number (preserves album order)

Data Enrichment Strategy

Enrichment Pipeline

1. Fetch base entity (track/album/artist)
   ↓
2. Collect related entity IDs
   ↓
3. Batch fetch related entities
   ↓
4. Assemble nested structures
   ↓
5. Return enriched object

Batch Optimization Functions

Implementation in db.go (907 lines):

// Batch fetch album images for multiple albums
func (d *Database) batchGetAlbumImages(albumIDs []string) map[string][]Image {
    // Build IN clause
    placeholders := strings.Repeat("?,", len(albumIDs)-1) + "?"
    query := fmt.Sprintf(`
        SELECT album_id, url, width, height
        FROM album_images
        WHERE album_id IN (%s)
        ORDER BY album_id, width DESC
    `, placeholders)
    
    // Execute query
    rows, _ := d.mainDB.Query(query, albumIDs...)
    
    // Group by album_id
    result := make(map[string][]Image)
    for rows.Next() {
        var albumID string
        var img Image
        rows.Scan(&albumID, &img.URL, &img.Width, &img.Height)
        result[albumID] = append(result[albumID], img)
    }
    
    return result
}

Similar functions:

  • batchGetAlbumArtists(albumIDs []string) map[string][]Artist
  • batchGetTrackArtists(trackIDs []string) map[string][]Artist
  • batchGetArtistGenres(artistIDs []string) map[string][]string
  • batchGetArtistImages(artistIDs []string) map[string][]Image
  • batchEnrichTrackFiles(trackIDs []string) map[string]*TrackFile

Pattern:

  1. Build IN clause with placeholders
  2. Execute single query for all IDs
  3. Group results by parent ID
  4. Return map for O(1) lookup

Why Batch Matters

Without batching (400 tracks):

  • 400 track queries
  • 400 album queries
  • 400 album image queries
  • 400 album artist queries
  • 400 track artist queries
  • ~800 artist genre queries (2 artists per track avg)
  • ~800 artist image queries
  • 400 track file queries
  • Total: ~3,600 queries

With batching (400 tracks):

  • 1 batch track query
  • 1 batch album image query
  • 1 batch album artist query
  • 1 batch track artist query
  • 1 batch artist genre query
  • 1 batch artist image query
  • 1 batch track file query
  • Total: 7 queries

Performance gain: 514x fewer queries

Data Provenance

Source

Disclaimer from repository:

"This project is not affiliated with Spotify."

Implications:

  • Data source unclear (likely scraped or obtained from third party)
  • Legal status uncertain
  • No official Spotify endorsement

Data Freshness

Static snapshot:

  • No update mechanism
  • Data frozen at time of database creation
  • No real-time sync with Spotify

Staleness concerns:

  • New releases not included
  • Popularity scores outdated
  • Artist follower counts stale
  • Deleted tracks still present

Mitigation:

  • Treat as historical snapshot
  • Complement with real-time APIs for fresh data
  • Periodically obtain updated database (if available)

Data Quality

Strengths:

  • 256M tracks (massive coverage)
  • Rich metadata (genres, images, roles)
  • ISRC codes for cross-referencing
  • Popularity/follower metrics

Weaknesses:

  • No data validation visible
  • Potential duplicates (not deduplicated)
  • Missing ISRCs for some tracks
  • Incomplete artist roles

Storage Requirements

Disk Space

Component Size Compressible
main_database.sqlite3 ~117GB Minimal (already compact)
track_files.sqlite3 ~99GB Minimal (JSON fields)
Total ~216GB -

Recommendations:

  • SSD strongly recommended (HDD too slow for 256M rows)
  • NVMe for best performance
  • RAID not necessary (read-only, can rebuild from backup)

Memory Usage

SQLite memory:

  • Page cache: 64MB per connection
  • 8 connections: 512MB cache total
  • Memory-mapped I/O: 1GB per database (2GB total)
  • Total: ~2.5GB minimum

Application memory:

  • Go runtime: ~50MB
  • Rate limiter map: Grows unbounded (leak)
  • Request buffers: ~10MB per concurrent request
  • Total: ~100MB + leak

Recommended RAM: 4GB+ (2.5GB for SQLite + 1.5GB for OS/app)

I/O Characteristics

Read patterns:

  • Random reads (track lookups by ID/ISRC)
  • Sequential scans (search queries)
  • Batch reads (IN clause queries)

Write patterns:

  • None (read-only)

Cache effectiveness:

  • Hot data (popular tracks): High hit rate
  • Cold data (obscure tracks): Low hit rate
  • Search queries: Low hit rate (full scans)

Database Maintenance

No Maintenance Required

Read-only benefits:

  • No VACUUM needed (no fragmentation from deletes)
  • No ANALYZE needed (statistics static)
  • No REINDEX needed (indexes don't degrade)
  • No WAL checkpoint (journal disabled)

Backup Strategy

Simple backup:

# Copy files (database must be idle)
cp main_database.sqlite3 backup/
cp track_files.sqlite3 backup/

Online backup (while running):

# SQLite backup API (requires custom tool)
sqlite3 main_database.sqlite3 ".backup backup/main_database.sqlite3"

Restore:

# Simply replace files
cp backup/main_database.sqlite3 .
cp backup/track_files.sqlite3 .

Integrity Checks

Verify database integrity:

sqlite3 main_database.sqlite3 "PRAGMA integrity_check;"
sqlite3 track_files.sqlite3 "PRAGMA integrity_check;"

Expected output: ok

Run periodically: Monthly or after hardware issues

Performance Tuning

Query Optimization

Indexes already present:

  • Primary keys on all ID columns
  • Foreign key indexes (album_rowid, artist_id, etc.)
  • Search indexes (tracks.name, artists.name)

Missing indexes (potential improvements):

  • Full-text search index (FTS5) on track/artist names
  • Composite index on (popularity, name) for sorted searches

Connection Pool Tuning

Current settings:

MaxOpenConns: 8
MaxIdleConns: 8
ConnMaxLifetime: 0

Tuning considerations:

  • Increase MaxOpenConns for higher concurrency (16-32)
  • Monitor CPU usage (SQLite is CPU-bound for searches)
  • No benefit beyond CPU core count

Cache Tuning

Current cache: 64MB per connection (512MB total)

Increase cache:

_cache_size=-128000  // 128MB per connection

Tradeoff: More memory usage vs fewer disk reads

Recommendation: Monitor cache hit rate, increase if low

Memory-Mapped I/O Tuning

Current mmap: 1GB per database

Increase mmap:

_mmap_size=2147483648  // 2GB

Tradeoff: More virtual memory vs faster reads

Recommendation: Set to database size if RAM allows (117GB not feasible)

Data Model Comparison

vs Spotify Web API

Feature Music Metadata API Spotify Web API
Track ID format Spotify-compatible Spotify IDs
ISRC support Yes Yes
Popularity Static snapshot Real-time
Followers Static snapshot Real-time
Images External URLs External URLs
Genres Artist-level Artist-level
Lyrics Flag only Not available
Artist roles Detailed Limited
Languages Supported Not available

vs MusicBrainz

Feature Music Metadata API MusicBrainz
Identifier Spotify IDs, ISRC MBIDs
Dataset size 256M tracks ~40M recordings
Popularity Yes No
Followers Yes No
Images Yes (external) Yes (Cover Art Archive)
Genres Yes Yes (tags)
Relationships Limited Extensive
Credits Artist roles Detailed credits
Updates Static Community-driven

Integration Considerations

Joining with Other Databases

ISRC as common key:

-- Join with local library
SELECT l.file_path, m.name, m.popularity
FROM local_library l
JOIN music_metadata_api.tracks m ON l.isrc = m.isrc

Spotify ID as common key:

-- Join with MusicBrainz
SELECT mb.mbid, mm.name, mm.popularity
FROM musicbrainz.recording mb
JOIN musicbrainz.isrc i ON mb.id = i.recording
JOIN music_metadata_api.tracks mm ON i.isrc = mm.isrc

Data Export

Export to JSON:

sqlite3 main_database.sqlite3 <<EOF
.mode json
.output tracks.json
SELECT * FROM tracks LIMIT 1000;
EOF

Export to CSV:

sqlite3 main_database.sqlite3 <<EOF
.mode csv
.output tracks.csv
SELECT id, name, isrc, popularity FROM tracks;
EOF

Data Import

Import from CSV:

sqlite3 new_database.sqlite3 <<EOF
.mode csv
.import tracks.csv tracks
EOF

Bulk insert from application:

tx, _ := db.Begin()
stmt, _ := tx.Prepare("INSERT INTO tracks VALUES (?, ?, ?, ...)")
for _, track := range tracks {
    stmt.Exec(track.ID, track.Name, track.ISRC, ...)
}
tx.Commit()

Limitations

No Write Operations

Implications:

  • Can't add new tracks
  • Can't update popularity scores
  • Can't delete duplicates
  • Can't fix data errors

Workarounds:

  • Create separate writable database for local additions
  • Use views to merge read-only + writable data
  • Periodically obtain updated database snapshot

Current search: LIKE %query% (slow)

FTS5 alternative:

-- Create FTS5 virtual table (requires writable database)
CREATE VIRTUAL TABLE tracks_fts USING fts5(name, content=tracks);
INSERT INTO tracks_fts SELECT name FROM tracks;

-- Fast search
SELECT * FROM tracks_fts WHERE name MATCH 'bohemian';

Limitation: Can't create FTS5 on read-only database

Workaround: Create separate FTS5 database, sync periodically

No Relationships Beyond Basics

Missing relationships:

  • Track-to-track (similar tracks, remixes)
  • Album-to-album (compilations, deluxe editions)
  • Artist-to-artist (collaborations, bands)

Workaround: Build relationship graph in separate database