# Music Metadata API - Data Layer ## Database Architecture Music Metadata API uses a dual-database architecture with two separate SQLite files: ``` ┌─────────────────────────────────────────────────────────────┐ │ Application Layer │ └─────────────────────────────────────────────────────────────┘ │ ┌───────────┴───────────┐ ▼ ▼ ┌──────────────────────────┐ ┌──────────────────────────┐ │ main_database.sqlite3 │ │ track_files.sqlite3 │ │ (~117GB) │ │ (~99GB) │ │ │ │ │ │ - tracks │ │ - track_files │ │ - albums │ │ (extended metadata) │ │ - artists │ │ │ │ - track_artists │ │ │ │ - artist_albums │ │ │ │ - album_images │ │ │ │ - artist_images │ │ │ │ - artist_genres │ │ │ └──────────────────────────┘ └──────────────────────────┘ ``` **Total storage:** ~216GB **Total tracks:** 256 million **Connection mode:** Read-only **Driver:** modernc.org/sqlite v1.34.4 (pure Go, no CGO) ## Connection Configuration ### Connection Strings **Main database:** ``` file:/path/to/main_database.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true ``` **Track files database:** ``` file:/path/to/track_files.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true ``` ### PRAGMA Settings | PRAGMA | Value | Purpose | Impact | |--------|-------|---------|--------| | `mode=ro` | Read-only | Prevents writes | No write locks, safe concurrent reads | | `_journal_mode=off` | Disabled | No WAL/rollback journal | Faster reads, safe for read-only | | `_cache_size=-64000` | 64MB | Page cache size | Reduces disk I/O for hot data | | `_mmap_size=1073741824` | 1GB | Memory-mapped I/O | Faster reads via mmap | | `_query_only=true` | Enabled | Additional read-only enforcement | Extra safety layer | **Cache size calculation:** - Negative value = kilobytes - `-64000` = 64,000 KB = 64 MB - Default SQLite cache is ~2MB (32x increase) **Memory-mapped I/O:** - Maps 1GB of database file into process memory - OS handles paging (faster than read() syscalls) - Effective for frequently accessed data ### Connection Pool ```go db.SetMaxOpenConns(8) // Conservative limit (8 concurrent queries) db.SetMaxIdleConns(8) // Keep all connections warm db.SetConnMaxLifetime(0) // No expiration (read-only safe) ``` **Rationale:** - Read-only workload (no write contention) - SQLite handles concurrent reads well - 8 connections balance throughput vs resource usage - No connection recycling needed (no state changes) ## Main Database Schema ### tracks Table **Purpose:** Core track metadata | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `rowid` | INTEGER | SQLite internal row ID | No | | `id` | TEXT | Internal track ID | No | | `name` | TEXT | Track title | No | | `isrc` | TEXT | ISRC code | Yes | | `duration_ms` | INTEGER | Duration in milliseconds | No | | `explicit` | INTEGER | Explicit content flag (0/1) | No | | `track_number` | INTEGER | Track number on album | No | | `disc_number` | INTEGER | Disc number | No | | `popularity` | INTEGER | Popularity score (0-100) | No | | `preview_url` | TEXT | 30-second preview URL | Yes | | `album_rowid` | INTEGER | Foreign key to albums.rowid | No | **Indexes:** - Primary key on `id` - Index on `isrc` (for ISRC lookups) - Index on `album_rowid` (for album track listings) **Sample row:** ```sql id: 4cOdK2wGLETKBW3PvgPWqT name: Bohemian Rhapsody isrc: GBUM71029604 duration_ms: 354320 explicit: 0 track_number: 11 disc_number: 1 popularity: 89 preview_url: https://p.scdn.co/mp3-preview/... album_rowid: 12345 ``` **Estimated rows:** 256 million ### albums Table **Purpose:** Album metadata | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `rowid` | INTEGER | SQLite internal row ID | No | | `id` | TEXT | Internal album ID | No | | `name` | TEXT | Album title | No | | `album_type` | TEXT | "album", "single", "compilation" | No | | `label` | TEXT | Record label | Yes | | `release_date` | TEXT | ISO 8601 date (YYYY-MM-DD) | No | | `release_date_precision` | TEXT | "year", "month", "day" | No | | `external_id_upc` | TEXT | UPC barcode | Yes | | `total_tracks` | INTEGER | Total tracks on album | No | | `copyright_c` | TEXT | Copyright notice | Yes | | `copyright_p` | TEXT | Phonographic copyright | Yes | **Indexes:** - Primary key on `id` - Index on `rowid` (for track joins) **Sample row:** ```sql id: 2ODvWsOgouMbaA5xf0RkJe name: A Night at the Opera album_type: album label: Hollywood Records release_date: 1975-11-21 release_date_precision: day external_id_upc: 050087246679 total_tracks: 12 copyright_c: 1975 Queen Productions Ltd copyright_p: 1975 Queen Productions Ltd ``` **Estimated rows:** Tens of millions (fewer than tracks) ### artists Table **Purpose:** Artist metadata | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `rowid` | INTEGER | SQLite internal row ID | No | | `id` | TEXT | Internal artist ID | No | | `name` | TEXT | Artist name | No | | `followers_total` | INTEGER | Total followers | Yes | | `popularity` | INTEGER | Popularity score (0-100) | Yes | **Indexes:** - Primary key on `id` - Index on `name` (for search) **Sample row:** ```sql id: 0TnOYISbd1XYRBk9myaseg name: Queen followers_total: 45000000 popularity: 92 ``` **Estimated rows:** Millions (fewer than albums) ### track_artists Table **Purpose:** Many-to-many relationship between tracks and artists | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `track_id` | TEXT | Foreign key to tracks.id | No | | `artist_id` | TEXT | Foreign key to artists.id | No | **Indexes:** - Composite index on `(track_id, artist_id)` - Index on `artist_id` (for artist track listings) **Sample rows:** ```sql track_id: 4cOdK2wGLETKBW3PvgPWqT, artist_id: 0TnOYISbd1XYRBk9myaseg track_id: 4cOdK2wGLETKBW3PvgPWqT, artist_id: 1A2B3C4D5E6F7G8H9I0J ``` **Estimated rows:** Hundreds of millions (tracks can have multiple artists) ### artist_albums Table **Purpose:** Many-to-many relationship between artists and albums with ordering | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `artist_id` | TEXT | Foreign key to artists.id | No | | `album_id` | TEXT | Foreign key to albums.id | No | | `index_in_album` | INTEGER | Artist order on album | No | **Indexes:** - Composite index on `(album_id, index_in_album)` - Index on `artist_id` (for artist discography) **Sample rows:** ```sql artist_id: 0TnOYISbd1XYRBk9myaseg, album_id: 2ODvWsOgouMbaA5xf0RkJe, index_in_album: 0 artist_id: 1A2B3C4D5E6F7G8H9I0J, album_id: 2ODvWsOgouMbaA5xf0RkJe, index_in_album: 1 ``` **Purpose of index_in_album:** Preserves artist order for multi-artist albums (e.g., "Artist A & Artist B") ### album_images Table **Purpose:** Album artwork URLs | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `album_id` | TEXT | Foreign key to albums.id | No | | `url` | TEXT | Image URL | No | | `width` | INTEGER | Width in pixels | No | | `height` | INTEGER | Height in pixels | No | **Indexes:** - Index on `album_id` **Sample rows:** ```sql album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d0000b273..., width: 640, height: 640 album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d00001e02..., width: 300, height: 300 album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d00004851..., width: 64, height: 64 ``` **Typical sizes:** 640x640, 300x300, 64x64 **Image hosting:** External CDN (i.scdn.co), not hosted by API ### artist_images Table **Purpose:** Artist images/photos | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `artist_id` | TEXT | Foreign key to artists.id | No | | `url` | TEXT | Image URL | No | | `width` | INTEGER | Width in pixels | No | | `height` | INTEGER | Height in pixels | No | **Indexes:** - Index on `artist_id` **Sample rows:** ```sql artist_id: 0TnOYISbd1XYRBk9myaseg, url: https://i.scdn.co/image/af2b8e57f6d7b5d..., width: 640, height: 640 artist_id: 0TnOYISbd1XYRBk9myaseg, url: https://i.scdn.co/image/c06971e9ff81696..., width: 320, height: 320 ``` ### artist_genres Table **Purpose:** Artist genre tags | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `artist_id` | TEXT | Foreign key to artists.id | No | | `genre` | TEXT | Genre name | No | **Indexes:** - Index on `artist_id` **Sample rows:** ```sql artist_id: 0TnOYISbd1XYRBk9myaseg, genre: rock artist_id: 0TnOYISbd1XYRBk9myaseg, genre: classic rock artist_id: 0TnOYISbd1XYRBk9myaseg, genre: glam rock ``` **Genre characteristics:** - Multiple genres per artist - Lowercase, hyphenated (e.g., "indie-rock") - Spotify-style genre taxonomy ## Track Files Database Schema ### track_files Table **Purpose:** Extended track metadata not in main database | Column | Type | Description | Nullable | |--------|------|-------------|----------| | `track_id` | TEXT | Foreign key to tracks.id | No | | `has_lyrics` | INTEGER | Lyrics availability flag (0/1) | No | | `original_title` | TEXT | Original title (if different) | Yes | | `version_title` | TEXT | Version descriptor (e.g., "Radio Edit") | Yes | | `language_of_performance` | TEXT | JSON array of language codes | Yes | | `artist_roles` | TEXT | JSON object mapping artist IDs to roles | Yes | **Indexes:** - Primary key on `track_id` **Sample row:** ```sql track_id: 4cOdK2wGLETKBW3PvgPWqT has_lyrics: 1 original_title: Bohemian Rhapsody version_title: NULL language_of_performance: ["en"] artist_roles: {"0TnOYISbd1XYRBk9myaseg": ["performer", "composer"]} ``` **JSON field parsing:** **language_of_performance:** ```json ["en", "es"] // ISO 639-1 language codes ``` **artist_roles:** ```json { "artist_id_1": ["performer", "composer"], "artist_id_2": ["producer"], "artist_id_3": ["lyricist"] } ``` **Common roles:** - `performer` - Main performer - `composer` - Music composer - `lyricist` - Lyrics writer - `producer` - Producer - `engineer` - Recording engineer - `mixer` - Mix engineer **Estimated rows:** 256 million (one per track) ## Query Patterns ### Individual Track Lookup ```sql -- Step 1: Fetch track + album (single JOIN) SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit, t.track_number, t.disc_number, t.popularity, t.preview_url, a.id AS album_id, a.name AS album_name, a.album_type, a.label, a.release_date, a.release_date_precision, a.external_id_upc, a.total_tracks, a.copyright_c, a.copyright_p FROM tracks t JOIN albums a ON t.album_rowid = a.rowid WHERE t.id = ? -- Step 2: Fetch album images SELECT url, width, height FROM album_images WHERE album_id = ? ORDER BY width DESC -- Step 3: Fetch album artists SELECT a.id, a.name, a.followers_total, a.popularity FROM artists a JOIN artist_albums aa ON a.id = aa.artist_id WHERE aa.album_id = ? ORDER BY aa.index_in_album -- Step 4: Fetch track artists SELECT a.id, a.name, a.followers_total, a.popularity FROM artists a JOIN track_artists ta ON a.id = ta.artist_id WHERE ta.track_id = ? -- Step 5: Fetch artist genres (for each artist) SELECT genre FROM artist_genres WHERE artist_id = ? -- Step 6: Fetch artist images (for each artist) SELECT url, width, height FROM artist_images WHERE artist_id = ? ORDER BY width DESC -- Step 7: Fetch track files (from track_files.sqlite3) SELECT has_lyrics, original_title, version_title, language_of_performance, artist_roles FROM track_files WHERE track_id = ? ``` **Total queries for single track:** 7+ (depending on number of artists) ### Batch ISRC Lookup ```sql -- Step 1: Fetch all tracks by ISRC (single query with IN clause) SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit, t.track_number, t.disc_number, t.popularity, t.preview_url, a.id AS album_id, a.name AS album_name, a.album_type, a.label, a.release_date, a.release_date_precision, a.external_id_upc, a.total_tracks, a.copyright_c, a.copyright_p FROM tracks t JOIN albums a ON t.album_rowid = a.rowid WHERE t.isrc IN (?, ?, ?, ...) -- Up to 400 placeholders -- Step 2: Batch fetch album images (all albums at once) SELECT album_id, url, width, height FROM album_images WHERE album_id IN (?, ?, ?, ...) ORDER BY album_id, width DESC -- Step 3: Batch fetch album artists SELECT aa.album_id, a.id, a.name, a.followers_total, a.popularity, aa.index_in_album FROM artists a JOIN artist_albums aa ON a.id = aa.artist_id WHERE aa.album_id IN (?, ?, ?, ...) ORDER BY aa.album_id, aa.index_in_album -- Step 4: Batch fetch track artists SELECT ta.track_id, a.id, a.name, a.followers_total, a.popularity FROM artists a JOIN track_artists ta ON a.id = ta.artist_id WHERE ta.track_id IN (?, ?, ?, ...) -- Step 5: Batch fetch artist genres SELECT artist_id, genre FROM artist_genres WHERE artist_id IN (?, ?, ?, ...) -- Step 6: Batch fetch artist images SELECT artist_id, url, width, height FROM artist_images WHERE artist_id IN (?, ?, ?, ...) ORDER BY artist_id, width DESC -- Step 7: Batch fetch track files SELECT track_id, has_lyrics, original_title, version_title, language_of_performance, artist_roles FROM track_files WHERE track_id IN (?, ?, ?, ...) ``` **Total queries for 400 tracks:** 7 (vs 2,800+ for individual lookups) **Performance gain:** 400x fewer queries ### Search Queries **Track search:** ```sql SELECT id, name, isrc, duration_ms, popularity, album_rowid FROM tracks WHERE name LIKE ? COLLATE NOCASE -- ? = '%query%' ORDER BY popularity DESC LIMIT ? ``` **Artist search:** ```sql SELECT id, name, followers_total, popularity FROM artists WHERE name LIKE ? COLLATE NOCASE -- ? = '%query%' ORDER BY followers_total DESC LIMIT ? ``` **Search characteristics:** - `LIKE %query%` can't use indexes (full table scan) - `COLLATE NOCASE` for case-insensitive matching - Ordered by popularity/followers (most relevant first) - Limited to 50 results max - 10-second timeout via context deadline **Performance concern:** Searching 256M tracks with `LIKE %query%` is slow. Full-text search (FTS5) would be faster but not implemented. ### Album Tracks Lookup ```sql -- Fetch all tracks for an album SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit, t.track_number, t.disc_number, t.popularity, t.preview_url FROM tracks t WHERE t.album_rowid = ( SELECT rowid FROM albums WHERE id = ? ) ORDER BY t.disc_number, t.track_number ``` **Ordering:** Disc number first, then track number (preserves album order) ## Data Enrichment Strategy ### Enrichment Pipeline ``` 1. Fetch base entity (track/album/artist) ↓ 2. Collect related entity IDs ↓ 3. Batch fetch related entities ↓ 4. Assemble nested structures ↓ 5. Return enriched object ``` ### Batch Optimization Functions **Implementation in db.go (907 lines):** ```go // Batch fetch album images for multiple albums func (d *Database) batchGetAlbumImages(albumIDs []string) map[string][]Image { // Build IN clause placeholders := strings.Repeat("?,", len(albumIDs)-1) + "?" query := fmt.Sprintf(` SELECT album_id, url, width, height FROM album_images WHERE album_id IN (%s) ORDER BY album_id, width DESC `, placeholders) // Execute query rows, _ := d.mainDB.Query(query, albumIDs...) // Group by album_id result := make(map[string][]Image) for rows.Next() { var albumID string var img Image rows.Scan(&albumID, &img.URL, &img.Width, &img.Height) result[albumID] = append(result[albumID], img) } return result } ``` **Similar functions:** - `batchGetAlbumArtists(albumIDs []string) map[string][]Artist` - `batchGetTrackArtists(trackIDs []string) map[string][]Artist` - `batchGetArtistGenres(artistIDs []string) map[string][]string` - `batchGetArtistImages(artistIDs []string) map[string][]Image` - `batchEnrichTrackFiles(trackIDs []string) map[string]*TrackFile` **Pattern:** 1. Build IN clause with placeholders 2. Execute single query for all IDs 3. Group results by parent ID 4. Return map for O(1) lookup ### Why Batch Matters **Without batching (400 tracks):** - 400 track queries - 400 album queries - 400 album image queries - 400 album artist queries - 400 track artist queries - ~800 artist genre queries (2 artists per track avg) - ~800 artist image queries - 400 track file queries - **Total: ~3,600 queries** **With batching (400 tracks):** - 1 batch track query - 1 batch album image query - 1 batch album artist query - 1 batch track artist query - 1 batch artist genre query - 1 batch artist image query - 1 batch track file query - **Total: 7 queries** **Performance gain: 514x fewer queries** ## Data Provenance ### Source **Disclaimer from repository:** > "This project is not affiliated with Spotify." **Implications:** - Data source unclear (likely scraped or obtained from third party) - Legal status uncertain - No official Spotify endorsement ### Data Freshness **Static snapshot:** - No update mechanism - Data frozen at time of database creation - No real-time sync with Spotify **Staleness concerns:** - New releases not included - Popularity scores outdated - Artist follower counts stale - Deleted tracks still present **Mitigation:** - Treat as historical snapshot - Complement with real-time APIs for fresh data - Periodically obtain updated database (if available) ### Data Quality **Strengths:** - 256M tracks (massive coverage) - Rich metadata (genres, images, roles) - ISRC codes for cross-referencing - Popularity/follower metrics **Weaknesses:** - No data validation visible - Potential duplicates (not deduplicated) - Missing ISRCs for some tracks - Incomplete artist roles ## Storage Requirements ### Disk Space | Component | Size | Compressible | |-----------|------|--------------| | main_database.sqlite3 | ~117GB | Minimal (already compact) | | track_files.sqlite3 | ~99GB | Minimal (JSON fields) | | **Total** | **~216GB** | - | **Recommendations:** - SSD strongly recommended (HDD too slow for 256M rows) - NVMe for best performance - RAID not necessary (read-only, can rebuild from backup) ### Memory Usage **SQLite memory:** - Page cache: 64MB per connection - 8 connections: 512MB cache total - Memory-mapped I/O: 1GB per database (2GB total) - **Total: ~2.5GB minimum** **Application memory:** - Go runtime: ~50MB - Rate limiter map: Grows unbounded (leak) - Request buffers: ~10MB per concurrent request - **Total: ~100MB + leak** **Recommended RAM:** 4GB+ (2.5GB for SQLite + 1.5GB for OS/app) ### I/O Characteristics **Read patterns:** - Random reads (track lookups by ID/ISRC) - Sequential scans (search queries) - Batch reads (IN clause queries) **Write patterns:** - None (read-only) **Cache effectiveness:** - Hot data (popular tracks): High hit rate - Cold data (obscure tracks): Low hit rate - Search queries: Low hit rate (full scans) ## Database Maintenance ### No Maintenance Required **Read-only benefits:** - No VACUUM needed (no fragmentation from deletes) - No ANALYZE needed (statistics static) - No REINDEX needed (indexes don't degrade) - No WAL checkpoint (journal disabled) ### Backup Strategy **Simple backup:** ```bash # Copy files (database must be idle) cp main_database.sqlite3 backup/ cp track_files.sqlite3 backup/ ``` **Online backup (while running):** ```bash # SQLite backup API (requires custom tool) sqlite3 main_database.sqlite3 ".backup backup/main_database.sqlite3" ``` **Restore:** ```bash # Simply replace files cp backup/main_database.sqlite3 . cp backup/track_files.sqlite3 . ``` ### Integrity Checks **Verify database integrity:** ```bash sqlite3 main_database.sqlite3 "PRAGMA integrity_check;" sqlite3 track_files.sqlite3 "PRAGMA integrity_check;" ``` **Expected output:** `ok` **Run periodically:** Monthly or after hardware issues ## Performance Tuning ### Query Optimization **Indexes already present:** - Primary keys on all ID columns - Foreign key indexes (album_rowid, artist_id, etc.) - Search indexes (tracks.name, artists.name) **Missing indexes (potential improvements):** - Full-text search index (FTS5) on track/artist names - Composite index on (popularity, name) for sorted searches ### Connection Pool Tuning **Current settings:** ```go MaxOpenConns: 8 MaxIdleConns: 8 ConnMaxLifetime: 0 ``` **Tuning considerations:** - Increase MaxOpenConns for higher concurrency (16-32) - Monitor CPU usage (SQLite is CPU-bound for searches) - No benefit beyond CPU core count ### Cache Tuning **Current cache:** 64MB per connection (512MB total) **Increase cache:** ``` _cache_size=-128000 // 128MB per connection ``` **Tradeoff:** More memory usage vs fewer disk reads **Recommendation:** Monitor cache hit rate, increase if low ### Memory-Mapped I/O Tuning **Current mmap:** 1GB per database **Increase mmap:** ``` _mmap_size=2147483648 // 2GB ``` **Tradeoff:** More virtual memory vs faster reads **Recommendation:** Set to database size if RAM allows (117GB not feasible) ## Data Model Comparison ### vs Spotify Web API | Feature | Music Metadata API | Spotify Web API | |---------|-------------------|-----------------| | Track ID format | Spotify-compatible | Spotify IDs | | ISRC support | Yes | Yes | | Popularity | Static snapshot | Real-time | | Followers | Static snapshot | Real-time | | Images | External URLs | External URLs | | Genres | Artist-level | Artist-level | | Lyrics | Flag only | Not available | | Artist roles | Detailed | Limited | | Languages | Supported | Not available | ### vs MusicBrainz | Feature | Music Metadata API | MusicBrainz | |---------|-------------------|-------------| | Identifier | Spotify IDs, ISRC | MBIDs | | Dataset size | 256M tracks | ~40M recordings | | Popularity | Yes | No | | Followers | Yes | No | | Images | Yes (external) | Yes (Cover Art Archive) | | Genres | Yes | Yes (tags) | | Relationships | Limited | Extensive | | Credits | Artist roles | Detailed credits | | Updates | Static | Community-driven | ## Integration Considerations ### Joining with Other Databases **ISRC as common key:** ```sql -- Join with local library SELECT l.file_path, m.name, m.popularity FROM local_library l JOIN music_metadata_api.tracks m ON l.isrc = m.isrc ``` **Spotify ID as common key:** ```sql -- Join with MusicBrainz SELECT mb.mbid, mm.name, mm.popularity FROM musicbrainz.recording mb JOIN musicbrainz.isrc i ON mb.id = i.recording JOIN music_metadata_api.tracks mm ON i.isrc = mm.isrc ``` ### Data Export **Export to JSON:** ```bash sqlite3 main_database.sqlite3 <