feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,626 @@
+# Music Metadata API - Architecture
+
+## Architectural Overview
+
+Music Metadata API follows a clean 3-layer architecture with clear separation of concerns:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                      HTTP Clients                            │
+└─────────────────────────────────────────────────────────────┘
+                            │
+                            ▼
+┌─────────────────────────────────────────────────────────────┐
+│                   API Layer (internal/api)                   │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
+│  │  Handlers    │  │ Rate Limiter │  │   OpenAPI    │      │
+│  │  (routing)   │  │ (middleware) │  │   (docs)     │      │
+│  └──────────────┘  └──────────────┘  └──────────────┘      │
+└─────────────────────────────────────────────────────────────┘
+                            │
+                            ▼
+┌─────────────────────────────────────────────────────────────┐
+│                Database Layer (internal/db)                  │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
+│  │   Queries    │  │ Enrichment   │  │    Batch     │      │
+│  │   (SQL)      │  │  (joins)     │  │ Optimization │      │
+│  └──────────────┘  └──────────────┘  └──────────────┘      │
+└─────────────────────────────────────────────────────────────┘
+                            │
+                            ▼
+┌─────────────────────────────────────────────────────────────┐
+│                 Models Layer (internal/models)               │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
+│  │    Track     │  │    Album     │  │   Artist     │      │
+│  │   (struct)   │  │   (struct)   │  │  (struct)    │      │
+│  └──────────────┘  └──────────────┘  └──────────────┘      │
+└─────────────────────────────────────────────────────────────┘
+                            │
+                            ▼
+┌─────────────────────────────────────────────────────────────┐
+│              SQLite Databases (read-only)                    │
+│  ┌──────────────────────────┐  ┌──────────────────────────┐ │
+│  │  main_database.sqlite3   │  │  track_files.sqlite3     │ │
+│  │       (~117GB)           │  │       (~99GB)            │ │
+│  └──────────────────────────┘  └──────────────────────────┘ │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Directory Structure
+
+```
+music-metadata-api/
+├── cmd/
+│   └── server/
+│       └── main.go                    # Entry point (62 lines)
+│
+├── internal/
+│   ├── api/
+│   │   ├── handlers.go                # HTTP route handlers
+│   │   ├── ratelimit.go               # Token bucket rate limiter
+│   │   └── openapi.go                 # OpenAPI spec + Swagger UI
+│   │
+│   ├── db/
+│   │   └── db.go                      # Database layer (907 lines)
+│   │
+│   └── models/
+│       └── models.go                  # Data structures (65 lines)
+│
+├── Dockerfile                         # Multi-stage build
+├── docker-compose.yml                 # Production deployment
+├── go.mod                             # Dependencies
+├── go.sum                             # Dependency checksums
+├── .gitignore                         # Excludes databases, binaries
+└── .github/
+    └── workflows/
+        └── docker-publish.yml         # CI/CD pipeline
+```
+
+## Layer Breakdown
+
+### Entry Point: cmd/server/main.go
+
+**Responsibilities:**
+- Parse CLI flags (`-db`, `-addr`)
+- Initialize database connections
+- Set up HTTP router
+- Configure graceful shutdown
+- Start HTTP server
+
+**Key code flow:**
+```go
+// 1. Parse flags
+dbPath := flag.String("db", "", "path to database")
+addr := flag.String("addr", ":8080", "server address")
+
+// 2. Initialize database
+database, err := db.NewDatabase(*dbPath)
+
+// 3. Set up router with rate limiting
+mux := http.NewServeMux()
+rateLimiter := api.NewRateLimiter(100, 200)  // 100 req/s, 200 burst
+handler := rateLimiter.Limit(mux)
+
+// 4. Register routes
+api.RegisterRoutes(mux, database)
+
+// 5. Graceful shutdown on SIGINT/SIGTERM
+server := &http.Server{Addr: *addr, Handler: handler}
+// ... shutdown logic with 10s timeout
+```
+
+**File size:** 62 lines (minimal, focused)
+
+### API Layer: internal/api/
+
+#### handlers.go
+
+**Responsibilities:**
+- Route registration
+- Request parsing
+- Response serialization
+- Error handling
+- Query parameter validation
+
+**Route patterns (Go 1.22+ enhanced routing):**
+```go
+// Method + path patterns
+mux.HandleFunc("POST /batch/lookup", handleBatchLookup)
+mux.HandleFunc("GET /lookup/isrc/{isrc}", handleISRCLookup)
+mux.HandleFunc("GET /lookup/track/{id}", handleTrackLookup)
+mux.HandleFunc("GET /lookup/artist/{id}", handleArtistLookup)
+mux.HandleFunc("GET /lookup/album/{id}", handleAlbumLookup)
+mux.HandleFunc("GET /lookup/album/{id}/tracks", handleAlbumTracks)
+mux.HandleFunc("GET /search/track", handleTrackSearch)
+mux.HandleFunc("GET /search/artist", handleArtistSearch)
+mux.HandleFunc("GET /health", handleHealth)
+mux.HandleFunc("GET /docs", handleDocs)
+mux.HandleFunc("GET /openapi.yaml", handleOpenAPI)
+```
+
+**Handler pattern:**
+```go
+func handleTrackLookup(w http.ResponseWriter, r *http.Request) {
+    // 1. Extract path parameter
+    id := r.PathValue("id")
+    
+    // 2. Call database layer
+    track, err := db.GetTrack(id)
+    if err != nil {
+        http.Error(w, "Track not found", http.StatusNotFound)
+        return
+    }
+    
+    // 3. Serialize response
+    w.Header().Set("Content-Type", "application/json")
+    json.NewEncoder(w).Encode(track)
+}
+```
+
+**Validation rules:**
+- Search queries: minimum 2 characters
+- Batch requests: maximum 400 items
+- Limit parameters: maximum 50 results
+- Timeouts: 10 seconds for search queries
+
+#### ratelimit.go
+
+**Implementation:** Token bucket algorithm with per-IP tracking
+
+**Data structure:**
+```go
+type RateLimiter struct {
+    visitors map[string]*rate.Limiter  // IP -> limiter
+    mu       sync.RWMutex               // Protects visitors map
+    rate     rate.Limit                 // Tokens per second
+    burst    int                        // Burst capacity
+}
+```
+
+**Algorithm:**
+1. Extract client IP from `X-Forwarded-For` header (fallback to `RemoteAddr`)
+2. Look up or create limiter for IP
+3. Check if token available (`limiter.Allow()`)
+4. If allowed, pass to next handler
+5. If denied, return HTTP 429 with `Retry-After` header
+
+**BUG:** Visitor map grows unbounded. No cleanup mechanism for inactive IPs. Long-running servers will accumulate memory.
+
+**Configuration:**
+- Rate: 100 requests/second
+- Burst: 200 requests
+- Scope: Per-IP (not per-user, no authentication)
+
+#### openapi.go
+
+**Responsibilities:**
+- Serve OpenAPI 3.1 specification at `/openapi.yaml`
+- Serve Swagger UI at `/docs`
+- Embed OpenAPI spec in binary (no external files)
+
+**Swagger UI loading:**
+```html
+<!-- Loaded from unpkg.com CDN (browser-side) -->
+<script src="https://unpkg.com/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
+<link rel="stylesheet" href="https://unpkg.com/swagger-ui-dist@5/swagger-ui.css" />
+```
+
+**OpenAPI spec highlights:**
+- Version: 3.1.0
+- All endpoints documented
+- Request/response schemas
+- Example payloads
+- Error responses
+
+### Database Layer: internal/db/db.go
+
+**File size:** 907 lines (largest file in codebase)
+
+**Responsibilities:**
+- SQLite connection management
+- Query execution
+- Data enrichment (joining related entities)
+- Batch optimization
+- Transaction handling (read-only)
+
+#### Connection Management
+
+**Dual database connections:**
+```go
+type Database struct {
+    mainDB       *sql.DB  // main_database.sqlite3
+    trackFilesDB *sql.DB  // track_files.sqlite3
+}
+```
+
+**Connection string PRAGMAs:**
+```
+file:/path/to/db.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true
+```
+
+**PRAGMA breakdown:**
+
+| PRAGMA | Value | Purpose |
+|--------|-------|---------|
+| `mode=ro` | Read-only | Prevents accidental writes |
+| `_journal_mode=off` | Disabled | No write-ahead log (read-only safe) |
+| `_cache_size=-64000` | 64MB | Page cache size (negative = KB) |
+| `_mmap_size=1073741824` | 1GB | Memory-mapped I/O size |
+| `_query_only=true` | Enabled | Additional read-only enforcement |
+
+**Connection pool:**
+```go
+db.SetMaxOpenConns(8)   // Conservative limit
+db.SetMaxIdleConns(8)   // Keep connections warm
+db.SetConnMaxLifetime(0) // No expiration
+```
+
+#### Query Patterns
+
+**Individual lookups:**
+```go
+func (d *Database) GetTrack(id string) (*models.Track, error) {
+    // 1. Fetch base track + album
+    row := d.mainDB.QueryRow(`
+        SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit,
+               t.track_number, t.disc_number, t.popularity, t.preview_url,
+               a.id, a.name, a.album_type, a.label, a.release_date,
+               a.release_date_precision, a.external_id_upc, a.total_tracks
+        FROM tracks t
+        JOIN albums a ON t.album_rowid = a.rowid
+        WHERE t.id = ?
+    `, id)
+    
+    // 2. Enrich album (images, artists)
+    d.enrichAlbum(&track.Album)
+    
+    // 3. Enrich track (artists, track_files)
+    d.enrichTrack(&track)
+    
+    return &track, nil
+}
+```
+
+**Batch lookups:**
+```go
+func (d *Database) BatchGetByISRC(isrcs []string) (map[string]*models.Track, error) {
+    // 1. Build IN clause
+    placeholders := strings.Repeat("?,", len(isrcs)-1) + "?"
+    query := fmt.Sprintf(`
+        SELECT t.id, t.isrc, ...
+        FROM tracks t
+        JOIN albums a ON t.album_rowid = a.rowid
+        WHERE t.isrc IN (%s)
+    `, placeholders)
+    
+    // 2. Execute batch query
+    rows, err := d.mainDB.Query(query, isrcs...)
+    
+    // 3. Collect track IDs for enrichment
+    trackIDs := make([]string, 0, len(tracks))
+    albumIDs := make([]string, 0, len(tracks))
+    
+    // 4. Batch enrich all entities
+    d.batchEnrichAlbums(albumIDs, tracks)
+    d.batchEnrichTracks(trackIDs, tracks)
+    
+    return tracks, nil
+}
+```
+
+#### Data Enrichment Flow
+
+**Track enrichment pipeline:**
+```
+1. Fetch base track + album (single JOIN)
+   ↓
+2. Enrich album:
+   - Batch fetch album images (batchGetAlbumImages)
+   - Batch fetch album artists (batchGetAlbumArtists)
+   ↓
+3. Enrich track:
+   - Batch fetch track artists (batchGetTrackArtists)
+   - Batch fetch track files (batchEnrichTrackFiles)
+   ↓
+4. Enrich artists:
+   - Batch fetch artist genres (batchGetArtistGenres)
+   - Batch fetch artist images (batchGetArtistImages)
+   ↓
+5. Return fully enriched track
+```
+
+**Batch optimization functions:**
+
+| Function | Purpose | Query Pattern |
+|----------|---------|---------------|
+| `batchGetAlbumImages` | Fetch all images for albums | `WHERE album_id IN (...)` |
+| `batchGetAlbumArtists` | Fetch all artists for albums | `WHERE album_id IN (...)` |
+| `batchGetTrackArtists` | Fetch all artists for tracks | `WHERE track_id IN (...)` |
+| `batchGetArtistGenres` | Fetch all genres for artists | `WHERE artist_id IN (...)` |
+| `batchGetArtistImages` | Fetch all images for artists | `WHERE artist_id IN (...)` |
+| `batchEnrichTrackFiles` | Fetch extended track data | `WHERE track_id IN (...)` |
+
+**Why batch optimization matters:**
+- Single batch request with 400 tracks triggers ~6 batch queries
+- Without batching: 400 tracks × 6 queries = 2,400 database queries
+- With batching: 1 main query + 6 batch queries = 7 database queries
+- **Performance gain: 343x fewer queries**
+
+#### Search Implementation
+
+**Track search:**
+```sql
+SELECT id, name, isrc, duration_ms, popularity, album_rowid
+FROM tracks
+WHERE name LIKE ? COLLATE NOCASE
+ORDER BY popularity DESC
+LIMIT ?
+```
+
+**Artist search:**
+```sql
+SELECT id, name, followers_total, popularity
+FROM artists
+WHERE name LIKE ? COLLATE NOCASE
+ORDER BY followers_total DESC
+LIMIT ?
+```
+
+**Search characteristics:**
+- Pattern: `%query%` (substring match)
+- Collation: `NOCASE` (case-insensitive)
+- Timeout: 10 seconds (context deadline)
+- Min query length: 2 characters
+- Max results: 50
+
+**Performance concern:** `LIKE %query%` can't use indexes efficiently. Full table scans on 256M tracks will be slow. FTS (Full-Text Search) would be faster but not implemented.
+
+### Models Layer: internal/models/models.go
+
+**File size:** 65 lines (smallest layer)
+
+**Responsibilities:**
+- Define data structures
+- JSON serialization tags
+- Nested relationships
+
+**Core models:**
+
+```go
+type Track struct {
+    ID            string   `json:"id"`
+    Name          string   `json:"name"`
+    ISRC          string   `json:"isrc,omitempty"`
+    DurationMs    int      `json:"duration_ms"`
+    Explicit      bool     `json:"explicit"`
+    TrackNumber   int      `json:"track_number"`
+    DiscNumber    int      `json:"disc_number"`
+    Popularity    int      `json:"popularity"`
+    PreviewURL    string   `json:"preview_url,omitempty"`
+    Album         Album    `json:"album"`
+    Artists       []Artist `json:"artists"`
+    
+    // Extended fields from track_files DB
+    OriginalTitle string                 `json:"original_title,omitempty"`
+    VersionTitle  string                 `json:"version_title,omitempty"`
+    HasLyrics     bool                   `json:"has_lyrics"`
+    Languages     []string               `json:"languages,omitempty"`
+    ArtistRoles   map[string][]string    `json:"artist_roles,omitempty"`
+}
+
+type Album struct {
+    ID                    string   `json:"id"`
+    Name                  string   `json:"name"`
+    AlbumType             string   `json:"album_type"`
+    Label                 string   `json:"label,omitempty"`
+    ReleaseDate           string   `json:"release_date"`
+    ReleaseDatePrecision  string   `json:"release_date_precision"`
+    ExternalIDUPC         string   `json:"external_id_upc,omitempty"`
+    TotalTracks           int      `json:"total_tracks"`
+    CopyrightC            string   `json:"copyright_c,omitempty"`
+    CopyrightP            string   `json:"copyright_p,omitempty"`
+    Images                []Image  `json:"images,omitempty"`
+    Artists               []Artist `json:"artists,omitempty"`
+}
+
+type Artist struct {
+    ID             string   `json:"id"`
+    Name           string   `json:"name"`
+    FollowersTotal int      `json:"followers_total,omitempty"`
+    Popularity     int      `json:"popularity,omitempty"`
+    Genres         []string `json:"genres,omitempty"`
+    Images         []Image  `json:"images,omitempty"`
+}
+
+type Image struct {
+    URL    string `json:"url"`
+    Width  int    `json:"width"`
+    Height int    `json:"height"`
+}
+```
+
+**Batch request/response models:**
+
+```go
+type BatchRequest struct {
+    Tracks  []string `json:"tracks,omitempty"`   // Track IDs
+    Artists []string `json:"artists,omitempty"`  // Artist IDs
+    Albums  []string `json:"albums,omitempty"`   // Album IDs
+    ISRCs   []string `json:"isrcs,omitempty"`    // ISRC codes
+}
+
+type BatchResponse struct {
+    Tracks  map[string]*Track  `json:"tracks,omitempty"`
+    Artists map[string]*Artist `json:"artists,omitempty"`
+    Albums  map[string]*Album  `json:"albums,omitempty"`
+    ISRCs   map[string]*Track  `json:"isrcs,omitempty"`
+}
+```
+
+## Request Flow
+
+### Example: GET /lookup/track/{id}
+
+```
+1. Client Request
+   GET /lookup/track/abc123
+   ↓
+2. Rate Limiter Middleware
+   - Extract IP from X-Forwarded-For
+   - Check token bucket for IP
+   - If allowed, continue; else return 429
+   ↓
+3. HTTP Handler (api/handlers.go)
+   - Extract "abc123" from path
+   - Call db.GetTrack("abc123")
+   ↓
+4. Database Layer (db/db.go)
+   - Query track + album (single JOIN)
+   - Enrich album (images, artists)
+   - Enrich track (artists, track_files)
+   - Enrich artists (genres, images)
+   ↓
+5. Models Layer (models/models.go)
+   - Populate Track struct
+   - Nest Album, Artists
+   ↓
+6. HTTP Handler
+   - Serialize Track to JSON
+   - Set Content-Type: application/json
+   - Write response
+   ↓
+7. Client Response
+   200 OK
+   {
+     "id": "abc123",
+     "name": "Song Title",
+     "album": {...},
+     "artists": [...]
+   }
+```
+
+### Example: POST /batch/lookup
+
+```
+1. Client Request
+   POST /batch/lookup
+   {
+     "isrcs": ["USRC12345678", "GBUM71234567", ...],  // Up to 400
+     "tracks": ["id1", "id2", ...]
+   }
+   ↓
+2. Rate Limiter Middleware
+   - Single request counts as 1 token (not 400)
+   ↓
+3. HTTP Handler
+   - Parse BatchRequest
+   - Validate: max 400 items total
+   - Call db.BatchGetByISRC(isrcs)
+   - Call db.BatchGetTracks(trackIDs)
+   ↓
+4. Database Layer
+   - Build IN clause for ISRCs
+   - Execute batch query (1 query for all ISRCs)
+   - Collect all track/album/artist IDs
+   - Batch enrich all entities (6 batch queries)
+   ↓
+5. HTTP Handler
+   - Build BatchResponse with maps
+   - Serialize to JSON
+   ↓
+6. Client Response
+   200 OK
+   {
+     "isrcs": {
+       "USRC12345678": {...},
+       "GBUM71234567": {...}
+     },
+     "tracks": {
+       "id1": {...},
+       "id2": {...}
+     }
+   }
+```
+
+## Graceful Shutdown
+
+**Signal handling:**
+```go
+// Listen for SIGINT (Ctrl+C) and SIGTERM (Docker stop)
+sigChan := make(chan os.Signal, 1)
+signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM)
+
+// Block until signal received
+<-sigChan
+
+// Shutdown with 10-second timeout
+ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+defer cancel()
+
+server.Shutdown(ctx)  // Stop accepting new requests, finish in-flight
+```
+
+**Shutdown sequence:**
+1. Receive SIGINT or SIGTERM
+2. Stop accepting new connections
+3. Wait for in-flight requests (max 10 seconds)
+4. Close database connections
+5. Exit process
+
+## No Framework Philosophy
+
+Music Metadata API uses **zero web frameworks**. Everything is Go stdlib:
+
+**Routing:** Go 1.22+ enhanced `http.ServeMux`
+- Method-specific routes: `GET /path`, `POST /path`
+- Path parameters: `/lookup/track/{id}`
+- No regex, no wildcards (simple patterns only)
+
+**JSON:** `encoding/json` stdlib
+- `json.NewEncoder(w).Encode(data)` for responses
+- `json.NewDecoder(r.Body).Decode(&req)` for requests
+
+**HTTP Server:** `net/http` stdlib
+- `http.Server` with custom `Addr` and `Handler`
+- No middleware framework (custom rate limiter)
+
+**Database:** `database/sql` stdlib
+- `modernc.org/sqlite` driver (pure Go, no CGO)
+- Raw SQL queries (no ORM)
+
+**Logging:** `log/slog` stdlib
+- Structured logging for errors
+- No log levels (all logs are errors)
+
+**Benefits:**
+- Minimal dependencies (2 external packages)
+- No framework lock-in
+- Easy to understand (no magic)
+- Fast compilation
+- Small binary size
+
+**Tradeoffs:**
+- More boilerplate (manual error handling)
+- No built-in middleware chain
+- Manual query building (no ORM)
+- No automatic validation
+
+## Performance Characteristics
+
+**Strengths:**
+- Read-only databases (no write locks)
+- Connection pooling (8 connections)
+- Memory-mapped I/O (1GB mmap)
+- Batch optimization (343x fewer queries)
+- Conservative cache (64MB)
+
+**Bottlenecks:**
+- Search queries (LIKE %query% on 256M rows)
+- Rate limiter memory leak (unbounded map)
+- No query result caching
+- No CDN for image URLs
+
+**Scalability:**
+- Horizontal: Run multiple instances (read-only safe)
+- Vertical: Limited by disk I/O and SQLite's single-writer model (not applicable here)
+- Database size: 216GB requires SSD for acceptable performance