feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,626 @@
# Music Metadata API - Architecture
## Architectural Overview
Music Metadata API follows a clean 3-layer architecture with clear separation of concerns:
```
┌─────────────────────────────────────────────────────────────┐
│ HTTP Clients │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ API Layer (internal/api) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Handlers │ │ Rate Limiter │ │ OpenAPI │ │
│ │ (routing) │ │ (middleware) │ │ (docs) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Database Layer (internal/db) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Queries │ │ Enrichment │ │ Batch │ │
│ │ (SQL) │ │ (joins) │ │ Optimization │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Models Layer (internal/models) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Track │ │ Album │ │ Artist │ │
│ │ (struct) │ │ (struct) │ │ (struct) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SQLite Databases (read-only) │
│ ┌──────────────────────────┐ ┌──────────────────────────┐ │
│ │ main_database.sqlite3 │ │ track_files.sqlite3 │ │
│ │ (~117GB) │ │ (~99GB) │ │
│ └──────────────────────────┘ └──────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
## Directory Structure
```
music-metadata-api/
├── cmd/
│ └── server/
│ └── main.go # Entry point (62 lines)
├── internal/
│ ├── api/
│ │ ├── handlers.go # HTTP route handlers
│ │ ├── ratelimit.go # Token bucket rate limiter
│ │ └── openapi.go # OpenAPI spec + Swagger UI
│ │
│ ├── db/
│ │ └── db.go # Database layer (907 lines)
│ │
│ └── models/
│ └── models.go # Data structures (65 lines)
├── Dockerfile # Multi-stage build
├── docker-compose.yml # Production deployment
├── go.mod # Dependencies
├── go.sum # Dependency checksums
├── .gitignore # Excludes databases, binaries
└── .github/
└── workflows/
└── docker-publish.yml # CI/CD pipeline
```
## Layer Breakdown
### Entry Point: cmd/server/main.go
**Responsibilities:**
- Parse CLI flags (`-db`, `-addr`)
- Initialize database connections
- Set up HTTP router
- Configure graceful shutdown
- Start HTTP server
**Key code flow:**
```go
// 1. Parse flags
dbPath := flag.String("db", "", "path to database")
addr := flag.String("addr", ":8080", "server address")
// 2. Initialize database
database, err := db.NewDatabase(*dbPath)
// 3. Set up router with rate limiting
mux := http.NewServeMux()
rateLimiter := api.NewRateLimiter(100, 200) // 100 req/s, 200 burst
handler := rateLimiter.Limit(mux)
// 4. Register routes
api.RegisterRoutes(mux, database)
// 5. Graceful shutdown on SIGINT/SIGTERM
server := &http.Server{Addr: *addr, Handler: handler}
// ... shutdown logic with 10s timeout
```
**File size:** 62 lines (minimal, focused)
### API Layer: internal/api/
#### handlers.go
**Responsibilities:**
- Route registration
- Request parsing
- Response serialization
- Error handling
- Query parameter validation
**Route patterns (Go 1.22+ enhanced routing):**
```go
// Method + path patterns
mux.HandleFunc("POST /batch/lookup", handleBatchLookup)
mux.HandleFunc("GET /lookup/isrc/{isrc}", handleISRCLookup)
mux.HandleFunc("GET /lookup/track/{id}", handleTrackLookup)
mux.HandleFunc("GET /lookup/artist/{id}", handleArtistLookup)
mux.HandleFunc("GET /lookup/album/{id}", handleAlbumLookup)
mux.HandleFunc("GET /lookup/album/{id}/tracks", handleAlbumTracks)
mux.HandleFunc("GET /search/track", handleTrackSearch)
mux.HandleFunc("GET /search/artist", handleArtistSearch)
mux.HandleFunc("GET /health", handleHealth)
mux.HandleFunc("GET /docs", handleDocs)
mux.HandleFunc("GET /openapi.yaml", handleOpenAPI)
```
**Handler pattern:**
```go
func handleTrackLookup(w http.ResponseWriter, r *http.Request) {
// 1. Extract path parameter
id := r.PathValue("id")
// 2. Call database layer
track, err := db.GetTrack(id)
if err != nil {
http.Error(w, "Track not found", http.StatusNotFound)
return
}
// 3. Serialize response
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(track)
}
```
**Validation rules:**
- Search queries: minimum 2 characters
- Batch requests: maximum 400 items
- Limit parameters: maximum 50 results
- Timeouts: 10 seconds for search queries
#### ratelimit.go
**Implementation:** Token bucket algorithm with per-IP tracking
**Data structure:**
```go
type RateLimiter struct {
visitors map[string]*rate.Limiter // IP -> limiter
mu sync.RWMutex // Protects visitors map
rate rate.Limit // Tokens per second
burst int // Burst capacity
}
```
**Algorithm:**
1. Extract client IP from `X-Forwarded-For` header (fallback to `RemoteAddr`)
2. Look up or create limiter for IP
3. Check if token available (`limiter.Allow()`)
4. If allowed, pass to next handler
5. If denied, return HTTP 429 with `Retry-After` header
**BUG:** Visitor map grows unbounded. No cleanup mechanism for inactive IPs. Long-running servers will accumulate memory.
**Configuration:**
- Rate: 100 requests/second
- Burst: 200 requests
- Scope: Per-IP (not per-user, no authentication)
#### openapi.go
**Responsibilities:**
- Serve OpenAPI 3.1 specification at `/openapi.yaml`
- Serve Swagger UI at `/docs`
- Embed OpenAPI spec in binary (no external files)
**Swagger UI loading:**
```html
<!-- Loaded from unpkg.com CDN (browser-side) -->
<script src="https://unpkg.com/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
<link rel="stylesheet" href="https://unpkg.com/swagger-ui-dist@5/swagger-ui.css" />
```
**OpenAPI spec highlights:**
- Version: 3.1.0
- All endpoints documented
- Request/response schemas
- Example payloads
- Error responses
### Database Layer: internal/db/db.go
**File size:** 907 lines (largest file in codebase)
**Responsibilities:**
- SQLite connection management
- Query execution
- Data enrichment (joining related entities)
- Batch optimization
- Transaction handling (read-only)
#### Connection Management
**Dual database connections:**
```go
type Database struct {
mainDB *sql.DB // main_database.sqlite3
trackFilesDB *sql.DB // track_files.sqlite3
}
```
**Connection string PRAGMAs:**
```
file:/path/to/db.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true
```
**PRAGMA breakdown:**
| PRAGMA | Value | Purpose |
|--------|-------|---------|
| `mode=ro` | Read-only | Prevents accidental writes |
| `_journal_mode=off` | Disabled | No write-ahead log (read-only safe) |
| `_cache_size=-64000` | 64MB | Page cache size (negative = KB) |
| `_mmap_size=1073741824` | 1GB | Memory-mapped I/O size |
| `_query_only=true` | Enabled | Additional read-only enforcement |
**Connection pool:**
```go
db.SetMaxOpenConns(8) // Conservative limit
db.SetMaxIdleConns(8) // Keep connections warm
db.SetConnMaxLifetime(0) // No expiration
```
#### Query Patterns
**Individual lookups:**
```go
func (d *Database) GetTrack(id string) (*models.Track, error) {
// 1. Fetch base track + album
row := d.mainDB.QueryRow(`
SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit,
t.track_number, t.disc_number, t.popularity, t.preview_url,
a.id, a.name, a.album_type, a.label, a.release_date,
a.release_date_precision, a.external_id_upc, a.total_tracks
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.id = ?
`, id)
// 2. Enrich album (images, artists)
d.enrichAlbum(&track.Album)
// 3. Enrich track (artists, track_files)
d.enrichTrack(&track)
return &track, nil
}
```
**Batch lookups:**
```go
func (d *Database) BatchGetByISRC(isrcs []string) (map[string]*models.Track, error) {
// 1. Build IN clause
placeholders := strings.Repeat("?,", len(isrcs)-1) + "?"
query := fmt.Sprintf(`
SELECT t.id, t.isrc, ...
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.isrc IN (%s)
`, placeholders)
// 2. Execute batch query
rows, err := d.mainDB.Query(query, isrcs...)
// 3. Collect track IDs for enrichment
trackIDs := make([]string, 0, len(tracks))
albumIDs := make([]string, 0, len(tracks))
// 4. Batch enrich all entities
d.batchEnrichAlbums(albumIDs, tracks)
d.batchEnrichTracks(trackIDs, tracks)
return tracks, nil
}
```
#### Data Enrichment Flow
**Track enrichment pipeline:**
```
1. Fetch base track + album (single JOIN)
2. Enrich album:
- Batch fetch album images (batchGetAlbumImages)
- Batch fetch album artists (batchGetAlbumArtists)
3. Enrich track:
- Batch fetch track artists (batchGetTrackArtists)
- Batch fetch track files (batchEnrichTrackFiles)
4. Enrich artists:
- Batch fetch artist genres (batchGetArtistGenres)
- Batch fetch artist images (batchGetArtistImages)
5. Return fully enriched track
```
**Batch optimization functions:**
| Function | Purpose | Query Pattern |
|----------|---------|---------------|
| `batchGetAlbumImages` | Fetch all images for albums | `WHERE album_id IN (...)` |
| `batchGetAlbumArtists` | Fetch all artists for albums | `WHERE album_id IN (...)` |
| `batchGetTrackArtists` | Fetch all artists for tracks | `WHERE track_id IN (...)` |
| `batchGetArtistGenres` | Fetch all genres for artists | `WHERE artist_id IN (...)` |
| `batchGetArtistImages` | Fetch all images for artists | `WHERE artist_id IN (...)` |
| `batchEnrichTrackFiles` | Fetch extended track data | `WHERE track_id IN (...)` |
**Why batch optimization matters:**
- Single batch request with 400 tracks triggers ~6 batch queries
- Without batching: 400 tracks × 6 queries = 2,400 database queries
- With batching: 1 main query + 6 batch queries = 7 database queries
- **Performance gain: 343x fewer queries**
#### Search Implementation
**Track search:**
```sql
SELECT id, name, isrc, duration_ms, popularity, album_rowid
FROM tracks
WHERE name LIKE ? COLLATE NOCASE
ORDER BY popularity DESC
LIMIT ?
```
**Artist search:**
```sql
SELECT id, name, followers_total, popularity
FROM artists
WHERE name LIKE ? COLLATE NOCASE
ORDER BY followers_total DESC
LIMIT ?
```
**Search characteristics:**
- Pattern: `%query%` (substring match)
- Collation: `NOCASE` (case-insensitive)
- Timeout: 10 seconds (context deadline)
- Min query length: 2 characters
- Max results: 50
**Performance concern:** `LIKE %query%` can't use indexes efficiently. Full table scans on 256M tracks will be slow. FTS (Full-Text Search) would be faster but not implemented.
### Models Layer: internal/models/models.go
**File size:** 65 lines (smallest layer)
**Responsibilities:**
- Define data structures
- JSON serialization tags
- Nested relationships
**Core models:**
```go
type Track struct {
ID string `json:"id"`
Name string `json:"name"`
ISRC string `json:"isrc,omitempty"`
DurationMs int `json:"duration_ms"`
Explicit bool `json:"explicit"`
TrackNumber int `json:"track_number"`
DiscNumber int `json:"disc_number"`
Popularity int `json:"popularity"`
PreviewURL string `json:"preview_url,omitempty"`
Album Album `json:"album"`
Artists []Artist `json:"artists"`
// Extended fields from track_files DB
OriginalTitle string `json:"original_title,omitempty"`
VersionTitle string `json:"version_title,omitempty"`
HasLyrics bool `json:"has_lyrics"`
Languages []string `json:"languages,omitempty"`
ArtistRoles map[string][]string `json:"artist_roles,omitempty"`
}
type Album struct {
ID string `json:"id"`
Name string `json:"name"`
AlbumType string `json:"album_type"`
Label string `json:"label,omitempty"`
ReleaseDate string `json:"release_date"`
ReleaseDatePrecision string `json:"release_date_precision"`
ExternalIDUPC string `json:"external_id_upc,omitempty"`
TotalTracks int `json:"total_tracks"`
CopyrightC string `json:"copyright_c,omitempty"`
CopyrightP string `json:"copyright_p,omitempty"`
Images []Image `json:"images,omitempty"`
Artists []Artist `json:"artists,omitempty"`
}
type Artist struct {
ID string `json:"id"`
Name string `json:"name"`
FollowersTotal int `json:"followers_total,omitempty"`
Popularity int `json:"popularity,omitempty"`
Genres []string `json:"genres,omitempty"`
Images []Image `json:"images,omitempty"`
}
type Image struct {
URL string `json:"url"`
Width int `json:"width"`
Height int `json:"height"`
}
```
**Batch request/response models:**
```go
type BatchRequest struct {
Tracks []string `json:"tracks,omitempty"` // Track IDs
Artists []string `json:"artists,omitempty"` // Artist IDs
Albums []string `json:"albums,omitempty"` // Album IDs
ISRCs []string `json:"isrcs,omitempty"` // ISRC codes
}
type BatchResponse struct {
Tracks map[string]*Track `json:"tracks,omitempty"`
Artists map[string]*Artist `json:"artists,omitempty"`
Albums map[string]*Album `json:"albums,omitempty"`
ISRCs map[string]*Track `json:"isrcs,omitempty"`
}
```
## Request Flow
### Example: GET /lookup/track/{id}
```
1. Client Request
GET /lookup/track/abc123
2. Rate Limiter Middleware
- Extract IP from X-Forwarded-For
- Check token bucket for IP
- If allowed, continue; else return 429
3. HTTP Handler (api/handlers.go)
- Extract "abc123" from path
- Call db.GetTrack("abc123")
4. Database Layer (db/db.go)
- Query track + album (single JOIN)
- Enrich album (images, artists)
- Enrich track (artists, track_files)
- Enrich artists (genres, images)
5. Models Layer (models/models.go)
- Populate Track struct
- Nest Album, Artists
6. HTTP Handler
- Serialize Track to JSON
- Set Content-Type: application/json
- Write response
7. Client Response
200 OK
{
"id": "abc123",
"name": "Song Title",
"album": {...},
"artists": [...]
}
```
### Example: POST /batch/lookup
```
1. Client Request
POST /batch/lookup
{
"isrcs": ["USRC12345678", "GBUM71234567", ...], // Up to 400
"tracks": ["id1", "id2", ...]
}
2. Rate Limiter Middleware
- Single request counts as 1 token (not 400)
3. HTTP Handler
- Parse BatchRequest
- Validate: max 400 items total
- Call db.BatchGetByISRC(isrcs)
- Call db.BatchGetTracks(trackIDs)
4. Database Layer
- Build IN clause for ISRCs
- Execute batch query (1 query for all ISRCs)
- Collect all track/album/artist IDs
- Batch enrich all entities (6 batch queries)
5. HTTP Handler
- Build BatchResponse with maps
- Serialize to JSON
6. Client Response
200 OK
{
"isrcs": {
"USRC12345678": {...},
"GBUM71234567": {...}
},
"tracks": {
"id1": {...},
"id2": {...}
}
}
```
## Graceful Shutdown
**Signal handling:**
```go
// Listen for SIGINT (Ctrl+C) and SIGTERM (Docker stop)
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM)
// Block until signal received
<-sigChan
// Shutdown with 10-second timeout
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
server.Shutdown(ctx) // Stop accepting new requests, finish in-flight
```
**Shutdown sequence:**
1. Receive SIGINT or SIGTERM
2. Stop accepting new connections
3. Wait for in-flight requests (max 10 seconds)
4. Close database connections
5. Exit process
## No Framework Philosophy
Music Metadata API uses **zero web frameworks**. Everything is Go stdlib:
**Routing:** Go 1.22+ enhanced `http.ServeMux`
- Method-specific routes: `GET /path`, `POST /path`
- Path parameters: `/lookup/track/{id}`
- No regex, no wildcards (simple patterns only)
**JSON:** `encoding/json` stdlib
- `json.NewEncoder(w).Encode(data)` for responses
- `json.NewDecoder(r.Body).Decode(&req)` for requests
**HTTP Server:** `net/http` stdlib
- `http.Server` with custom `Addr` and `Handler`
- No middleware framework (custom rate limiter)
**Database:** `database/sql` stdlib
- `modernc.org/sqlite` driver (pure Go, no CGO)
- Raw SQL queries (no ORM)
**Logging:** `log/slog` stdlib
- Structured logging for errors
- No log levels (all logs are errors)
**Benefits:**
- Minimal dependencies (2 external packages)
- No framework lock-in
- Easy to understand (no magic)
- Fast compilation
- Small binary size
**Tradeoffs:**
- More boilerplate (manual error handling)
- No built-in middleware chain
- Manual query building (no ORM)
- No automatic validation
## Performance Characteristics
**Strengths:**
- Read-only databases (no write locks)
- Connection pooling (8 connections)
- Memory-mapped I/O (1GB mmap)
- Batch optimization (343x fewer queries)
- Conservative cache (64MB)
**Bottlenecks:**
- Search queries (LIKE %query% on 256M rows)
- Rate limiter memory leak (unbounded map)
- No query result caching
- No CDN for image URLs
**Scalability:**
- Horizontal: Run multiple instances (read-only safe)
- Vertical: Limited by disk I/O and SQLite's single-writer model (not applicable here)
- Database size: 216GB requires SSD for acceptable performance