Files
metadata-agregator/docs/research/music-metadata-api/analysis/ARCHITECTURE.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

627 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Music Metadata API - Architecture
## Architectural Overview
Music Metadata API follows a clean 3-layer architecture with clear separation of concerns:
```
┌─────────────────────────────────────────────────────────────┐
│ HTTP Clients │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ API Layer (internal/api) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Handlers │ │ Rate Limiter │ │ OpenAPI │ │
│ │ (routing) │ │ (middleware) │ │ (docs) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Database Layer (internal/db) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Queries │ │ Enrichment │ │ Batch │ │
│ │ (SQL) │ │ (joins) │ │ Optimization │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Models Layer (internal/models) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Track │ │ Album │ │ Artist │ │
│ │ (struct) │ │ (struct) │ │ (struct) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SQLite Databases (read-only) │
│ ┌──────────────────────────┐ ┌──────────────────────────┐ │
│ │ main_database.sqlite3 │ │ track_files.sqlite3 │ │
│ │ (~117GB) │ │ (~99GB) │ │
│ └──────────────────────────┘ └──────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
## Directory Structure
```
music-metadata-api/
├── cmd/
│ └── server/
│ └── main.go # Entry point (62 lines)
├── internal/
│ ├── api/
│ │ ├── handlers.go # HTTP route handlers
│ │ ├── ratelimit.go # Token bucket rate limiter
│ │ └── openapi.go # OpenAPI spec + Swagger UI
│ │
│ ├── db/
│ │ └── db.go # Database layer (907 lines)
│ │
│ └── models/
│ └── models.go # Data structures (65 lines)
├── Dockerfile # Multi-stage build
├── docker-compose.yml # Production deployment
├── go.mod # Dependencies
├── go.sum # Dependency checksums
├── .gitignore # Excludes databases, binaries
└── .github/
└── workflows/
└── docker-publish.yml # CI/CD pipeline
```
## Layer Breakdown
### Entry Point: cmd/server/main.go
**Responsibilities:**
- Parse CLI flags (`-db`, `-addr`)
- Initialize database connections
- Set up HTTP router
- Configure graceful shutdown
- Start HTTP server
**Key code flow:**
```go
// 1. Parse flags
dbPath := flag.String("db", "", "path to database")
addr := flag.String("addr", ":8080", "server address")
// 2. Initialize database
database, err := db.NewDatabase(*dbPath)
// 3. Set up router with rate limiting
mux := http.NewServeMux()
rateLimiter := api.NewRateLimiter(100, 200) // 100 req/s, 200 burst
handler := rateLimiter.Limit(mux)
// 4. Register routes
api.RegisterRoutes(mux, database)
// 5. Graceful shutdown on SIGINT/SIGTERM
server := &http.Server{Addr: *addr, Handler: handler}
// ... shutdown logic with 10s timeout
```
**File size:** 62 lines (minimal, focused)
### API Layer: internal/api/
#### handlers.go
**Responsibilities:**
- Route registration
- Request parsing
- Response serialization
- Error handling
- Query parameter validation
**Route patterns (Go 1.22+ enhanced routing):**
```go
// Method + path patterns
mux.HandleFunc("POST /batch/lookup", handleBatchLookup)
mux.HandleFunc("GET /lookup/isrc/{isrc}", handleISRCLookup)
mux.HandleFunc("GET /lookup/track/{id}", handleTrackLookup)
mux.HandleFunc("GET /lookup/artist/{id}", handleArtistLookup)
mux.HandleFunc("GET /lookup/album/{id}", handleAlbumLookup)
mux.HandleFunc("GET /lookup/album/{id}/tracks", handleAlbumTracks)
mux.HandleFunc("GET /search/track", handleTrackSearch)
mux.HandleFunc("GET /search/artist", handleArtistSearch)
mux.HandleFunc("GET /health", handleHealth)
mux.HandleFunc("GET /docs", handleDocs)
mux.HandleFunc("GET /openapi.yaml", handleOpenAPI)
```
**Handler pattern:**
```go
func handleTrackLookup(w http.ResponseWriter, r *http.Request) {
// 1. Extract path parameter
id := r.PathValue("id")
// 2. Call database layer
track, err := db.GetTrack(id)
if err != nil {
http.Error(w, "Track not found", http.StatusNotFound)
return
}
// 3. Serialize response
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(track)
}
```
**Validation rules:**
- Search queries: minimum 2 characters
- Batch requests: maximum 400 items
- Limit parameters: maximum 50 results
- Timeouts: 10 seconds for search queries
#### ratelimit.go
**Implementation:** Token bucket algorithm with per-IP tracking
**Data structure:**
```go
type RateLimiter struct {
visitors map[string]*rate.Limiter // IP -> limiter
mu sync.RWMutex // Protects visitors map
rate rate.Limit // Tokens per second
burst int // Burst capacity
}
```
**Algorithm:**
1. Extract client IP from `X-Forwarded-For` header (fallback to `RemoteAddr`)
2. Look up or create limiter for IP
3. Check if token available (`limiter.Allow()`)
4. If allowed, pass to next handler
5. If denied, return HTTP 429 with `Retry-After` header
**BUG:** Visitor map grows unbounded. No cleanup mechanism for inactive IPs. Long-running servers will accumulate memory.
**Configuration:**
- Rate: 100 requests/second
- Burst: 200 requests
- Scope: Per-IP (not per-user, no authentication)
#### openapi.go
**Responsibilities:**
- Serve OpenAPI 3.1 specification at `/openapi.yaml`
- Serve Swagger UI at `/docs`
- Embed OpenAPI spec in binary (no external files)
**Swagger UI loading:**
```html
<!-- Loaded from unpkg.com CDN (browser-side) -->
<script src="https://unpkg.com/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
<link rel="stylesheet" href="https://unpkg.com/swagger-ui-dist@5/swagger-ui.css" />
```
**OpenAPI spec highlights:**
- Version: 3.1.0
- All endpoints documented
- Request/response schemas
- Example payloads
- Error responses
### Database Layer: internal/db/db.go
**File size:** 907 lines (largest file in codebase)
**Responsibilities:**
- SQLite connection management
- Query execution
- Data enrichment (joining related entities)
- Batch optimization
- Transaction handling (read-only)
#### Connection Management
**Dual database connections:**
```go
type Database struct {
mainDB *sql.DB // main_database.sqlite3
trackFilesDB *sql.DB // track_files.sqlite3
}
```
**Connection string PRAGMAs:**
```
file:/path/to/db.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true
```
**PRAGMA breakdown:**
| PRAGMA | Value | Purpose |
|--------|-------|---------|
| `mode=ro` | Read-only | Prevents accidental writes |
| `_journal_mode=off` | Disabled | No write-ahead log (read-only safe) |
| `_cache_size=-64000` | 64MB | Page cache size (negative = KB) |
| `_mmap_size=1073741824` | 1GB | Memory-mapped I/O size |
| `_query_only=true` | Enabled | Additional read-only enforcement |
**Connection pool:**
```go
db.SetMaxOpenConns(8) // Conservative limit
db.SetMaxIdleConns(8) // Keep connections warm
db.SetConnMaxLifetime(0) // No expiration
```
#### Query Patterns
**Individual lookups:**
```go
func (d *Database) GetTrack(id string) (*models.Track, error) {
// 1. Fetch base track + album
row := d.mainDB.QueryRow(`
SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit,
t.track_number, t.disc_number, t.popularity, t.preview_url,
a.id, a.name, a.album_type, a.label, a.release_date,
a.release_date_precision, a.external_id_upc, a.total_tracks
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.id = ?
`, id)
// 2. Enrich album (images, artists)
d.enrichAlbum(&track.Album)
// 3. Enrich track (artists, track_files)
d.enrichTrack(&track)
return &track, nil
}
```
**Batch lookups:**
```go
func (d *Database) BatchGetByISRC(isrcs []string) (map[string]*models.Track, error) {
// 1. Build IN clause
placeholders := strings.Repeat("?,", len(isrcs)-1) + "?"
query := fmt.Sprintf(`
SELECT t.id, t.isrc, ...
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.isrc IN (%s)
`, placeholders)
// 2. Execute batch query
rows, err := d.mainDB.Query(query, isrcs...)
// 3. Collect track IDs for enrichment
trackIDs := make([]string, 0, len(tracks))
albumIDs := make([]string, 0, len(tracks))
// 4. Batch enrich all entities
d.batchEnrichAlbums(albumIDs, tracks)
d.batchEnrichTracks(trackIDs, tracks)
return tracks, nil
}
```
#### Data Enrichment Flow
**Track enrichment pipeline:**
```
1. Fetch base track + album (single JOIN)
2. Enrich album:
- Batch fetch album images (batchGetAlbumImages)
- Batch fetch album artists (batchGetAlbumArtists)
3. Enrich track:
- Batch fetch track artists (batchGetTrackArtists)
- Batch fetch track files (batchEnrichTrackFiles)
4. Enrich artists:
- Batch fetch artist genres (batchGetArtistGenres)
- Batch fetch artist images (batchGetArtistImages)
5. Return fully enriched track
```
**Batch optimization functions:**
| Function | Purpose | Query Pattern |
|----------|---------|---------------|
| `batchGetAlbumImages` | Fetch all images for albums | `WHERE album_id IN (...)` |
| `batchGetAlbumArtists` | Fetch all artists for albums | `WHERE album_id IN (...)` |
| `batchGetTrackArtists` | Fetch all artists for tracks | `WHERE track_id IN (...)` |
| `batchGetArtistGenres` | Fetch all genres for artists | `WHERE artist_id IN (...)` |
| `batchGetArtistImages` | Fetch all images for artists | `WHERE artist_id IN (...)` |
| `batchEnrichTrackFiles` | Fetch extended track data | `WHERE track_id IN (...)` |
**Why batch optimization matters:**
- Single batch request with 400 tracks triggers ~6 batch queries
- Without batching: 400 tracks × 6 queries = 2,400 database queries
- With batching: 1 main query + 6 batch queries = 7 database queries
- **Performance gain: 343x fewer queries**
#### Search Implementation
**Track search:**
```sql
SELECT id, name, isrc, duration_ms, popularity, album_rowid
FROM tracks
WHERE name LIKE ? COLLATE NOCASE
ORDER BY popularity DESC
LIMIT ?
```
**Artist search:**
```sql
SELECT id, name, followers_total, popularity
FROM artists
WHERE name LIKE ? COLLATE NOCASE
ORDER BY followers_total DESC
LIMIT ?
```
**Search characteristics:**
- Pattern: `%query%` (substring match)
- Collation: `NOCASE` (case-insensitive)
- Timeout: 10 seconds (context deadline)
- Min query length: 2 characters
- Max results: 50
**Performance concern:** `LIKE %query%` can't use indexes efficiently. Full table scans on 256M tracks will be slow. FTS (Full-Text Search) would be faster but not implemented.
### Models Layer: internal/models/models.go
**File size:** 65 lines (smallest layer)
**Responsibilities:**
- Define data structures
- JSON serialization tags
- Nested relationships
**Core models:**
```go
type Track struct {
ID string `json:"id"`
Name string `json:"name"`
ISRC string `json:"isrc,omitempty"`
DurationMs int `json:"duration_ms"`
Explicit bool `json:"explicit"`
TrackNumber int `json:"track_number"`
DiscNumber int `json:"disc_number"`
Popularity int `json:"popularity"`
PreviewURL string `json:"preview_url,omitempty"`
Album Album `json:"album"`
Artists []Artist `json:"artists"`
// Extended fields from track_files DB
OriginalTitle string `json:"original_title,omitempty"`
VersionTitle string `json:"version_title,omitempty"`
HasLyrics bool `json:"has_lyrics"`
Languages []string `json:"languages,omitempty"`
ArtistRoles map[string][]string `json:"artist_roles,omitempty"`
}
type Album struct {
ID string `json:"id"`
Name string `json:"name"`
AlbumType string `json:"album_type"`
Label string `json:"label,omitempty"`
ReleaseDate string `json:"release_date"`
ReleaseDatePrecision string `json:"release_date_precision"`
ExternalIDUPC string `json:"external_id_upc,omitempty"`
TotalTracks int `json:"total_tracks"`
CopyrightC string `json:"copyright_c,omitempty"`
CopyrightP string `json:"copyright_p,omitempty"`
Images []Image `json:"images,omitempty"`
Artists []Artist `json:"artists,omitempty"`
}
type Artist struct {
ID string `json:"id"`
Name string `json:"name"`
FollowersTotal int `json:"followers_total,omitempty"`
Popularity int `json:"popularity,omitempty"`
Genres []string `json:"genres,omitempty"`
Images []Image `json:"images,omitempty"`
}
type Image struct {
URL string `json:"url"`
Width int `json:"width"`
Height int `json:"height"`
}
```
**Batch request/response models:**
```go
type BatchRequest struct {
Tracks []string `json:"tracks,omitempty"` // Track IDs
Artists []string `json:"artists,omitempty"` // Artist IDs
Albums []string `json:"albums,omitempty"` // Album IDs
ISRCs []string `json:"isrcs,omitempty"` // ISRC codes
}
type BatchResponse struct {
Tracks map[string]*Track `json:"tracks,omitempty"`
Artists map[string]*Artist `json:"artists,omitempty"`
Albums map[string]*Album `json:"albums,omitempty"`
ISRCs map[string]*Track `json:"isrcs,omitempty"`
}
```
## Request Flow
### Example: GET /lookup/track/{id}
```
1. Client Request
GET /lookup/track/abc123
2. Rate Limiter Middleware
- Extract IP from X-Forwarded-For
- Check token bucket for IP
- If allowed, continue; else return 429
3. HTTP Handler (api/handlers.go)
- Extract "abc123" from path
- Call db.GetTrack("abc123")
4. Database Layer (db/db.go)
- Query track + album (single JOIN)
- Enrich album (images, artists)
- Enrich track (artists, track_files)
- Enrich artists (genres, images)
5. Models Layer (models/models.go)
- Populate Track struct
- Nest Album, Artists
6. HTTP Handler
- Serialize Track to JSON
- Set Content-Type: application/json
- Write response
7. Client Response
200 OK
{
"id": "abc123",
"name": "Song Title",
"album": {...},
"artists": [...]
}
```
### Example: POST /batch/lookup
```
1. Client Request
POST /batch/lookup
{
"isrcs": ["USRC12345678", "GBUM71234567", ...], // Up to 400
"tracks": ["id1", "id2", ...]
}
2. Rate Limiter Middleware
- Single request counts as 1 token (not 400)
3. HTTP Handler
- Parse BatchRequest
- Validate: max 400 items total
- Call db.BatchGetByISRC(isrcs)
- Call db.BatchGetTracks(trackIDs)
4. Database Layer
- Build IN clause for ISRCs
- Execute batch query (1 query for all ISRCs)
- Collect all track/album/artist IDs
- Batch enrich all entities (6 batch queries)
5. HTTP Handler
- Build BatchResponse with maps
- Serialize to JSON
6. Client Response
200 OK
{
"isrcs": {
"USRC12345678": {...},
"GBUM71234567": {...}
},
"tracks": {
"id1": {...},
"id2": {...}
}
}
```
## Graceful Shutdown
**Signal handling:**
```go
// Listen for SIGINT (Ctrl+C) and SIGTERM (Docker stop)
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM)
// Block until signal received
<-sigChan
// Shutdown with 10-second timeout
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
server.Shutdown(ctx) // Stop accepting new requests, finish in-flight
```
**Shutdown sequence:**
1. Receive SIGINT or SIGTERM
2. Stop accepting new connections
3. Wait for in-flight requests (max 10 seconds)
4. Close database connections
5. Exit process
## No Framework Philosophy
Music Metadata API uses **zero web frameworks**. Everything is Go stdlib:
**Routing:** Go 1.22+ enhanced `http.ServeMux`
- Method-specific routes: `GET /path`, `POST /path`
- Path parameters: `/lookup/track/{id}`
- No regex, no wildcards (simple patterns only)
**JSON:** `encoding/json` stdlib
- `json.NewEncoder(w).Encode(data)` for responses
- `json.NewDecoder(r.Body).Decode(&req)` for requests
**HTTP Server:** `net/http` stdlib
- `http.Server` with custom `Addr` and `Handler`
- No middleware framework (custom rate limiter)
**Database:** `database/sql` stdlib
- `modernc.org/sqlite` driver (pure Go, no CGO)
- Raw SQL queries (no ORM)
**Logging:** `log/slog` stdlib
- Structured logging for errors
- No log levels (all logs are errors)
**Benefits:**
- Minimal dependencies (2 external packages)
- No framework lock-in
- Easy to understand (no magic)
- Fast compilation
- Small binary size
**Tradeoffs:**
- More boilerplate (manual error handling)
- No built-in middleware chain
- Manual query building (no ORM)
- No automatic validation
## Performance Characteristics
**Strengths:**
- Read-only databases (no write locks)
- Connection pooling (8 connections)
- Memory-mapped I/O (1GB mmap)
- Batch optimization (343x fewer queries)
- Conservative cache (64MB)
**Bottlenecks:**
- Search queries (LIKE %query% on 256M rows)
- Rate limiter memory leak (unbounded map)
- No query result caching
- No CDN for image URLs
**Scalability:**
- Horizontal: Run multiple instances (read-only safe)
- Vertical: Limited by disk I/O and SQLite's single-writer model (not applicable here)
- Database size: 216GB requires SSD for acceptable performance