feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,895 @@
# Music Metadata API - API Reference
## API Overview
Music Metadata API exposes a RESTful HTTP API with 11 endpoints for querying music metadata. The API is fully documented with OpenAPI 3.1 and includes an interactive Swagger UI.
**Base URL:** `http://localhost:8080` (configurable via `-addr` flag)
**Content-Type:** `application/json`
**Authentication:** None (public API)
**CORS:** Not supported
**Rate Limiting:** 100 requests/second, 200 burst (per-IP)
## Endpoints
### Batch Operations
#### POST /batch/lookup
Retrieve multiple tracks, albums, and artists in a single request.
**Request Body:**
```json
{
"tracks": ["track_id_1", "track_id_2"],
"artists": ["artist_id_1", "artist_id_2"],
"albums": ["album_id_1", "album_id_2"],
"isrcs": ["USRC12345678", "GBUM71234567"]
}
```
**Constraints:**
- Maximum 400 items total across all arrays
- All fields optional (at least one required)
- Duplicate IDs allowed (deduplicated in response)
**Response:**
```json
{
"tracks": {
"track_id_1": {
"id": "track_id_1",
"name": "Song Title",
"isrc": "USRC12345678",
"duration_ms": 240000,
"explicit": false,
"track_number": 1,
"disc_number": 1,
"popularity": 85,
"preview_url": "https://...",
"album": { /* Album object */ },
"artists": [ /* Artist objects */ ],
"original_title": "Original Title",
"version_title": "Radio Edit",
"has_lyrics": true,
"languages": ["en", "es"],
"artist_roles": {
"artist_id_1": ["performer", "composer"]
}
}
},
"artists": {
"artist_id_1": { /* Artist object */ }
},
"albums": {
"album_id_1": { /* Album object */ }
},
"isrcs": {
"USRC12345678": { /* Track object */ }
}
}
```
**Status Codes:**
- `200 OK` - Success (even if some items not found)
- `400 Bad Request` - Invalid request (exceeds 400 items, malformed JSON)
- `429 Too Many Requests` - Rate limit exceeded
**Performance:**
- Optimized with batch queries (7 queries for 400 items vs 2,400 individual queries)
- Typical response time: 100-500ms for 400 items
**Example:**
```bash
curl -X POST http://localhost:8080/batch/lookup \
-H "Content-Type: application/json" \
-d '{
"isrcs": ["USRC17607839", "GBUM71029604"],
"tracks": ["4cOdK2wGLETKBW3PvgPWqT"]
}'
```
### Track Lookups
#### GET /lookup/isrc/{isrc}
Retrieve track by ISRC (International Standard Recording Code).
**Path Parameters:**
- `isrc` - ISRC code (e.g., `USRC12345678`)
**Response:**
```json
{
"id": "track_id",
"name": "Song Title",
"isrc": "USRC12345678",
"duration_ms": 240000,
"explicit": false,
"track_number": 1,
"disc_number": 1,
"popularity": 85,
"preview_url": "https://p.scdn.co/mp3-preview/...",
"album": {
"id": "album_id",
"name": "Album Title",
"album_type": "album",
"label": "Record Label",
"release_date": "2023-01-15",
"release_date_precision": "day",
"external_id_upc": "123456789012",
"total_tracks": 12,
"copyright_c": "2023 Label",
"copyright_p": "2023 Label",
"images": [
{
"url": "https://i.scdn.co/image/...",
"width": 640,
"height": 640
}
],
"artists": [ /* Album artists */ ]
},
"artists": [
{
"id": "artist_id",
"name": "Artist Name",
"followers_total": 1000000,
"popularity": 90,
"genres": ["pop", "rock"],
"images": [ /* Artist images */ ]
}
],
"original_title": "Original Title",
"version_title": "Radio Edit",
"has_lyrics": true,
"languages": ["en"],
"artist_roles": {
"artist_id": ["performer", "composer"]
}
}
```
**Status Codes:**
- `200 OK` - Track found
- `404 Not Found` - ISRC not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/isrc/USRC17607839
```
#### GET /lookup/track/{id}
Retrieve track by internal track ID.
**Path Parameters:**
- `id` - Track ID (internal identifier)
**Response:** Same as `/lookup/isrc/{isrc}`
**Status Codes:**
- `200 OK` - Track found
- `404 Not Found` - Track ID not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/track/4cOdK2wGLETKBW3PvgPWqT
```
### Artist Lookups
#### GET /lookup/artist/{id}
Retrieve artist by ID.
**Path Parameters:**
- `id` - Artist ID
**Response:**
```json
{
"id": "artist_id",
"name": "Artist Name",
"followers_total": 1000000,
"popularity": 90,
"genres": ["pop", "rock", "indie"],
"images": [
{
"url": "https://i.scdn.co/image/...",
"width": 640,
"height": 640
},
{
"url": "https://i.scdn.co/image/...",
"width": 320,
"height": 320
}
]
}
```
**Status Codes:**
- `200 OK` - Artist found
- `404 Not Found` - Artist ID not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/artist/0TnOYISbd1XYRBk9myaseg
```
### Album Lookups
#### GET /lookup/album/{id}
Retrieve album by ID.
**Path Parameters:**
- `id` - Album ID
**Response:**
```json
{
"id": "album_id",
"name": "Album Title",
"album_type": "album",
"label": "Record Label",
"release_date": "2023-01-15",
"release_date_precision": "day",
"external_id_upc": "123456789012",
"total_tracks": 12,
"copyright_c": "2023 Label",
"copyright_p": "2023 Label",
"images": [
{
"url": "https://i.scdn.co/image/...",
"width": 640,
"height": 640
}
],
"artists": [
{
"id": "artist_id",
"name": "Artist Name",
"followers_total": 1000000,
"popularity": 90,
"genres": ["pop"],
"images": [ /* Artist images */ ]
}
]
}
```
**Status Codes:**
- `200 OK` - Album found
- `404 Not Found` - Album ID not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/album/2ODvWsOgouMbaA5xf0RkJe
```
#### GET /lookup/album/{id}/tracks
Retrieve all tracks for an album.
**Path Parameters:**
- `id` - Album ID
**Response:**
```json
{
"tracks": [
{
"id": "track_id_1",
"name": "Track 1",
"track_number": 1,
"disc_number": 1,
/* Full track object */
},
{
"id": "track_id_2",
"name": "Track 2",
"track_number": 2,
"disc_number": 1,
/* Full track object */
}
]
}
```
**Status Codes:**
- `200 OK` - Album found (even if no tracks)
- `404 Not Found` - Album ID not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/album/2ODvWsOgouMbaA5xf0RkJe/tracks
```
### Search
#### GET /search/track
Search tracks by name.
**Query Parameters:**
- `q` - Search query (minimum 2 characters, required)
- `limit` - Maximum results (default 10, max 50)
**Search Behavior:**
- Case-insensitive substring match (`LIKE %query% COLLATE NOCASE`)
- Ordered by popularity (descending)
- 10-second timeout
- Returns partial matches
**Response:**
```json
{
"tracks": [
{
"id": "track_id",
"name": "Song Title",
/* Full track object */
}
],
"total": 1
}
```
**Status Codes:**
- `200 OK` - Search completed (even if no results)
- `400 Bad Request` - Query too short (< 2 chars) or limit too high (> 50)
- `429 Too Many Requests` - Rate limit exceeded
- `504 Gateway Timeout` - Search exceeded 10 seconds
**Example:**
```bash
curl "http://localhost:8080/search/track?q=bohemian&limit=5"
```
**Performance Note:** Search uses `LIKE %query%` which can't leverage indexes efficiently. Searches on common terms may be slow (full table scan on 256M tracks).
#### GET /search/artist
Search artists by name.
**Query Parameters:**
- `q` - Search query (minimum 2 characters, required)
- `limit` - Maximum results (default 10, max 50)
**Search Behavior:**
- Case-insensitive substring match (`LIKE %query% COLLATE NOCASE`)
- Ordered by follower count (descending)
- 10-second timeout
- Returns partial matches
**Response:**
```json
{
"artists": [
{
"id": "artist_id",
"name": "Artist Name",
/* Full artist object */
}
],
"total": 1
}
```
**Status Codes:**
- `200 OK` - Search completed (even if no results)
- `400 Bad Request` - Query too short (< 2 chars) or limit too high (> 50)
- `429 Too Many Requests` - Rate limit exceeded
- `504 Gateway Timeout` - Search exceeded 10 seconds
**Example:**
```bash
curl "http://localhost:8080/search/artist?q=beatles&limit=5"
```
### Health & Documentation
#### GET /health
Health check endpoint for monitoring.
**Response:**
```json
{
"status": "ok"
}
```
**Status Codes:**
- `200 OK` - Always (even if database unreachable)
**Limitation:** This is a naive health check. It doesn't verify database connectivity. A database failure won't be detected until actual queries fail.
**Example:**
```bash
curl http://localhost:8080/health
```
#### GET /docs
Interactive Swagger UI for API documentation.
**Response:** HTML page with embedded Swagger UI
**Dependencies:**
- Loads Swagger UI from unpkg.com CDN (browser-side)
- Requires internet connection for first load
- Fetches OpenAPI spec from `/openapi.yaml`
**Example:**
```bash
# Open in browser
open http://localhost:8080/docs
```
#### GET /openapi.yaml
OpenAPI 3.1 specification in YAML format.
**Response:** YAML document with full API specification
**Content-Type:** `application/x-yaml`
**Example:**
```bash
curl http://localhost:8080/openapi.yaml
```
## Rate Limiting
### Algorithm
**Implementation:** Token bucket per IP address
**Configuration:**
- **Rate:** 100 requests/second
- **Burst:** 200 requests
- **Scope:** Per-IP (extracted from `X-Forwarded-For` or `RemoteAddr`)
### Behavior
**Token bucket mechanics:**
1. Each IP gets a bucket with 200 tokens (burst capacity)
2. Tokens refill at 100/second
3. Each request consumes 1 token
4. If bucket empty, request rejected with HTTP 429
**Example scenarios:**
| Scenario | Tokens Available | Result |
|----------|------------------|--------|
| First request | 200 | Allowed (199 remaining) |
| 200 requests in 1 second | 200 → 0 | All allowed |
| 201st request in same second | 0 | Rejected (429) |
| Wait 1 second | 0 → 100 | 100 requests allowed |
| Steady 50 req/s | Always > 0 | Never rate limited |
### Response Headers
**When rate limited (HTTP 429):**
```
HTTP/1.1 429 Too Many Requests
Retry-After: 1
Content-Type: text/plain
Rate limit exceeded
```
**Retry-After:** Seconds to wait before retrying (always 1)
### IP Extraction
**Priority:**
1. `X-Forwarded-For` header (first IP if comma-separated)
2. `RemoteAddr` from connection
**Example:**
```
X-Forwarded-For: 203.0.113.1, 198.51.100.1
→ Rate limiter uses 203.0.113.1
```
### Known Issues
**Memory leak:** Visitor map grows unbounded. No cleanup for inactive IPs. Long-running servers will accumulate memory over time.
**Workaround:** Restart server periodically or implement custom cleanup.
## Data Models
### Track Object
```json
{
"id": "string", // Internal track ID
"name": "string", // Track title
"isrc": "string", // ISRC code (optional)
"duration_ms": 0, // Duration in milliseconds
"explicit": false, // Explicit content flag
"track_number": 0, // Track number on album
"disc_number": 0, // Disc number (multi-disc albums)
"popularity": 0, // Popularity score (0-100)
"preview_url": "string", // 30-second preview URL (optional)
"album": { /* Album object */ }, // Parent album
"artists": [ /* Artist objects */ ], // Track artists
"original_title": "string", // Original title (optional)
"version_title": "string", // Version (e.g., "Radio Edit") (optional)
"has_lyrics": false, // Lyrics availability flag
"languages": ["string"], // Languages of performance (optional)
"artist_roles": { // Artist roles map (optional)
"artist_id": ["role1", "role2"]
}
}
```
**Field notes:**
- `isrc`: May be null for some tracks
- `preview_url`: May be null if no preview available
- `popularity`: Higher = more popular (Spotify-style metric)
- `languages`: ISO 639-1 codes (e.g., "en", "es")
- `artist_roles`: Maps artist ID to roles (e.g., "performer", "composer", "producer")
### Album Object
```json
{
"id": "string", // Internal album ID
"name": "string", // Album title
"album_type": "string", // "album", "single", "compilation"
"label": "string", // Record label (optional)
"release_date": "string", // ISO 8601 date (YYYY-MM-DD)
"release_date_precision": "string", // "year", "month", "day"
"external_id_upc": "string", // UPC barcode (optional)
"total_tracks": 0, // Total tracks on album
"copyright_c": "string", // Copyright notice (optional)
"copyright_p": "string", // Phonographic copyright (optional)
"images": [ /* Image objects */ ], // Album artwork (optional)
"artists": [ /* Artist objects */ ] // Album artists (optional)
}
```
**Field notes:**
- `album_type`: Typically "album", "single", or "compilation"
- `release_date_precision`: Indicates granularity of release date
- `external_id_upc`: Universal Product Code (barcode)
- `images`: Sorted by size (largest first)
### Artist Object
```json
{
"id": "string", // Internal artist ID
"name": "string", // Artist name
"followers_total": 0, // Total followers (optional)
"popularity": 0, // Popularity score 0-100 (optional)
"genres": ["string"], // Genres (optional)
"images": [ /* Image objects */ ] // Artist images (optional)
}
```
**Field notes:**
- `followers_total`: Spotify-style follower count
- `popularity`: Higher = more popular
- `genres`: Multiple genres possible (e.g., ["pop", "rock"])
- `images`: Sorted by size (largest first)
### Image Object
```json
{
"url": "string", // Image URL (typically i.scdn.co)
"width": 0, // Width in pixels
"height": 0 // Height in pixels
}
```
**Field notes:**
- URLs reference external CDN (i.scdn.co)
- Multiple sizes available (640x640, 320x320, 64x64 typical)
- Images not hosted by API (external references)
## Error Responses
### Standard Error Format
```json
{
"error": "Error message"
}
```
**Content-Type:** `application/json` (for structured errors) or `text/plain` (for simple errors)
### Common Error Codes
| Status | Meaning | Common Causes |
|--------|---------|---------------|
| 400 | Bad Request | Invalid query params, malformed JSON, validation failure |
| 404 | Not Found | ID/ISRC not in database |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Server Error | Database error, query timeout |
| 504 | Gateway Timeout | Search exceeded 10 seconds |
### Example Error Responses
**404 Not Found:**
```json
{
"error": "Track not found"
}
```
**400 Bad Request:**
```json
{
"error": "Query must be at least 2 characters"
}
```
**429 Too Many Requests:**
```
Rate limit exceeded
```
**500 Internal Server Error:**
```json
{
"error": "Database query failed"
}
```
## OpenAPI Specification
### Metadata
```yaml
openapi: 3.1.0
info:
title: Music Metadata API
version: 1.0.0
description: API for querying music metadata from 256M tracks
license:
name: MIT
servers:
- url: http://localhost:8080
description: Local development server
```
### Example Endpoint Definition
```yaml
/lookup/track/{id}:
get:
summary: Get track by ID
operationId: getTrack
parameters:
- name: id
in: path
required: true
schema:
type: string
description: Track ID
responses:
'200':
description: Track found
content:
application/json:
schema:
$ref: '#/components/schemas/Track'
'404':
description: Track not found
'429':
description: Rate limit exceeded
```
### Schema Definitions
All data models defined in `components/schemas`:
- `Track`
- `Album`
- `Artist`
- `Image`
- `BatchRequest`
- `BatchResponse`
**Access:** http://localhost:8080/openapi.yaml
## Usage Examples
### Batch Lookup (Python)
```python
import requests
url = "http://localhost:8080/batch/lookup"
payload = {
"isrcs": ["USRC17607839", "GBUM71029604"],
"tracks": ["4cOdK2wGLETKBW3PvgPWqT"]
}
response = requests.post(url, json=payload)
data = response.json()
for isrc, track in data.get("isrcs", {}).items():
print(f"{isrc}: {track['name']} by {track['artists'][0]['name']}")
```
### Search with Pagination (JavaScript)
```javascript
async function searchTracks(query, limit = 10) {
const url = `http://localhost:8080/search/track?q=${encodeURIComponent(query)}&limit=${limit}`;
const response = await fetch(url);
const data = await response.json();
return data.tracks;
}
const tracks = await searchTracks("bohemian rhapsody", 5);
tracks.forEach(track => {
console.log(`${track.name} - ${track.album.name}`);
});
```
### Rate Limit Handling (Go)
```go
func fetchWithRetry(url string) (*http.Response, error) {
for {
resp, err := http.Get(url)
if err != nil {
return nil, err
}
if resp.StatusCode == 429 {
retryAfter := resp.Header.Get("Retry-After")
duration, _ := time.ParseDuration(retryAfter + "s")
time.Sleep(duration)
continue
}
return resp, nil
}
}
```
## Performance Considerations
### Batch vs Individual Requests
**Individual requests (400 tracks):**
- 400 HTTP requests
- 400 × ~50ms = 20 seconds
- Rate limited at 100 req/s (4 seconds minimum)
**Batch request (400 tracks):**
- 1 HTTP request
- ~200-500ms total
- **40-100x faster**
**Recommendation:** Always use batch endpoint for multiple items.
### Search Performance
**Fast searches:**
- Short, specific queries ("beatles")
- Queries matching popular items (returned first)
**Slow searches:**
- Common words ("love", "the")
- Long queries with many results
- Queries requiring full table scan
**Recommendation:** Implement client-side caching for common searches.
### Caching Strategy
**Cacheable:**
- Track/album/artist lookups (data rarely changes)
- Search results (cache for 1 hour)
**Not cacheable:**
- Health checks
- OpenAPI spec (changes with deployments)
**Recommendation:** Use HTTP caching headers (not currently implemented by API).
## Integration Patterns
### Enrichment Pipeline
```
1. Extract ISRCs from audio files (e.g., via AcoustID)
2. Batch lookup ISRCs (400 at a time)
3. Store track metadata in local database
4. Fetch missing artists/albums individually
5. Update local cache
```
### Complementing MusicBrainz
```
MusicBrainz (MBID-based)
Resolve ISRC from MusicBrainz
Lookup ISRC in Music Metadata API
Merge metadata (MusicBrainz + Spotify-style data)
```
### Real-time Lookup
```
User plays track
Extract ISRC from file
Check local cache
If miss: GET /lookup/isrc/{isrc}
Display metadata in UI
```
## Limitations
### No Authentication
**Impact:**
- Anyone can query API
- No usage tracking per user
- No quota enforcement per user
**Mitigation:**
- Deploy behind reverse proxy with auth
- Use firewall rules to restrict access
- Implement API gateway with authentication
### No CORS
**Impact:**
- Browser-based clients blocked
- Can't call from web apps directly
**Mitigation:**
- Add CORS middleware (custom implementation)
- Use server-side proxy
- Deploy API on same origin as web app
### No Metrics
**Impact:**
- No visibility into usage patterns
- Can't track error rates
- No performance monitoring
**Mitigation:**
- Add Prometheus metrics (custom implementation)
- Use reverse proxy with metrics (e.g., nginx)
- Parse logs for analytics
### Naive Health Check
**Impact:**
- Health endpoint returns OK even if database down
- Monitoring systems can't detect database failures
**Mitigation:**
- Implement custom health check with database ping
- Monitor actual query endpoints (e.g., /lookup/track/test_id)
@@ -0,0 +1,626 @@
# Music Metadata API - Architecture
## Architectural Overview
Music Metadata API follows a clean 3-layer architecture with clear separation of concerns:
```
┌─────────────────────────────────────────────────────────────┐
│ HTTP Clients │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ API Layer (internal/api) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Handlers │ │ Rate Limiter │ │ OpenAPI │ │
│ │ (routing) │ │ (middleware) │ │ (docs) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Database Layer (internal/db) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Queries │ │ Enrichment │ │ Batch │ │
│ │ (SQL) │ │ (joins) │ │ Optimization │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Models Layer (internal/models) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Track │ │ Album │ │ Artist │ │
│ │ (struct) │ │ (struct) │ │ (struct) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SQLite Databases (read-only) │
│ ┌──────────────────────────┐ ┌──────────────────────────┐ │
│ │ main_database.sqlite3 │ │ track_files.sqlite3 │ │
│ │ (~117GB) │ │ (~99GB) │ │
│ └──────────────────────────┘ └──────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
## Directory Structure
```
music-metadata-api/
├── cmd/
│ └── server/
│ └── main.go # Entry point (62 lines)
├── internal/
│ ├── api/
│ │ ├── handlers.go # HTTP route handlers
│ │ ├── ratelimit.go # Token bucket rate limiter
│ │ └── openapi.go # OpenAPI spec + Swagger UI
│ │
│ ├── db/
│ │ └── db.go # Database layer (907 lines)
│ │
│ └── models/
│ └── models.go # Data structures (65 lines)
├── Dockerfile # Multi-stage build
├── docker-compose.yml # Production deployment
├── go.mod # Dependencies
├── go.sum # Dependency checksums
├── .gitignore # Excludes databases, binaries
└── .github/
└── workflows/
└── docker-publish.yml # CI/CD pipeline
```
## Layer Breakdown
### Entry Point: cmd/server/main.go
**Responsibilities:**
- Parse CLI flags (`-db`, `-addr`)
- Initialize database connections
- Set up HTTP router
- Configure graceful shutdown
- Start HTTP server
**Key code flow:**
```go
// 1. Parse flags
dbPath := flag.String("db", "", "path to database")
addr := flag.String("addr", ":8080", "server address")
// 2. Initialize database
database, err := db.NewDatabase(*dbPath)
// 3. Set up router with rate limiting
mux := http.NewServeMux()
rateLimiter := api.NewRateLimiter(100, 200) // 100 req/s, 200 burst
handler := rateLimiter.Limit(mux)
// 4. Register routes
api.RegisterRoutes(mux, database)
// 5. Graceful shutdown on SIGINT/SIGTERM
server := &http.Server{Addr: *addr, Handler: handler}
// ... shutdown logic with 10s timeout
```
**File size:** 62 lines (minimal, focused)
### API Layer: internal/api/
#### handlers.go
**Responsibilities:**
- Route registration
- Request parsing
- Response serialization
- Error handling
- Query parameter validation
**Route patterns (Go 1.22+ enhanced routing):**
```go
// Method + path patterns
mux.HandleFunc("POST /batch/lookup", handleBatchLookup)
mux.HandleFunc("GET /lookup/isrc/{isrc}", handleISRCLookup)
mux.HandleFunc("GET /lookup/track/{id}", handleTrackLookup)
mux.HandleFunc("GET /lookup/artist/{id}", handleArtistLookup)
mux.HandleFunc("GET /lookup/album/{id}", handleAlbumLookup)
mux.HandleFunc("GET /lookup/album/{id}/tracks", handleAlbumTracks)
mux.HandleFunc("GET /search/track", handleTrackSearch)
mux.HandleFunc("GET /search/artist", handleArtistSearch)
mux.HandleFunc("GET /health", handleHealth)
mux.HandleFunc("GET /docs", handleDocs)
mux.HandleFunc("GET /openapi.yaml", handleOpenAPI)
```
**Handler pattern:**
```go
func handleTrackLookup(w http.ResponseWriter, r *http.Request) {
// 1. Extract path parameter
id := r.PathValue("id")
// 2. Call database layer
track, err := db.GetTrack(id)
if err != nil {
http.Error(w, "Track not found", http.StatusNotFound)
return
}
// 3. Serialize response
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(track)
}
```
**Validation rules:**
- Search queries: minimum 2 characters
- Batch requests: maximum 400 items
- Limit parameters: maximum 50 results
- Timeouts: 10 seconds for search queries
#### ratelimit.go
**Implementation:** Token bucket algorithm with per-IP tracking
**Data structure:**
```go
type RateLimiter struct {
visitors map[string]*rate.Limiter // IP -> limiter
mu sync.RWMutex // Protects visitors map
rate rate.Limit // Tokens per second
burst int // Burst capacity
}
```
**Algorithm:**
1. Extract client IP from `X-Forwarded-For` header (fallback to `RemoteAddr`)
2. Look up or create limiter for IP
3. Check if token available (`limiter.Allow()`)
4. If allowed, pass to next handler
5. If denied, return HTTP 429 with `Retry-After` header
**BUG:** Visitor map grows unbounded. No cleanup mechanism for inactive IPs. Long-running servers will accumulate memory.
**Configuration:**
- Rate: 100 requests/second
- Burst: 200 requests
- Scope: Per-IP (not per-user, no authentication)
#### openapi.go
**Responsibilities:**
- Serve OpenAPI 3.1 specification at `/openapi.yaml`
- Serve Swagger UI at `/docs`
- Embed OpenAPI spec in binary (no external files)
**Swagger UI loading:**
```html
<!-- Loaded from unpkg.com CDN (browser-side) -->
<script src="https://unpkg.com/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
<link rel="stylesheet" href="https://unpkg.com/swagger-ui-dist@5/swagger-ui.css" />
```
**OpenAPI spec highlights:**
- Version: 3.1.0
- All endpoints documented
- Request/response schemas
- Example payloads
- Error responses
### Database Layer: internal/db/db.go
**File size:** 907 lines (largest file in codebase)
**Responsibilities:**
- SQLite connection management
- Query execution
- Data enrichment (joining related entities)
- Batch optimization
- Transaction handling (read-only)
#### Connection Management
**Dual database connections:**
```go
type Database struct {
mainDB *sql.DB // main_database.sqlite3
trackFilesDB *sql.DB // track_files.sqlite3
}
```
**Connection string PRAGMAs:**
```
file:/path/to/db.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true
```
**PRAGMA breakdown:**
| PRAGMA | Value | Purpose |
|--------|-------|---------|
| `mode=ro` | Read-only | Prevents accidental writes |
| `_journal_mode=off` | Disabled | No write-ahead log (read-only safe) |
| `_cache_size=-64000` | 64MB | Page cache size (negative = KB) |
| `_mmap_size=1073741824` | 1GB | Memory-mapped I/O size |
| `_query_only=true` | Enabled | Additional read-only enforcement |
**Connection pool:**
```go
db.SetMaxOpenConns(8) // Conservative limit
db.SetMaxIdleConns(8) // Keep connections warm
db.SetConnMaxLifetime(0) // No expiration
```
#### Query Patterns
**Individual lookups:**
```go
func (d *Database) GetTrack(id string) (*models.Track, error) {
// 1. Fetch base track + album
row := d.mainDB.QueryRow(`
SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit,
t.track_number, t.disc_number, t.popularity, t.preview_url,
a.id, a.name, a.album_type, a.label, a.release_date,
a.release_date_precision, a.external_id_upc, a.total_tracks
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.id = ?
`, id)
// 2. Enrich album (images, artists)
d.enrichAlbum(&track.Album)
// 3. Enrich track (artists, track_files)
d.enrichTrack(&track)
return &track, nil
}
```
**Batch lookups:**
```go
func (d *Database) BatchGetByISRC(isrcs []string) (map[string]*models.Track, error) {
// 1. Build IN clause
placeholders := strings.Repeat("?,", len(isrcs)-1) + "?"
query := fmt.Sprintf(`
SELECT t.id, t.isrc, ...
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.isrc IN (%s)
`, placeholders)
// 2. Execute batch query
rows, err := d.mainDB.Query(query, isrcs...)
// 3. Collect track IDs for enrichment
trackIDs := make([]string, 0, len(tracks))
albumIDs := make([]string, 0, len(tracks))
// 4. Batch enrich all entities
d.batchEnrichAlbums(albumIDs, tracks)
d.batchEnrichTracks(trackIDs, tracks)
return tracks, nil
}
```
#### Data Enrichment Flow
**Track enrichment pipeline:**
```
1. Fetch base track + album (single JOIN)
2. Enrich album:
- Batch fetch album images (batchGetAlbumImages)
- Batch fetch album artists (batchGetAlbumArtists)
3. Enrich track:
- Batch fetch track artists (batchGetTrackArtists)
- Batch fetch track files (batchEnrichTrackFiles)
4. Enrich artists:
- Batch fetch artist genres (batchGetArtistGenres)
- Batch fetch artist images (batchGetArtistImages)
5. Return fully enriched track
```
**Batch optimization functions:**
| Function | Purpose | Query Pattern |
|----------|---------|---------------|
| `batchGetAlbumImages` | Fetch all images for albums | `WHERE album_id IN (...)` |
| `batchGetAlbumArtists` | Fetch all artists for albums | `WHERE album_id IN (...)` |
| `batchGetTrackArtists` | Fetch all artists for tracks | `WHERE track_id IN (...)` |
| `batchGetArtistGenres` | Fetch all genres for artists | `WHERE artist_id IN (...)` |
| `batchGetArtistImages` | Fetch all images for artists | `WHERE artist_id IN (...)` |
| `batchEnrichTrackFiles` | Fetch extended track data | `WHERE track_id IN (...)` |
**Why batch optimization matters:**
- Single batch request with 400 tracks triggers ~6 batch queries
- Without batching: 400 tracks × 6 queries = 2,400 database queries
- With batching: 1 main query + 6 batch queries = 7 database queries
- **Performance gain: 343x fewer queries**
#### Search Implementation
**Track search:**
```sql
SELECT id, name, isrc, duration_ms, popularity, album_rowid
FROM tracks
WHERE name LIKE ? COLLATE NOCASE
ORDER BY popularity DESC
LIMIT ?
```
**Artist search:**
```sql
SELECT id, name, followers_total, popularity
FROM artists
WHERE name LIKE ? COLLATE NOCASE
ORDER BY followers_total DESC
LIMIT ?
```
**Search characteristics:**
- Pattern: `%query%` (substring match)
- Collation: `NOCASE` (case-insensitive)
- Timeout: 10 seconds (context deadline)
- Min query length: 2 characters
- Max results: 50
**Performance concern:** `LIKE %query%` can't use indexes efficiently. Full table scans on 256M tracks will be slow. FTS (Full-Text Search) would be faster but not implemented.
### Models Layer: internal/models/models.go
**File size:** 65 lines (smallest layer)
**Responsibilities:**
- Define data structures
- JSON serialization tags
- Nested relationships
**Core models:**
```go
type Track struct {
ID string `json:"id"`
Name string `json:"name"`
ISRC string `json:"isrc,omitempty"`
DurationMs int `json:"duration_ms"`
Explicit bool `json:"explicit"`
TrackNumber int `json:"track_number"`
DiscNumber int `json:"disc_number"`
Popularity int `json:"popularity"`
PreviewURL string `json:"preview_url,omitempty"`
Album Album `json:"album"`
Artists []Artist `json:"artists"`
// Extended fields from track_files DB
OriginalTitle string `json:"original_title,omitempty"`
VersionTitle string `json:"version_title,omitempty"`
HasLyrics bool `json:"has_lyrics"`
Languages []string `json:"languages,omitempty"`
ArtistRoles map[string][]string `json:"artist_roles,omitempty"`
}
type Album struct {
ID string `json:"id"`
Name string `json:"name"`
AlbumType string `json:"album_type"`
Label string `json:"label,omitempty"`
ReleaseDate string `json:"release_date"`
ReleaseDatePrecision string `json:"release_date_precision"`
ExternalIDUPC string `json:"external_id_upc,omitempty"`
TotalTracks int `json:"total_tracks"`
CopyrightC string `json:"copyright_c,omitempty"`
CopyrightP string `json:"copyright_p,omitempty"`
Images []Image `json:"images,omitempty"`
Artists []Artist `json:"artists,omitempty"`
}
type Artist struct {
ID string `json:"id"`
Name string `json:"name"`
FollowersTotal int `json:"followers_total,omitempty"`
Popularity int `json:"popularity,omitempty"`
Genres []string `json:"genres,omitempty"`
Images []Image `json:"images,omitempty"`
}
type Image struct {
URL string `json:"url"`
Width int `json:"width"`
Height int `json:"height"`
}
```
**Batch request/response models:**
```go
type BatchRequest struct {
Tracks []string `json:"tracks,omitempty"` // Track IDs
Artists []string `json:"artists,omitempty"` // Artist IDs
Albums []string `json:"albums,omitempty"` // Album IDs
ISRCs []string `json:"isrcs,omitempty"` // ISRC codes
}
type BatchResponse struct {
Tracks map[string]*Track `json:"tracks,omitempty"`
Artists map[string]*Artist `json:"artists,omitempty"`
Albums map[string]*Album `json:"albums,omitempty"`
ISRCs map[string]*Track `json:"isrcs,omitempty"`
}
```
## Request Flow
### Example: GET /lookup/track/{id}
```
1. Client Request
GET /lookup/track/abc123
2. Rate Limiter Middleware
- Extract IP from X-Forwarded-For
- Check token bucket for IP
- If allowed, continue; else return 429
3. HTTP Handler (api/handlers.go)
- Extract "abc123" from path
- Call db.GetTrack("abc123")
4. Database Layer (db/db.go)
- Query track + album (single JOIN)
- Enrich album (images, artists)
- Enrich track (artists, track_files)
- Enrich artists (genres, images)
5. Models Layer (models/models.go)
- Populate Track struct
- Nest Album, Artists
6. HTTP Handler
- Serialize Track to JSON
- Set Content-Type: application/json
- Write response
7. Client Response
200 OK
{
"id": "abc123",
"name": "Song Title",
"album": {...},
"artists": [...]
}
```
### Example: POST /batch/lookup
```
1. Client Request
POST /batch/lookup
{
"isrcs": ["USRC12345678", "GBUM71234567", ...], // Up to 400
"tracks": ["id1", "id2", ...]
}
2. Rate Limiter Middleware
- Single request counts as 1 token (not 400)
3. HTTP Handler
- Parse BatchRequest
- Validate: max 400 items total
- Call db.BatchGetByISRC(isrcs)
- Call db.BatchGetTracks(trackIDs)
4. Database Layer
- Build IN clause for ISRCs
- Execute batch query (1 query for all ISRCs)
- Collect all track/album/artist IDs
- Batch enrich all entities (6 batch queries)
5. HTTP Handler
- Build BatchResponse with maps
- Serialize to JSON
6. Client Response
200 OK
{
"isrcs": {
"USRC12345678": {...},
"GBUM71234567": {...}
},
"tracks": {
"id1": {...},
"id2": {...}
}
}
```
## Graceful Shutdown
**Signal handling:**
```go
// Listen for SIGINT (Ctrl+C) and SIGTERM (Docker stop)
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM)
// Block until signal received
<-sigChan
// Shutdown with 10-second timeout
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
server.Shutdown(ctx) // Stop accepting new requests, finish in-flight
```
**Shutdown sequence:**
1. Receive SIGINT or SIGTERM
2. Stop accepting new connections
3. Wait for in-flight requests (max 10 seconds)
4. Close database connections
5. Exit process
## No Framework Philosophy
Music Metadata API uses **zero web frameworks**. Everything is Go stdlib:
**Routing:** Go 1.22+ enhanced `http.ServeMux`
- Method-specific routes: `GET /path`, `POST /path`
- Path parameters: `/lookup/track/{id}`
- No regex, no wildcards (simple patterns only)
**JSON:** `encoding/json` stdlib
- `json.NewEncoder(w).Encode(data)` for responses
- `json.NewDecoder(r.Body).Decode(&req)` for requests
**HTTP Server:** `net/http` stdlib
- `http.Server` with custom `Addr` and `Handler`
- No middleware framework (custom rate limiter)
**Database:** `database/sql` stdlib
- `modernc.org/sqlite` driver (pure Go, no CGO)
- Raw SQL queries (no ORM)
**Logging:** `log/slog` stdlib
- Structured logging for errors
- No log levels (all logs are errors)
**Benefits:**
- Minimal dependencies (2 external packages)
- No framework lock-in
- Easy to understand (no magic)
- Fast compilation
- Small binary size
**Tradeoffs:**
- More boilerplate (manual error handling)
- No built-in middleware chain
- Manual query building (no ORM)
- No automatic validation
## Performance Characteristics
**Strengths:**
- Read-only databases (no write locks)
- Connection pooling (8 connections)
- Memory-mapped I/O (1GB mmap)
- Batch optimization (343x fewer queries)
- Conservative cache (64MB)
**Bottlenecks:**
- Search queries (LIKE %query% on 256M rows)
- Rate limiter memory leak (unbounded map)
- No query result caching
- No CDN for image URLs
**Scalability:**
- Horizontal: Run multiple instances (read-only safe)
- Vertical: Limited by disk I/O and SQLite's single-writer model (not applicable here)
- Database size: 216GB requires SSD for acceptable performance
@@ -0,0 +1,945 @@
# Music Metadata API - Codebase Analysis
## Codebase Overview
Music Metadata API is a small, focused Go codebase with minimal complexity:
**Total lines of code:** ~1,100 lines (excluding tests, which don't exist)
**File breakdown:**
- `cmd/server/main.go` - 62 lines (entry point)
- `internal/db/db.go` - 907 lines (database layer, largest file)
- `internal/models/models.go` - 65 lines (data structures)
- `internal/api/handlers.go` - ~150 lines (HTTP handlers)
- `internal/api/ratelimit.go` - ~80 lines (rate limiting)
- `internal/api/openapi.go` - ~100 lines (OpenAPI spec)
**Characteristics:**
- No web framework (stdlib only)
- No ORM (raw SQL)
- No test files (zero test coverage)
- No configuration files (CLI flags only)
- Minimal dependencies (2 external packages)
## Configuration
### CLI Flags
**Defined in:** `cmd/server/main.go`
```go
var (
dbPath = flag.String("db", "", "path to database file (required)")
addr = flag.String("addr", ":8080", "HTTP server address")
)
```
**Usage:**
```bash
./metadata-api -db /data/main_database.sqlite3 -addr :8080
```
**Limitations:**
- Only 2 configurable parameters
- No environment variable support
- No configuration file support
- All timeouts hardcoded
- All limits hardcoded
### Hardcoded Configuration
**Timeouts:**
```go
// Graceful shutdown timeout
shutdownTimeout := 10 * time.Second
// Search query timeout
ctx, cancel := context.WithTimeout(r.Context(), 10*time.Second)
```
**Rate limiting:**
```go
// Hardcoded in api/ratelimit.go
rateLimiter := NewRateLimiter(100, 200) // 100 req/s, 200 burst
```
**Database connection pool:**
```go
// Hardcoded in db/db.go
db.SetMaxOpenConns(8)
db.SetMaxIdleConns(8)
db.SetConnMaxLifetime(0)
```
**Search limits:**
```go
// Hardcoded in api/handlers.go
const (
minQueryLength = 2
maxSearchLimit = 50
defaultLimit = 10
)
```
**Batch limits:**
```go
// Hardcoded in api/handlers.go
const maxBatchItems = 400
```
**SQLite PRAGMAs:**
```go
// Hardcoded in db/db.go
dsn := fmt.Sprintf("file:%s?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true", dbPath)
```
**Recommendation:** Extract to configuration struct for flexibility.
### Environment Variables
**docker-compose.yml defines:**
```yaml
environment:
- LOG_LEVEL=info
```
**BUG:** `LOG_LEVEL` is not used in code. No log level control implemented.
**Expected behavior:** Filter logs by level (debug, info, warn, error)
**Actual behavior:** All logs output (no filtering)
**Fix required:**
```go
// Add to main.go
logLevel := os.Getenv("LOG_LEVEL")
if logLevel == "" {
logLevel = "info"
}
var level slog.Level
switch logLevel {
case "debug":
level = slog.LevelDebug
case "info":
level = slog.LevelInfo
case "warn":
level = slog.LevelWarn
case "error":
level = slog.LevelError
}
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: level}))
```
## Logging
### Implementation
**Package:** Go stdlib `log/slog` (structured logging)
**Usage pattern:**
```go
slog.Error("Database query failed", "error", err, "query", query)
```
**Output format:**
```json
{"time":"2024-01-15T10:30:00Z","level":"ERROR","msg":"Database query failed","error":"no such table","query":"SELECT * FROM tracks"}
```
### Logging Locations
**Error logging only:**
- Database query failures
- JSON decode errors
- HTTP handler errors
- Graceful shutdown errors
**No info/debug logging:**
- Request logging (no access logs)
- Query execution logging
- Performance metrics
- Startup messages
**Example from db.go:**
```go
rows, err := d.mainDB.Query(query, args...)
if err != nil {
slog.Error("Query failed", "error", err, "query", query)
return nil, err
}
```
### Log Level Control
**Current:** No log level filtering (all logs output)
**Missing:**
- Debug logs (query details, timing)
- Info logs (startup, shutdown, requests)
- Warn logs (rate limiting, slow queries)
**Recommendation:** Implement log level control via environment variable.
## Health Checks
### Naive Implementation
**Endpoint:** `GET /health`
**Code:**
```go
func handleHealth(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}
```
**Response:**
```json
{"status":"ok"}
```
**Problem:** Always returns 200 OK, even if database is unreachable.
**Test:**
```bash
# Stop database (simulate failure)
mv /data/main_database.sqlite3 /data/main_database.sqlite3.bak
# Health check still returns OK
curl http://localhost:8080/health
# {"status":"ok"}
# But actual queries fail
curl http://localhost:8080/lookup/track/abc123
# 500 Internal Server Error
```
### Improved Health Check
**Recommendation:**
```go
func handleHealth(db *sql.DB) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// Ping database
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
if err := db.PingContext(ctx); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "unhealthy",
"error": "database unavailable",
})
return
}
// Optional: Test query
var count int
err := db.QueryRowContext(ctx, "SELECT COUNT(*) FROM tracks LIMIT 1").Scan(&count)
if err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "unhealthy",
"error": "database query failed",
})
return
}
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}
}
```
## Rate Limiting
### Implementation
**File:** `internal/api/ratelimit.go`
**Algorithm:** Token bucket per IP
**Data structure:**
```go
type RateLimiter struct {
visitors map[string]*rate.Limiter // IP -> limiter
mu sync.RWMutex // Protects visitors map
rate rate.Limit // Tokens per second
burst int // Burst capacity
}
```
**Middleware:**
```go
func (rl *RateLimiter) Limit(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Extract IP
ip := getIP(r)
// Get or create limiter for IP
limiter := rl.getLimiter(ip)
// Check if allowed
if !limiter.Allow() {
w.Header().Set("Retry-After", "1")
http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
return
}
next.ServeHTTP(w, r)
})
}
```
**IP extraction:**
```go
func getIP(r *http.Request) string {
// Check X-Forwarded-For header (proxy/load balancer)
forwarded := r.Header.Get("X-Forwarded-For")
if forwarded != "" {
// Take first IP if comma-separated
ips := strings.Split(forwarded, ",")
return strings.TrimSpace(ips[0])
}
// Fallback to RemoteAddr
ip, _, _ := net.SplitHostPort(r.RemoteAddr)
return ip
}
```
### Memory Leak
**Problem:** Visitor map grows unbounded. No cleanup for inactive IPs.
**Code:**
```go
func (rl *RateLimiter) getLimiter(ip string) *rate.Limiter {
rl.mu.Lock()
defer rl.mu.Unlock()
limiter, exists := rl.visitors[ip]
if !exists {
limiter = rate.NewLimiter(rl.rate, rl.burst)
rl.visitors[ip] = limiter // BUG: Never removed
}
return limiter
}
```
**Impact:**
- Long-running servers accumulate IPs
- Memory usage grows over time
- No expiration for inactive IPs
**Example:**
- 1 million unique IPs over 1 month
- ~100 bytes per limiter
- ~100MB memory leak
**Fix:**
```go
type visitor struct {
limiter *rate.Limiter
lastSeen time.Time
}
func (rl *RateLimiter) cleanup() {
ticker := time.NewTicker(1 * time.Hour)
defer ticker.Stop()
for range ticker.C {
rl.mu.Lock()
for ip, v := range rl.visitors {
// Remove visitors inactive for 24 hours
if time.Since(v.lastSeen) > 24*time.Hour {
delete(rl.visitors, ip)
}
}
rl.mu.Unlock()
}
}
// Start cleanup goroutine in NewRateLimiter
go rl.cleanup()
```
### Rate Limit Configuration
**Current:** Hardcoded (100 req/s, 200 burst)
**Recommendation:** Make configurable via CLI flags or environment variables.
```go
// CLI flags
var (
rateLimit = flag.Int("rate-limit", 100, "requests per second")
rateBurst = flag.Int("rate-burst", 200, "burst capacity")
)
// Usage
rateLimiter := api.NewRateLimiter(rate.Limit(*rateLimit), *rateBurst)
```
## Search Implementation
### Query Pattern
**Track search:**
```go
query := `
SELECT id, name, isrc, duration_ms, popularity, album_rowid
FROM tracks
WHERE name LIKE ? COLLATE NOCASE
ORDER BY popularity DESC
LIMIT ?
`
args := []interface{}{"%" + searchQuery + "%", limit}
```
**Artist search:**
```go
query := `
SELECT id, name, followers_total, popularity
FROM artists
WHERE name LIKE ? COLLATE NOCASE
ORDER BY followers_total DESC
LIMIT ?
`
args := []interface{}{"%" + searchQuery + "%", limit}
```
### Performance Characteristics
**LIKE %query% problems:**
- Can't use indexes (full table scan)
- Slow on 256M rows
- CPU-intensive (string matching)
**Benchmark (estimated):**
- Common query ("love"): 5-10 seconds
- Specific query ("bohemian rhapsody"): 1-2 seconds
- Rare query ("xyzabc"): 10+ seconds (full scan)
**10-second timeout:**
```go
ctx, cancel := context.WithTimeout(r.Context(), 10*time.Second)
defer cancel()
rows, err := db.QueryContext(ctx, query, args...)
if err == context.DeadlineExceeded {
http.Error(w, "Search timeout", http.StatusGatewayTimeout)
return
}
```
### Search Validation
**Minimum query length:**
```go
if len(searchQuery) < 2 {
http.Error(w, "Query must be at least 2 characters", http.StatusBadRequest)
return
}
```
**Maximum limit:**
```go
if limit > 50 {
http.Error(w, "Limit cannot exceed 50", http.StatusBadRequest)
return
}
```
**Default limit:**
```go
limit := 10
if limitParam := r.URL.Query().Get("limit"); limitParam != "" {
limit, _ = strconv.Atoi(limitParam)
}
```
### Full-Text Search Alternative
**Not implemented:** SQLite FTS5 (Full-Text Search)
**FTS5 benefits:**
- Indexed search (much faster)
- Relevance ranking
- Phrase search
- Boolean operators
**Why not used:**
- Requires writable database (to create FTS5 table)
- Databases are read-only
- Would need separate FTS5 database
**Workaround:**
```sql
-- Create separate FTS5 database (one-time setup)
CREATE VIRTUAL TABLE tracks_fts USING fts5(id, name, content=tracks);
INSERT INTO tracks_fts SELECT id, name FROM tracks;
-- Fast search
SELECT * FROM tracks_fts WHERE name MATCH 'bohemian';
```
**Implementation:**
- Create FTS5 database during database preparation
- Open second database connection in code
- Query FTS5 for search, then fetch full data from main DB
## Testing
### Test Coverage
**Test files:** 0
**Test coverage:** 0%
**Test framework:** None
**Evidence:**
```bash
# No test files in repository
find . -name "*_test.go"
# (no output)
```
**.gitignore includes:**
```
coverage.out
```
**Implication:** Testing was planned but never implemented.
### CI/CD Testing
**GitHub Actions workflow:** `.github/workflows/docker-publish.yml`
**Steps:**
1. Checkout code
2. Build Docker image
3. Push to registry
**Missing:** No test step
**Expected workflow:**
```yaml
- name: Run tests
run: go test -v ./...
- name: Check coverage
run: go test -cover ./...
```
### Manual Testing
**Only testing:** Manual API calls
**Example:**
```bash
# Health check
curl http://localhost:8080/health
# Track lookup
curl http://localhost:8080/lookup/track/abc123
# Search
curl "http://localhost:8080/search/track?q=test"
```
**No automated testing:**
- No unit tests
- No integration tests
- No end-to-end tests
- No performance tests
- No load tests
### Testing Recommendations
**Unit tests needed:**
- Rate limiter logic
- IP extraction
- Query building
- Data enrichment
- JSON serialization
**Integration tests needed:**
- Database queries
- HTTP handlers
- Batch operations
- Search functionality
**Example unit test:**
```go
// internal/api/ratelimit_test.go
func TestRateLimiter(t *testing.T) {
rl := NewRateLimiter(10, 20) // 10 req/s, 20 burst
// Should allow burst
for i := 0; i < 20; i++ {
if !rl.getLimiter("127.0.0.1").Allow() {
t.Errorf("Request %d should be allowed", i)
}
}
// Should reject 21st request
if rl.getLimiter("127.0.0.1").Allow() {
t.Error("Request 21 should be rate limited")
}
}
```
**Example integration test:**
```go
// internal/db/db_test.go
func TestGetTrack(t *testing.T) {
db, err := NewDatabase("testdata/test.db")
if err != nil {
t.Fatal(err)
}
defer db.Close()
track, err := db.GetTrack("test_track_id")
if err != nil {
t.Fatal(err)
}
if track.Name != "Test Track" {
t.Errorf("Expected 'Test Track', got '%s'", track.Name)
}
}
```
## Error Handling
### Error Patterns
**Database errors:**
```go
rows, err := db.Query(query, args...)
if err != nil {
slog.Error("Query failed", "error", err)
http.Error(w, "Internal server error", http.StatusInternalServerError)
return
}
```
**JSON decode errors:**
```go
var req BatchRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, "Invalid JSON", http.StatusBadRequest)
return
}
```
**Validation errors:**
```go
if len(query) < 2 {
http.Error(w, "Query too short", http.StatusBadRequest)
return
}
```
### Error Responses
**Generic errors:**
```go
http.Error(w, "Internal server error", http.StatusInternalServerError)
```
**Problem:** No error details returned to client (security vs usability tradeoff)
**Structured errors (not implemented):**
```go
type ErrorResponse struct {
Error string `json:"error"`
Code string `json:"code"`
Details string `json:"details,omitempty"`
}
func writeError(w http.ResponseWriter, status int, code, message string) {
w.WriteHeader(status)
json.NewEncoder(w).Encode(ErrorResponse{
Error: message,
Code: code,
})
}
```
## Code Quality
### Strengths
**Simplicity:**
- Small codebase (~1,100 lines)
- Easy to understand
- Minimal dependencies
- No framework magic
**Readability:**
- Clear function names
- Logical file organization
- Consistent style
**Performance:**
- Batch optimization (343x fewer queries)
- Connection pooling
- Memory-mapped I/O
### Weaknesses
**No tests:**
- Zero test coverage
- No regression protection
- No documentation via tests
**Hardcoded config:**
- No flexibility
- Requires recompilation to change limits
- No environment-specific config
**Memory leak:**
- Rate limiter visitor map grows unbounded
- Requires periodic restarts
**Naive health check:**
- Doesn't verify database connectivity
- False positives in monitoring
**No metrics:**
- No visibility into performance
- No error rate tracking
- No usage analytics
**Unused config:**
- `LOG_LEVEL` environment variable ignored
- Misleading documentation
**No CORS:**
- Browser-based clients blocked
- Requires reverse proxy workaround
**No authentication:**
- Public API (security risk)
- No usage tracking per user
### Code Smells
**Magic numbers:**
```go
// What is 64000? Why 1073741824?
_cache_size=-64000&_mmap_size=1073741824
```
**Fix:** Use named constants
```go
const (
sqliteCacheSizeKB = 64000 // 64MB
sqliteMmapSizeBytes = 1 << 30 // 1GB
)
```
**Repeated code:**
```go
// Similar enrichment logic repeated for tracks, albums, artists
func enrichTrack(track *Track) { /* ... */ }
func enrichAlbum(album *Album) { /* ... */ }
func enrichArtist(artist *Artist) { /* ... */ }
```
**Fix:** Generic enrichment function
**Global state:**
```go
// Rate limiter as global variable (not shown in code, but implied)
var rateLimiter *RateLimiter
```
**Fix:** Dependency injection
## Dependencies
### External Packages
**modernc.org/sqlite v1.34.4:**
- Pure Go SQLite driver
- No CGO required
- 100% Go implementation
- Larger binary size vs CGO version
**golang.org/x/time v0.14.0:**
- Rate limiting (token bucket)
- Part of Go extended stdlib
- Minimal, focused package
**Total dependencies:** 2 direct + transitive dependencies
### Dependency Management
**go.mod:**
```go
module github.com/Aunali321/music-metadata-api
go 1.24
require (
modernc.org/sqlite v1.34.4
golang.org/x/time v0.14.0
)
```
**Dependency updates:**
```bash
# Check for updates
go list -u -m all
# Update dependencies
go get -u ./...
go mod tidy
```
**Security scanning:**
```bash
# Scan for vulnerabilities
go list -json -m all | nancy sleuth
```
## Code Organization
### Package Structure
```
music-metadata-api/
├── cmd/
│ └── server/ # Entry point
│ └── main.go # CLI, server setup, graceful shutdown
├── internal/ # Private packages
│ ├── api/ # HTTP layer
│ │ ├── handlers.go # Route handlers
│ │ ├── ratelimit.go # Rate limiting middleware
│ │ └── openapi.go # OpenAPI spec
│ │
│ ├── db/ # Database layer
│ │ └── db.go # Queries, enrichment, batch optimization
│ │
│ └── models/ # Data models
│ └── models.go # Structs, JSON tags
├── Dockerfile # Container build
├── docker-compose.yml # Local deployment
├── go.mod # Dependencies
└── .github/
└── workflows/
└── docker-publish.yml # CI/CD
```
### Separation of Concerns
**Good:**
- Clear layer boundaries (API → DB → Models)
- No circular dependencies
- Database logic isolated from HTTP
**Could improve:**
- Extract configuration to separate package
- Extract validation to separate package
- Extract error handling to separate package
## Performance Characteristics
### Bottlenecks
**Search queries:**
- `LIKE %query%` full table scan
- 10-second timeout (can be hit)
- CPU-bound (string matching)
**Rate limiter:**
- RWMutex contention under high load
- Map lookup on every request
**Database:**
- Single SQLite file (no sharding)
- 8 connection limit (conservative)
### Optimizations
**Batch queries:**
- 343x fewer queries (400 items: 7 queries vs 2,800)
- IN clause for bulk lookups
**Connection pooling:**
- Reuse connections (no overhead)
- 8 warm connections
**Memory-mapped I/O:**
- 1GB mmap (faster than read() syscalls)
- OS handles paging
**Read-only mode:**
- No write locks
- Safe concurrent reads
## Maintainability
### Documentation
**Code comments:** Minimal
**README:** Basic (installation, usage)
**OpenAPI spec:** Comprehensive (all endpoints documented)
**No inline documentation:**
```go
// No function comments
func enrichTrack(track *Track) {
// No explanation of enrichment logic
}
```
**Recommendation:** Add godoc comments
```go
// enrichTrack populates track with related entities (artists, album, track files).
// It performs batch queries to minimize database round-trips.
func enrichTrack(track *Track) {
// ...
}
```
### Extensibility
**Easy to extend:**
- Add new endpoints (register route)
- Add new models (define struct)
- Add new queries (write SQL)
**Hard to extend:**
- Change rate limiting strategy (tightly coupled)
- Add authentication (no middleware chain)
- Add metrics (no instrumentation points)
### Technical Debt
**High priority:**
1. Fix rate limiter memory leak
2. Implement proper health check
3. Add test coverage
4. Use LOG_LEVEL environment variable
**Medium priority:**
1. Extract hardcoded config
2. Add metrics/monitoring
3. Implement CORS support
4. Add authentication
**Low priority:**
1. Improve search performance (FTS5)
2. Add caching layer
3. Structured error responses
4. Request logging
@@ -0,0 +1,911 @@
# Music Metadata API - Data Layer
## Database Architecture
Music Metadata API uses a dual-database architecture with two separate SQLite files:
```
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
└─────────────────────────────────────────────────────────────┘
┌───────────┴───────────┐
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ main_database.sqlite3 │ │ track_files.sqlite3 │
│ (~117GB) │ │ (~99GB) │
│ │ │ │
│ - tracks │ │ - track_files │
│ - albums │ │ (extended metadata) │
│ - artists │ │ │
│ - track_artists │ │ │
│ - artist_albums │ │ │
│ - album_images │ │ │
│ - artist_images │ │ │
│ - artist_genres │ │ │
└──────────────────────────┘ └──────────────────────────┘
```
**Total storage:** ~216GB
**Total tracks:** 256 million
**Connection mode:** Read-only
**Driver:** modernc.org/sqlite v1.34.4 (pure Go, no CGO)
## Connection Configuration
### Connection Strings
**Main database:**
```
file:/path/to/main_database.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true
```
**Track files database:**
```
file:/path/to/track_files.sqlite3?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true
```
### PRAGMA Settings
| PRAGMA | Value | Purpose | Impact |
|--------|-------|---------|--------|
| `mode=ro` | Read-only | Prevents writes | No write locks, safe concurrent reads |
| `_journal_mode=off` | Disabled | No WAL/rollback journal | Faster reads, safe for read-only |
| `_cache_size=-64000` | 64MB | Page cache size | Reduces disk I/O for hot data |
| `_mmap_size=1073741824` | 1GB | Memory-mapped I/O | Faster reads via mmap |
| `_query_only=true` | Enabled | Additional read-only enforcement | Extra safety layer |
**Cache size calculation:**
- Negative value = kilobytes
- `-64000` = 64,000 KB = 64 MB
- Default SQLite cache is ~2MB (32x increase)
**Memory-mapped I/O:**
- Maps 1GB of database file into process memory
- OS handles paging (faster than read() syscalls)
- Effective for frequently accessed data
### Connection Pool
```go
db.SetMaxOpenConns(8) // Conservative limit (8 concurrent queries)
db.SetMaxIdleConns(8) // Keep all connections warm
db.SetConnMaxLifetime(0) // No expiration (read-only safe)
```
**Rationale:**
- Read-only workload (no write contention)
- SQLite handles concurrent reads well
- 8 connections balance throughput vs resource usage
- No connection recycling needed (no state changes)
## Main Database Schema
### tracks Table
**Purpose:** Core track metadata
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `rowid` | INTEGER | SQLite internal row ID | No |
| `id` | TEXT | Internal track ID | No |
| `name` | TEXT | Track title | No |
| `isrc` | TEXT | ISRC code | Yes |
| `duration_ms` | INTEGER | Duration in milliseconds | No |
| `explicit` | INTEGER | Explicit content flag (0/1) | No |
| `track_number` | INTEGER | Track number on album | No |
| `disc_number` | INTEGER | Disc number | No |
| `popularity` | INTEGER | Popularity score (0-100) | No |
| `preview_url` | TEXT | 30-second preview URL | Yes |
| `album_rowid` | INTEGER | Foreign key to albums.rowid | No |
**Indexes:**
- Primary key on `id`
- Index on `isrc` (for ISRC lookups)
- Index on `album_rowid` (for album track listings)
**Sample row:**
```sql
id: 4cOdK2wGLETKBW3PvgPWqT
name: Bohemian Rhapsody
isrc: GBUM71029604
duration_ms: 354320
explicit: 0
track_number: 11
disc_number: 1
popularity: 89
preview_url: https://p.scdn.co/mp3-preview/...
album_rowid: 12345
```
**Estimated rows:** 256 million
### albums Table
**Purpose:** Album metadata
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `rowid` | INTEGER | SQLite internal row ID | No |
| `id` | TEXT | Internal album ID | No |
| `name` | TEXT | Album title | No |
| `album_type` | TEXT | "album", "single", "compilation" | No |
| `label` | TEXT | Record label | Yes |
| `release_date` | TEXT | ISO 8601 date (YYYY-MM-DD) | No |
| `release_date_precision` | TEXT | "year", "month", "day" | No |
| `external_id_upc` | TEXT | UPC barcode | Yes |
| `total_tracks` | INTEGER | Total tracks on album | No |
| `copyright_c` | TEXT | Copyright notice | Yes |
| `copyright_p` | TEXT | Phonographic copyright | Yes |
**Indexes:**
- Primary key on `id`
- Index on `rowid` (for track joins)
**Sample row:**
```sql
id: 2ODvWsOgouMbaA5xf0RkJe
name: A Night at the Opera
album_type: album
label: Hollywood Records
release_date: 1975-11-21
release_date_precision: day
external_id_upc: 050087246679
total_tracks: 12
copyright_c: 1975 Queen Productions Ltd
copyright_p: 1975 Queen Productions Ltd
```
**Estimated rows:** Tens of millions (fewer than tracks)
### artists Table
**Purpose:** Artist metadata
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `rowid` | INTEGER | SQLite internal row ID | No |
| `id` | TEXT | Internal artist ID | No |
| `name` | TEXT | Artist name | No |
| `followers_total` | INTEGER | Total followers | Yes |
| `popularity` | INTEGER | Popularity score (0-100) | Yes |
**Indexes:**
- Primary key on `id`
- Index on `name` (for search)
**Sample row:**
```sql
id: 0TnOYISbd1XYRBk9myaseg
name: Queen
followers_total: 45000000
popularity: 92
```
**Estimated rows:** Millions (fewer than albums)
### track_artists Table
**Purpose:** Many-to-many relationship between tracks and artists
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `track_id` | TEXT | Foreign key to tracks.id | No |
| `artist_id` | TEXT | Foreign key to artists.id | No |
**Indexes:**
- Composite index on `(track_id, artist_id)`
- Index on `artist_id` (for artist track listings)
**Sample rows:**
```sql
track_id: 4cOdK2wGLETKBW3PvgPWqT, artist_id: 0TnOYISbd1XYRBk9myaseg
track_id: 4cOdK2wGLETKBW3PvgPWqT, artist_id: 1A2B3C4D5E6F7G8H9I0J
```
**Estimated rows:** Hundreds of millions (tracks can have multiple artists)
### artist_albums Table
**Purpose:** Many-to-many relationship between artists and albums with ordering
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `artist_id` | TEXT | Foreign key to artists.id | No |
| `album_id` | TEXT | Foreign key to albums.id | No |
| `index_in_album` | INTEGER | Artist order on album | No |
**Indexes:**
- Composite index on `(album_id, index_in_album)`
- Index on `artist_id` (for artist discography)
**Sample rows:**
```sql
artist_id: 0TnOYISbd1XYRBk9myaseg, album_id: 2ODvWsOgouMbaA5xf0RkJe, index_in_album: 0
artist_id: 1A2B3C4D5E6F7G8H9I0J, album_id: 2ODvWsOgouMbaA5xf0RkJe, index_in_album: 1
```
**Purpose of index_in_album:** Preserves artist order for multi-artist albums (e.g., "Artist A & Artist B")
### album_images Table
**Purpose:** Album artwork URLs
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `album_id` | TEXT | Foreign key to albums.id | No |
| `url` | TEXT | Image URL | No |
| `width` | INTEGER | Width in pixels | No |
| `height` | INTEGER | Height in pixels | No |
**Indexes:**
- Index on `album_id`
**Sample rows:**
```sql
album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d0000b273..., width: 640, height: 640
album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d00001e02..., width: 300, height: 300
album_id: 2ODvWsOgouMbaA5xf0RkJe, url: https://i.scdn.co/image/ab67616d00004851..., width: 64, height: 64
```
**Typical sizes:** 640x640, 300x300, 64x64
**Image hosting:** External CDN (i.scdn.co), not hosted by API
### artist_images Table
**Purpose:** Artist images/photos
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `artist_id` | TEXT | Foreign key to artists.id | No |
| `url` | TEXT | Image URL | No |
| `width` | INTEGER | Width in pixels | No |
| `height` | INTEGER | Height in pixels | No |
**Indexes:**
- Index on `artist_id`
**Sample rows:**
```sql
artist_id: 0TnOYISbd1XYRBk9myaseg, url: https://i.scdn.co/image/af2b8e57f6d7b5d..., width: 640, height: 640
artist_id: 0TnOYISbd1XYRBk9myaseg, url: https://i.scdn.co/image/c06971e9ff81696..., width: 320, height: 320
```
### artist_genres Table
**Purpose:** Artist genre tags
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `artist_id` | TEXT | Foreign key to artists.id | No |
| `genre` | TEXT | Genre name | No |
**Indexes:**
- Index on `artist_id`
**Sample rows:**
```sql
artist_id: 0TnOYISbd1XYRBk9myaseg, genre: rock
artist_id: 0TnOYISbd1XYRBk9myaseg, genre: classic rock
artist_id: 0TnOYISbd1XYRBk9myaseg, genre: glam rock
```
**Genre characteristics:**
- Multiple genres per artist
- Lowercase, hyphenated (e.g., "indie-rock")
- Spotify-style genre taxonomy
## Track Files Database Schema
### track_files Table
**Purpose:** Extended track metadata not in main database
| Column | Type | Description | Nullable |
|--------|------|-------------|----------|
| `track_id` | TEXT | Foreign key to tracks.id | No |
| `has_lyrics` | INTEGER | Lyrics availability flag (0/1) | No |
| `original_title` | TEXT | Original title (if different) | Yes |
| `version_title` | TEXT | Version descriptor (e.g., "Radio Edit") | Yes |
| `language_of_performance` | TEXT | JSON array of language codes | Yes |
| `artist_roles` | TEXT | JSON object mapping artist IDs to roles | Yes |
**Indexes:**
- Primary key on `track_id`
**Sample row:**
```sql
track_id: 4cOdK2wGLETKBW3PvgPWqT
has_lyrics: 1
original_title: Bohemian Rhapsody
version_title: NULL
language_of_performance: ["en"]
artist_roles: {"0TnOYISbd1XYRBk9myaseg": ["performer", "composer"]}
```
**JSON field parsing:**
**language_of_performance:**
```json
["en", "es"] // ISO 639-1 language codes
```
**artist_roles:**
```json
{
"artist_id_1": ["performer", "composer"],
"artist_id_2": ["producer"],
"artist_id_3": ["lyricist"]
}
```
**Common roles:**
- `performer` - Main performer
- `composer` - Music composer
- `lyricist` - Lyrics writer
- `producer` - Producer
- `engineer` - Recording engineer
- `mixer` - Mix engineer
**Estimated rows:** 256 million (one per track)
## Query Patterns
### Individual Track Lookup
```sql
-- Step 1: Fetch track + album (single JOIN)
SELECT
t.id, t.name, t.isrc, t.duration_ms, t.explicit,
t.track_number, t.disc_number, t.popularity, t.preview_url,
a.id AS album_id, a.name AS album_name, a.album_type,
a.label, a.release_date, a.release_date_precision,
a.external_id_upc, a.total_tracks, a.copyright_c, a.copyright_p
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.id = ?
-- Step 2: Fetch album images
SELECT url, width, height
FROM album_images
WHERE album_id = ?
ORDER BY width DESC
-- Step 3: Fetch album artists
SELECT a.id, a.name, a.followers_total, a.popularity
FROM artists a
JOIN artist_albums aa ON a.id = aa.artist_id
WHERE aa.album_id = ?
ORDER BY aa.index_in_album
-- Step 4: Fetch track artists
SELECT a.id, a.name, a.followers_total, a.popularity
FROM artists a
JOIN track_artists ta ON a.id = ta.artist_id
WHERE ta.track_id = ?
-- Step 5: Fetch artist genres (for each artist)
SELECT genre
FROM artist_genres
WHERE artist_id = ?
-- Step 6: Fetch artist images (for each artist)
SELECT url, width, height
FROM artist_images
WHERE artist_id = ?
ORDER BY width DESC
-- Step 7: Fetch track files (from track_files.sqlite3)
SELECT has_lyrics, original_title, version_title,
language_of_performance, artist_roles
FROM track_files
WHERE track_id = ?
```
**Total queries for single track:** 7+ (depending on number of artists)
### Batch ISRC Lookup
```sql
-- Step 1: Fetch all tracks by ISRC (single query with IN clause)
SELECT
t.id, t.name, t.isrc, t.duration_ms, t.explicit,
t.track_number, t.disc_number, t.popularity, t.preview_url,
a.id AS album_id, a.name AS album_name, a.album_type,
a.label, a.release_date, a.release_date_precision,
a.external_id_upc, a.total_tracks, a.copyright_c, a.copyright_p
FROM tracks t
JOIN albums a ON t.album_rowid = a.rowid
WHERE t.isrc IN (?, ?, ?, ...) -- Up to 400 placeholders
-- Step 2: Batch fetch album images (all albums at once)
SELECT album_id, url, width, height
FROM album_images
WHERE album_id IN (?, ?, ?, ...)
ORDER BY album_id, width DESC
-- Step 3: Batch fetch album artists
SELECT aa.album_id, a.id, a.name, a.followers_total, a.popularity, aa.index_in_album
FROM artists a
JOIN artist_albums aa ON a.id = aa.artist_id
WHERE aa.album_id IN (?, ?, ?, ...)
ORDER BY aa.album_id, aa.index_in_album
-- Step 4: Batch fetch track artists
SELECT ta.track_id, a.id, a.name, a.followers_total, a.popularity
FROM artists a
JOIN track_artists ta ON a.id = ta.artist_id
WHERE ta.track_id IN (?, ?, ?, ...)
-- Step 5: Batch fetch artist genres
SELECT artist_id, genre
FROM artist_genres
WHERE artist_id IN (?, ?, ?, ...)
-- Step 6: Batch fetch artist images
SELECT artist_id, url, width, height
FROM artist_images
WHERE artist_id IN (?, ?, ?, ...)
ORDER BY artist_id, width DESC
-- Step 7: Batch fetch track files
SELECT track_id, has_lyrics, original_title, version_title,
language_of_performance, artist_roles
FROM track_files
WHERE track_id IN (?, ?, ?, ...)
```
**Total queries for 400 tracks:** 7 (vs 2,800+ for individual lookups)
**Performance gain:** 400x fewer queries
### Search Queries
**Track search:**
```sql
SELECT id, name, isrc, duration_ms, popularity, album_rowid
FROM tracks
WHERE name LIKE ? COLLATE NOCASE -- ? = '%query%'
ORDER BY popularity DESC
LIMIT ?
```
**Artist search:**
```sql
SELECT id, name, followers_total, popularity
FROM artists
WHERE name LIKE ? COLLATE NOCASE -- ? = '%query%'
ORDER BY followers_total DESC
LIMIT ?
```
**Search characteristics:**
- `LIKE %query%` can't use indexes (full table scan)
- `COLLATE NOCASE` for case-insensitive matching
- Ordered by popularity/followers (most relevant first)
- Limited to 50 results max
- 10-second timeout via context deadline
**Performance concern:** Searching 256M tracks with `LIKE %query%` is slow. Full-text search (FTS5) would be faster but not implemented.
### Album Tracks Lookup
```sql
-- Fetch all tracks for an album
SELECT t.id, t.name, t.isrc, t.duration_ms, t.explicit,
t.track_number, t.disc_number, t.popularity, t.preview_url
FROM tracks t
WHERE t.album_rowid = (
SELECT rowid FROM albums WHERE id = ?
)
ORDER BY t.disc_number, t.track_number
```
**Ordering:** Disc number first, then track number (preserves album order)
## Data Enrichment Strategy
### Enrichment Pipeline
```
1. Fetch base entity (track/album/artist)
2. Collect related entity IDs
3. Batch fetch related entities
4. Assemble nested structures
5. Return enriched object
```
### Batch Optimization Functions
**Implementation in db.go (907 lines):**
```go
// Batch fetch album images for multiple albums
func (d *Database) batchGetAlbumImages(albumIDs []string) map[string][]Image {
// Build IN clause
placeholders := strings.Repeat("?,", len(albumIDs)-1) + "?"
query := fmt.Sprintf(`
SELECT album_id, url, width, height
FROM album_images
WHERE album_id IN (%s)
ORDER BY album_id, width DESC
`, placeholders)
// Execute query
rows, _ := d.mainDB.Query(query, albumIDs...)
// Group by album_id
result := make(map[string][]Image)
for rows.Next() {
var albumID string
var img Image
rows.Scan(&albumID, &img.URL, &img.Width, &img.Height)
result[albumID] = append(result[albumID], img)
}
return result
}
```
**Similar functions:**
- `batchGetAlbumArtists(albumIDs []string) map[string][]Artist`
- `batchGetTrackArtists(trackIDs []string) map[string][]Artist`
- `batchGetArtistGenres(artistIDs []string) map[string][]string`
- `batchGetArtistImages(artistIDs []string) map[string][]Image`
- `batchEnrichTrackFiles(trackIDs []string) map[string]*TrackFile`
**Pattern:**
1. Build IN clause with placeholders
2. Execute single query for all IDs
3. Group results by parent ID
4. Return map for O(1) lookup
### Why Batch Matters
**Without batching (400 tracks):**
- 400 track queries
- 400 album queries
- 400 album image queries
- 400 album artist queries
- 400 track artist queries
- ~800 artist genre queries (2 artists per track avg)
- ~800 artist image queries
- 400 track file queries
- **Total: ~3,600 queries**
**With batching (400 tracks):**
- 1 batch track query
- 1 batch album image query
- 1 batch album artist query
- 1 batch track artist query
- 1 batch artist genre query
- 1 batch artist image query
- 1 batch track file query
- **Total: 7 queries**
**Performance gain: 514x fewer queries**
## Data Provenance
### Source
**Disclaimer from repository:**
> "This project is not affiliated with Spotify."
**Implications:**
- Data source unclear (likely scraped or obtained from third party)
- Legal status uncertain
- No official Spotify endorsement
### Data Freshness
**Static snapshot:**
- No update mechanism
- Data frozen at time of database creation
- No real-time sync with Spotify
**Staleness concerns:**
- New releases not included
- Popularity scores outdated
- Artist follower counts stale
- Deleted tracks still present
**Mitigation:**
- Treat as historical snapshot
- Complement with real-time APIs for fresh data
- Periodically obtain updated database (if available)
### Data Quality
**Strengths:**
- 256M tracks (massive coverage)
- Rich metadata (genres, images, roles)
- ISRC codes for cross-referencing
- Popularity/follower metrics
**Weaknesses:**
- No data validation visible
- Potential duplicates (not deduplicated)
- Missing ISRCs for some tracks
- Incomplete artist roles
## Storage Requirements
### Disk Space
| Component | Size | Compressible |
|-----------|------|--------------|
| main_database.sqlite3 | ~117GB | Minimal (already compact) |
| track_files.sqlite3 | ~99GB | Minimal (JSON fields) |
| **Total** | **~216GB** | - |
**Recommendations:**
- SSD strongly recommended (HDD too slow for 256M rows)
- NVMe for best performance
- RAID not necessary (read-only, can rebuild from backup)
### Memory Usage
**SQLite memory:**
- Page cache: 64MB per connection
- 8 connections: 512MB cache total
- Memory-mapped I/O: 1GB per database (2GB total)
- **Total: ~2.5GB minimum**
**Application memory:**
- Go runtime: ~50MB
- Rate limiter map: Grows unbounded (leak)
- Request buffers: ~10MB per concurrent request
- **Total: ~100MB + leak**
**Recommended RAM:** 4GB+ (2.5GB for SQLite + 1.5GB for OS/app)
### I/O Characteristics
**Read patterns:**
- Random reads (track lookups by ID/ISRC)
- Sequential scans (search queries)
- Batch reads (IN clause queries)
**Write patterns:**
- None (read-only)
**Cache effectiveness:**
- Hot data (popular tracks): High hit rate
- Cold data (obscure tracks): Low hit rate
- Search queries: Low hit rate (full scans)
## Database Maintenance
### No Maintenance Required
**Read-only benefits:**
- No VACUUM needed (no fragmentation from deletes)
- No ANALYZE needed (statistics static)
- No REINDEX needed (indexes don't degrade)
- No WAL checkpoint (journal disabled)
### Backup Strategy
**Simple backup:**
```bash
# Copy files (database must be idle)
cp main_database.sqlite3 backup/
cp track_files.sqlite3 backup/
```
**Online backup (while running):**
```bash
# SQLite backup API (requires custom tool)
sqlite3 main_database.sqlite3 ".backup backup/main_database.sqlite3"
```
**Restore:**
```bash
# Simply replace files
cp backup/main_database.sqlite3 .
cp backup/track_files.sqlite3 .
```
### Integrity Checks
**Verify database integrity:**
```bash
sqlite3 main_database.sqlite3 "PRAGMA integrity_check;"
sqlite3 track_files.sqlite3 "PRAGMA integrity_check;"
```
**Expected output:** `ok`
**Run periodically:** Monthly or after hardware issues
## Performance Tuning
### Query Optimization
**Indexes already present:**
- Primary keys on all ID columns
- Foreign key indexes (album_rowid, artist_id, etc.)
- Search indexes (tracks.name, artists.name)
**Missing indexes (potential improvements):**
- Full-text search index (FTS5) on track/artist names
- Composite index on (popularity, name) for sorted searches
### Connection Pool Tuning
**Current settings:**
```go
MaxOpenConns: 8
MaxIdleConns: 8
ConnMaxLifetime: 0
```
**Tuning considerations:**
- Increase MaxOpenConns for higher concurrency (16-32)
- Monitor CPU usage (SQLite is CPU-bound for searches)
- No benefit beyond CPU core count
### Cache Tuning
**Current cache:** 64MB per connection (512MB total)
**Increase cache:**
```
_cache_size=-128000 // 128MB per connection
```
**Tradeoff:** More memory usage vs fewer disk reads
**Recommendation:** Monitor cache hit rate, increase if low
### Memory-Mapped I/O Tuning
**Current mmap:** 1GB per database
**Increase mmap:**
```
_mmap_size=2147483648 // 2GB
```
**Tradeoff:** More virtual memory vs faster reads
**Recommendation:** Set to database size if RAM allows (117GB not feasible)
## Data Model Comparison
### vs Spotify Web API
| Feature | Music Metadata API | Spotify Web API |
|---------|-------------------|-----------------|
| Track ID format | Spotify-compatible | Spotify IDs |
| ISRC support | Yes | Yes |
| Popularity | Static snapshot | Real-time |
| Followers | Static snapshot | Real-time |
| Images | External URLs | External URLs |
| Genres | Artist-level | Artist-level |
| Lyrics | Flag only | Not available |
| Artist roles | Detailed | Limited |
| Languages | Supported | Not available |
### vs MusicBrainz
| Feature | Music Metadata API | MusicBrainz |
|---------|-------------------|-------------|
| Identifier | Spotify IDs, ISRC | MBIDs |
| Dataset size | 256M tracks | ~40M recordings |
| Popularity | Yes | No |
| Followers | Yes | No |
| Images | Yes (external) | Yes (Cover Art Archive) |
| Genres | Yes | Yes (tags) |
| Relationships | Limited | Extensive |
| Credits | Artist roles | Detailed credits |
| Updates | Static | Community-driven |
## Integration Considerations
### Joining with Other Databases
**ISRC as common key:**
```sql
-- Join with local library
SELECT l.file_path, m.name, m.popularity
FROM local_library l
JOIN music_metadata_api.tracks m ON l.isrc = m.isrc
```
**Spotify ID as common key:**
```sql
-- Join with MusicBrainz
SELECT mb.mbid, mm.name, mm.popularity
FROM musicbrainz.recording mb
JOIN musicbrainz.isrc i ON mb.id = i.recording
JOIN music_metadata_api.tracks mm ON i.isrc = mm.isrc
```
### Data Export
**Export to JSON:**
```bash
sqlite3 main_database.sqlite3 <<EOF
.mode json
.output tracks.json
SELECT * FROM tracks LIMIT 1000;
EOF
```
**Export to CSV:**
```bash
sqlite3 main_database.sqlite3 <<EOF
.mode csv
.output tracks.csv
SELECT id, name, isrc, popularity FROM tracks;
EOF
```
### Data Import
**Import from CSV:**
```bash
sqlite3 new_database.sqlite3 <<EOF
.mode csv
.import tracks.csv tracks
EOF
```
**Bulk insert from application:**
```go
tx, _ := db.Begin()
stmt, _ := tx.Prepare("INSERT INTO tracks VALUES (?, ?, ?, ...)")
for _, track := range tracks {
stmt.Exec(track.ID, track.Name, track.ISRC, ...)
}
tx.Commit()
```
## Limitations
### No Write Operations
**Implications:**
- Can't add new tracks
- Can't update popularity scores
- Can't delete duplicates
- Can't fix data errors
**Workarounds:**
- Create separate writable database for local additions
- Use views to merge read-only + writable data
- Periodically obtain updated database snapshot
### No Full-Text Search
**Current search:** `LIKE %query%` (slow)
**FTS5 alternative:**
```sql
-- Create FTS5 virtual table (requires writable database)
CREATE VIRTUAL TABLE tracks_fts USING fts5(name, content=tracks);
INSERT INTO tracks_fts SELECT name FROM tracks;
-- Fast search
SELECT * FROM tracks_fts WHERE name MATCH 'bohemian';
```
**Limitation:** Can't create FTS5 on read-only database
**Workaround:** Create separate FTS5 database, sync periodically
### No Relationships Beyond Basics
**Missing relationships:**
- Track-to-track (similar tracks, remixes)
- Album-to-album (compilations, deluxe editions)
- Artist-to-artist (collaborations, bands)
**Workaround:** Build relationship graph in separate database
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,761 @@
# Music Metadata API - Evaluation
## Executive Summary
Music Metadata API is a **simple, focused, self-contained** service for querying metadata on 256 million music tracks. It excels at batch lookups and ISRC-based queries but lacks authentication, testing, and real-time data updates.
**Best for:** Self-hosted metadata enrichment, high-volume batch processing, ISRC resolution
**Not suitable for:** Real-time data, production systems requiring authentication, mission-critical applications without testing
## Strengths
### 1. Massive Dataset
**256 million tracks** across two SQLite databases (~216GB)
**Coverage:**
- Tracks with ISRC codes
- Albums with artwork, labels, release dates
- Artists with genres, follower counts, popularity
- Extended metadata (lyrics flags, languages, artist roles)
**Comparison:**
- Spotify Web API: Full catalog (real-time)
- MusicBrainz: ~40M recordings
- Discogs: ~15M releases
**Value:** Comprehensive coverage for metadata enrichment without API rate limits.
### 2. Extremely Simple Architecture
**No framework, no ORM, minimal dependencies:**
- Go stdlib for HTTP, JSON, database
- 2 external packages (sqlite driver, rate limiter)
- ~1,100 lines of code
- Single binary deployment
**Benefits:**
- Easy to understand and modify
- Fast compilation
- No framework lock-in
- Minimal attack surface
**Comparison:**
- Typical web service: 10+ dependencies, framework overhead
- Music Metadata API: 2 dependencies, stdlib only
### 3. High-Performance Batch API
**Batch endpoint:** Process up to 400 items per request
**Performance gain:**
- Individual requests: 400 × ~50ms = 20 seconds
- Batch request: ~200-500ms total
- **40-100x faster**
**Query optimization:**
- Without batching: 2,800+ queries for 400 tracks
- With batching: 7 queries for 400 tracks
- **400x fewer queries**
**Use case:** Enriching large music libraries efficiently.
### 4. Pure Go (No CGO)
**CGO_ENABLED=0** - No C dependencies
**Benefits:**
- Cross-compilation trivial (GOOS/GOARCH)
- No C toolchain required
- Smaller attack surface
- Easier deployment (static binary)
**Tradeoff:** Larger binary size vs CGO SQLite driver (~2MB vs ~500KB)
### 5. Read-Only Safety
**Databases opened in read-only mode:**
- No accidental writes
- No data corruption risk
- Safe concurrent reads
- No write locks
**PRAGMAs:**
```
mode=ro
_journal_mode=off
_query_only=true
```
**Benefit:** Multiple instances can share database files safely.
### 6. OpenAPI Documentation
**Comprehensive OpenAPI 3.1 spec:**
- All endpoints documented
- Request/response schemas
- Example payloads
- Interactive Swagger UI at `/docs`
**Value:** Self-documenting API, easy integration.
### 7. MIT License
**Permissive license:**
- Free for commercial use
- No attribution required (recommended)
- Modify and redistribute freely
**Comparison:**
- Spotify Web API: Proprietary, rate limited
- MusicBrainz: CC0/Public Domain (data), GPL (server)
### 8. Easy Deployment
**Multiple deployment options:**
- Standalone binary (single executable)
- Docker container (official image)
- Kubernetes (example manifests)
- Cloud platforms (ECS, Cloud Run, ACI)
**Minimal requirements:**
- 216GB disk (databases)
- 4GB RAM
- 1 CPU core
**No external dependencies:**
- No database server (SQLite embedded)
- No cache server (SQLite cache)
- No message queue
- No authentication service
## Weaknesses
### 1. Zero Test Coverage
**No test files, no test framework, no CI testing**
**Risks:**
- No regression protection
- Bugs discovered in production
- Difficult to refactor safely
- No documentation via tests
**Evidence:**
- `.gitignore` includes `coverage.out` (testing planned but not implemented)
- GitHub Actions workflow has no test step
**Impact:** High risk for production use without extensive manual testing.
### 2. No Authentication
**Public API with no access control:**
- No OAuth
- No API keys
- No rate limiting per user (only per IP)
- No usage tracking per user
**Risks:**
- Abuse (unlimited queries)
- No accountability
- No quota enforcement
- Data scraping
**Workarounds:**
- Deploy behind reverse proxy with auth (nginx, Caddy)
- Use API gateway (Kong, Tyk)
- Implement custom middleware
**Impact:** Not suitable for public internet deployment without additional security layer.
### 3. Naive Health Check
**Health endpoint always returns OK:**
```go
func handleHealth(w http.ResponseWriter, r *http.Request) {
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}
```
**Problem:** Doesn't verify database connectivity
**Scenario:**
- Database file deleted/corrupted
- Health check returns 200 OK
- Actual queries fail with 500 errors
- Monitoring systems don't detect failure
**Impact:** False positives in monitoring, delayed incident detection.
### 4. Rate Limiter Memory Leak
**Visitor map grows unbounded:**
```go
type RateLimiter struct {
visitors map[string]*rate.Limiter // Never cleaned up
mu sync.RWMutex
}
```
**Impact:**
- Long-running servers accumulate IPs
- Memory usage grows over time
- 1M unique IPs = ~100MB leak
**Workaround:** Restart server periodically
**Fix required:** Implement visitor cleanup (remove inactive IPs after 24 hours)
### 5. No CORS Support
**No CORS headers:**
- Browser-based clients blocked
- Can't call from web apps directly
- OPTIONS preflight requests fail
**Workarounds:**
- Add CORS middleware (custom implementation)
- Use server-side proxy
- Deploy API on same origin as web app
**Impact:** Limited to server-side integrations.
### 6. No Metrics/Monitoring
**No instrumentation:**
- No Prometheus metrics
- No request counters
- No latency histograms
- No error rate tracking
**Visibility gaps:**
- Can't track usage patterns
- Can't identify slow endpoints
- Can't detect error spikes
- No performance baselines
**Workarounds:**
- Parse logs for metrics
- Use reverse proxy metrics (nginx)
- Implement custom metrics middleware
**Impact:** Blind operation, difficult to optimize.
### 7. Database Provenance Unclear
**Repository disclaimer:**
> "This project is not affiliated with Spotify."
**Concerns:**
- Data source unclear (likely scraped)
- Legal status uncertain
- No official Spotify endorsement
- Potential copyright issues
**Risks:**
- Takedown requests
- Legal liability
- Data quality unknown
- No support/updates
**Recommendation:** Verify legal compliance before production use.
### 8. No Data Freshness Mechanism
**Static snapshot:**
- No update mechanism
- Data frozen at time of database creation
- No real-time sync with Spotify
**Staleness:**
- New releases not included
- Popularity scores outdated
- Artist follower counts stale
- Deleted tracks still present
**Workarounds:**
- Periodically obtain updated database (if available)
- Complement with real-time APIs for fresh data
- Treat as historical snapshot
**Impact:** Not suitable for applications requiring current data.
### 9. Search Performance
**LIKE %query% on 256M rows:**
- Full table scan (can't use indexes)
- 10-second timeout (can be hit)
- CPU-intensive
**Slow searches:**
- Common words ("love", "the"): 5-10 seconds
- Rare queries: 10+ seconds (full scan)
**Alternative:** SQLite FTS5 (Full-Text Search)
- Requires writable database (not compatible with read-only mode)
- Would need separate FTS5 database
**Impact:** Search functionality limited to specific queries.
### 10. Hardcoded Configuration
**All limits/timeouts hardcoded:**
- Rate limit: 100 req/s, 200 burst
- Search timeout: 10 seconds
- Batch limit: 400 items
- Connection pool: 8 connections
- SQLite cache: 64MB
**Problems:**
- No flexibility
- Requires recompilation to change
- No environment-specific config
**Workaround:** Fork and modify code
**Impact:** Limited adaptability to different workloads.
## Use Case Evaluation
### Ideal Use Cases
#### 1. Music Library Enrichment
**Scenario:** Enrich local music library with metadata
**Flow:**
1. Extract ISRCs from audio files (via AcoustID)
2. Batch lookup ISRCs (400 at a time)
3. Store metadata in local database
4. Display in music player UI
**Why suitable:**
- Batch API optimized for bulk lookups
- ISRC-based lookup (industry standard)
- No API rate limits (self-hosted)
- Comprehensive metadata (genres, images, popularity)
**Example:**
```python
# Enrich 10,000 tracks
isrcs = extract_isrcs_from_library() # 10,000 ISRCs
# Batch lookup (25 requests for 10,000 tracks)
for batch in chunks(isrcs, 400):
response = requests.post("http://localhost:8080/batch/lookup", json={"isrcs": batch})
store_metadata(response.json())
```
#### 2. Metadata Aggregator Pipeline
**Scenario:** Combine data from multiple sources (MusicBrainz + Music Metadata API)
**Flow:**
1. Query MusicBrainz for recording by MBID
2. Extract ISRC from MusicBrainz response
3. Lookup ISRC in Music Metadata API
4. Merge metadata (MusicBrainz credits + Spotify-style data)
**Why suitable:**
- Complements MusicBrainz (different data models)
- ISRC as common key
- Fast batch lookups
- No external API dependencies
**Example:**
```python
# Get MusicBrainz data
mb_data = musicbrainz.get_recording(mbid)
isrc = mb_data['isrcs'][0]
# Get Spotify-style data
mm_data = requests.get(f"http://localhost:8080/lookup/isrc/{isrc}").json()
# Merge
merged = {
"mbid": mbid,
"isrc": isrc,
"title": mm_data['name'],
"popularity": mm_data['popularity'],
"credits": mb_data['artist-credit'],
"genres": mm_data['artists'][0]['genres']
}
```
#### 3. Self-Hosted Alternative to Spotify API
**Scenario:** Replace Spotify Web API with local service
**Why suitable:**
- No OAuth complexity
- No API rate limits
- No per-request costs
- Batch support (400 items vs Spotify's 50)
**Tradeoffs:**
- Static data (no real-time updates)
- Database size (216GB)
- No write operations
**Example:**
```python
# Spotify Web API (rate limited, requires OAuth)
spotify_data = spotify_client.search(q=f"isrc:{isrc}", type="track")
# Music Metadata API (no auth, no rate limits)
mm_data = requests.get(f"http://localhost:8080/lookup/isrc/{isrc}").json()
```
#### 4. DJ Software Metadata Provider
**Scenario:** Enrich DJ library with popularity, genres, images
**Why suitable:**
- Batch processing for large libraries
- Popularity scores for track selection
- Genre tags for filtering
- Album artwork for UI
**Example:**
```python
# Enrich DJ library
tracks = load_dj_library() # 5,000 tracks
isrcs = [t.isrc for t in tracks]
# Batch lookup
for batch in chunks(isrcs, 400):
response = requests.post("http://localhost:8080/batch/lookup", json={"isrcs": batch})
update_dj_library(response.json())
```
### Unsuitable Use Cases
#### 1. Real-Time Music Discovery App
**Why unsuitable:**
- Static data (no new releases)
- Outdated popularity scores
- No personalization
- No user-specific data
**Alternative:** Spotify Web API, Apple Music API
#### 2. Public-Facing API Service
**Why unsuitable:**
- No authentication (abuse risk)
- No usage tracking
- No quota enforcement
- Rate limiter memory leak
**Alternative:** Add authentication layer or use managed API service
#### 3. Mission-Critical Production System
**Why unsuitable:**
- Zero test coverage
- Naive health check
- Memory leak
- No metrics
**Alternative:** Extensive testing + monitoring before production use
#### 4. Applications Requiring Fresh Data
**Why unsuitable:**
- Static snapshot (no updates)
- Stale popularity/follower counts
- Missing new releases
**Alternative:** Spotify Web API, MusicBrainz (community-updated)
## Integration Evaluation
### Complementary Services
**Works well with:**
- **MusicBrainz:** Different data models, ISRC as common key
- **AcoustID:** Fingerprint to ISRC, then lookup metadata
- **Local music libraries:** Enrich with metadata
- **DJ software:** Popularity, genres, artwork
**Conflicts with:**
- **Spotify Web API:** Overlapping data, but Music Metadata API is static
- **Real-time services:** Music Metadata API data is stale
### Integration Complexity
**Easy integrations:**
- HTTP client (any language)
- Batch processing pipelines
- Local applications
**Complex integrations:**
- Browser-based apps (no CORS)
- Authenticated services (no auth)
- Real-time systems (static data)
## Performance Evaluation
### Throughput
**Batch endpoint:**
- 400 items per request
- ~200-500ms per request
- **800-2,000 items/second** (single instance)
**Individual endpoints:**
- ~50ms per request
- Rate limited to 100 req/s
- **100 items/second** (single instance)
**Scaling:**
- Horizontal: Run multiple instances (read-only safe)
- Vertical: More RAM (larger cache), faster disk (SSD)
### Latency
**Typical latencies:**
- Track lookup: 10-50ms
- Album lookup: 10-50ms
- Artist lookup: 10-50ms
- Batch lookup (400 items): 200-500ms
- Search: 1-10 seconds (depends on query)
**Bottlenecks:**
- Search queries (LIKE %query%)
- Disk I/O (use SSD)
- Rate limiter (RWMutex contention)
### Resource Usage
**Disk:** 216GB (databases)
**RAM:** 2.5GB (SQLite cache + mmap) + 1.5GB (app/OS) = 4GB minimum
**CPU:** 1 core minimum, 2+ recommended (search queries CPU-intensive)
**Scaling costs:**
- 10 instances = 2.16TB storage (expensive)
- Shared filesystem (NFS, EFS) reduces storage cost but increases latency
## Security Evaluation
### Vulnerabilities
**High severity:**
- **No authentication:** Anyone can query API
- **No rate limiting per user:** IP-based only (easily bypassed)
**Medium severity:**
- **Memory leak:** Rate limiter grows unbounded
- **No input sanitization:** SQL injection risk (mitigated by parameterized queries)
**Low severity:**
- **No HTTPS:** Deploy behind reverse proxy with TLS
- **No CORS:** Browser-based attacks limited
### Mitigations
**Authentication:**
- Deploy behind reverse proxy with auth (nginx, Caddy)
- Use API gateway (Kong, Tyk)
**Rate limiting:**
- Implement per-user rate limiting (requires auth)
- Use distributed rate limiter (Redis)
**Memory leak:**
- Restart server periodically
- Implement visitor cleanup
**HTTPS:**
- Terminate TLS at reverse proxy
- Use Let's Encrypt for free certificates
## Reliability Evaluation
### Failure Modes
**Database unavailable:**
- Health check returns OK (false positive)
- Queries fail with 500 errors
- No automatic recovery
**Memory exhaustion:**
- Rate limiter leak accumulates
- OOM kill by OS
- Service restart required
**Disk full:**
- SQLite read-only (no writes)
- No impact on service
**Network partition:**
- No external dependencies
- Service continues (self-contained)
### Recovery
**Automatic recovery:**
- Graceful shutdown on SIGINT/SIGTERM
- Docker/Kubernetes restart on failure
**Manual recovery:**
- Restart service (clears rate limiter leak)
- Restore database from backup
- Check database integrity (PRAGMA integrity_check)
### High Availability
**Strategies:**
- Run multiple instances (read-only safe)
- Load balancer distributes traffic
- Health checks route around failures (but naive health check is a problem)
**Limitations:**
- No shared state (rate limiter per-instance)
- No session affinity required
- Database replication (copy files to each instance)
## Cost Evaluation
### Infrastructure Costs
**Single instance:**
- Compute: $20-50/month (2 CPU, 8GB RAM)
- Storage: $20-40/month (250GB SSD)
- Network: $5-10/month (1TB transfer)
- **Total: $45-100/month**
**10 instances (high availability):**
- Compute: $200-500/month
- Storage: $200-400/month (2.5TB SSD, or shared filesystem)
- Network: $50-100/month
- **Total: $450-1,000/month**
**Comparison:**
- Spotify Web API: Free tier limited, paid tiers $0.001-0.01 per request
- MusicBrainz: Free (donations encouraged)
### Development Costs
**Initial setup:**
- Deploy service: 1-2 hours
- Obtain databases: Unknown (not in repository)
- Test integration: 2-4 hours
- **Total: 4-8 hours**
**Ongoing maintenance:**
- Monitor service: 1-2 hours/month
- Update databases: Unknown (no update mechanism)
- Security patches: 1-2 hours/month
- **Total: 2-4 hours/month**
### Total Cost of Ownership
**Year 1:**
- Infrastructure: $540-1,200 (single instance)
- Development: $400-800 (setup + 12 months maintenance)
- **Total: $940-2,000**
**Comparison:**
- Spotify Web API: $0-10,000+ (depends on usage)
- MusicBrainz: $0 (free, donations encouraged)
## Recommendation Matrix
| Use Case | Suitability | Reasoning |
|----------|-------------|-----------|
| Music library enrichment | ⭐⭐⭐⭐⭐ | Ideal: Batch API, ISRC lookup, no rate limits |
| Metadata aggregator | ⭐⭐⭐⭐⭐ | Ideal: Complements MusicBrainz, fast lookups |
| Self-hosted alternative | ⭐⭐⭐⭐ | Good: No auth complexity, but static data |
| DJ software integration | ⭐⭐⭐⭐ | Good: Popularity, genres, artwork |
| Real-time music app | ⭐⭐ | Poor: Static data, no updates |
| Public API service | ⭐⭐ | Poor: No auth, no metrics, memory leak |
| Mission-critical system | ⭐ | Very poor: No tests, naive health check |
| Fresh data required | ⭐ | Very poor: Static snapshot, no updates |
**Legend:**
- ⭐⭐⭐⭐⭐ Ideal
- ⭐⭐⭐⭐ Good
- ⭐⭐⭐ Acceptable
- ⭐⭐ Poor
- ⭐ Very poor
## Final Verdict
### Overall Rating: 7/10
**Breakdown:**
- **Functionality:** 9/10 (comprehensive metadata, batch API)
- **Performance:** 8/10 (fast batch, slow search)
- **Reliability:** 5/10 (no tests, memory leak, naive health check)
- **Security:** 4/10 (no auth, no metrics)
- **Maintainability:** 6/10 (simple code, but no tests)
- **Documentation:** 8/10 (OpenAPI spec, but minimal code comments)
### Strengths Summary
1. Massive dataset (256M tracks)
2. Simple architecture (no framework)
3. High-performance batch API (400 items/request)
4. Pure Go (no CGO)
5. Read-only safety
6. OpenAPI documentation
7. MIT license
8. Easy deployment
### Weaknesses Summary
1. Zero test coverage
2. No authentication
3. Naive health check
4. Rate limiter memory leak
5. No CORS
6. No metrics
7. Database provenance unclear
8. No data freshness
9. Slow search (LIKE %query%)
10. Hardcoded configuration
### Recommendation
**Use Music Metadata API if:**
- You need to enrich large music libraries (batch processing)
- You want ISRC-based lookups without API rate limits
- You can tolerate static data (no real-time updates)
- You can deploy behind reverse proxy (for auth/CORS)
- You can implement monitoring (metrics, proper health checks)
- You can accept legal uncertainty (database provenance)
**Don't use Music Metadata API if:**
- You need real-time data (use Spotify Web API)
- You need production-grade reliability (no tests)
- You need authentication out-of-the-box
- You need fresh data (new releases, current popularity)
- You can't tolerate 216GB storage requirement
### Improvement Priorities
**Critical (before production):**
1. Add test coverage (unit + integration tests)
2. Fix rate limiter memory leak
3. Implement proper health check (verify database)
4. Add authentication (or deploy behind auth proxy)
**High priority:**
1. Add metrics/monitoring (Prometheus)
2. Implement CORS support
3. Extract hardcoded config (environment variables)
4. Use LOG_LEVEL environment variable
**Medium priority:**
1. Improve search performance (FTS5)
2. Add request logging
3. Structured error responses
4. Documentation (code comments)
**Low priority:**
1. Caching layer (Redis)
2. Horizontal scaling improvements
3. Database update mechanism
4. Admin API (stats, cache control)
@@ -0,0 +1,899 @@
# Music Metadata API - Integrations
## Integration Overview
Music Metadata API is a **fully self-contained service** with zero external integrations at runtime. All data is served from pre-populated SQLite databases with no external API calls, no authentication services, and no third-party dependencies beyond the Go runtime.
```
┌─────────────────────────────────────────────────────────────┐
│ Music Metadata API │
│ (Self-Contained Service) │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ HTTP │ │ Database │ │ Models │ │
│ │ Handlers │→ │ Layer │→ │ Layer │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ ↓ │
│ ┌─────────────┐ │
│ │ SQLite │ │
│ │ Databases │ │
│ │ (216GB) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ NO external calls
(All data local)
```
## Runtime Dependencies
### Go Standard Library
**Packages used:**
- `net/http` - HTTP server and routing
- `database/sql` - Database interface
- `encoding/json` - JSON serialization
- `log/slog` - Structured logging
- `context` - Request context and timeouts
- `sync` - Concurrency primitives (RWMutex)
- `flag` - CLI argument parsing
- `os/signal` - Graceful shutdown
**No external HTTP calls:** All functionality implemented with stdlib.
### External Go Modules
**modernc.org/sqlite v1.34.4**
- Pure Go SQLite driver
- No CGO required
- No C dependencies
- No external network calls
**golang.org/x/time v0.14.0**
- Rate limiting (token bucket)
- No external network calls
- Pure algorithm implementation
**Total external dependencies:** 2 packages (both offline)
## Data Sources
### Pre-Populated Databases
**Source:** User must obtain databases separately (not included in repository)
**Database files:**
- `main_database.sqlite3` (~117GB)
- `track_files.sqlite3` (~99GB)
**Provenance:** Unclear (repository states "not affiliated with Spotify")
**Update mechanism:** None (static snapshot)
**Implications:**
- No real-time data sync
- No automatic updates
- User responsible for obtaining databases
- Legal status uncertain
### No External APIs
**What's NOT integrated:**
- Spotify Web API (no OAuth, no API calls)
- MusicBrainz API (no lookups)
- Last.fm API (no scrobbling)
- Discogs API (no catalog queries)
- AcoustID API (no fingerprinting)
- Cover Art Archive (no image fetching)
**All data served from local databases.**
## Browser-Side Dependencies
### Swagger UI (Documentation Only)
**Endpoint:** `/docs`
**External resources loaded by browser:**
```html
<!-- Loaded from unpkg.com CDN -->
<script src="https://unpkg.com/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
<link rel="stylesheet" href="https://unpkg.com/swagger-ui-dist@5/swagger-ui.css" />
```
**Characteristics:**
- Loaded client-side (browser fetches)
- Server doesn't make requests to unpkg.com
- Works offline after first load (browser cache)
- Only affects `/docs` endpoint (not API functionality)
**Implications:**
- Requires internet connection for first `/docs` visit
- Subsequent visits work offline (cached)
- API endpoints work without internet
### Image URLs (External CDN)
**Image hosting:** Spotify CDN (i.scdn.co)
**Example URLs:**
```
https://i.scdn.co/image/ab67616d0000b273ce4f1737bc8a646c8c4bd25a
https://i.scdn.co/image/af2b8e57f6d7b5d1c9a5f3e8d4c2b1a0e9f8d7c6
```
**Characteristics:**
- API returns URLs (not image data)
- Client responsible for fetching images
- Server never fetches images
- Images hosted externally (not by API)
**Implications:**
- Image availability depends on Spotify CDN
- No image caching by API
- Clients need internet to display images
- Broken links possible if Spotify removes images
## No Authentication Integration
### No OAuth
**What's missing:**
- No OAuth 2.0 flow
- No token validation
- No user authentication
- No API keys
**Implications:**
- Public API (anyone can query)
- No usage tracking per user
- No quota enforcement per user
- No access control
**Workarounds:**
- Deploy behind reverse proxy with auth (nginx, Caddy)
- Use API gateway (Kong, Tyk)
- Implement custom middleware
### No Authorization
**What's missing:**
- No role-based access control (RBAC)
- No permission system
- No resource ownership
**Implications:**
- All data accessible to all clients
- No private/public data distinction
- No user-specific data
## No Monitoring Integration
### No Metrics Exporters
**What's missing:**
- No Prometheus metrics
- No StatsD integration
- No OpenTelemetry
- No custom metrics endpoint
**Implications:**
- No visibility into request rates
- No error rate tracking
- No latency percentiles
- No resource usage metrics
**Workarounds:**
- Parse logs for metrics
- Use reverse proxy metrics (nginx, Envoy)
- Implement custom metrics middleware
### No Distributed Tracing
**What's missing:**
- No Jaeger integration
- No Zipkin support
- No trace context propagation
**Implications:**
- Can't trace requests across services
- No performance profiling
- No bottleneck identification
**Workarounds:**
- Add custom tracing middleware
- Use APM tools (Datadog, New Relic)
### No Log Aggregation
**What's missing:**
- No Elasticsearch integration
- No Splunk forwarding
- No CloudWatch Logs
- No structured log shipping
**Logging:** Go stdlib `log/slog` to stdout
**Implications:**
- Logs only in container/process stdout
- No centralized log storage
- No log search/analysis
**Workarounds:**
- Docker log drivers (json-file, syslog, fluentd)
- Kubernetes log collectors (Fluentd, Filebeat)
- Redirect stdout to log aggregator
## No Message Queue Integration
**What's missing:**
- No RabbitMQ
- No Kafka
- No Redis Pub/Sub
- No AWS SQS
**Implications:**
- Synchronous request/response only
- No async job processing
- No event streaming
- No background tasks
**Use case:** All queries processed synchronously (acceptable for read-only API)
## No Cache Integration
### No External Cache
**What's missing:**
- No Redis
- No Memcached
- No Varnish
**Caching:** SQLite page cache only (64MB per connection)
**Implications:**
- No shared cache across instances
- No cache invalidation strategy
- No cache warming
- Cold start on each instance
**Workarounds:**
- Add Redis layer for hot data
- Use HTTP caching headers (not implemented)
- Deploy CDN in front of API
### No HTTP Caching
**What's missing:**
- No `Cache-Control` headers
- No `ETag` support
- No `Last-Modified` headers
**Implications:**
- Clients can't cache responses
- Repeated requests hit database
- No bandwidth savings
**Workarounds:**
- Add caching middleware
- Use reverse proxy with caching (Varnish, nginx)
## No Database Replication
**What's missing:**
- No master-slave replication
- No read replicas
- No database clustering
**Database:** Single SQLite file per instance
**Implications:**
- Each instance has full database copy (216GB)
- No shared database across instances
- Horizontal scaling requires full database per instance
**Workarounds:**
- Read-only databases safe to copy
- Use network filesystem (NFS, EFS) for shared access
- Replicate databases to multiple instances
## No Service Discovery
**What's missing:**
- No Consul integration
- No etcd
- No Kubernetes service discovery
- No DNS-based discovery
**Deployment:** Static configuration (IP:port)
**Implications:**
- Manual load balancer configuration
- No dynamic scaling
- No health-based routing
**Workarounds:**
- Use Kubernetes services (automatic discovery)
- Use cloud load balancers (AWS ALB, GCP LB)
- Use service mesh (Istio, Linkerd)
## No Configuration Management
### No External Config
**What's missing:**
- No Consul KV
- No etcd
- No AWS Parameter Store
- No HashiCorp Vault
**Configuration:** CLI flags only (`-db`, `-addr`)
**Implications:**
- All config at startup
- No dynamic reconfiguration
- No secrets management
- Hardcoded timeouts/limits
**Workarounds:**
- Use environment variables (requires code changes)
- Mount config files (requires code changes)
- Use init containers to generate config
### No Secrets Management
**What's missing:**
- No Vault integration
- No AWS Secrets Manager
- No Kubernetes secrets
- No encrypted config
**Secrets:** None required (no authentication)
**Implications:**
- No sensitive data to protect
- No credential rotation
- No encryption at rest
**Future consideration:** If adding authentication, integrate secrets manager
## Integration Patterns
### Reverse Proxy Integration
**Use case:** Add authentication, CORS, caching, SSL
**Example with nginx:**
```nginx
upstream metadata_api {
server localhost:8080;
}
server {
listen 443 ssl;
server_name api.example.com;
ssl_certificate /etc/ssl/cert.pem;
ssl_certificate_key /etc/ssl/key.pem;
# CORS headers
add_header Access-Control-Allow-Origin *;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS";
# Caching
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:10m;
proxy_cache api_cache;
proxy_cache_valid 200 1h;
# Authentication
auth_basic "Restricted";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass http://metadata_api;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
```
### API Gateway Integration
**Use case:** Rate limiting, authentication, analytics
**Example with Kong:**
```yaml
services:
- name: metadata-api
url: http://localhost:8080
routes:
- name: metadata-routes
paths:
- /
plugins:
- name: rate-limiting
config:
minute: 1000
policy: local
- name: key-auth
config:
key_names:
- apikey
- name: prometheus
config:
per_consumer: true
```
### Load Balancer Integration
**Use case:** Distribute traffic across multiple instances
**Example with HAProxy:**
```
frontend metadata_frontend
bind *:80
default_backend metadata_backend
backend metadata_backend
balance roundrobin
option httpchk GET /health
server api1 10.0.1.10:8080 check
server api2 10.0.1.11:8080 check
server api3 10.0.1.12:8080 check
```
### Kubernetes Integration
**Use case:** Container orchestration, auto-scaling
**Example deployment:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: metadata-api
spec:
replicas: 3
selector:
matchLabels:
app: metadata-api
template:
metadata:
labels:
app: metadata-api
spec:
containers:
- name: api
image: ghcr.io/aunali321/music-metadata-api:latest
args: ["-db", "/data/main_database.sqlite3"]
ports:
- containerPort: 8080
volumeMounts:
- name: database
mountPath: /data
readOnly: true
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
resources:
requests:
memory: "4Gi"
cpu: "1"
limits:
memory: "8Gi"
cpu: "2"
volumes:
- name: database
persistentVolumeClaim:
claimName: metadata-db-pvc
---
apiVersion: v1
kind: Service
metadata:
name: metadata-api
spec:
selector:
app: metadata-api
ports:
- port: 80
targetPort: 8080
type: LoadBalancer
```
### Monitoring Integration
**Use case:** Metrics, logs, traces
**Example with Prometheus + Grafana:**
**1. Add metrics exporter (custom middleware):**
```go
// Not implemented in current codebase
import "github.com/prometheus/client_golang/prometheus"
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{Name: "api_requests_total"},
[]string{"method", "endpoint", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{Name: "api_request_duration_seconds"},
[]string{"method", "endpoint"},
)
)
```
**2. Scrape metrics with Prometheus:**
```yaml
scrape_configs:
- job_name: 'metadata-api'
static_configs:
- targets: ['localhost:8080']
```
**3. Visualize in Grafana:**
- Request rate dashboard
- Error rate dashboard
- Latency percentiles (p50, p95, p99)
### Logging Integration
**Use case:** Centralized log aggregation
**Example with Fluentd:**
**1. Configure Docker logging driver:**
```yaml
services:
metadata-api:
image: ghcr.io/aunali321/music-metadata-api:latest
logging:
driver: fluentd
options:
fluentd-address: localhost:24224
tag: metadata-api
```
**2. Fluentd configuration:**
```
<source>
@type forward
port 24224
</source>
<match metadata-api>
@type elasticsearch
host elasticsearch
port 9200
index_name metadata-api
type_name _doc
</match>
```
### Caching Integration
**Use case:** Reduce database load, improve latency
**Example with Redis:**
**1. Add Redis middleware (custom implementation):**
```go
// Not implemented in current codebase
func cacheMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Check Redis cache
cached, err := redisClient.Get(r.URL.Path).Result()
if err == nil {
w.Write([]byte(cached))
return
}
// Cache miss, call handler
rec := httptest.NewRecorder()
next.ServeHTTP(rec, r)
// Store in Redis (1 hour TTL)
redisClient.Set(r.URL.Path, rec.Body.String(), time.Hour)
w.Write(rec.Body.Bytes())
})
}
```
**2. Deploy Redis:**
```yaml
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
```
## Complementary Services
### MusicBrainz Integration
**Use case:** Resolve MBIDs to ISRCs, then lookup in Music Metadata API
**Flow:**
```
1. Query MusicBrainz for recording by MBID
2. Extract ISRC from MusicBrainz response
3. Lookup ISRC in Music Metadata API
4. Merge metadata (MusicBrainz credits + Spotify-style data)
```
**Example:**
```python
import requests
# Step 1: Get ISRC from MusicBrainz
mb_url = "https://musicbrainz.org/ws/2/recording/abc-123?fmt=json&inc=isrcs"
mb_response = requests.get(mb_url).json()
isrc = mb_response['isrcs'][0]
# Step 2: Lookup in Music Metadata API
mm_url = f"http://localhost:8080/lookup/isrc/{isrc}"
mm_response = requests.get(mm_url).json()
# Step 3: Merge metadata
merged = {
"mbid": "abc-123",
"isrc": isrc,
"title": mm_response['name'],
"popularity": mm_response['popularity'],
"credits": mb_response['artist-credit']
}
```
### AcoustID Integration
**Use case:** Fingerprint audio files, resolve to ISRCs
**Flow:**
```
1. Generate audio fingerprint (chromaprint)
2. Query AcoustID API with fingerprint
3. Extract ISRC from AcoustID response
4. Lookup ISRC in Music Metadata API
5. Tag audio file with metadata
```
**Example:**
```python
import acoustid
# Step 1: Fingerprint audio file
duration, fingerprint = acoustid.fingerprint_file('song.mp3')
# Step 2: Query AcoustID
results = acoustid.lookup(api_key, fingerprint, duration, meta='recordings')
# Step 3: Extract ISRC
isrc = results['recordings'][0]['isrc']
# Step 4: Lookup in Music Metadata API
mm_url = f"http://localhost:8080/lookup/isrc/{isrc}"
metadata = requests.get(mm_url).json()
# Step 5: Tag file
audio = mutagen.File('song.mp3')
audio['title'] = metadata['name']
audio['artist'] = metadata['artists'][0]['name']
audio.save()
```
### Spotify Web API Integration
**Use case:** Get real-time data, then fallback to Music Metadata API
**Flow:**
```
1. Try Spotify Web API (requires OAuth)
2. If rate limited or unavailable, fallback to Music Metadata API
3. Return cached/static data from Music Metadata API
```
**Example:**
```python
def get_track_metadata(isrc):
try:
# Try Spotify Web API (real-time)
spotify_data = spotify_client.search(q=f"isrc:{isrc}", type="track")
return spotify_data['tracks']['items'][0]
except Exception:
# Fallback to Music Metadata API (static)
mm_url = f"http://localhost:8080/lookup/isrc/{isrc}"
return requests.get(mm_url).json()
```
## Deployment Integrations
### Docker Compose
**Use case:** Local development, simple deployments
**Example:**
```yaml
version: '3.8'
services:
metadata-api:
image: ghcr.io/aunali321/music-metadata-api:latest
ports:
- "8080:8080"
volumes:
- ./data:/data:ro
command: ["-db", "/data/main_database.sqlite3"]
restart: unless-stopped
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- metadata-api
```
### Kubernetes
**Use case:** Production deployments, auto-scaling
**See Kubernetes Integration section above**
### Cloud Platforms
**AWS ECS:**
```json
{
"family": "metadata-api",
"containerDefinitions": [{
"name": "api",
"image": "ghcr.io/aunali321/music-metadata-api:latest",
"memory": 4096,
"cpu": 1024,
"portMappings": [{"containerPort": 8080}],
"command": ["-db", "/data/main_database.sqlite3"],
"mountPoints": [{
"sourceVolume": "database",
"containerPath": "/data",
"readOnly": true
}]
}],
"volumes": [{
"name": "database",
"efsVolumeConfiguration": {
"fileSystemId": "fs-12345678"
}
}]
}
```
**Google Cloud Run:**
```yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: metadata-api
spec:
template:
spec:
containers:
- image: ghcr.io/aunali321/music-metadata-api:latest
args: ["-db", "/data/main_database.sqlite3"]
volumeMounts:
- name: database
mountPath: /data
readOnly: true
volumes:
- name: database
gcePersistentDisk:
pdName: metadata-db
readOnly: true
```
## No Integration Advantages
### Simplicity
**Benefits:**
- No external service dependencies
- No network calls (faster, more reliable)
- No authentication complexity
- No API rate limits (external)
**Tradeoffs:**
- No real-time data
- No automatic updates
- No distributed features
### Reliability
**Benefits:**
- No cascading failures (no external dependencies)
- No network timeouts (all local)
- No third-party outages
- Predictable performance
**Tradeoffs:**
- Single point of failure (database file)
- No redundancy (unless replicated)
### Performance
**Benefits:**
- No network latency (local database)
- No API rate limits (self-imposed only)
- Batch queries optimized (7 queries vs 2,800)
**Tradeoffs:**
- Database size (216GB per instance)
- Memory usage (2.5GB minimum)
### Cost
**Benefits:**
- No API subscription fees
- No per-request charges
- No data transfer costs (local)
**Tradeoffs:**
- Storage costs (216GB)
- Compute costs (self-hosted)
## Future Integration Opportunities
### Potential Additions
**Authentication:**
- OAuth 2.0 provider (Keycloak, Auth0)
- API key management (custom or Kong)
**Monitoring:**
- Prometheus metrics exporter
- OpenTelemetry tracing
- Structured logging to Elasticsearch
**Caching:**
- Redis for hot data
- HTTP caching headers
- CDN for static responses
**Database:**
- PostgreSQL for writable data
- Read replicas for scaling
- Full-text search (Elasticsearch, Meilisearch)
**Message Queue:**
- Background job processing (Celery, Sidekiq)
- Event streaming (Kafka)
**Configuration:**
- Environment variables
- Config files (YAML, TOML)
- Secrets management (Vault)
### Integration Complexity
**Current:** Zero integrations (simplest possible)
**With additions:** Each integration adds:
- Configuration complexity
- Deployment dependencies
- Failure modes
- Maintenance burden
**Recommendation:** Only add integrations when necessary for specific use cases.
@@ -0,0 +1,321 @@
# Music Metadata API - Overview
## Project Identity
**Name:** Music Metadata API
**Repository:** https://github.com/Aunali321/music-metadata-api
**License:** MIT
**Language:** Go 1.24
**Maintainer:** Single maintainer (Aunali321)
**Status:** Active, production-ready
## Purpose
Music Metadata API provides a self-hosted HTTP service for querying metadata on 256 million music tracks. The service operates entirely from pre-populated SQLite databases, requiring no external API calls at runtime. It's designed as a high-performance alternative to commercial music metadata APIs like Spotify's Web API.
## Core Technology Stack
### Runtime Dependencies
| Component | Version | Purpose | Notes |
|-----------|---------|---------|-------|
| Go | 1.24 | Runtime & stdlib HTTP server | Uses Go 1.22+ enhanced routing |
| modernc.org/sqlite | v1.34.4 | Pure Go SQLite driver | No CGO required |
| golang.org/x/time | v0.14.0 | Rate limiting (token bucket) | Only external dependency |
### Build Configuration
```bash
CGO_ENABLED=0 go build -ldflags="-s -w" ./cmd/server
```
**Flags explained:**
- `CGO_ENABLED=0`: Pure Go binary, no C dependencies
- `-s -w`: Strip debug symbols and DWARF tables (smaller binary)
## Data Scale
### Database Files
| Database | Size | Purpose | Records |
|----------|------|---------|---------|
| main_database.sqlite3 | ~117GB | Core metadata (tracks, albums, artists) | 256M tracks |
| track_files.sqlite3 | ~99GB | Extended track data (lyrics flags, languages, roles) | 256M track files |
| **Total** | **~216GB** | Combined storage requirement | - |
### Dataset Coverage
- **256 million tracks** across all databases
- Album metadata with images, labels, release dates
- Artist metadata with genres, follower counts, popularity scores
- ISRC codes for track identification
- Multi-language support (language_of_performance field)
- Artist role information (performer, composer, etc.)
## Entry Points
### Command Line
**Binary:** `cmd/server/main.go` (62 lines)
**Flags:**
```bash
-db string
Path to main database file (REQUIRED)
-addr string
HTTP server address (default ":8080")
```
**Example:**
```bash
./metadata-api -db /data/main_database.sqlite3 -addr :8080
```
### Docker
**Image:** `ghcr.io/aunali321/music-metadata-api:latest`
**Base:** Alpine Linux 3.21
**docker-compose.yml:**
```yaml
services:
metadata-api:
image: ghcr.io/aunali321/music-metadata-api:latest
ports:
- "8080:8080"
volumes:
- ./data:/data:ro
environment:
- LOG_LEVEL=info # NOTE: Not actually used in code
command: ["-db", "/data/main_database.sqlite3"]
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
```
## Architecture Layers
### Directory Structure
```
music-metadata-api/
├── cmd/
│ └── server/
│ └── main.go # Entry point (62 lines)
├── internal/
│ ├── api/ # HTTP handlers, routing, middleware
│ │ ├── handlers.go
│ │ ├── ratelimit.go
│ │ └── openapi.go
│ ├── db/
│ │ └── db.go # Database layer (907 lines)
│ └── models/
│ └── models.go # Data structures (65 lines)
├── Dockerfile
├── docker-compose.yml
└── .github/
└── workflows/
└── docker-publish.yml
```
### Layer Responsibilities
**API Layer** (`internal/api/`)
- HTTP request handling
- Rate limiting (token bucket, per-IP)
- OpenAPI specification serving
- Swagger UI hosting
**Database Layer** (`internal/db/`)
- SQLite connection management
- Query execution
- Data enrichment (joining related entities)
- Batch optimization
**Models Layer** (`internal/models/`)
- Data structure definitions
- JSON serialization tags
- Response formatting
## Key Features
### Performance Optimizations
1. **Read-only databases** - No write locks, safe concurrent reads
2. **Conservative PRAGMAs** - Optimized for read-heavy workloads
3. **Batch endpoints** - Process up to 400 items per request
4. **Connection pooling** - MaxOpenConns=8 for controlled resource usage
5. **Memory-mapped I/O** - 1GB mmap for faster reads
### API Capabilities
- **Batch lookup** - Retrieve multiple tracks/albums/artists in single request
- **ISRC lookup** - Industry-standard track identification
- **Search** - Full-text search on tracks and artists
- **Relationship traversal** - Album tracks, artist albums, track artists
- **OpenAPI documentation** - Interactive Swagger UI at `/docs`
### Operational Features
- **Graceful shutdown** - 10-second timeout for in-flight requests
- **Health checks** - `/health` endpoint for monitoring
- **Rate limiting** - 100 req/s with 200 burst capacity
- **Structured logging** - Go stdlib `log/slog` for error tracking
## Deployment Models
### Standalone Binary
**Pros:**
- Single executable, no dependencies
- Minimal resource footprint
- Direct filesystem access to databases
**Cons:**
- Manual process management
- No automatic restarts
- Manual log rotation
### Docker Container
**Pros:**
- Consistent runtime environment
- Built-in health checks
- Automatic restarts
- Easy horizontal scaling
**Cons:**
- Requires Docker runtime
- Additional layer of abstraction
- Volume mount for large databases
## Use Cases
### Primary Use Cases
1. **Music library enrichment** - Add metadata to existing track collections
2. **ISRC-based lookup** - Resolve ISRCs to full track metadata
3. **Batch processing** - Enrich large catalogs efficiently
4. **Self-hosted alternative** - Replace commercial APIs with local service
### Integration Scenarios
- **Metadata aggregator pipelines** - Complement MusicBrainz with Spotify-style data
- **Music streaming services** - Populate track/album/artist information
- **DJ software** - Enrich track libraries with popularity, genres, images
- **Music analytics** - Analyze trends across 256M tracks
## Limitations
### Technical Constraints
- **Database size** - Requires 216GB disk space
- **No write operations** - Read-only, no data updates
- **No authentication** - Public API, no access control
- **No CORS** - Browser-based clients blocked
- **Memory leak** - Rate limiter visitor map grows unbounded
### Data Constraints
- **Database provenance unclear** - "Not affiliated with Spotify"
- **No freshness mechanism** - Static snapshot, no updates
- **Search performance** - LIKE queries slow on large datasets (no FTS)
### Operational Constraints
- **No metrics** - No Prometheus, no counters
- **Naive health check** - Doesn't verify database connectivity
- **Hardcoded config** - Timeouts, limits not configurable
- **No tests** - Zero test coverage
## Project Maturity
### Strengths
- Clean, simple codebase
- Production-ready Docker setup
- Comprehensive OpenAPI spec
- Massive dataset (256M tracks)
- Pure Go (no CGO complexity)
### Weaknesses
- Single maintainer
- No test suite
- No CI test step
- Unused config (LOG_LEVEL)
- Memory leak in rate limiter
## Comparison to Alternatives
| Feature | Music Metadata API | Spotify Web API | MusicBrainz API |
|---------|-------------------|-----------------|-----------------|
| Self-hosted | Yes | No | No |
| Authentication | None | OAuth required | Optional |
| Dataset size | 256M tracks | Full catalog | ~40M recordings |
| Rate limits | 100 req/s | Varies by tier | 1 req/s |
| Batch support | 400 items | 50 items | Limited |
| Cost | Free (MIT) | Free tier limited | Free |
| Data freshness | Static | Real-time | Community-updated |
| Identifier | ISRC, internal IDs | Spotify IDs | MBIDs |
## Getting Started
### Minimum Requirements
1. Go 1.24+ (for building from source)
2. 216GB disk space for databases
3. Database files (not included in repository)
4. 2GB+ RAM recommended
### Quick Start
```bash
# Clone repository
git clone https://github.com/Aunali321/music-metadata-api.git
cd music-metadata-api
# Build binary
CGO_ENABLED=0 go build -ldflags="-s -w" -o metadata-api ./cmd/server
# Run server (assumes databases in /data)
./metadata-api -db /data/main_database.sqlite3 -addr :8080
# Test health endpoint
curl http://localhost:8080/health
# View API documentation
open http://localhost:8080/docs
```
### Docker Quick Start
```bash
# Pull image
docker pull ghcr.io/aunali321/music-metadata-api:latest
# Run container
docker run -d \
-p 8080:8080 \
-v /path/to/databases:/data:ro \
ghcr.io/aunali321/music-metadata-api:latest \
-db /data/main_database.sqlite3
# Check health
curl http://localhost:8080/health
```
## Documentation Resources
- **OpenAPI Spec:** http://localhost:8080/openapi.yaml
- **Interactive Docs:** http://localhost:8080/docs
- **GitHub Repository:** https://github.com/Aunali321/music-metadata-api
- **Docker Image:** ghcr.io/aunali321/music-metadata-api
## License
MIT License - Free for commercial and personal use with attribution.