feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,895 @@
# Music Metadata API - API Reference
## API Overview
Music Metadata API exposes a RESTful HTTP API with 11 endpoints for querying music metadata. The API is fully documented with OpenAPI 3.1 and includes an interactive Swagger UI.
**Base URL:** `http://localhost:8080` (configurable via `-addr` flag)
**Content-Type:** `application/json`
**Authentication:** None (public API)
**CORS:** Not supported
**Rate Limiting:** 100 requests/second, 200 burst (per-IP)
## Endpoints
### Batch Operations
#### POST /batch/lookup
Retrieve multiple tracks, albums, and artists in a single request.
**Request Body:**
```json
{
"tracks": ["track_id_1", "track_id_2"],
"artists": ["artist_id_1", "artist_id_2"],
"albums": ["album_id_1", "album_id_2"],
"isrcs": ["USRC12345678", "GBUM71234567"]
}
```
**Constraints:**
- Maximum 400 items total across all arrays
- All fields optional (at least one required)
- Duplicate IDs allowed (deduplicated in response)
**Response:**
```json
{
"tracks": {
"track_id_1": {
"id": "track_id_1",
"name": "Song Title",
"isrc": "USRC12345678",
"duration_ms": 240000,
"explicit": false,
"track_number": 1,
"disc_number": 1,
"popularity": 85,
"preview_url": "https://...",
"album": { /* Album object */ },
"artists": [ /* Artist objects */ ],
"original_title": "Original Title",
"version_title": "Radio Edit",
"has_lyrics": true,
"languages": ["en", "es"],
"artist_roles": {
"artist_id_1": ["performer", "composer"]
}
}
},
"artists": {
"artist_id_1": { /* Artist object */ }
},
"albums": {
"album_id_1": { /* Album object */ }
},
"isrcs": {
"USRC12345678": { /* Track object */ }
}
}
```
**Status Codes:**
- `200 OK` - Success (even if some items not found)
- `400 Bad Request` - Invalid request (exceeds 400 items, malformed JSON)
- `429 Too Many Requests` - Rate limit exceeded
**Performance:**
- Optimized with batch queries (7 queries for 400 items vs 2,400 individual queries)
- Typical response time: 100-500ms for 400 items
**Example:**
```bash
curl -X POST http://localhost:8080/batch/lookup \
-H "Content-Type: application/json" \
-d '{
"isrcs": ["USRC17607839", "GBUM71029604"],
"tracks": ["4cOdK2wGLETKBW3PvgPWqT"]
}'
```
### Track Lookups
#### GET /lookup/isrc/{isrc}
Retrieve track by ISRC (International Standard Recording Code).
**Path Parameters:**
- `isrc` - ISRC code (e.g., `USRC12345678`)
**Response:**
```json
{
"id": "track_id",
"name": "Song Title",
"isrc": "USRC12345678",
"duration_ms": 240000,
"explicit": false,
"track_number": 1,
"disc_number": 1,
"popularity": 85,
"preview_url": "https://p.scdn.co/mp3-preview/...",
"album": {
"id": "album_id",
"name": "Album Title",
"album_type": "album",
"label": "Record Label",
"release_date": "2023-01-15",
"release_date_precision": "day",
"external_id_upc": "123456789012",
"total_tracks": 12,
"copyright_c": "2023 Label",
"copyright_p": "2023 Label",
"images": [
{
"url": "https://i.scdn.co/image/...",
"width": 640,
"height": 640
}
],
"artists": [ /* Album artists */ ]
},
"artists": [
{
"id": "artist_id",
"name": "Artist Name",
"followers_total": 1000000,
"popularity": 90,
"genres": ["pop", "rock"],
"images": [ /* Artist images */ ]
}
],
"original_title": "Original Title",
"version_title": "Radio Edit",
"has_lyrics": true,
"languages": ["en"],
"artist_roles": {
"artist_id": ["performer", "composer"]
}
}
```
**Status Codes:**
- `200 OK` - Track found
- `404 Not Found` - ISRC not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/isrc/USRC17607839
```
#### GET /lookup/track/{id}
Retrieve track by internal track ID.
**Path Parameters:**
- `id` - Track ID (internal identifier)
**Response:** Same as `/lookup/isrc/{isrc}`
**Status Codes:**
- `200 OK` - Track found
- `404 Not Found` - Track ID not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/track/4cOdK2wGLETKBW3PvgPWqT
```
### Artist Lookups
#### GET /lookup/artist/{id}
Retrieve artist by ID.
**Path Parameters:**
- `id` - Artist ID
**Response:**
```json
{
"id": "artist_id",
"name": "Artist Name",
"followers_total": 1000000,
"popularity": 90,
"genres": ["pop", "rock", "indie"],
"images": [
{
"url": "https://i.scdn.co/image/...",
"width": 640,
"height": 640
},
{
"url": "https://i.scdn.co/image/...",
"width": 320,
"height": 320
}
]
}
```
**Status Codes:**
- `200 OK` - Artist found
- `404 Not Found` - Artist ID not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/artist/0TnOYISbd1XYRBk9myaseg
```
### Album Lookups
#### GET /lookup/album/{id}
Retrieve album by ID.
**Path Parameters:**
- `id` - Album ID
**Response:**
```json
{
"id": "album_id",
"name": "Album Title",
"album_type": "album",
"label": "Record Label",
"release_date": "2023-01-15",
"release_date_precision": "day",
"external_id_upc": "123456789012",
"total_tracks": 12,
"copyright_c": "2023 Label",
"copyright_p": "2023 Label",
"images": [
{
"url": "https://i.scdn.co/image/...",
"width": 640,
"height": 640
}
],
"artists": [
{
"id": "artist_id",
"name": "Artist Name",
"followers_total": 1000000,
"popularity": 90,
"genres": ["pop"],
"images": [ /* Artist images */ ]
}
]
}
```
**Status Codes:**
- `200 OK` - Album found
- `404 Not Found` - Album ID not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/album/2ODvWsOgouMbaA5xf0RkJe
```
#### GET /lookup/album/{id}/tracks
Retrieve all tracks for an album.
**Path Parameters:**
- `id` - Album ID
**Response:**
```json
{
"tracks": [
{
"id": "track_id_1",
"name": "Track 1",
"track_number": 1,
"disc_number": 1,
/* Full track object */
},
{
"id": "track_id_2",
"name": "Track 2",
"track_number": 2,
"disc_number": 1,
/* Full track object */
}
]
}
```
**Status Codes:**
- `200 OK` - Album found (even if no tracks)
- `404 Not Found` - Album ID not in database
- `429 Too Many Requests` - Rate limit exceeded
**Example:**
```bash
curl http://localhost:8080/lookup/album/2ODvWsOgouMbaA5xf0RkJe/tracks
```
### Search
#### GET /search/track
Search tracks by name.
**Query Parameters:**
- `q` - Search query (minimum 2 characters, required)
- `limit` - Maximum results (default 10, max 50)
**Search Behavior:**
- Case-insensitive substring match (`LIKE %query% COLLATE NOCASE`)
- Ordered by popularity (descending)
- 10-second timeout
- Returns partial matches
**Response:**
```json
{
"tracks": [
{
"id": "track_id",
"name": "Song Title",
/* Full track object */
}
],
"total": 1
}
```
**Status Codes:**
- `200 OK` - Search completed (even if no results)
- `400 Bad Request` - Query too short (< 2 chars) or limit too high (> 50)
- `429 Too Many Requests` - Rate limit exceeded
- `504 Gateway Timeout` - Search exceeded 10 seconds
**Example:**
```bash
curl "http://localhost:8080/search/track?q=bohemian&limit=5"
```
**Performance Note:** Search uses `LIKE %query%` which can't leverage indexes efficiently. Searches on common terms may be slow (full table scan on 256M tracks).
#### GET /search/artist
Search artists by name.
**Query Parameters:**
- `q` - Search query (minimum 2 characters, required)
- `limit` - Maximum results (default 10, max 50)
**Search Behavior:**
- Case-insensitive substring match (`LIKE %query% COLLATE NOCASE`)
- Ordered by follower count (descending)
- 10-second timeout
- Returns partial matches
**Response:**
```json
{
"artists": [
{
"id": "artist_id",
"name": "Artist Name",
/* Full artist object */
}
],
"total": 1
}
```
**Status Codes:**
- `200 OK` - Search completed (even if no results)
- `400 Bad Request` - Query too short (< 2 chars) or limit too high (> 50)
- `429 Too Many Requests` - Rate limit exceeded
- `504 Gateway Timeout` - Search exceeded 10 seconds
**Example:**
```bash
curl "http://localhost:8080/search/artist?q=beatles&limit=5"
```
### Health & Documentation
#### GET /health
Health check endpoint for monitoring.
**Response:**
```json
{
"status": "ok"
}
```
**Status Codes:**
- `200 OK` - Always (even if database unreachable)
**Limitation:** This is a naive health check. It doesn't verify database connectivity. A database failure won't be detected until actual queries fail.
**Example:**
```bash
curl http://localhost:8080/health
```
#### GET /docs
Interactive Swagger UI for API documentation.
**Response:** HTML page with embedded Swagger UI
**Dependencies:**
- Loads Swagger UI from unpkg.com CDN (browser-side)
- Requires internet connection for first load
- Fetches OpenAPI spec from `/openapi.yaml`
**Example:**
```bash
# Open in browser
open http://localhost:8080/docs
```
#### GET /openapi.yaml
OpenAPI 3.1 specification in YAML format.
**Response:** YAML document with full API specification
**Content-Type:** `application/x-yaml`
**Example:**
```bash
curl http://localhost:8080/openapi.yaml
```
## Rate Limiting
### Algorithm
**Implementation:** Token bucket per IP address
**Configuration:**
- **Rate:** 100 requests/second
- **Burst:** 200 requests
- **Scope:** Per-IP (extracted from `X-Forwarded-For` or `RemoteAddr`)
### Behavior
**Token bucket mechanics:**
1. Each IP gets a bucket with 200 tokens (burst capacity)
2. Tokens refill at 100/second
3. Each request consumes 1 token
4. If bucket empty, request rejected with HTTP 429
**Example scenarios:**
| Scenario | Tokens Available | Result |
|----------|------------------|--------|
| First request | 200 | Allowed (199 remaining) |
| 200 requests in 1 second | 200 → 0 | All allowed |
| 201st request in same second | 0 | Rejected (429) |
| Wait 1 second | 0 → 100 | 100 requests allowed |
| Steady 50 req/s | Always > 0 | Never rate limited |
### Response Headers
**When rate limited (HTTP 429):**
```
HTTP/1.1 429 Too Many Requests
Retry-After: 1
Content-Type: text/plain
Rate limit exceeded
```
**Retry-After:** Seconds to wait before retrying (always 1)
### IP Extraction
**Priority:**
1. `X-Forwarded-For` header (first IP if comma-separated)
2. `RemoteAddr` from connection
**Example:**
```
X-Forwarded-For: 203.0.113.1, 198.51.100.1
→ Rate limiter uses 203.0.113.1
```
### Known Issues
**Memory leak:** Visitor map grows unbounded. No cleanup for inactive IPs. Long-running servers will accumulate memory over time.
**Workaround:** Restart server periodically or implement custom cleanup.
## Data Models
### Track Object
```json
{
"id": "string", // Internal track ID
"name": "string", // Track title
"isrc": "string", // ISRC code (optional)
"duration_ms": 0, // Duration in milliseconds
"explicit": false, // Explicit content flag
"track_number": 0, // Track number on album
"disc_number": 0, // Disc number (multi-disc albums)
"popularity": 0, // Popularity score (0-100)
"preview_url": "string", // 30-second preview URL (optional)
"album": { /* Album object */ }, // Parent album
"artists": [ /* Artist objects */ ], // Track artists
"original_title": "string", // Original title (optional)
"version_title": "string", // Version (e.g., "Radio Edit") (optional)
"has_lyrics": false, // Lyrics availability flag
"languages": ["string"], // Languages of performance (optional)
"artist_roles": { // Artist roles map (optional)
"artist_id": ["role1", "role2"]
}
}
```
**Field notes:**
- `isrc`: May be null for some tracks
- `preview_url`: May be null if no preview available
- `popularity`: Higher = more popular (Spotify-style metric)
- `languages`: ISO 639-1 codes (e.g., "en", "es")
- `artist_roles`: Maps artist ID to roles (e.g., "performer", "composer", "producer")
### Album Object
```json
{
"id": "string", // Internal album ID
"name": "string", // Album title
"album_type": "string", // "album", "single", "compilation"
"label": "string", // Record label (optional)
"release_date": "string", // ISO 8601 date (YYYY-MM-DD)
"release_date_precision": "string", // "year", "month", "day"
"external_id_upc": "string", // UPC barcode (optional)
"total_tracks": 0, // Total tracks on album
"copyright_c": "string", // Copyright notice (optional)
"copyright_p": "string", // Phonographic copyright (optional)
"images": [ /* Image objects */ ], // Album artwork (optional)
"artists": [ /* Artist objects */ ] // Album artists (optional)
}
```
**Field notes:**
- `album_type`: Typically "album", "single", or "compilation"
- `release_date_precision`: Indicates granularity of release date
- `external_id_upc`: Universal Product Code (barcode)
- `images`: Sorted by size (largest first)
### Artist Object
```json
{
"id": "string", // Internal artist ID
"name": "string", // Artist name
"followers_total": 0, // Total followers (optional)
"popularity": 0, // Popularity score 0-100 (optional)
"genres": ["string"], // Genres (optional)
"images": [ /* Image objects */ ] // Artist images (optional)
}
```
**Field notes:**
- `followers_total`: Spotify-style follower count
- `popularity`: Higher = more popular
- `genres`: Multiple genres possible (e.g., ["pop", "rock"])
- `images`: Sorted by size (largest first)
### Image Object
```json
{
"url": "string", // Image URL (typically i.scdn.co)
"width": 0, // Width in pixels
"height": 0 // Height in pixels
}
```
**Field notes:**
- URLs reference external CDN (i.scdn.co)
- Multiple sizes available (640x640, 320x320, 64x64 typical)
- Images not hosted by API (external references)
## Error Responses
### Standard Error Format
```json
{
"error": "Error message"
}
```
**Content-Type:** `application/json` (for structured errors) or `text/plain` (for simple errors)
### Common Error Codes
| Status | Meaning | Common Causes |
|--------|---------|---------------|
| 400 | Bad Request | Invalid query params, malformed JSON, validation failure |
| 404 | Not Found | ID/ISRC not in database |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Server Error | Database error, query timeout |
| 504 | Gateway Timeout | Search exceeded 10 seconds |
### Example Error Responses
**404 Not Found:**
```json
{
"error": "Track not found"
}
```
**400 Bad Request:**
```json
{
"error": "Query must be at least 2 characters"
}
```
**429 Too Many Requests:**
```
Rate limit exceeded
```
**500 Internal Server Error:**
```json
{
"error": "Database query failed"
}
```
## OpenAPI Specification
### Metadata
```yaml
openapi: 3.1.0
info:
title: Music Metadata API
version: 1.0.0
description: API for querying music metadata from 256M tracks
license:
name: MIT
servers:
- url: http://localhost:8080
description: Local development server
```
### Example Endpoint Definition
```yaml
/lookup/track/{id}:
get:
summary: Get track by ID
operationId: getTrack
parameters:
- name: id
in: path
required: true
schema:
type: string
description: Track ID
responses:
'200':
description: Track found
content:
application/json:
schema:
$ref: '#/components/schemas/Track'
'404':
description: Track not found
'429':
description: Rate limit exceeded
```
### Schema Definitions
All data models defined in `components/schemas`:
- `Track`
- `Album`
- `Artist`
- `Image`
- `BatchRequest`
- `BatchResponse`
**Access:** http://localhost:8080/openapi.yaml
## Usage Examples
### Batch Lookup (Python)
```python
import requests
url = "http://localhost:8080/batch/lookup"
payload = {
"isrcs": ["USRC17607839", "GBUM71029604"],
"tracks": ["4cOdK2wGLETKBW3PvgPWqT"]
}
response = requests.post(url, json=payload)
data = response.json()
for isrc, track in data.get("isrcs", {}).items():
print(f"{isrc}: {track['name']} by {track['artists'][0]['name']}")
```
### Search with Pagination (JavaScript)
```javascript
async function searchTracks(query, limit = 10) {
const url = `http://localhost:8080/search/track?q=${encodeURIComponent(query)}&limit=${limit}`;
const response = await fetch(url);
const data = await response.json();
return data.tracks;
}
const tracks = await searchTracks("bohemian rhapsody", 5);
tracks.forEach(track => {
console.log(`${track.name} - ${track.album.name}`);
});
```
### Rate Limit Handling (Go)
```go
func fetchWithRetry(url string) (*http.Response, error) {
for {
resp, err := http.Get(url)
if err != nil {
return nil, err
}
if resp.StatusCode == 429 {
retryAfter := resp.Header.Get("Retry-After")
duration, _ := time.ParseDuration(retryAfter + "s")
time.Sleep(duration)
continue
}
return resp, nil
}
}
```
## Performance Considerations
### Batch vs Individual Requests
**Individual requests (400 tracks):**
- 400 HTTP requests
- 400 × ~50ms = 20 seconds
- Rate limited at 100 req/s (4 seconds minimum)
**Batch request (400 tracks):**
- 1 HTTP request
- ~200-500ms total
- **40-100x faster**
**Recommendation:** Always use batch endpoint for multiple items.
### Search Performance
**Fast searches:**
- Short, specific queries ("beatles")
- Queries matching popular items (returned first)
**Slow searches:**
- Common words ("love", "the")
- Long queries with many results
- Queries requiring full table scan
**Recommendation:** Implement client-side caching for common searches.
### Caching Strategy
**Cacheable:**
- Track/album/artist lookups (data rarely changes)
- Search results (cache for 1 hour)
**Not cacheable:**
- Health checks
- OpenAPI spec (changes with deployments)
**Recommendation:** Use HTTP caching headers (not currently implemented by API).
## Integration Patterns
### Enrichment Pipeline
```
1. Extract ISRCs from audio files (e.g., via AcoustID)
2. Batch lookup ISRCs (400 at a time)
3. Store track metadata in local database
4. Fetch missing artists/albums individually
5. Update local cache
```
### Complementing MusicBrainz
```
MusicBrainz (MBID-based)
Resolve ISRC from MusicBrainz
Lookup ISRC in Music Metadata API
Merge metadata (MusicBrainz + Spotify-style data)
```
### Real-time Lookup
```
User plays track
Extract ISRC from file
Check local cache
If miss: GET /lookup/isrc/{isrc}
Display metadata in UI
```
## Limitations
### No Authentication
**Impact:**
- Anyone can query API
- No usage tracking per user
- No quota enforcement per user
**Mitigation:**
- Deploy behind reverse proxy with auth
- Use firewall rules to restrict access
- Implement API gateway with authentication
### No CORS
**Impact:**
- Browser-based clients blocked
- Can't call from web apps directly
**Mitigation:**
- Add CORS middleware (custom implementation)
- Use server-side proxy
- Deploy API on same origin as web app
### No Metrics
**Impact:**
- No visibility into usage patterns
- Can't track error rates
- No performance monitoring
**Mitigation:**
- Add Prometheus metrics (custom implementation)
- Use reverse proxy with metrics (e.g., nginx)
- Parse logs for analytics
### Naive Health Check
**Impact:**
- Health endpoint returns OK even if database down
- Monitoring systems can't detect database failures
**Mitigation:**
- Implement custom health check with database ping
- Monitor actual query endpoints (e.g., /lookup/track/test_id)