Files
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

900 lines
21 KiB
Markdown

# Music Metadata API - Integrations
## Integration Overview
Music Metadata API is a **fully self-contained service** with zero external integrations at runtime. All data is served from pre-populated SQLite databases with no external API calls, no authentication services, and no third-party dependencies beyond the Go runtime.
```
┌─────────────────────────────────────────────────────────────┐
│ Music Metadata API │
│ (Self-Contained Service) │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ HTTP │ │ Database │ │ Models │ │
│ │ Handlers │→ │ Layer │→ │ Layer │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ ↓ │
│ ┌─────────────┐ │
│ │ SQLite │ │
│ │ Databases │ │
│ │ (216GB) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ NO external calls
(All data local)
```
## Runtime Dependencies
### Go Standard Library
**Packages used:**
- `net/http` - HTTP server and routing
- `database/sql` - Database interface
- `encoding/json` - JSON serialization
- `log/slog` - Structured logging
- `context` - Request context and timeouts
- `sync` - Concurrency primitives (RWMutex)
- `flag` - CLI argument parsing
- `os/signal` - Graceful shutdown
**No external HTTP calls:** All functionality implemented with stdlib.
### External Go Modules
**modernc.org/sqlite v1.34.4**
- Pure Go SQLite driver
- No CGO required
- No C dependencies
- No external network calls
**golang.org/x/time v0.14.0**
- Rate limiting (token bucket)
- No external network calls
- Pure algorithm implementation
**Total external dependencies:** 2 packages (both offline)
## Data Sources
### Pre-Populated Databases
**Source:** User must obtain databases separately (not included in repository)
**Database files:**
- `main_database.sqlite3` (~117GB)
- `track_files.sqlite3` (~99GB)
**Provenance:** Unclear (repository states "not affiliated with Spotify")
**Update mechanism:** None (static snapshot)
**Implications:**
- No real-time data sync
- No automatic updates
- User responsible for obtaining databases
- Legal status uncertain
### No External APIs
**What's NOT integrated:**
- Spotify Web API (no OAuth, no API calls)
- MusicBrainz API (no lookups)
- Last.fm API (no scrobbling)
- Discogs API (no catalog queries)
- AcoustID API (no fingerprinting)
- Cover Art Archive (no image fetching)
**All data served from local databases.**
## Browser-Side Dependencies
### Swagger UI (Documentation Only)
**Endpoint:** `/docs`
**External resources loaded by browser:**
```html
<!-- Loaded from unpkg.com CDN -->
<script src="https://unpkg.com/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
<link rel="stylesheet" href="https://unpkg.com/swagger-ui-dist@5/swagger-ui.css" />
```
**Characteristics:**
- Loaded client-side (browser fetches)
- Server doesn't make requests to unpkg.com
- Works offline after first load (browser cache)
- Only affects `/docs` endpoint (not API functionality)
**Implications:**
- Requires internet connection for first `/docs` visit
- Subsequent visits work offline (cached)
- API endpoints work without internet
### Image URLs (External CDN)
**Image hosting:** Spotify CDN (i.scdn.co)
**Example URLs:**
```
https://i.scdn.co/image/ab67616d0000b273ce4f1737bc8a646c8c4bd25a
https://i.scdn.co/image/af2b8e57f6d7b5d1c9a5f3e8d4c2b1a0e9f8d7c6
```
**Characteristics:**
- API returns URLs (not image data)
- Client responsible for fetching images
- Server never fetches images
- Images hosted externally (not by API)
**Implications:**
- Image availability depends on Spotify CDN
- No image caching by API
- Clients need internet to display images
- Broken links possible if Spotify removes images
## No Authentication Integration
### No OAuth
**What's missing:**
- No OAuth 2.0 flow
- No token validation
- No user authentication
- No API keys
**Implications:**
- Public API (anyone can query)
- No usage tracking per user
- No quota enforcement per user
- No access control
**Workarounds:**
- Deploy behind reverse proxy with auth (nginx, Caddy)
- Use API gateway (Kong, Tyk)
- Implement custom middleware
### No Authorization
**What's missing:**
- No role-based access control (RBAC)
- No permission system
- No resource ownership
**Implications:**
- All data accessible to all clients
- No private/public data distinction
- No user-specific data
## No Monitoring Integration
### No Metrics Exporters
**What's missing:**
- No Prometheus metrics
- No StatsD integration
- No OpenTelemetry
- No custom metrics endpoint
**Implications:**
- No visibility into request rates
- No error rate tracking
- No latency percentiles
- No resource usage metrics
**Workarounds:**
- Parse logs for metrics
- Use reverse proxy metrics (nginx, Envoy)
- Implement custom metrics middleware
### No Distributed Tracing
**What's missing:**
- No Jaeger integration
- No Zipkin support
- No trace context propagation
**Implications:**
- Can't trace requests across services
- No performance profiling
- No bottleneck identification
**Workarounds:**
- Add custom tracing middleware
- Use APM tools (Datadog, New Relic)
### No Log Aggregation
**What's missing:**
- No Elasticsearch integration
- No Splunk forwarding
- No CloudWatch Logs
- No structured log shipping
**Logging:** Go stdlib `log/slog` to stdout
**Implications:**
- Logs only in container/process stdout
- No centralized log storage
- No log search/analysis
**Workarounds:**
- Docker log drivers (json-file, syslog, fluentd)
- Kubernetes log collectors (Fluentd, Filebeat)
- Redirect stdout to log aggregator
## No Message Queue Integration
**What's missing:**
- No RabbitMQ
- No Kafka
- No Redis Pub/Sub
- No AWS SQS
**Implications:**
- Synchronous request/response only
- No async job processing
- No event streaming
- No background tasks
**Use case:** All queries processed synchronously (acceptable for read-only API)
## No Cache Integration
### No External Cache
**What's missing:**
- No Redis
- No Memcached
- No Varnish
**Caching:** SQLite page cache only (64MB per connection)
**Implications:**
- No shared cache across instances
- No cache invalidation strategy
- No cache warming
- Cold start on each instance
**Workarounds:**
- Add Redis layer for hot data
- Use HTTP caching headers (not implemented)
- Deploy CDN in front of API
### No HTTP Caching
**What's missing:**
- No `Cache-Control` headers
- No `ETag` support
- No `Last-Modified` headers
**Implications:**
- Clients can't cache responses
- Repeated requests hit database
- No bandwidth savings
**Workarounds:**
- Add caching middleware
- Use reverse proxy with caching (Varnish, nginx)
## No Database Replication
**What's missing:**
- No master-slave replication
- No read replicas
- No database clustering
**Database:** Single SQLite file per instance
**Implications:**
- Each instance has full database copy (216GB)
- No shared database across instances
- Horizontal scaling requires full database per instance
**Workarounds:**
- Read-only databases safe to copy
- Use network filesystem (NFS, EFS) for shared access
- Replicate databases to multiple instances
## No Service Discovery
**What's missing:**
- No Consul integration
- No etcd
- No Kubernetes service discovery
- No DNS-based discovery
**Deployment:** Static configuration (IP:port)
**Implications:**
- Manual load balancer configuration
- No dynamic scaling
- No health-based routing
**Workarounds:**
- Use Kubernetes services (automatic discovery)
- Use cloud load balancers (AWS ALB, GCP LB)
- Use service mesh (Istio, Linkerd)
## No Configuration Management
### No External Config
**What's missing:**
- No Consul KV
- No etcd
- No AWS Parameter Store
- No HashiCorp Vault
**Configuration:** CLI flags only (`-db`, `-addr`)
**Implications:**
- All config at startup
- No dynamic reconfiguration
- No secrets management
- Hardcoded timeouts/limits
**Workarounds:**
- Use environment variables (requires code changes)
- Mount config files (requires code changes)
- Use init containers to generate config
### No Secrets Management
**What's missing:**
- No Vault integration
- No AWS Secrets Manager
- No Kubernetes secrets
- No encrypted config
**Secrets:** None required (no authentication)
**Implications:**
- No sensitive data to protect
- No credential rotation
- No encryption at rest
**Future consideration:** If adding authentication, integrate secrets manager
## Integration Patterns
### Reverse Proxy Integration
**Use case:** Add authentication, CORS, caching, SSL
**Example with nginx:**
```nginx
upstream metadata_api {
server localhost:8080;
}
server {
listen 443 ssl;
server_name api.example.com;
ssl_certificate /etc/ssl/cert.pem;
ssl_certificate_key /etc/ssl/key.pem;
# CORS headers
add_header Access-Control-Allow-Origin *;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS";
# Caching
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:10m;
proxy_cache api_cache;
proxy_cache_valid 200 1h;
# Authentication
auth_basic "Restricted";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass http://metadata_api;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
```
### API Gateway Integration
**Use case:** Rate limiting, authentication, analytics
**Example with Kong:**
```yaml
services:
- name: metadata-api
url: http://localhost:8080
routes:
- name: metadata-routes
paths:
- /
plugins:
- name: rate-limiting
config:
minute: 1000
policy: local
- name: key-auth
config:
key_names:
- apikey
- name: prometheus
config:
per_consumer: true
```
### Load Balancer Integration
**Use case:** Distribute traffic across multiple instances
**Example with HAProxy:**
```
frontend metadata_frontend
bind *:80
default_backend metadata_backend
backend metadata_backend
balance roundrobin
option httpchk GET /health
server api1 10.0.1.10:8080 check
server api2 10.0.1.11:8080 check
server api3 10.0.1.12:8080 check
```
### Kubernetes Integration
**Use case:** Container orchestration, auto-scaling
**Example deployment:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: metadata-api
spec:
replicas: 3
selector:
matchLabels:
app: metadata-api
template:
metadata:
labels:
app: metadata-api
spec:
containers:
- name: api
image: ghcr.io/aunali321/music-metadata-api:latest
args: ["-db", "/data/main_database.sqlite3"]
ports:
- containerPort: 8080
volumeMounts:
- name: database
mountPath: /data
readOnly: true
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
resources:
requests:
memory: "4Gi"
cpu: "1"
limits:
memory: "8Gi"
cpu: "2"
volumes:
- name: database
persistentVolumeClaim:
claimName: metadata-db-pvc
---
apiVersion: v1
kind: Service
metadata:
name: metadata-api
spec:
selector:
app: metadata-api
ports:
- port: 80
targetPort: 8080
type: LoadBalancer
```
### Monitoring Integration
**Use case:** Metrics, logs, traces
**Example with Prometheus + Grafana:**
**1. Add metrics exporter (custom middleware):**
```go
// Not implemented in current codebase
import "github.com/prometheus/client_golang/prometheus"
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{Name: "api_requests_total"},
[]string{"method", "endpoint", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{Name: "api_request_duration_seconds"},
[]string{"method", "endpoint"},
)
)
```
**2. Scrape metrics with Prometheus:**
```yaml
scrape_configs:
- job_name: 'metadata-api'
static_configs:
- targets: ['localhost:8080']
```
**3. Visualize in Grafana:**
- Request rate dashboard
- Error rate dashboard
- Latency percentiles (p50, p95, p99)
### Logging Integration
**Use case:** Centralized log aggregation
**Example with Fluentd:**
**1. Configure Docker logging driver:**
```yaml
services:
metadata-api:
image: ghcr.io/aunali321/music-metadata-api:latest
logging:
driver: fluentd
options:
fluentd-address: localhost:24224
tag: metadata-api
```
**2. Fluentd configuration:**
```
<source>
@type forward
port 24224
</source>
<match metadata-api>
@type elasticsearch
host elasticsearch
port 9200
index_name metadata-api
type_name _doc
</match>
```
### Caching Integration
**Use case:** Reduce database load, improve latency
**Example with Redis:**
**1. Add Redis middleware (custom implementation):**
```go
// Not implemented in current codebase
func cacheMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Check Redis cache
cached, err := redisClient.Get(r.URL.Path).Result()
if err == nil {
w.Write([]byte(cached))
return
}
// Cache miss, call handler
rec := httptest.NewRecorder()
next.ServeHTTP(rec, r)
// Store in Redis (1 hour TTL)
redisClient.Set(r.URL.Path, rec.Body.String(), time.Hour)
w.Write(rec.Body.Bytes())
})
}
```
**2. Deploy Redis:**
```yaml
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
```
## Complementary Services
### MusicBrainz Integration
**Use case:** Resolve MBIDs to ISRCs, then lookup in Music Metadata API
**Flow:**
```
1. Query MusicBrainz for recording by MBID
2. Extract ISRC from MusicBrainz response
3. Lookup ISRC in Music Metadata API
4. Merge metadata (MusicBrainz credits + Spotify-style data)
```
**Example:**
```python
import requests
# Step 1: Get ISRC from MusicBrainz
mb_url = "https://musicbrainz.org/ws/2/recording/abc-123?fmt=json&inc=isrcs"
mb_response = requests.get(mb_url).json()
isrc = mb_response['isrcs'][0]
# Step 2: Lookup in Music Metadata API
mm_url = f"http://localhost:8080/lookup/isrc/{isrc}"
mm_response = requests.get(mm_url).json()
# Step 3: Merge metadata
merged = {
"mbid": "abc-123",
"isrc": isrc,
"title": mm_response['name'],
"popularity": mm_response['popularity'],
"credits": mb_response['artist-credit']
}
```
### AcoustID Integration
**Use case:** Fingerprint audio files, resolve to ISRCs
**Flow:**
```
1. Generate audio fingerprint (chromaprint)
2. Query AcoustID API with fingerprint
3. Extract ISRC from AcoustID response
4. Lookup ISRC in Music Metadata API
5. Tag audio file with metadata
```
**Example:**
```python
import acoustid
# Step 1: Fingerprint audio file
duration, fingerprint = acoustid.fingerprint_file('song.mp3')
# Step 2: Query AcoustID
results = acoustid.lookup(api_key, fingerprint, duration, meta='recordings')
# Step 3: Extract ISRC
isrc = results['recordings'][0]['isrc']
# Step 4: Lookup in Music Metadata API
mm_url = f"http://localhost:8080/lookup/isrc/{isrc}"
metadata = requests.get(mm_url).json()
# Step 5: Tag file
audio = mutagen.File('song.mp3')
audio['title'] = metadata['name']
audio['artist'] = metadata['artists'][0]['name']
audio.save()
```
### Spotify Web API Integration
**Use case:** Get real-time data, then fallback to Music Metadata API
**Flow:**
```
1. Try Spotify Web API (requires OAuth)
2. If rate limited or unavailable, fallback to Music Metadata API
3. Return cached/static data from Music Metadata API
```
**Example:**
```python
def get_track_metadata(isrc):
try:
# Try Spotify Web API (real-time)
spotify_data = spotify_client.search(q=f"isrc:{isrc}", type="track")
return spotify_data['tracks']['items'][0]
except Exception:
# Fallback to Music Metadata API (static)
mm_url = f"http://localhost:8080/lookup/isrc/{isrc}"
return requests.get(mm_url).json()
```
## Deployment Integrations
### Docker Compose
**Use case:** Local development, simple deployments
**Example:**
```yaml
version: '3.8'
services:
metadata-api:
image: ghcr.io/aunali321/music-metadata-api:latest
ports:
- "8080:8080"
volumes:
- ./data:/data:ro
command: ["-db", "/data/main_database.sqlite3"]
restart: unless-stopped
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- metadata-api
```
### Kubernetes
**Use case:** Production deployments, auto-scaling
**See Kubernetes Integration section above**
### Cloud Platforms
**AWS ECS:**
```json
{
"family": "metadata-api",
"containerDefinitions": [{
"name": "api",
"image": "ghcr.io/aunali321/music-metadata-api:latest",
"memory": 4096,
"cpu": 1024,
"portMappings": [{"containerPort": 8080}],
"command": ["-db", "/data/main_database.sqlite3"],
"mountPoints": [{
"sourceVolume": "database",
"containerPath": "/data",
"readOnly": true
}]
}],
"volumes": [{
"name": "database",
"efsVolumeConfiguration": {
"fileSystemId": "fs-12345678"
}
}]
}
```
**Google Cloud Run:**
```yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: metadata-api
spec:
template:
spec:
containers:
- image: ghcr.io/aunali321/music-metadata-api:latest
args: ["-db", "/data/main_database.sqlite3"]
volumeMounts:
- name: database
mountPath: /data
readOnly: true
volumes:
- name: database
gcePersistentDisk:
pdName: metadata-db
readOnly: true
```
## No Integration Advantages
### Simplicity
**Benefits:**
- No external service dependencies
- No network calls (faster, more reliable)
- No authentication complexity
- No API rate limits (external)
**Tradeoffs:**
- No real-time data
- No automatic updates
- No distributed features
### Reliability
**Benefits:**
- No cascading failures (no external dependencies)
- No network timeouts (all local)
- No third-party outages
- Predictable performance
**Tradeoffs:**
- Single point of failure (database file)
- No redundancy (unless replicated)
### Performance
**Benefits:**
- No network latency (local database)
- No API rate limits (self-imposed only)
- Batch queries optimized (7 queries vs 2,800)
**Tradeoffs:**
- Database size (216GB per instance)
- Memory usage (2.5GB minimum)
### Cost
**Benefits:**
- No API subscription fees
- No per-request charges
- No data transfer costs (local)
**Tradeoffs:**
- Storage costs (216GB)
- Compute costs (self-hosted)
## Future Integration Opportunities
### Potential Additions
**Authentication:**
- OAuth 2.0 provider (Keycloak, Auth0)
- API key management (custom or Kong)
**Monitoring:**
- Prometheus metrics exporter
- OpenTelemetry tracing
- Structured logging to Elasticsearch
**Caching:**
- Redis for hot data
- HTTP caching headers
- CDN for static responses
**Database:**
- PostgreSQL for writable data
- Read replicas for scaling
- Full-text search (Elasticsearch, Meilisearch)
**Message Queue:**
- Background job processing (Celery, Sidekiq)
- Event streaming (Kafka)
**Configuration:**
- Environment variables
- Config files (YAML, TOML)
- Secrets management (Vault)
### Integration Complexity
**Current:** Zero integrations (simplest possible)
**With additions:** Each integration adds:
- Configuration complexity
- Deployment dependencies
- Failure modes
- Maintenance burden
**Recommendation:** Only add integrations when necessary for specific use cases.