Files
metadata-agregator/docs/research/music-metadata-api/analysis/CODEBASE.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

946 lines
20 KiB
Markdown

# Music Metadata API - Codebase Analysis
## Codebase Overview
Music Metadata API is a small, focused Go codebase with minimal complexity:
**Total lines of code:** ~1,100 lines (excluding tests, which don't exist)
**File breakdown:**
- `cmd/server/main.go` - 62 lines (entry point)
- `internal/db/db.go` - 907 lines (database layer, largest file)
- `internal/models/models.go` - 65 lines (data structures)
- `internal/api/handlers.go` - ~150 lines (HTTP handlers)
- `internal/api/ratelimit.go` - ~80 lines (rate limiting)
- `internal/api/openapi.go` - ~100 lines (OpenAPI spec)
**Characteristics:**
- No web framework (stdlib only)
- No ORM (raw SQL)
- No test files (zero test coverage)
- No configuration files (CLI flags only)
- Minimal dependencies (2 external packages)
## Configuration
### CLI Flags
**Defined in:** `cmd/server/main.go`
```go
var (
dbPath = flag.String("db", "", "path to database file (required)")
addr = flag.String("addr", ":8080", "HTTP server address")
)
```
**Usage:**
```bash
./metadata-api -db /data/main_database.sqlite3 -addr :8080
```
**Limitations:**
- Only 2 configurable parameters
- No environment variable support
- No configuration file support
- All timeouts hardcoded
- All limits hardcoded
### Hardcoded Configuration
**Timeouts:**
```go
// Graceful shutdown timeout
shutdownTimeout := 10 * time.Second
// Search query timeout
ctx, cancel := context.WithTimeout(r.Context(), 10*time.Second)
```
**Rate limiting:**
```go
// Hardcoded in api/ratelimit.go
rateLimiter := NewRateLimiter(100, 200) // 100 req/s, 200 burst
```
**Database connection pool:**
```go
// Hardcoded in db/db.go
db.SetMaxOpenConns(8)
db.SetMaxIdleConns(8)
db.SetConnMaxLifetime(0)
```
**Search limits:**
```go
// Hardcoded in api/handlers.go
const (
minQueryLength = 2
maxSearchLimit = 50
defaultLimit = 10
)
```
**Batch limits:**
```go
// Hardcoded in api/handlers.go
const maxBatchItems = 400
```
**SQLite PRAGMAs:**
```go
// Hardcoded in db/db.go
dsn := fmt.Sprintf("file:%s?mode=ro&_journal_mode=off&_cache_size=-64000&_mmap_size=1073741824&_query_only=true", dbPath)
```
**Recommendation:** Extract to configuration struct for flexibility.
### Environment Variables
**docker-compose.yml defines:**
```yaml
environment:
- LOG_LEVEL=info
```
**BUG:** `LOG_LEVEL` is not used in code. No log level control implemented.
**Expected behavior:** Filter logs by level (debug, info, warn, error)
**Actual behavior:** All logs output (no filtering)
**Fix required:**
```go
// Add to main.go
logLevel := os.Getenv("LOG_LEVEL")
if logLevel == "" {
logLevel = "info"
}
var level slog.Level
switch logLevel {
case "debug":
level = slog.LevelDebug
case "info":
level = slog.LevelInfo
case "warn":
level = slog.LevelWarn
case "error":
level = slog.LevelError
}
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: level}))
```
## Logging
### Implementation
**Package:** Go stdlib `log/slog` (structured logging)
**Usage pattern:**
```go
slog.Error("Database query failed", "error", err, "query", query)
```
**Output format:**
```json
{"time":"2024-01-15T10:30:00Z","level":"ERROR","msg":"Database query failed","error":"no such table","query":"SELECT * FROM tracks"}
```
### Logging Locations
**Error logging only:**
- Database query failures
- JSON decode errors
- HTTP handler errors
- Graceful shutdown errors
**No info/debug logging:**
- Request logging (no access logs)
- Query execution logging
- Performance metrics
- Startup messages
**Example from db.go:**
```go
rows, err := d.mainDB.Query(query, args...)
if err != nil {
slog.Error("Query failed", "error", err, "query", query)
return nil, err
}
```
### Log Level Control
**Current:** No log level filtering (all logs output)
**Missing:**
- Debug logs (query details, timing)
- Info logs (startup, shutdown, requests)
- Warn logs (rate limiting, slow queries)
**Recommendation:** Implement log level control via environment variable.
## Health Checks
### Naive Implementation
**Endpoint:** `GET /health`
**Code:**
```go
func handleHealth(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}
```
**Response:**
```json
{"status":"ok"}
```
**Problem:** Always returns 200 OK, even if database is unreachable.
**Test:**
```bash
# Stop database (simulate failure)
mv /data/main_database.sqlite3 /data/main_database.sqlite3.bak
# Health check still returns OK
curl http://localhost:8080/health
# {"status":"ok"}
# But actual queries fail
curl http://localhost:8080/lookup/track/abc123
# 500 Internal Server Error
```
### Improved Health Check
**Recommendation:**
```go
func handleHealth(db *sql.DB) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// Ping database
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
if err := db.PingContext(ctx); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "unhealthy",
"error": "database unavailable",
})
return
}
// Optional: Test query
var count int
err := db.QueryRowContext(ctx, "SELECT COUNT(*) FROM tracks LIMIT 1").Scan(&count)
if err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "unhealthy",
"error": "database query failed",
})
return
}
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}
}
```
## Rate Limiting
### Implementation
**File:** `internal/api/ratelimit.go`
**Algorithm:** Token bucket per IP
**Data structure:**
```go
type RateLimiter struct {
visitors map[string]*rate.Limiter // IP -> limiter
mu sync.RWMutex // Protects visitors map
rate rate.Limit // Tokens per second
burst int // Burst capacity
}
```
**Middleware:**
```go
func (rl *RateLimiter) Limit(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Extract IP
ip := getIP(r)
// Get or create limiter for IP
limiter := rl.getLimiter(ip)
// Check if allowed
if !limiter.Allow() {
w.Header().Set("Retry-After", "1")
http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
return
}
next.ServeHTTP(w, r)
})
}
```
**IP extraction:**
```go
func getIP(r *http.Request) string {
// Check X-Forwarded-For header (proxy/load balancer)
forwarded := r.Header.Get("X-Forwarded-For")
if forwarded != "" {
// Take first IP if comma-separated
ips := strings.Split(forwarded, ",")
return strings.TrimSpace(ips[0])
}
// Fallback to RemoteAddr
ip, _, _ := net.SplitHostPort(r.RemoteAddr)
return ip
}
```
### Memory Leak
**Problem:** Visitor map grows unbounded. No cleanup for inactive IPs.
**Code:**
```go
func (rl *RateLimiter) getLimiter(ip string) *rate.Limiter {
rl.mu.Lock()
defer rl.mu.Unlock()
limiter, exists := rl.visitors[ip]
if !exists {
limiter = rate.NewLimiter(rl.rate, rl.burst)
rl.visitors[ip] = limiter // BUG: Never removed
}
return limiter
}
```
**Impact:**
- Long-running servers accumulate IPs
- Memory usage grows over time
- No expiration for inactive IPs
**Example:**
- 1 million unique IPs over 1 month
- ~100 bytes per limiter
- ~100MB memory leak
**Fix:**
```go
type visitor struct {
limiter *rate.Limiter
lastSeen time.Time
}
func (rl *RateLimiter) cleanup() {
ticker := time.NewTicker(1 * time.Hour)
defer ticker.Stop()
for range ticker.C {
rl.mu.Lock()
for ip, v := range rl.visitors {
// Remove visitors inactive for 24 hours
if time.Since(v.lastSeen) > 24*time.Hour {
delete(rl.visitors, ip)
}
}
rl.mu.Unlock()
}
}
// Start cleanup goroutine in NewRateLimiter
go rl.cleanup()
```
### Rate Limit Configuration
**Current:** Hardcoded (100 req/s, 200 burst)
**Recommendation:** Make configurable via CLI flags or environment variables.
```go
// CLI flags
var (
rateLimit = flag.Int("rate-limit", 100, "requests per second")
rateBurst = flag.Int("rate-burst", 200, "burst capacity")
)
// Usage
rateLimiter := api.NewRateLimiter(rate.Limit(*rateLimit), *rateBurst)
```
## Search Implementation
### Query Pattern
**Track search:**
```go
query := `
SELECT id, name, isrc, duration_ms, popularity, album_rowid
FROM tracks
WHERE name LIKE ? COLLATE NOCASE
ORDER BY popularity DESC
LIMIT ?
`
args := []interface{}{"%" + searchQuery + "%", limit}
```
**Artist search:**
```go
query := `
SELECT id, name, followers_total, popularity
FROM artists
WHERE name LIKE ? COLLATE NOCASE
ORDER BY followers_total DESC
LIMIT ?
`
args := []interface{}{"%" + searchQuery + "%", limit}
```
### Performance Characteristics
**LIKE %query% problems:**
- Can't use indexes (full table scan)
- Slow on 256M rows
- CPU-intensive (string matching)
**Benchmark (estimated):**
- Common query ("love"): 5-10 seconds
- Specific query ("bohemian rhapsody"): 1-2 seconds
- Rare query ("xyzabc"): 10+ seconds (full scan)
**10-second timeout:**
```go
ctx, cancel := context.WithTimeout(r.Context(), 10*time.Second)
defer cancel()
rows, err := db.QueryContext(ctx, query, args...)
if err == context.DeadlineExceeded {
http.Error(w, "Search timeout", http.StatusGatewayTimeout)
return
}
```
### Search Validation
**Minimum query length:**
```go
if len(searchQuery) < 2 {
http.Error(w, "Query must be at least 2 characters", http.StatusBadRequest)
return
}
```
**Maximum limit:**
```go
if limit > 50 {
http.Error(w, "Limit cannot exceed 50", http.StatusBadRequest)
return
}
```
**Default limit:**
```go
limit := 10
if limitParam := r.URL.Query().Get("limit"); limitParam != "" {
limit, _ = strconv.Atoi(limitParam)
}
```
### Full-Text Search Alternative
**Not implemented:** SQLite FTS5 (Full-Text Search)
**FTS5 benefits:**
- Indexed search (much faster)
- Relevance ranking
- Phrase search
- Boolean operators
**Why not used:**
- Requires writable database (to create FTS5 table)
- Databases are read-only
- Would need separate FTS5 database
**Workaround:**
```sql
-- Create separate FTS5 database (one-time setup)
CREATE VIRTUAL TABLE tracks_fts USING fts5(id, name, content=tracks);
INSERT INTO tracks_fts SELECT id, name FROM tracks;
-- Fast search
SELECT * FROM tracks_fts WHERE name MATCH 'bohemian';
```
**Implementation:**
- Create FTS5 database during database preparation
- Open second database connection in code
- Query FTS5 for search, then fetch full data from main DB
## Testing
### Test Coverage
**Test files:** 0
**Test coverage:** 0%
**Test framework:** None
**Evidence:**
```bash
# No test files in repository
find . -name "*_test.go"
# (no output)
```
**.gitignore includes:**
```
coverage.out
```
**Implication:** Testing was planned but never implemented.
### CI/CD Testing
**GitHub Actions workflow:** `.github/workflows/docker-publish.yml`
**Steps:**
1. Checkout code
2. Build Docker image
3. Push to registry
**Missing:** No test step
**Expected workflow:**
```yaml
- name: Run tests
run: go test -v ./...
- name: Check coverage
run: go test -cover ./...
```
### Manual Testing
**Only testing:** Manual API calls
**Example:**
```bash
# Health check
curl http://localhost:8080/health
# Track lookup
curl http://localhost:8080/lookup/track/abc123
# Search
curl "http://localhost:8080/search/track?q=test"
```
**No automated testing:**
- No unit tests
- No integration tests
- No end-to-end tests
- No performance tests
- No load tests
### Testing Recommendations
**Unit tests needed:**
- Rate limiter logic
- IP extraction
- Query building
- Data enrichment
- JSON serialization
**Integration tests needed:**
- Database queries
- HTTP handlers
- Batch operations
- Search functionality
**Example unit test:**
```go
// internal/api/ratelimit_test.go
func TestRateLimiter(t *testing.T) {
rl := NewRateLimiter(10, 20) // 10 req/s, 20 burst
// Should allow burst
for i := 0; i < 20; i++ {
if !rl.getLimiter("127.0.0.1").Allow() {
t.Errorf("Request %d should be allowed", i)
}
}
// Should reject 21st request
if rl.getLimiter("127.0.0.1").Allow() {
t.Error("Request 21 should be rate limited")
}
}
```
**Example integration test:**
```go
// internal/db/db_test.go
func TestGetTrack(t *testing.T) {
db, err := NewDatabase("testdata/test.db")
if err != nil {
t.Fatal(err)
}
defer db.Close()
track, err := db.GetTrack("test_track_id")
if err != nil {
t.Fatal(err)
}
if track.Name != "Test Track" {
t.Errorf("Expected 'Test Track', got '%s'", track.Name)
}
}
```
## Error Handling
### Error Patterns
**Database errors:**
```go
rows, err := db.Query(query, args...)
if err != nil {
slog.Error("Query failed", "error", err)
http.Error(w, "Internal server error", http.StatusInternalServerError)
return
}
```
**JSON decode errors:**
```go
var req BatchRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, "Invalid JSON", http.StatusBadRequest)
return
}
```
**Validation errors:**
```go
if len(query) < 2 {
http.Error(w, "Query too short", http.StatusBadRequest)
return
}
```
### Error Responses
**Generic errors:**
```go
http.Error(w, "Internal server error", http.StatusInternalServerError)
```
**Problem:** No error details returned to client (security vs usability tradeoff)
**Structured errors (not implemented):**
```go
type ErrorResponse struct {
Error string `json:"error"`
Code string `json:"code"`
Details string `json:"details,omitempty"`
}
func writeError(w http.ResponseWriter, status int, code, message string) {
w.WriteHeader(status)
json.NewEncoder(w).Encode(ErrorResponse{
Error: message,
Code: code,
})
}
```
## Code Quality
### Strengths
**Simplicity:**
- Small codebase (~1,100 lines)
- Easy to understand
- Minimal dependencies
- No framework magic
**Readability:**
- Clear function names
- Logical file organization
- Consistent style
**Performance:**
- Batch optimization (343x fewer queries)
- Connection pooling
- Memory-mapped I/O
### Weaknesses
**No tests:**
- Zero test coverage
- No regression protection
- No documentation via tests
**Hardcoded config:**
- No flexibility
- Requires recompilation to change limits
- No environment-specific config
**Memory leak:**
- Rate limiter visitor map grows unbounded
- Requires periodic restarts
**Naive health check:**
- Doesn't verify database connectivity
- False positives in monitoring
**No metrics:**
- No visibility into performance
- No error rate tracking
- No usage analytics
**Unused config:**
- `LOG_LEVEL` environment variable ignored
- Misleading documentation
**No CORS:**
- Browser-based clients blocked
- Requires reverse proxy workaround
**No authentication:**
- Public API (security risk)
- No usage tracking per user
### Code Smells
**Magic numbers:**
```go
// What is 64000? Why 1073741824?
_cache_size=-64000&_mmap_size=1073741824
```
**Fix:** Use named constants
```go
const (
sqliteCacheSizeKB = 64000 // 64MB
sqliteMmapSizeBytes = 1 << 30 // 1GB
)
```
**Repeated code:**
```go
// Similar enrichment logic repeated for tracks, albums, artists
func enrichTrack(track *Track) { /* ... */ }
func enrichAlbum(album *Album) { /* ... */ }
func enrichArtist(artist *Artist) { /* ... */ }
```
**Fix:** Generic enrichment function
**Global state:**
```go
// Rate limiter as global variable (not shown in code, but implied)
var rateLimiter *RateLimiter
```
**Fix:** Dependency injection
## Dependencies
### External Packages
**modernc.org/sqlite v1.34.4:**
- Pure Go SQLite driver
- No CGO required
- 100% Go implementation
- Larger binary size vs CGO version
**golang.org/x/time v0.14.0:**
- Rate limiting (token bucket)
- Part of Go extended stdlib
- Minimal, focused package
**Total dependencies:** 2 direct + transitive dependencies
### Dependency Management
**go.mod:**
```go
module github.com/Aunali321/music-metadata-api
go 1.24
require (
modernc.org/sqlite v1.34.4
golang.org/x/time v0.14.0
)
```
**Dependency updates:**
```bash
# Check for updates
go list -u -m all
# Update dependencies
go get -u ./...
go mod tidy
```
**Security scanning:**
```bash
# Scan for vulnerabilities
go list -json -m all | nancy sleuth
```
## Code Organization
### Package Structure
```
music-metadata-api/
├── cmd/
│ └── server/ # Entry point
│ └── main.go # CLI, server setup, graceful shutdown
├── internal/ # Private packages
│ ├── api/ # HTTP layer
│ │ ├── handlers.go # Route handlers
│ │ ├── ratelimit.go # Rate limiting middleware
│ │ └── openapi.go # OpenAPI spec
│ │
│ ├── db/ # Database layer
│ │ └── db.go # Queries, enrichment, batch optimization
│ │
│ └── models/ # Data models
│ └── models.go # Structs, JSON tags
├── Dockerfile # Container build
├── docker-compose.yml # Local deployment
├── go.mod # Dependencies
└── .github/
└── workflows/
└── docker-publish.yml # CI/CD
```
### Separation of Concerns
**Good:**
- Clear layer boundaries (API → DB → Models)
- No circular dependencies
- Database logic isolated from HTTP
**Could improve:**
- Extract configuration to separate package
- Extract validation to separate package
- Extract error handling to separate package
## Performance Characteristics
### Bottlenecks
**Search queries:**
- `LIKE %query%` full table scan
- 10-second timeout (can be hit)
- CPU-bound (string matching)
**Rate limiter:**
- RWMutex contention under high load
- Map lookup on every request
**Database:**
- Single SQLite file (no sharding)
- 8 connection limit (conservative)
### Optimizations
**Batch queries:**
- 343x fewer queries (400 items: 7 queries vs 2,800)
- IN clause for bulk lookups
**Connection pooling:**
- Reuse connections (no overhead)
- 8 warm connections
**Memory-mapped I/O:**
- 1GB mmap (faster than read() syscalls)
- OS handles paging
**Read-only mode:**
- No write locks
- Safe concurrent reads
## Maintainability
### Documentation
**Code comments:** Minimal
**README:** Basic (installation, usage)
**OpenAPI spec:** Comprehensive (all endpoints documented)
**No inline documentation:**
```go
// No function comments
func enrichTrack(track *Track) {
// No explanation of enrichment logic
}
```
**Recommendation:** Add godoc comments
```go
// enrichTrack populates track with related entities (artists, album, track files).
// It performs batch queries to minimize database round-trips.
func enrichTrack(track *Track) {
// ...
}
```
### Extensibility
**Easy to extend:**
- Add new endpoints (register route)
- Add new models (define struct)
- Add new queries (write SQL)
**Hard to extend:**
- Change rate limiting strategy (tightly coupled)
- Add authentication (no middleware chain)
- Add metrics (no instrumentation points)
### Technical Debt
**High priority:**
1. Fix rate limiter memory leak
2. Implement proper health check
3. Add test coverage
4. Use LOG_LEVEL environment variable
**Medium priority:**
1. Extract hardcoded config
2. Add metrics/monitoring
3. Implement CORS support
4. Add authentication
**Low priority:**
1. Improve search performance (FTS5)
2. Add caching layer
3. Structured error responses
4. Request logging