feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,611 @@
+# AcoustID Architecture
+
+## System Architecture Overview
+
+AcoustID employs a **monolithic multi-process architecture** with microservice-like separation of concerns. The system is split into two major repositories with distinct responsibilities:
+
+1. **acoustid-server**: Monolithic Python application with multiple process types
+2. **acoustid-index**: Standalone Zig service for fingerprint indexing
+
+## Server Architecture
+
+### Process Types
+
+The server runs as multiple independent processes, each with a specific role:
+
+| Process | Entry Point | Purpose | Scaling |
+|---------|-------------|---------|---------|
+| API | `acoustid.server:make_application()` | Handle API requests | Horizontal |
+| Web | `acoustid.server:make_application()` | Serve web UI | Horizontal |
+| Worker | `acoustid.worker:run()` | Process background jobs | Horizontal |
+| Cron | `acoustid.cron:run()` | Execute scheduled tasks | Single instance |
+| Import | `acoustid.scripts.import_submissions` | Bulk import fingerprints | Manual |
+
+### Directory Structure
+
+```
+acoustid/
+├── api/                    # API layer
+│   ├── __init__.py        # API application factory
+│   ├── errors.py          # Error handling
+│   ├── ratelimit.py       # Rate limiting logic
+│   └── v2/                # API v2 endpoints
+│       ├── __init__.py
+│       ├── lookup.py      # Fingerprint lookup
+│       ├── submit.py      # Fingerprint submission
+│       ├── misc.py        # Utility endpoints
+│       └── internal.py    # Internal admin endpoints
+├── data/                   # Business logic layer
+│   ├── account.py         # User account operations
+│   ├── application.py     # API application management
+│   ├── fingerprint.py     # Fingerprint operations
+│   ├── foreignid.py       # Foreign ID management
+│   ├── meta.py            # Metadata operations
+│   ├── musicbrainz.py     # MusicBrainz queries
+│   ├── stats.py           # Statistics tracking
+│   ├── submission.py      # Submission processing
+│   └── track.py           # Track operations
+├── future/                 # Starlette migration
+│   ├── app.py             # ASGI application
+│   ├── lookup.py          # Async lookup handler
+│   └── submit.py          # Async submit handler
+├── web/                    # Web UI layer
+│   ├── __init__.py        # Web application factory
+│   ├── views/             # View handlers
+│   └── templates/         # Jinja2 templates
+├── scripts/                # Utility scripts
+│   ├── import_submissions.py
+│   ├── backfill_fingerprint_index.py
+│   └── update_lookup_stats.py
+├── cli.py                  # CLI command definitions
+├── server.py               # WSGI/ASGI application
+├── worker.py               # Background worker
+├── cron.py                 # Cron job scheduler
+├── fingerprint.py          # Fingerprint utilities
+├── indexclient.py          # Legacy TCP index client
+├── fpstore.py              # Modern HTTP index client
+├── db.py                   # Database connection management
+├── config.py               # Configuration loading
+└── tables.py               # SQLAlchemy ORM models
+```
+
+### Layered Architecture
+
+The server follows a traditional layered architecture:
+
+```
+┌─────────────────────────────────────────┐
+│     Presentation Layer                  │
+│  (api/, web/, future/)                  │
+│  - HTTP request/response handling       │
+│  - Input validation                     │
+│  - Response formatting                  │
+└─────────────────────────────────────────┘
+              ↓
+┌─────────────────────────────────────────┐
+│     Business Logic Layer                │
+│  (data/)                                │
+│  - Domain operations                    │
+│  - Business rules                       │
+│  - Orchestration                        │
+└─────────────────────────────────────────┘
+              ↓
+┌─────────────────────────────────────────┐
+│     Data Access Layer                   │
+│  (db.py, tables.py)                     │
+│  - Database queries                     │
+│  - ORM models                           │
+│  - Transaction management               │
+└─────────────────────────────────────────┘
+              ↓
+┌─────────────────────────────────────────┐
+│     External Services Layer             │
+│  (indexclient.py, fpstore.py)           │
+│  - Index communication                  │
+│  - MusicBrainz queries                  │
+│  - Redis operations                     │
+└─────────────────────────────────────────┘
+```
+
+### Framework Transition
+
+The server is actively transitioning from Flask to Starlette:
+
+**Current (Flask/Werkzeug)**:
+- Location: `acoustid/api/`, `acoustid/web/`
+- WSGI-based synchronous request handling
+- Gunicorn as application server
+- Blocking database operations with psycopg2
+
+**Future (Starlette)**:
+- Location: `acoustid/future/`
+- ASGI-based asynchronous request handling
+- Uvicorn as application server
+- Async database operations with asyncpg
+
+**Migration Status**:
+- Core lookup and submit endpoints have async implementations
+- Legacy endpoints still use Flask
+- Both frameworks run simultaneously during transition
+- Configuration flag controls which implementation is used
+
+## Index Architecture
+
+### LSM-Tree Design
+
+The index uses a **Log-Structured Merge-tree (LSM-tree)** for efficient fingerprint storage and retrieval.
+
+**Core Concept**:
+- Writes go to in-memory segment (fast)
+- Memory segment periodically flushed to disk
+- Background process merges disk segments
+- Reads check memory segment first, then disk segments
+
+**Components**:
+
+```
+┌─────────────────────────────────────────┐
+│         MultiIndex                      │
+│  - Manages multiple named indexes       │
+│  - Routes requests to correct index     │
+└─────────────────────────────────────────┘
+              ↓
+┌─────────────────────────────────────────┐
+│         Index                           │
+│  - Single fingerprint index             │
+│  - Coordinates segments and merging     │
+└─────────────────────────────────────────┘
+              ↓
+┌──────────────────┬──────────────────────┐
+│  MemorySegment   │   FileSegment(s)     │
+│  - In-memory     │   - On-disk          │
+│  - Fast writes   │   - Immutable        │
+│  - Volatile      │   - Persistent       │
+└──────────────────┴──────────────────────┘
+              ↓
+┌─────────────────────────────────────────┐
+│         Oplog (Write-Ahead Log)         │
+│  - Durability for memory segment        │
+│  - Replay on crash recovery             │
+└─────────────────────────────────────────┘
+```
+
+### Segment Management
+
+**MemorySegment** (`src/MemorySegment.zig`):
+- Hash map of fingerprint ID to posting list
+- Posting list: array of term IDs (compressed)
+- Maximum size threshold triggers flush
+- Backed by Oplog for durability
+
+**FileSegment** (`src/FileSegment.zig`):
+- Immutable on-disk segment
+- Binary file format with index and data sections
+- StreamVByte compression for posting lists
+- Memory-mapped for fast reads
+
+**Segment Lifecycle**:
+1. Writes accumulate in MemorySegment
+2. MemorySegment reaches size threshold
+3. Flush to new FileSegment
+4. Clear MemorySegment and Oplog
+5. Background merger selects segments to merge
+6. Merge creates new larger FileSegment
+7. Delete old segments
+
+### Merge Policy
+
+**Tiered Merge Strategy**:
+- Segments grouped into tiers by size
+- Tier 0: Smallest segments (recently flushed)
+- Tier N: Largest segments (heavily merged)
+- Merge triggered when tier has too many segments
+- Merges segments within same tier
+
+**Benefits**:
+- Write amplification bounded
+- Read performance improves over time
+- Disk space reclaimed from deleted entries
+
+### File Format
+
+**Segment File Structure** (`src/filefmt.zig`):
+
+```
+┌─────────────────────────────────────────┐
+│  Header                                 │
+│  - Magic number                         │
+│  - Version                              │
+│  - Metadata                             │
+├─────────────────────────────────────────┤
+│  Index Section                          │
+│  - Fingerprint ID → Offset mapping      │
+│  - Binary search tree or hash table     │
+├─────────────────────────────────────────┤
+│  Data Section                           │
+│  - Compressed posting lists             │
+│  - StreamVByte encoded                  │
+└─────────────────────────────────────────┘
+```
+
+**Block Compression** (`src/block.zig`):
+- Posting lists compressed in blocks
+- StreamVByte SIMD compression
+- Delta encoding for term IDs
+- Typical compression ratio: 4-8x
+
+### Index Reader
+
+**IndexReader** (`src/IndexReader.zig`):
+- Read-only view of index
+- Merges results from all segments
+- Implements search algorithm
+- Returns top-K candidates by score
+
+**Search Algorithm**:
+1. Extract query terms from fingerprint
+2. For each term, fetch posting lists from all segments
+3. Merge posting lists (union)
+4. Score each candidate by term overlap
+5. Return top-K candidates sorted by score
+
+## Data Flow
+
+### Submission Flow (Detailed)
+
+```
+┌─────────┐
+│ Client  │
+└────┬────┘
+     │ POST /v2/submit
+     ↓
+┌─────────────────────────────────────────┐
+│  SubmitHandler (api/v2/submit.py)      │
+│  1. Validate API keys (client + user)  │
+│  2. Check rate limits (Redis)          │
+│  3. Decode fingerprints                │
+│  4. Insert into submission table       │
+│  5. Publish to NATS queue              │
+└─────────────────────────────────────────┘
+     │
+     ↓ NATS message
+┌─────────────────────────────────────────┐
+│  Worker (worker.py)                     │
+│  1. Consume message from NATS          │
+│  2. Load submission from database      │
+└─────────────────────────────────────────┘
+     │
+     ↓
+┌─────────────────────────────────────────┐
+│  FingerprintSearcher (data/fingerprint) │
+│  1. Extract query from fingerprint     │
+│  2. Search index for matches           │
+└─────────────────────────────────────────┘
+     │
+     ↓ HTTP POST /:index/_search
+┌─────────────────────────────────────────┐
+│  Index (fpindex)                        │
+│  1. Decode MessagePack request         │
+│  2. Search segments                    │
+│  3. Score candidates                   │
+│  4. Return top matches                 │
+└─────────────────────────────────────────┘
+     │
+     ↓ Candidate fingerprint IDs
+┌─────────────────────────────────────────┐
+│  Worker (continued)                     │
+│  1. Fetch candidate metadata from DB   │
+│  2. Decide: create new track or link   │
+│  3. Insert/update track tables         │
+│  4. Update index with new fingerprint  │
+│  5. Store result in submission_result  │
+└─────────────────────────────────────────┘
+     │
+     ↓ HTTP PUT /:index/:fpid
+┌─────────────────────────────────────────┐
+│  Index (fpindex)                        │
+│  1. Add fingerprint to MemorySegment   │
+│  2. Append to Oplog                    │
+│  3. Trigger flush if needed            │
+└─────────────────────────────────────────┘
+```
+
+### Lookup Flow (Detailed)
+
+```
+┌─────────┐
+│ Client  │
+└────┬────┘
+     │ GET/POST /v2/lookup
+     ↓
+┌─────────────────────────────────────────┐
+│  LookupHandler (api/v2/lookup.py)      │
+│  1. Validate API key (client)          │
+│  2. Check rate limits (Redis)          │
+│  3. Parse parameters                   │
+└─────────────────────────────────────────┘
+     │
+     ↓
+┌─────────────────────────────────────────┐
+│  decode_fingerprint (fingerprint.py)    │
+│  1. Decode base64 or compressed format │
+│  2. Decompress if needed               │
+│  3. Parse Chromaprint data             │
+└─────────────────────────────────────────┘
+     │
+     ↓
+┌─────────────────────────────────────────┐
+│  extract_query (fingerprint.py)         │
+│  1. Extract hash terms from fingerprint│
+│  2. Build query structure              │
+└─────────────────────────────────────────┘
+     │
+     ↓
+┌─────────────────────────────────────────┐
+│  fpstore.search (fpstore.py)            │
+│  1. Encode query as MessagePack        │
+│  2. HTTP POST to index                 │
+└─────────────────────────────────────────┘
+     │
+     ↓ HTTP POST /:index/_search
+┌─────────────────────────────────────────┐
+│  Index (fpindex)                        │
+│  1. Parse MessagePack query            │
+│  2. Search all segments                │
+│  3. Merge and score results            │
+│  4. Return top-K candidates            │
+└─────────────────────────────────────────┘
+     │
+     ↓ Candidate fingerprint IDs + scores
+┌─────────────────────────────────────────┐
+│  LookupHandler (continued)              │
+│  1. Fetch fingerprint metadata from DB │
+│  2. Fetch track metadata from DB       │
+│  3. Fetch MusicBrainz data if requested│
+│  4. Build result structure             │
+│  5. Format as JSON/XML                 │
+└─────────────────────────────────────────┘
+     │
+     ↓ JSON response
+┌─────────┐
+│ Client  │
+└─────────┘
+```
+
+### Background Processing
+
+**Cron Jobs** (`acoustid/cron.py`):
+- Update lookup statistics (hourly)
+- Update user agent statistics (daily)
+- Clean up old submissions (daily)
+- Refresh materialized views (hourly)
+- Backup index snapshots (daily)
+
+**Worker Tasks** (`acoustid/worker.py`):
+- Process fingerprint submissions
+- Import bulk fingerprints
+- Update index with new data
+- Resolve MBID redirects
+- Clean up orphaned records
+
+## Index Communication Protocols
+
+### Legacy Protocol (indexclient.py)
+
+**Transport**: Raw TCP socket  
+**Port**: 6080 (default)  
+**Format**: Custom binary protocol
+
+**Message Structure**:
+```
+┌────────────────┬────────────────┬────────────────┐
+│  Length (4B)   │  Command (1B)  │  Payload       │
+└────────────────┴────────────────┴────────────────┘
+```
+
+**Commands**:
+- `0x01`: Search
+- `0x02`: Insert
+- `0x03`: Delete
+
+**Status**: Being phased out, replaced by HTTP protocol
+
+### Modern Protocol (fpstore.py)
+
+**Transport**: HTTP/1.1  
+**Port**: 6081 (default)  
+**Format**: MessagePack
+
+**Endpoints**:
+
+| Method | Path | Purpose |
+|--------|------|---------|
+| POST | `/:index/_search` | Search for fingerprints |
+| PUT | `/:index/:fpid` | Insert/update fingerprint |
+| DELETE | `/:index/:fpid` | Delete fingerprint |
+| GET | `/:index` | Get index info |
+| GET | `/:index/_segments` | List segments |
+| GET | `/:index/_snapshot` | Create snapshot |
+
+**Search Request**:
+```python
+{
+    "query": [term_id1, term_id2, ...],  # Query terms
+    "limit": 10,                          # Max results
+    "min_score": 0.5                      # Score threshold
+}
+```
+
+**Search Response**:
+```python
+{
+    "results": [
+        {"id": fpid1, "score": 0.95},
+        {"id": fpid2, "score": 0.87},
+        ...
+    ]
+}
+```
+
+## Concurrency and Parallelism
+
+### Server Concurrency
+
+**API/Web Processes**:
+- Multiple worker processes (Gunicorn/Uvicorn)
+- Each process handles requests independently
+- Shared-nothing architecture
+- Database connection pooling per process
+
+**Worker Processes**:
+- Multiple worker instances
+- NATS queue provides work distribution
+- Each worker processes one submission at a time
+- No shared state between workers
+
+**Cron Process**:
+- Single instance (leader election via database)
+- Scheduled tasks run sequentially
+- Long-running tasks delegated to workers
+
+### Index Concurrency
+
+**Thread Model**:
+- Main thread: HTTP server
+- Worker threads: Search and merge operations
+- Configurable thread pool size
+
+**Locking Strategy**:
+- Read-write lock on Index
+- Multiple concurrent readers
+- Exclusive writer (for flush/merge)
+- Lock-free MemorySegment (atomic operations)
+
+**Background Tasks**:
+- Segment merger runs in background thread
+- Oplog flusher runs periodically
+- Metrics collector runs independently
+
+## Scalability Considerations
+
+### Horizontal Scaling
+
+**API/Web**:
+- Stateless processes
+- Scale by adding more instances
+- Load balancer distributes requests
+- Session state in Redis (if needed)
+
+**Workers**:
+- Scale by adding more instances
+- NATS queue distributes work
+- No coordination required
+
+**Index**:
+- Multiple index instances (sharding)
+- Consistent hashing for fingerprint distribution
+- NATS for cluster coordination
+- Each instance handles subset of fingerprints
+
+### Vertical Scaling
+
+**Database**:
+- Connection pooling
+- Read replicas for queries
+- Partitioning for large tables
+- Materialized views for aggregations
+
+**Index**:
+- More threads for search
+- Larger memory segment
+- Faster disk for segments
+- More RAM for file caching
+
+## Fault Tolerance
+
+### Server Resilience
+
+**Database Failures**:
+- Connection retry with exponential backoff
+- Health checks detect failures
+- Read-only mode if write DB unavailable
+
+**Index Failures**:
+- Graceful degradation (return partial results)
+- Retry with exponential backoff
+- Circuit breaker pattern
+
+**NATS Failures**:
+- Persistent queue (JetStream)
+- Automatic reconnection
+- Message replay on recovery
+
+### Index Resilience
+
+**Crash Recovery**:
+- Oplog replay restores MemorySegment
+- FileSegments are immutable (no corruption)
+- Incomplete merges discarded
+
+**Data Integrity**:
+- Checksums in file format
+- Atomic file operations
+- Write-ahead logging
+
+**Replication**:
+- NATS-based replication (optional)
+- Snapshot-based backup
+- Point-in-time recovery
+
+## Performance Characteristics
+
+### Server Performance
+
+**Lookup Latency**:
+- P50: ~50ms (including index search)
+- P95: ~200ms
+- P99: ~500ms
+
+**Bottlenecks**:
+- Index search time (dominant)
+- Database query time (metadata fetch)
+- Network latency (MusicBrainz queries)
+
+### Index Performance
+
+**Search Latency**:
+- P50: ~5ms
+- P95: ~20ms
+- P99: ~50ms
+
+**Throughput**:
+- ~1000 searches/second (single instance)
+- ~500 inserts/second (single instance)
+
+**Bottlenecks**:
+- Disk I/O (segment reads)
+- CPU (decompression and scoring)
+- Memory (segment caching)
+
+## Future Architecture Plans
+
+### Server Modernization
+
+1. Complete migration to Starlette/ASGI
+2. Remove Flask dependencies
+3. Async database operations everywhere
+4. GraphQL API alongside REST
+
+### Index Enhancements
+
+1. Distributed index with automatic sharding
+2. Replication for high availability
+3. Incremental snapshots
+4. Query result caching
+
+### Infrastructure
+
+1. Kubernetes deployment
+2. Service mesh (Istio/Linkerd)
+3. Distributed tracing (OpenTelemetry)
+4. Advanced monitoring (Prometheus + Grafana)