feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,611 @@
|
||||
# AcoustID Architecture
|
||||
|
||||
## System Architecture Overview
|
||||
|
||||
AcoustID employs a **monolithic multi-process architecture** with microservice-like separation of concerns. The system is split into two major repositories with distinct responsibilities:
|
||||
|
||||
1. **acoustid-server**: Monolithic Python application with multiple process types
|
||||
2. **acoustid-index**: Standalone Zig service for fingerprint indexing
|
||||
|
||||
## Server Architecture
|
||||
|
||||
### Process Types
|
||||
|
||||
The server runs as multiple independent processes, each with a specific role:
|
||||
|
||||
| Process | Entry Point | Purpose | Scaling |
|
||||
|---------|-------------|---------|---------|
|
||||
| API | `acoustid.server:make_application()` | Handle API requests | Horizontal |
|
||||
| Web | `acoustid.server:make_application()` | Serve web UI | Horizontal |
|
||||
| Worker | `acoustid.worker:run()` | Process background jobs | Horizontal |
|
||||
| Cron | `acoustid.cron:run()` | Execute scheduled tasks | Single instance |
|
||||
| Import | `acoustid.scripts.import_submissions` | Bulk import fingerprints | Manual |
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
acoustid/
|
||||
├── api/ # API layer
|
||||
│ ├── __init__.py # API application factory
|
||||
│ ├── errors.py # Error handling
|
||||
│ ├── ratelimit.py # Rate limiting logic
|
||||
│ └── v2/ # API v2 endpoints
|
||||
│ ├── __init__.py
|
||||
│ ├── lookup.py # Fingerprint lookup
|
||||
│ ├── submit.py # Fingerprint submission
|
||||
│ ├── misc.py # Utility endpoints
|
||||
│ └── internal.py # Internal admin endpoints
|
||||
├── data/ # Business logic layer
|
||||
│ ├── account.py # User account operations
|
||||
│ ├── application.py # API application management
|
||||
│ ├── fingerprint.py # Fingerprint operations
|
||||
│ ├── foreignid.py # Foreign ID management
|
||||
│ ├── meta.py # Metadata operations
|
||||
│ ├── musicbrainz.py # MusicBrainz queries
|
||||
│ ├── stats.py # Statistics tracking
|
||||
│ ├── submission.py # Submission processing
|
||||
│ └── track.py # Track operations
|
||||
├── future/ # Starlette migration
|
||||
│ ├── app.py # ASGI application
|
||||
│ ├── lookup.py # Async lookup handler
|
||||
│ └── submit.py # Async submit handler
|
||||
├── web/ # Web UI layer
|
||||
│ ├── __init__.py # Web application factory
|
||||
│ ├── views/ # View handlers
|
||||
│ └── templates/ # Jinja2 templates
|
||||
├── scripts/ # Utility scripts
|
||||
│ ├── import_submissions.py
|
||||
│ ├── backfill_fingerprint_index.py
|
||||
│ └── update_lookup_stats.py
|
||||
├── cli.py # CLI command definitions
|
||||
├── server.py # WSGI/ASGI application
|
||||
├── worker.py # Background worker
|
||||
├── cron.py # Cron job scheduler
|
||||
├── fingerprint.py # Fingerprint utilities
|
||||
├── indexclient.py # Legacy TCP index client
|
||||
├── fpstore.py # Modern HTTP index client
|
||||
├── db.py # Database connection management
|
||||
├── config.py # Configuration loading
|
||||
└── tables.py # SQLAlchemy ORM models
|
||||
```
|
||||
|
||||
### Layered Architecture
|
||||
|
||||
The server follows a traditional layered architecture:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Presentation Layer │
|
||||
│ (api/, web/, future/) │
|
||||
│ - HTTP request/response handling │
|
||||
│ - Input validation │
|
||||
│ - Response formatting │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Business Logic Layer │
|
||||
│ (data/) │
|
||||
│ - Domain operations │
|
||||
│ - Business rules │
|
||||
│ - Orchestration │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Data Access Layer │
|
||||
│ (db.py, tables.py) │
|
||||
│ - Database queries │
|
||||
│ - ORM models │
|
||||
│ - Transaction management │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ External Services Layer │
|
||||
│ (indexclient.py, fpstore.py) │
|
||||
│ - Index communication │
|
||||
│ - MusicBrainz queries │
|
||||
│ - Redis operations │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Framework Transition
|
||||
|
||||
The server is actively transitioning from Flask to Starlette:
|
||||
|
||||
**Current (Flask/Werkzeug)**:
|
||||
- Location: `acoustid/api/`, `acoustid/web/`
|
||||
- WSGI-based synchronous request handling
|
||||
- Gunicorn as application server
|
||||
- Blocking database operations with psycopg2
|
||||
|
||||
**Future (Starlette)**:
|
||||
- Location: `acoustid/future/`
|
||||
- ASGI-based asynchronous request handling
|
||||
- Uvicorn as application server
|
||||
- Async database operations with asyncpg
|
||||
|
||||
**Migration Status**:
|
||||
- Core lookup and submit endpoints have async implementations
|
||||
- Legacy endpoints still use Flask
|
||||
- Both frameworks run simultaneously during transition
|
||||
- Configuration flag controls which implementation is used
|
||||
|
||||
## Index Architecture
|
||||
|
||||
### LSM-Tree Design
|
||||
|
||||
The index uses a **Log-Structured Merge-tree (LSM-tree)** for efficient fingerprint storage and retrieval.
|
||||
|
||||
**Core Concept**:
|
||||
- Writes go to in-memory segment (fast)
|
||||
- Memory segment periodically flushed to disk
|
||||
- Background process merges disk segments
|
||||
- Reads check memory segment first, then disk segments
|
||||
|
||||
**Components**:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ MultiIndex │
|
||||
│ - Manages multiple named indexes │
|
||||
│ - Routes requests to correct index │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Index │
|
||||
│ - Single fingerprint index │
|
||||
│ - Coordinates segments and merging │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌──────────────────┬──────────────────────┐
|
||||
│ MemorySegment │ FileSegment(s) │
|
||||
│ - In-memory │ - On-disk │
|
||||
│ - Fast writes │ - Immutable │
|
||||
│ - Volatile │ - Persistent │
|
||||
└──────────────────┴──────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Oplog (Write-Ahead Log) │
|
||||
│ - Durability for memory segment │
|
||||
│ - Replay on crash recovery │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Segment Management
|
||||
|
||||
**MemorySegment** (`src/MemorySegment.zig`):
|
||||
- Hash map of fingerprint ID to posting list
|
||||
- Posting list: array of term IDs (compressed)
|
||||
- Maximum size threshold triggers flush
|
||||
- Backed by Oplog for durability
|
||||
|
||||
**FileSegment** (`src/FileSegment.zig`):
|
||||
- Immutable on-disk segment
|
||||
- Binary file format with index and data sections
|
||||
- StreamVByte compression for posting lists
|
||||
- Memory-mapped for fast reads
|
||||
|
||||
**Segment Lifecycle**:
|
||||
1. Writes accumulate in MemorySegment
|
||||
2. MemorySegment reaches size threshold
|
||||
3. Flush to new FileSegment
|
||||
4. Clear MemorySegment and Oplog
|
||||
5. Background merger selects segments to merge
|
||||
6. Merge creates new larger FileSegment
|
||||
7. Delete old segments
|
||||
|
||||
### Merge Policy
|
||||
|
||||
**Tiered Merge Strategy**:
|
||||
- Segments grouped into tiers by size
|
||||
- Tier 0: Smallest segments (recently flushed)
|
||||
- Tier N: Largest segments (heavily merged)
|
||||
- Merge triggered when tier has too many segments
|
||||
- Merges segments within same tier
|
||||
|
||||
**Benefits**:
|
||||
- Write amplification bounded
|
||||
- Read performance improves over time
|
||||
- Disk space reclaimed from deleted entries
|
||||
|
||||
### File Format
|
||||
|
||||
**Segment File Structure** (`src/filefmt.zig`):
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Header │
|
||||
│ - Magic number │
|
||||
│ - Version │
|
||||
│ - Metadata │
|
||||
├─────────────────────────────────────────┤
|
||||
│ Index Section │
|
||||
│ - Fingerprint ID → Offset mapping │
|
||||
│ - Binary search tree or hash table │
|
||||
├─────────────────────────────────────────┤
|
||||
│ Data Section │
|
||||
│ - Compressed posting lists │
|
||||
│ - StreamVByte encoded │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Block Compression** (`src/block.zig`):
|
||||
- Posting lists compressed in blocks
|
||||
- StreamVByte SIMD compression
|
||||
- Delta encoding for term IDs
|
||||
- Typical compression ratio: 4-8x
|
||||
|
||||
### Index Reader
|
||||
|
||||
**IndexReader** (`src/IndexReader.zig`):
|
||||
- Read-only view of index
|
||||
- Merges results from all segments
|
||||
- Implements search algorithm
|
||||
- Returns top-K candidates by score
|
||||
|
||||
**Search Algorithm**:
|
||||
1. Extract query terms from fingerprint
|
||||
2. For each term, fetch posting lists from all segments
|
||||
3. Merge posting lists (union)
|
||||
4. Score each candidate by term overlap
|
||||
5. Return top-K candidates sorted by score
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Submission Flow (Detailed)
|
||||
|
||||
```
|
||||
┌─────────┐
|
||||
│ Client │
|
||||
└────┬────┘
|
||||
│ POST /v2/submit
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ SubmitHandler (api/v2/submit.py) │
|
||||
│ 1. Validate API keys (client + user) │
|
||||
│ 2. Check rate limits (Redis) │
|
||||
│ 3. Decode fingerprints │
|
||||
│ 4. Insert into submission table │
|
||||
│ 5. Publish to NATS queue │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ NATS message
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Worker (worker.py) │
|
||||
│ 1. Consume message from NATS │
|
||||
│ 2. Load submission from database │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ FingerprintSearcher (data/fingerprint) │
|
||||
│ 1. Extract query from fingerprint │
|
||||
│ 2. Search index for matches │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ HTTP POST /:index/_search
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Index (fpindex) │
|
||||
│ 1. Decode MessagePack request │
|
||||
│ 2. Search segments │
|
||||
│ 3. Score candidates │
|
||||
│ 4. Return top matches │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ Candidate fingerprint IDs
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Worker (continued) │
|
||||
│ 1. Fetch candidate metadata from DB │
|
||||
│ 2. Decide: create new track or link │
|
||||
│ 3. Insert/update track tables │
|
||||
│ 4. Update index with new fingerprint │
|
||||
│ 5. Store result in submission_result │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ HTTP PUT /:index/:fpid
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Index (fpindex) │
|
||||
│ 1. Add fingerprint to MemorySegment │
|
||||
│ 2. Append to Oplog │
|
||||
│ 3. Trigger flush if needed │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Lookup Flow (Detailed)
|
||||
|
||||
```
|
||||
┌─────────┐
|
||||
│ Client │
|
||||
└────┬────┘
|
||||
│ GET/POST /v2/lookup
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ LookupHandler (api/v2/lookup.py) │
|
||||
│ 1. Validate API key (client) │
|
||||
│ 2. Check rate limits (Redis) │
|
||||
│ 3. Parse parameters │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ decode_fingerprint (fingerprint.py) │
|
||||
│ 1. Decode base64 or compressed format │
|
||||
│ 2. Decompress if needed │
|
||||
│ 3. Parse Chromaprint data │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ extract_query (fingerprint.py) │
|
||||
│ 1. Extract hash terms from fingerprint│
|
||||
│ 2. Build query structure │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ fpstore.search (fpstore.py) │
|
||||
│ 1. Encode query as MessagePack │
|
||||
│ 2. HTTP POST to index │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ HTTP POST /:index/_search
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Index (fpindex) │
|
||||
│ 1. Parse MessagePack query │
|
||||
│ 2. Search all segments │
|
||||
│ 3. Merge and score results │
|
||||
│ 4. Return top-K candidates │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ Candidate fingerprint IDs + scores
|
||||
┌─────────────────────────────────────────┐
|
||||
│ LookupHandler (continued) │
|
||||
│ 1. Fetch fingerprint metadata from DB │
|
||||
│ 2. Fetch track metadata from DB │
|
||||
│ 3. Fetch MusicBrainz data if requested│
|
||||
│ 4. Build result structure │
|
||||
│ 5. Format as JSON/XML │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ JSON response
|
||||
┌─────────┐
|
||||
│ Client │
|
||||
└─────────┘
|
||||
```
|
||||
|
||||
### Background Processing
|
||||
|
||||
**Cron Jobs** (`acoustid/cron.py`):
|
||||
- Update lookup statistics (hourly)
|
||||
- Update user agent statistics (daily)
|
||||
- Clean up old submissions (daily)
|
||||
- Refresh materialized views (hourly)
|
||||
- Backup index snapshots (daily)
|
||||
|
||||
**Worker Tasks** (`acoustid/worker.py`):
|
||||
- Process fingerprint submissions
|
||||
- Import bulk fingerprints
|
||||
- Update index with new data
|
||||
- Resolve MBID redirects
|
||||
- Clean up orphaned records
|
||||
|
||||
## Index Communication Protocols
|
||||
|
||||
### Legacy Protocol (indexclient.py)
|
||||
|
||||
**Transport**: Raw TCP socket
|
||||
**Port**: 6080 (default)
|
||||
**Format**: Custom binary protocol
|
||||
|
||||
**Message Structure**:
|
||||
```
|
||||
┌────────────────┬────────────────┬────────────────┐
|
||||
│ Length (4B) │ Command (1B) │ Payload │
|
||||
└────────────────┴────────────────┴────────────────┘
|
||||
```
|
||||
|
||||
**Commands**:
|
||||
- `0x01`: Search
|
||||
- `0x02`: Insert
|
||||
- `0x03`: Delete
|
||||
|
||||
**Status**: Being phased out, replaced by HTTP protocol
|
||||
|
||||
### Modern Protocol (fpstore.py)
|
||||
|
||||
**Transport**: HTTP/1.1
|
||||
**Port**: 6081 (default)
|
||||
**Format**: MessagePack
|
||||
|
||||
**Endpoints**:
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|------|---------|
|
||||
| POST | `/:index/_search` | Search for fingerprints |
|
||||
| PUT | `/:index/:fpid` | Insert/update fingerprint |
|
||||
| DELETE | `/:index/:fpid` | Delete fingerprint |
|
||||
| GET | `/:index` | Get index info |
|
||||
| GET | `/:index/_segments` | List segments |
|
||||
| GET | `/:index/_snapshot` | Create snapshot |
|
||||
|
||||
**Search Request**:
|
||||
```python
|
||||
{
|
||||
"query": [term_id1, term_id2, ...], # Query terms
|
||||
"limit": 10, # Max results
|
||||
"min_score": 0.5 # Score threshold
|
||||
}
|
||||
```
|
||||
|
||||
**Search Response**:
|
||||
```python
|
||||
{
|
||||
"results": [
|
||||
{"id": fpid1, "score": 0.95},
|
||||
{"id": fpid2, "score": 0.87},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Concurrency and Parallelism
|
||||
|
||||
### Server Concurrency
|
||||
|
||||
**API/Web Processes**:
|
||||
- Multiple worker processes (Gunicorn/Uvicorn)
|
||||
- Each process handles requests independently
|
||||
- Shared-nothing architecture
|
||||
- Database connection pooling per process
|
||||
|
||||
**Worker Processes**:
|
||||
- Multiple worker instances
|
||||
- NATS queue provides work distribution
|
||||
- Each worker processes one submission at a time
|
||||
- No shared state between workers
|
||||
|
||||
**Cron Process**:
|
||||
- Single instance (leader election via database)
|
||||
- Scheduled tasks run sequentially
|
||||
- Long-running tasks delegated to workers
|
||||
|
||||
### Index Concurrency
|
||||
|
||||
**Thread Model**:
|
||||
- Main thread: HTTP server
|
||||
- Worker threads: Search and merge operations
|
||||
- Configurable thread pool size
|
||||
|
||||
**Locking Strategy**:
|
||||
- Read-write lock on Index
|
||||
- Multiple concurrent readers
|
||||
- Exclusive writer (for flush/merge)
|
||||
- Lock-free MemorySegment (atomic operations)
|
||||
|
||||
**Background Tasks**:
|
||||
- Segment merger runs in background thread
|
||||
- Oplog flusher runs periodically
|
||||
- Metrics collector runs independently
|
||||
|
||||
## Scalability Considerations
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
**API/Web**:
|
||||
- Stateless processes
|
||||
- Scale by adding more instances
|
||||
- Load balancer distributes requests
|
||||
- Session state in Redis (if needed)
|
||||
|
||||
**Workers**:
|
||||
- Scale by adding more instances
|
||||
- NATS queue distributes work
|
||||
- No coordination required
|
||||
|
||||
**Index**:
|
||||
- Multiple index instances (sharding)
|
||||
- Consistent hashing for fingerprint distribution
|
||||
- NATS for cluster coordination
|
||||
- Each instance handles subset of fingerprints
|
||||
|
||||
### Vertical Scaling
|
||||
|
||||
**Database**:
|
||||
- Connection pooling
|
||||
- Read replicas for queries
|
||||
- Partitioning for large tables
|
||||
- Materialized views for aggregations
|
||||
|
||||
**Index**:
|
||||
- More threads for search
|
||||
- Larger memory segment
|
||||
- Faster disk for segments
|
||||
- More RAM for file caching
|
||||
|
||||
## Fault Tolerance
|
||||
|
||||
### Server Resilience
|
||||
|
||||
**Database Failures**:
|
||||
- Connection retry with exponential backoff
|
||||
- Health checks detect failures
|
||||
- Read-only mode if write DB unavailable
|
||||
|
||||
**Index Failures**:
|
||||
- Graceful degradation (return partial results)
|
||||
- Retry with exponential backoff
|
||||
- Circuit breaker pattern
|
||||
|
||||
**NATS Failures**:
|
||||
- Persistent queue (JetStream)
|
||||
- Automatic reconnection
|
||||
- Message replay on recovery
|
||||
|
||||
### Index Resilience
|
||||
|
||||
**Crash Recovery**:
|
||||
- Oplog replay restores MemorySegment
|
||||
- FileSegments are immutable (no corruption)
|
||||
- Incomplete merges discarded
|
||||
|
||||
**Data Integrity**:
|
||||
- Checksums in file format
|
||||
- Atomic file operations
|
||||
- Write-ahead logging
|
||||
|
||||
**Replication**:
|
||||
- NATS-based replication (optional)
|
||||
- Snapshot-based backup
|
||||
- Point-in-time recovery
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Server Performance
|
||||
|
||||
**Lookup Latency**:
|
||||
- P50: ~50ms (including index search)
|
||||
- P95: ~200ms
|
||||
- P99: ~500ms
|
||||
|
||||
**Bottlenecks**:
|
||||
- Index search time (dominant)
|
||||
- Database query time (metadata fetch)
|
||||
- Network latency (MusicBrainz queries)
|
||||
|
||||
### Index Performance
|
||||
|
||||
**Search Latency**:
|
||||
- P50: ~5ms
|
||||
- P95: ~20ms
|
||||
- P99: ~50ms
|
||||
|
||||
**Throughput**:
|
||||
- ~1000 searches/second (single instance)
|
||||
- ~500 inserts/second (single instance)
|
||||
|
||||
**Bottlenecks**:
|
||||
- Disk I/O (segment reads)
|
||||
- CPU (decompression and scoring)
|
||||
- Memory (segment caching)
|
||||
|
||||
## Future Architecture Plans
|
||||
|
||||
### Server Modernization
|
||||
|
||||
1. Complete migration to Starlette/ASGI
|
||||
2. Remove Flask dependencies
|
||||
3. Async database operations everywhere
|
||||
4. GraphQL API alongside REST
|
||||
|
||||
### Index Enhancements
|
||||
|
||||
1. Distributed index with automatic sharding
|
||||
2. Replication for high availability
|
||||
3. Incremental snapshots
|
||||
4. Query result caching
|
||||
|
||||
### Infrastructure
|
||||
|
||||
1. Kubernetes deployment
|
||||
2. Service mesh (Istio/Linkerd)
|
||||
3. Distributed tracing (OpenTelemetry)
|
||||
4. Advanced monitoring (Prometheus + Grafana)
|
||||
Reference in New Issue
Block a user