a1f6701bac
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
612 lines
22 KiB
Markdown
612 lines
22 KiB
Markdown
# AcoustID Architecture
|
|
|
|
## System Architecture Overview
|
|
|
|
AcoustID employs a **monolithic multi-process architecture** with microservice-like separation of concerns. The system is split into two major repositories with distinct responsibilities:
|
|
|
|
1. **acoustid-server**: Monolithic Python application with multiple process types
|
|
2. **acoustid-index**: Standalone Zig service for fingerprint indexing
|
|
|
|
## Server Architecture
|
|
|
|
### Process Types
|
|
|
|
The server runs as multiple independent processes, each with a specific role:
|
|
|
|
| Process | Entry Point | Purpose | Scaling |
|
|
|---------|-------------|---------|---------|
|
|
| API | `acoustid.server:make_application()` | Handle API requests | Horizontal |
|
|
| Web | `acoustid.server:make_application()` | Serve web UI | Horizontal |
|
|
| Worker | `acoustid.worker:run()` | Process background jobs | Horizontal |
|
|
| Cron | `acoustid.cron:run()` | Execute scheduled tasks | Single instance |
|
|
| Import | `acoustid.scripts.import_submissions` | Bulk import fingerprints | Manual |
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
acoustid/
|
|
├── api/ # API layer
|
|
│ ├── __init__.py # API application factory
|
|
│ ├── errors.py # Error handling
|
|
│ ├── ratelimit.py # Rate limiting logic
|
|
│ └── v2/ # API v2 endpoints
|
|
│ ├── __init__.py
|
|
│ ├── lookup.py # Fingerprint lookup
|
|
│ ├── submit.py # Fingerprint submission
|
|
│ ├── misc.py # Utility endpoints
|
|
│ └── internal.py # Internal admin endpoints
|
|
├── data/ # Business logic layer
|
|
│ ├── account.py # User account operations
|
|
│ ├── application.py # API application management
|
|
│ ├── fingerprint.py # Fingerprint operations
|
|
│ ├── foreignid.py # Foreign ID management
|
|
│ ├── meta.py # Metadata operations
|
|
│ ├── musicbrainz.py # MusicBrainz queries
|
|
│ ├── stats.py # Statistics tracking
|
|
│ ├── submission.py # Submission processing
|
|
│ └── track.py # Track operations
|
|
├── future/ # Starlette migration
|
|
│ ├── app.py # ASGI application
|
|
│ ├── lookup.py # Async lookup handler
|
|
│ └── submit.py # Async submit handler
|
|
├── web/ # Web UI layer
|
|
│ ├── __init__.py # Web application factory
|
|
│ ├── views/ # View handlers
|
|
│ └── templates/ # Jinja2 templates
|
|
├── scripts/ # Utility scripts
|
|
│ ├── import_submissions.py
|
|
│ ├── backfill_fingerprint_index.py
|
|
│ └── update_lookup_stats.py
|
|
├── cli.py # CLI command definitions
|
|
├── server.py # WSGI/ASGI application
|
|
├── worker.py # Background worker
|
|
├── cron.py # Cron job scheduler
|
|
├── fingerprint.py # Fingerprint utilities
|
|
├── indexclient.py # Legacy TCP index client
|
|
├── fpstore.py # Modern HTTP index client
|
|
├── db.py # Database connection management
|
|
├── config.py # Configuration loading
|
|
└── tables.py # SQLAlchemy ORM models
|
|
```
|
|
|
|
### Layered Architecture
|
|
|
|
The server follows a traditional layered architecture:
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ Presentation Layer │
|
|
│ (api/, web/, future/) │
|
|
│ - HTTP request/response handling │
|
|
│ - Input validation │
|
|
│ - Response formatting │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Business Logic Layer │
|
|
│ (data/) │
|
|
│ - Domain operations │
|
|
│ - Business rules │
|
|
│ - Orchestration │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Data Access Layer │
|
|
│ (db.py, tables.py) │
|
|
│ - Database queries │
|
|
│ - ORM models │
|
|
│ - Transaction management │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ External Services Layer │
|
|
│ (indexclient.py, fpstore.py) │
|
|
│ - Index communication │
|
|
│ - MusicBrainz queries │
|
|
│ - Redis operations │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
### Framework Transition
|
|
|
|
The server is actively transitioning from Flask to Starlette:
|
|
|
|
**Current (Flask/Werkzeug)**:
|
|
- Location: `acoustid/api/`, `acoustid/web/`
|
|
- WSGI-based synchronous request handling
|
|
- Gunicorn as application server
|
|
- Blocking database operations with psycopg2
|
|
|
|
**Future (Starlette)**:
|
|
- Location: `acoustid/future/`
|
|
- ASGI-based asynchronous request handling
|
|
- Uvicorn as application server
|
|
- Async database operations with asyncpg
|
|
|
|
**Migration Status**:
|
|
- Core lookup and submit endpoints have async implementations
|
|
- Legacy endpoints still use Flask
|
|
- Both frameworks run simultaneously during transition
|
|
- Configuration flag controls which implementation is used
|
|
|
|
## Index Architecture
|
|
|
|
### LSM-Tree Design
|
|
|
|
The index uses a **Log-Structured Merge-tree (LSM-tree)** for efficient fingerprint storage and retrieval.
|
|
|
|
**Core Concept**:
|
|
- Writes go to in-memory segment (fast)
|
|
- Memory segment periodically flushed to disk
|
|
- Background process merges disk segments
|
|
- Reads check memory segment first, then disk segments
|
|
|
|
**Components**:
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ MultiIndex │
|
|
│ - Manages multiple named indexes │
|
|
│ - Routes requests to correct index │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Index │
|
|
│ - Single fingerprint index │
|
|
│ - Coordinates segments and merging │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌──────────────────┬──────────────────────┐
|
|
│ MemorySegment │ FileSegment(s) │
|
|
│ - In-memory │ - On-disk │
|
|
│ - Fast writes │ - Immutable │
|
|
│ - Volatile │ - Persistent │
|
|
└──────────────────┴──────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Oplog (Write-Ahead Log) │
|
|
│ - Durability for memory segment │
|
|
│ - Replay on crash recovery │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
### Segment Management
|
|
|
|
**MemorySegment** (`src/MemorySegment.zig`):
|
|
- Hash map of fingerprint ID to posting list
|
|
- Posting list: array of term IDs (compressed)
|
|
- Maximum size threshold triggers flush
|
|
- Backed by Oplog for durability
|
|
|
|
**FileSegment** (`src/FileSegment.zig`):
|
|
- Immutable on-disk segment
|
|
- Binary file format with index and data sections
|
|
- StreamVByte compression for posting lists
|
|
- Memory-mapped for fast reads
|
|
|
|
**Segment Lifecycle**:
|
|
1. Writes accumulate in MemorySegment
|
|
2. MemorySegment reaches size threshold
|
|
3. Flush to new FileSegment
|
|
4. Clear MemorySegment and Oplog
|
|
5. Background merger selects segments to merge
|
|
6. Merge creates new larger FileSegment
|
|
7. Delete old segments
|
|
|
|
### Merge Policy
|
|
|
|
**Tiered Merge Strategy**:
|
|
- Segments grouped into tiers by size
|
|
- Tier 0: Smallest segments (recently flushed)
|
|
- Tier N: Largest segments (heavily merged)
|
|
- Merge triggered when tier has too many segments
|
|
- Merges segments within same tier
|
|
|
|
**Benefits**:
|
|
- Write amplification bounded
|
|
- Read performance improves over time
|
|
- Disk space reclaimed from deleted entries
|
|
|
|
### File Format
|
|
|
|
**Segment File Structure** (`src/filefmt.zig`):
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ Header │
|
|
│ - Magic number │
|
|
│ - Version │
|
|
│ - Metadata │
|
|
├─────────────────────────────────────────┤
|
|
│ Index Section │
|
|
│ - Fingerprint ID → Offset mapping │
|
|
│ - Binary search tree or hash table │
|
|
├─────────────────────────────────────────┤
|
|
│ Data Section │
|
|
│ - Compressed posting lists │
|
|
│ - StreamVByte encoded │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
**Block Compression** (`src/block.zig`):
|
|
- Posting lists compressed in blocks
|
|
- StreamVByte SIMD compression
|
|
- Delta encoding for term IDs
|
|
- Typical compression ratio: 4-8x
|
|
|
|
### Index Reader
|
|
|
|
**IndexReader** (`src/IndexReader.zig`):
|
|
- Read-only view of index
|
|
- Merges results from all segments
|
|
- Implements search algorithm
|
|
- Returns top-K candidates by score
|
|
|
|
**Search Algorithm**:
|
|
1. Extract query terms from fingerprint
|
|
2. For each term, fetch posting lists from all segments
|
|
3. Merge posting lists (union)
|
|
4. Score each candidate by term overlap
|
|
5. Return top-K candidates sorted by score
|
|
|
|
## Data Flow
|
|
|
|
### Submission Flow (Detailed)
|
|
|
|
```
|
|
┌─────────┐
|
|
│ Client │
|
|
└────┬────┘
|
|
│ POST /v2/submit
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ SubmitHandler (api/v2/submit.py) │
|
|
│ 1. Validate API keys (client + user) │
|
|
│ 2. Check rate limits (Redis) │
|
|
│ 3. Decode fingerprints │
|
|
│ 4. Insert into submission table │
|
|
│ 5. Publish to NATS queue │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓ NATS message
|
|
┌─────────────────────────────────────────┐
|
|
│ Worker (worker.py) │
|
|
│ 1. Consume message from NATS │
|
|
│ 2. Load submission from database │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ FingerprintSearcher (data/fingerprint) │
|
|
│ 1. Extract query from fingerprint │
|
|
│ 2. Search index for matches │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓ HTTP POST /:index/_search
|
|
┌─────────────────────────────────────────┐
|
|
│ Index (fpindex) │
|
|
│ 1. Decode MessagePack request │
|
|
│ 2. Search segments │
|
|
│ 3. Score candidates │
|
|
│ 4. Return top matches │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓ Candidate fingerprint IDs
|
|
┌─────────────────────────────────────────┐
|
|
│ Worker (continued) │
|
|
│ 1. Fetch candidate metadata from DB │
|
|
│ 2. Decide: create new track or link │
|
|
│ 3. Insert/update track tables │
|
|
│ 4. Update index with new fingerprint │
|
|
│ 5. Store result in submission_result │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓ HTTP PUT /:index/:fpid
|
|
┌─────────────────────────────────────────┐
|
|
│ Index (fpindex) │
|
|
│ 1. Add fingerprint to MemorySegment │
|
|
│ 2. Append to Oplog │
|
|
│ 3. Trigger flush if needed │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
### Lookup Flow (Detailed)
|
|
|
|
```
|
|
┌─────────┐
|
|
│ Client │
|
|
└────┬────┘
|
|
│ GET/POST /v2/lookup
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ LookupHandler (api/v2/lookup.py) │
|
|
│ 1. Validate API key (client) │
|
|
│ 2. Check rate limits (Redis) │
|
|
│ 3. Parse parameters │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ decode_fingerprint (fingerprint.py) │
|
|
│ 1. Decode base64 or compressed format │
|
|
│ 2. Decompress if needed │
|
|
│ 3. Parse Chromaprint data │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ extract_query (fingerprint.py) │
|
|
│ 1. Extract hash terms from fingerprint│
|
|
│ 2. Build query structure │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ fpstore.search (fpstore.py) │
|
|
│ 1. Encode query as MessagePack │
|
|
│ 2. HTTP POST to index │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓ HTTP POST /:index/_search
|
|
┌─────────────────────────────────────────┐
|
|
│ Index (fpindex) │
|
|
│ 1. Parse MessagePack query │
|
|
│ 2. Search all segments │
|
|
│ 3. Merge and score results │
|
|
│ 4. Return top-K candidates │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓ Candidate fingerprint IDs + scores
|
|
┌─────────────────────────────────────────┐
|
|
│ LookupHandler (continued) │
|
|
│ 1. Fetch fingerprint metadata from DB │
|
|
│ 2. Fetch track metadata from DB │
|
|
│ 3. Fetch MusicBrainz data if requested│
|
|
│ 4. Build result structure │
|
|
│ 5. Format as JSON/XML │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
↓ JSON response
|
|
┌─────────┐
|
|
│ Client │
|
|
└─────────┘
|
|
```
|
|
|
|
### Background Processing
|
|
|
|
**Cron Jobs** (`acoustid/cron.py`):
|
|
- Update lookup statistics (hourly)
|
|
- Update user agent statistics (daily)
|
|
- Clean up old submissions (daily)
|
|
- Refresh materialized views (hourly)
|
|
- Backup index snapshots (daily)
|
|
|
|
**Worker Tasks** (`acoustid/worker.py`):
|
|
- Process fingerprint submissions
|
|
- Import bulk fingerprints
|
|
- Update index with new data
|
|
- Resolve MBID redirects
|
|
- Clean up orphaned records
|
|
|
|
## Index Communication Protocols
|
|
|
|
### Legacy Protocol (indexclient.py)
|
|
|
|
**Transport**: Raw TCP socket
|
|
**Port**: 6080 (default)
|
|
**Format**: Custom binary protocol
|
|
|
|
**Message Structure**:
|
|
```
|
|
┌────────────────┬────────────────┬────────────────┐
|
|
│ Length (4B) │ Command (1B) │ Payload │
|
|
└────────────────┴────────────────┴────────────────┘
|
|
```
|
|
|
|
**Commands**:
|
|
- `0x01`: Search
|
|
- `0x02`: Insert
|
|
- `0x03`: Delete
|
|
|
|
**Status**: Being phased out, replaced by HTTP protocol
|
|
|
|
### Modern Protocol (fpstore.py)
|
|
|
|
**Transport**: HTTP/1.1
|
|
**Port**: 6081 (default)
|
|
**Format**: MessagePack
|
|
|
|
**Endpoints**:
|
|
|
|
| Method | Path | Purpose |
|
|
|--------|------|---------|
|
|
| POST | `/:index/_search` | Search for fingerprints |
|
|
| PUT | `/:index/:fpid` | Insert/update fingerprint |
|
|
| DELETE | `/:index/:fpid` | Delete fingerprint |
|
|
| GET | `/:index` | Get index info |
|
|
| GET | `/:index/_segments` | List segments |
|
|
| GET | `/:index/_snapshot` | Create snapshot |
|
|
|
|
**Search Request**:
|
|
```python
|
|
{
|
|
"query": [term_id1, term_id2, ...], # Query terms
|
|
"limit": 10, # Max results
|
|
"min_score": 0.5 # Score threshold
|
|
}
|
|
```
|
|
|
|
**Search Response**:
|
|
```python
|
|
{
|
|
"results": [
|
|
{"id": fpid1, "score": 0.95},
|
|
{"id": fpid2, "score": 0.87},
|
|
...
|
|
]
|
|
}
|
|
```
|
|
|
|
## Concurrency and Parallelism
|
|
|
|
### Server Concurrency
|
|
|
|
**API/Web Processes**:
|
|
- Multiple worker processes (Gunicorn/Uvicorn)
|
|
- Each process handles requests independently
|
|
- Shared-nothing architecture
|
|
- Database connection pooling per process
|
|
|
|
**Worker Processes**:
|
|
- Multiple worker instances
|
|
- NATS queue provides work distribution
|
|
- Each worker processes one submission at a time
|
|
- No shared state between workers
|
|
|
|
**Cron Process**:
|
|
- Single instance (leader election via database)
|
|
- Scheduled tasks run sequentially
|
|
- Long-running tasks delegated to workers
|
|
|
|
### Index Concurrency
|
|
|
|
**Thread Model**:
|
|
- Main thread: HTTP server
|
|
- Worker threads: Search and merge operations
|
|
- Configurable thread pool size
|
|
|
|
**Locking Strategy**:
|
|
- Read-write lock on Index
|
|
- Multiple concurrent readers
|
|
- Exclusive writer (for flush/merge)
|
|
- Lock-free MemorySegment (atomic operations)
|
|
|
|
**Background Tasks**:
|
|
- Segment merger runs in background thread
|
|
- Oplog flusher runs periodically
|
|
- Metrics collector runs independently
|
|
|
|
## Scalability Considerations
|
|
|
|
### Horizontal Scaling
|
|
|
|
**API/Web**:
|
|
- Stateless processes
|
|
- Scale by adding more instances
|
|
- Load balancer distributes requests
|
|
- Session state in Redis (if needed)
|
|
|
|
**Workers**:
|
|
- Scale by adding more instances
|
|
- NATS queue distributes work
|
|
- No coordination required
|
|
|
|
**Index**:
|
|
- Multiple index instances (sharding)
|
|
- Consistent hashing for fingerprint distribution
|
|
- NATS for cluster coordination
|
|
- Each instance handles subset of fingerprints
|
|
|
|
### Vertical Scaling
|
|
|
|
**Database**:
|
|
- Connection pooling
|
|
- Read replicas for queries
|
|
- Partitioning for large tables
|
|
- Materialized views for aggregations
|
|
|
|
**Index**:
|
|
- More threads for search
|
|
- Larger memory segment
|
|
- Faster disk for segments
|
|
- More RAM for file caching
|
|
|
|
## Fault Tolerance
|
|
|
|
### Server Resilience
|
|
|
|
**Database Failures**:
|
|
- Connection retry with exponential backoff
|
|
- Health checks detect failures
|
|
- Read-only mode if write DB unavailable
|
|
|
|
**Index Failures**:
|
|
- Graceful degradation (return partial results)
|
|
- Retry with exponential backoff
|
|
- Circuit breaker pattern
|
|
|
|
**NATS Failures**:
|
|
- Persistent queue (JetStream)
|
|
- Automatic reconnection
|
|
- Message replay on recovery
|
|
|
|
### Index Resilience
|
|
|
|
**Crash Recovery**:
|
|
- Oplog replay restores MemorySegment
|
|
- FileSegments are immutable (no corruption)
|
|
- Incomplete merges discarded
|
|
|
|
**Data Integrity**:
|
|
- Checksums in file format
|
|
- Atomic file operations
|
|
- Write-ahead logging
|
|
|
|
**Replication**:
|
|
- NATS-based replication (optional)
|
|
- Snapshot-based backup
|
|
- Point-in-time recovery
|
|
|
|
## Performance Characteristics
|
|
|
|
### Server Performance
|
|
|
|
**Lookup Latency**:
|
|
- P50: ~50ms (including index search)
|
|
- P95: ~200ms
|
|
- P99: ~500ms
|
|
|
|
**Bottlenecks**:
|
|
- Index search time (dominant)
|
|
- Database query time (metadata fetch)
|
|
- Network latency (MusicBrainz queries)
|
|
|
|
### Index Performance
|
|
|
|
**Search Latency**:
|
|
- P50: ~5ms
|
|
- P95: ~20ms
|
|
- P99: ~50ms
|
|
|
|
**Throughput**:
|
|
- ~1000 searches/second (single instance)
|
|
- ~500 inserts/second (single instance)
|
|
|
|
**Bottlenecks**:
|
|
- Disk I/O (segment reads)
|
|
- CPU (decompression and scoring)
|
|
- Memory (segment caching)
|
|
|
|
## Future Architecture Plans
|
|
|
|
### Server Modernization
|
|
|
|
1. Complete migration to Starlette/ASGI
|
|
2. Remove Flask dependencies
|
|
3. Async database operations everywhere
|
|
4. GraphQL API alongside REST
|
|
|
|
### Index Enhancements
|
|
|
|
1. Distributed index with automatic sharding
|
|
2. Replication for high availability
|
|
3. Incremental snapshots
|
|
4. Query result caching
|
|
|
|
### Infrastructure
|
|
|
|
1. Kubernetes deployment
|
|
2. Service mesh (Istio/Linkerd)
|
|
3. Distributed tracing (OpenTelemetry)
|
|
4. Advanced monitoring (Prometheus + Grafana)
|