feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,391 @@
|
||||
# AcoustID System Overview
|
||||
|
||||
## Introduction
|
||||
|
||||
AcoustID is an open-source audio fingerprinting service that identifies music recordings by analyzing their acoustic characteristics. The system consists of two primary components working in tandem: a Python-based web service (acoustid-server) and a high-performance Zig-based fingerprint index (acoustid-index). Together, they provide a production-grade solution for matching audio fingerprints to MusicBrainz metadata.
|
||||
|
||||
## System Components
|
||||
|
||||
### acoustid-server (Python)
|
||||
|
||||
The server component handles all user-facing operations, database management, and business logic.
|
||||
|
||||
**Repository**: acoustid/acoustid-server
|
||||
**License**: MIT
|
||||
**Language**: Python 3.12+
|
||||
**Current Version**: 26.3.1
|
||||
|
||||
**Core Technologies**:
|
||||
- **Web Framework**: Werkzeug/Flask (current) with migration to Starlette (future async)
|
||||
- **ORM**: SQLAlchemy 2.x with multi-database support
|
||||
- **Database**: PostgreSQL 17.4 (4 separate databases)
|
||||
- **Cache/Queue**: Redis for rate limiting and task queues
|
||||
- **Message Queue**: NATS with JetStream for async submission processing
|
||||
- **ASGI Server**: Uvicorn for async endpoints, Gunicorn for legacy
|
||||
|
||||
**Key Dependencies**:
|
||||
```
|
||||
acoustid-ext (C extension for Chromaprint)
|
||||
Flask (current web framework)
|
||||
Starlette (future async framework)
|
||||
aiohttp (async HTTP client)
|
||||
SQLAlchemy 2.x (ORM)
|
||||
alembic (database migrations)
|
||||
asyncpg (async PostgreSQL driver)
|
||||
psycopg2 (sync PostgreSQL driver)
|
||||
nats-py (NATS client)
|
||||
mbdata (MusicBrainz data models)
|
||||
msgspec (fast JSON/MessagePack)
|
||||
zstd (compression)
|
||||
gunicorn (WSGI server)
|
||||
uvicorn (ASGI server)
|
||||
```
|
||||
|
||||
**Entry Point**:
|
||||
```bash
|
||||
# Main CLI entry
|
||||
python manage.py -> acoustid.cli:main()
|
||||
|
||||
# Available commands
|
||||
python manage.py run web # Web UI server
|
||||
python manage.py run api # API server
|
||||
python manage.py run cron # Scheduled tasks
|
||||
python manage.py run worker # Background worker
|
||||
python manage.py run import # Import fingerprints
|
||||
```
|
||||
|
||||
**File Locations**:
|
||||
- Entry script: `manage.py`
|
||||
- CLI implementation: `acoustid/cli.py`
|
||||
- Server logic: `acoustid/server.py`
|
||||
- Worker logic: `acoustid/worker.py`
|
||||
- Cron jobs: `acoustid/cron.py`
|
||||
- Configuration: `acoustid/config.py`
|
||||
|
||||
### acoustid-index (Zig)
|
||||
|
||||
The index component provides ultra-fast fingerprint search using advanced data structures and SIMD optimizations.
|
||||
|
||||
**Repository**: acoustid/acoustid-index
|
||||
**License**: GPL-3.0
|
||||
**Language**: Zig
|
||||
**Build System**: Zig build system
|
||||
|
||||
**Core Technologies**:
|
||||
- **HTTP Server**: httpz (Zig HTTP library)
|
||||
- **Data Structure**: LSM-tree (Log-Structured Merge-tree) inverted index
|
||||
- **Compression**: StreamVByte SIMD compression for posting lists
|
||||
- **Serialization**: MessagePack for wire protocol
|
||||
- **Metrics**: Prometheus-compatible metrics endpoint
|
||||
|
||||
**Key Dependencies**:
|
||||
```
|
||||
httpz (HTTP server framework)
|
||||
metrics (Prometheus metrics)
|
||||
zul (Zig utility library)
|
||||
msgpack (MessagePack serialization)
|
||||
nats (NATS client)
|
||||
```
|
||||
|
||||
**Entry Point**:
|
||||
```bash
|
||||
# Build and run
|
||||
zig build run -- --dir /tmp --port 8080
|
||||
|
||||
# Binary name
|
||||
fpindex
|
||||
|
||||
# CLI flags
|
||||
--dir <path> # Data directory for index storage
|
||||
--port <number> # HTTP server port (default: 6081)
|
||||
--threads <number> # Worker thread count
|
||||
--log-level <level> # Logging verbosity
|
||||
--cluster <name> # Cluster name for distributed setup
|
||||
--nats-url <url> # NATS server URL for clustering
|
||||
```
|
||||
|
||||
**File Locations**:
|
||||
- Main entry: `src/main.zig`
|
||||
- HTTP server: `src/server.zig`
|
||||
- API handlers: `src/api.zig`
|
||||
- Multi-index manager: `src/MultiIndex.zig`
|
||||
- Core index: `src/Index.zig`
|
||||
- Index reader: `src/IndexReader.zig`
|
||||
- Segment management: `src/segment.zig`
|
||||
- Memory segment: `src/MemorySegment.zig`
|
||||
- File segment: `src/FileSegment.zig`
|
||||
- Write-ahead log: `src/Oplog.zig`
|
||||
- File format: `src/filefmt.zig`
|
||||
- Block compression: `src/block.zig`
|
||||
- SIMD compression: `src/streamvbyte.zig`
|
||||
- Metrics: `src/metrics.zig`
|
||||
|
||||
## Build and Run
|
||||
|
||||
### Server Build
|
||||
|
||||
```bash
|
||||
# Install dependencies with uv
|
||||
uv sync
|
||||
|
||||
# Build Chromaprint extension
|
||||
# (handled automatically in Docker build)
|
||||
|
||||
# Run with docker-compose
|
||||
docker compose up
|
||||
```
|
||||
|
||||
**Docker Compose Services**:
|
||||
- `nats`: Message queue
|
||||
- `redis`: Cache and rate limiting
|
||||
- `postgres`: Database (custom pg17.4 image)
|
||||
- `index`: Fingerprint index service
|
||||
- `api`: API server
|
||||
- `web`: Web UI server
|
||||
- `cron`: Scheduled tasks
|
||||
- `worker`: Background job processor
|
||||
|
||||
### Index Build
|
||||
|
||||
```bash
|
||||
# Build binary
|
||||
zig build
|
||||
|
||||
# Run with options
|
||||
zig build run -- --dir /var/lib/acoustid-index --port 6081 --threads 4
|
||||
```
|
||||
|
||||
## Architecture Relationship
|
||||
|
||||
The two components work together in a client-server model:
|
||||
|
||||
1. **Server** receives fingerprint submissions and lookup requests via HTTP API
|
||||
2. **Server** stores metadata in PostgreSQL
|
||||
3. **Server** sends fingerprint data to **Index** via HTTP/MessagePack protocol
|
||||
4. **Index** performs ultra-fast similarity search using LSM-tree
|
||||
5. **Index** returns candidate fingerprint IDs to **Server**
|
||||
6. **Server** enriches results with metadata from PostgreSQL and MusicBrainz
|
||||
7. **Server** returns final results to client
|
||||
|
||||
## Communication Protocols
|
||||
|
||||
### Server to Index
|
||||
|
||||
**Modern Protocol** (fpstore.py):
|
||||
- HTTP POST to `http://index:6081/:index/_search`
|
||||
- Request body: MessagePack-encoded fingerprint query
|
||||
- Response: MessagePack-encoded list of candidate IDs with scores
|
||||
|
||||
**Legacy Protocol** (indexclient.py):
|
||||
- Raw TCP socket connection
|
||||
- Binary protocol with custom framing
|
||||
- Being phased out in favor of HTTP
|
||||
|
||||
### Client to Server
|
||||
|
||||
**Public API**:
|
||||
- HTTP GET/POST to `https://api.acoustid.org/v2/*`
|
||||
- JSON/XML/JSONP responses
|
||||
- Rate-limited by API key and IP
|
||||
|
||||
## Version Information
|
||||
|
||||
**Server Version**: 26.3.1
|
||||
- Semantic versioning
|
||||
- Tagged releases in Git
|
||||
- Version defined in `acoustid/__init__.py`
|
||||
|
||||
**Index Version**: No formal versioning yet
|
||||
- Tracked by Git commit hash
|
||||
- Breaking changes communicated via commit messages
|
||||
|
||||
## Deployment Models
|
||||
|
||||
### Production (acoustid.org)
|
||||
|
||||
- Multi-server deployment
|
||||
- Separate API, web, worker, and cron processes
|
||||
- Dedicated PostgreSQL cluster (4 databases)
|
||||
- Redis cluster for caching
|
||||
- NATS cluster for message queue
|
||||
- Multiple index instances for load balancing
|
||||
|
||||
### Self-Hosted (Docker Compose)
|
||||
|
||||
- Single-host deployment
|
||||
- All services in containers
|
||||
- Shared PostgreSQL instance
|
||||
- Single Redis instance
|
||||
- Single NATS instance
|
||||
- Single index instance
|
||||
|
||||
### Development (Local)
|
||||
|
||||
- Python virtual environment with uv
|
||||
- Local PostgreSQL (or Docker)
|
||||
- Local Redis (or Docker)
|
||||
- Local NATS (or Docker)
|
||||
- Index built and run locally with Zig
|
||||
|
||||
## Key Features
|
||||
|
||||
### Server Features
|
||||
|
||||
- **Fingerprint Submission**: Accept audio fingerprints with optional metadata
|
||||
- **Fingerprint Lookup**: Match fingerprints to known recordings
|
||||
- **MusicBrainz Integration**: Link fingerprints to MBIDs
|
||||
- **User Management**: API key generation and management
|
||||
- **Rate Limiting**: Multi-tier rate limiting (global, app, IP)
|
||||
- **Batch Operations**: Submit/lookup up to 20 fingerprints per request
|
||||
- **Async Processing**: Background workers for heavy operations
|
||||
- **Health Checks**: Multiple health endpoints for monitoring
|
||||
- **Metrics**: StatsD metrics for observability
|
||||
|
||||
### Index Features
|
||||
|
||||
- **Fast Search**: Sub-millisecond fingerprint matching
|
||||
- **SIMD Optimization**: StreamVByte compression for posting lists
|
||||
- **LSM-Tree Storage**: Efficient write and read performance
|
||||
- **Background Merging**: Automatic segment compaction
|
||||
- **Snapshot Support**: Point-in-time index snapshots
|
||||
- **Cluster Support**: Distributed index via NATS
|
||||
- **Prometheus Metrics**: Built-in metrics endpoint
|
||||
- **HTTP API**: RESTful API for all operations
|
||||
|
||||
## Configuration
|
||||
|
||||
### Server Configuration
|
||||
|
||||
**Config File**: `acoustid.conf` (INI format)
|
||||
**Environment Variables**: `ACOUSTID_*` prefix
|
||||
**Secret Files**: `*_file` suffix for file-based secrets
|
||||
|
||||
Example:
|
||||
```ini
|
||||
[database]
|
||||
name = acoustid_app
|
||||
user = acoustid
|
||||
password_file = /run/secrets/db_password
|
||||
|
||||
[redis]
|
||||
host = redis
|
||||
port = 6379
|
||||
|
||||
[fingerprint_index]
|
||||
host = index
|
||||
port = 6081
|
||||
```
|
||||
|
||||
### Index Configuration
|
||||
|
||||
**CLI Flags Only**: No config file support
|
||||
**Environment Variables**: Limited support
|
||||
|
||||
Example:
|
||||
```bash
|
||||
fpindex \
|
||||
--dir /var/lib/acoustid-index \
|
||||
--port 6081 \
|
||||
--threads 4 \
|
||||
--log-level info \
|
||||
--nats-url nats://nats:4222
|
||||
```
|
||||
|
||||
## Data Flow Summary
|
||||
|
||||
### Submission Flow
|
||||
|
||||
1. Client submits fingerprint via `/v2/submit`
|
||||
2. Server validates API keys and rate limits
|
||||
3. Server stores submission in `submission` table
|
||||
4. Server publishes message to NATS queue
|
||||
5. Worker picks up message from NATS
|
||||
6. Worker searches index for matches
|
||||
7. Worker creates or links track in PostgreSQL
|
||||
8. Worker updates index with new fingerprint
|
||||
9. Client polls `/v2/submission_status` for result
|
||||
|
||||
### Lookup Flow
|
||||
|
||||
1. Client requests lookup via `/v2/lookup`
|
||||
2. Server validates API key and rate limits
|
||||
3. Server decodes fingerprint from request
|
||||
4. Server extracts query features from fingerprint
|
||||
5. Server sends search request to index
|
||||
6. Index returns candidate fingerprint IDs
|
||||
7. Server fetches metadata from PostgreSQL
|
||||
8. Server fetches MusicBrainz data if requested
|
||||
9. Server returns enriched results as JSON
|
||||
|
||||
## Technology Stack Summary
|
||||
|
||||
| Component | Server | Index |
|
||||
|-----------|--------|-------|
|
||||
| Language | Python 3.12+ | Zig |
|
||||
| Web Framework | Flask/Starlette | httpz |
|
||||
| Database | PostgreSQL 17.4 | N/A (file-based) |
|
||||
| ORM | SQLAlchemy 2.x | N/A |
|
||||
| Cache | Redis | N/A |
|
||||
| Queue | NATS+JetStream | NATS (optional) |
|
||||
| Serialization | JSON/MessagePack | MessagePack |
|
||||
| Compression | zstd | StreamVByte |
|
||||
| Metrics | StatsD | Prometheus |
|
||||
| Testing | pytest | Zig test |
|
||||
| Build | uv | zig build |
|
||||
| Container | Docker | Docker |
|
||||
|
||||
## Repository Structure
|
||||
|
||||
### acoustid-server
|
||||
|
||||
```
|
||||
acoustid/
|
||||
├── api/ # API handlers
|
||||
│ └── v2/ # API v2 endpoints
|
||||
├── data/ # Business logic layer
|
||||
├── future/ # Starlette migration code
|
||||
├── web/ # Web UI handlers
|
||||
├── scripts/ # Utility scripts
|
||||
├── cli.py # CLI commands
|
||||
├── server.py # Server entry point
|
||||
├── worker.py # Background worker
|
||||
├── cron.py # Scheduled tasks
|
||||
├── fingerprint.py # Fingerprint utilities
|
||||
├── indexclient.py # Legacy index client
|
||||
├── fpstore.py # Modern index client
|
||||
├── db.py # Database connection
|
||||
├── config.py # Configuration
|
||||
└── tables.py # SQLAlchemy models
|
||||
```
|
||||
|
||||
### acoustid-index
|
||||
|
||||
```
|
||||
src/
|
||||
├── main.zig # Entry point
|
||||
├── server.zig # HTTP server
|
||||
├── api.zig # API handlers
|
||||
├── MultiIndex.zig # Multi-index manager
|
||||
├── Index.zig # Core index
|
||||
├── IndexReader.zig # Read-only index view
|
||||
├── segment.zig # Segment interface
|
||||
├── MemorySegment.zig # In-memory segment
|
||||
├── FileSegment.zig # On-disk segment
|
||||
├── Oplog.zig # Write-ahead log
|
||||
├── filefmt.zig # File format
|
||||
├── block.zig # Block compression
|
||||
├── streamvbyte.zig # SIMD compression
|
||||
└── metrics.zig # Prometheus metrics
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
For detailed information on specific aspects of the AcoustID system, refer to:
|
||||
|
||||
- **ARCHITECTURE.md**: Detailed architecture and data flow
|
||||
- **API.md**: Complete API reference
|
||||
- **DATA.md**: Database schema and data models
|
||||
- **INTEGRATIONS.md**: External service integrations
|
||||
- **DEPLOYMENT.md**: Deployment and infrastructure
|
||||
- **CODEBASE.md**: Code organization and patterns
|
||||
- **EVALUATION.md**: System evaluation and recommendations
|
||||
Reference in New Issue
Block a user