feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
+55
View File
@@ -0,0 +1,55 @@
# AcoustID
## Overview
AcoustID is an open-source audio fingerprinting service. It identifies music tracks by their acoustic fingerprint and links them to MusicBrainz recordings.
## Key Features
- **Purpose**: Audio identification via acoustic fingerprinting
- **Technology**: Chromaprint fingerprint generation
- **Database**: Crowdsourced fingerprints linked to MusicBrainz
- **License**: MIT (code), CC BY-SA 3.0 (data)
## Source
| Resource | URL |
|----------|-----|
| **Server Repository** | https://github.com/acoustid/acoustid-server |
| **Index Repository** | https://github.com/acoustid/acoustid-index |
| **Chromaprint Library** | https://github.com/acoustid/chromaprint |
| **API Documentation** | https://acoustid.org/webservice |
| **Website** | https://acoustid.org |
## API Examples
```bash
# Lookup by fingerprint
GET /v2/lookup?client=YOUR_API_KEY&meta=recordings&fingerprint={fp}&duration={dur}
# Submit new fingerprint
POST /v2/submit
```
## Chromaprint CLI
```bash
# Generate fingerprint from audio file
fpcalc song.mp3
# Returns: FINGERPRINT=... DURATION=...
```
## Self-Hosting
The acoustid-index v2 is written in Zig for performance:
```bash
git clone https://github.com/acoustid/acoustid-index.git
# Follow build instructions in README
```
## Notes
- Used by: Beets, Picard, Kid3, MusicBrainz ecosystem
- Free API for audio fingerprint matching
- Identify unknown files → get MusicBrainz metadata
+807
View File
@@ -0,0 +1,807 @@
# AcoustID API Reference
## API Overview
The AcoustID API provides fingerprint-based music identification services. The API is RESTful, supports multiple response formats (JSON, XML, JSONP), and requires API key authentication for most operations.
**Base URL**: `https://api.acoustid.org`
**Protocol**: HTTPS only
**Authentication**: API key (application key + user key for submissions)
**Rate Limiting**: Multi-tier (global, application, IP-based)
## Public API Endpoints
### Fingerprint Lookup
Identify recordings by audio fingerprint.
#### `/v2/lookup`
**Methods**: GET, POST
**Authentication**: Required (client key)
**Rate Limit**: 3 requests/second (IP), 10 requests/second (application)
**Required Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
| `duration` | integer | Track duration in seconds (if using fingerprint) |
| `trackid` | string | AcoustID track ID (alternative to fingerprint) |
**Optional Parameters**:
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `fingerprint` | string | Chromaprint fingerprint (base64 or compressed) | - |
| `format` | string | Response format: `json`, `xml`, `jsonp` | `json` |
| `jsoncallback` | string | JSONP callback function name | - |
| `meta` | string | Metadata to include (see below) | - |
**Metadata Options** (comma-separated):
- `recordings`: Include MusicBrainz recording metadata
- `recordingids`: Include only recording MBIDs (faster)
- `releases`: Include release metadata
- `releaseids`: Include only release MBIDs
- `releasegroups`: Include release group metadata
- `releasegroupids`: Include only release group MBIDs
- `tracks`: Include track metadata
- `compress`: Compress response with gzip
- `usermeta`: Include user-submitted metadata
- `sources`: Include submission source information
**Batch Lookup**:
Submit multiple fingerprints in a single request using indexed parameters:
```
duration.0=240&fingerprint.0=AQADtN...
duration.1=180&fingerprint.1=AQABtK...
```
**Limits**:
- Maximum 20 fingerprints per batch request
- Maximum 100 track IDs per request
**Example Request** (GET):
```
GET /v2/lookup?client=8XaBELgH&duration=240&fingerprint=AQADtNGiJE...&meta=recordings
```
**Example Request** (POST):
```
POST /v2/lookup
Content-Type: application/x-www-form-urlencoded
client=8XaBELgH&duration=240&fingerprint=AQADtNGiJE...&meta=recordings
```
**Example Response** (JSON):
```json
{
"status": "ok",
"results": [
{
"id": "7e8b1234-5678-90ab-cdef-1234567890ab",
"score": 0.95,
"recordings": [
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"title": "Example Song",
"duration": 240,
"artists": [
{
"id": "12345678-90ab-cdef-1234-567890abcdef",
"name": "Example Artist"
}
],
"releases": [
{
"id": "abcdef12-3456-7890-abcd-ef1234567890",
"title": "Example Album",
"country": "US",
"date": {
"year": 2020,
"month": 5,
"day": 15
},
"track_count": 12,
"medium_count": 1
}
]
}
]
}
]
}
```
**Response Fields**:
| Field | Type | Description |
|-------|------|-------------|
| `status` | string | `ok` or `error` |
| `results` | array | Array of match results |
| `results[].id` | string | AcoustID track ID |
| `results[].score` | float | Match confidence (0.0-1.0) |
| `results[].recordings` | array | MusicBrainz recordings (if requested) |
### Fingerprint Submission
Submit audio fingerprints with optional metadata.
#### `/v2/submit`
**Method**: POST
**Authentication**: Required (client key + user key)
**Rate Limit**: 3 requests/second (IP), 10 requests/second (application)
**Required Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
| `user` | string | User API key |
| `duration.#` | integer | Track duration in seconds |
| `fingerprint.#` | string | Chromaprint fingerprint |
**Optional Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `clientversion` | string | Client application version |
| `bitrate.#` | integer | Audio bitrate in kbps |
| `fileformat.#` | string | Audio file format (mp3, flac, etc.) |
| `mbid.#` | string | MusicBrainz recording MBID |
| `track.#` | string | Track title |
| `artist.#` | string | Artist name |
| `album.#` | string | Album title |
| `albumartist.#` | string | Album artist name |
| `year.#` | integer | Release year |
| `trackno.#` | integer | Track number |
| `discno.#` | integer | Disc number |
**Batch Submission**:
Use indexed parameters (`.0`, `.1`, `.2`, etc.) to submit multiple fingerprints:
```
duration.0=240&fingerprint.0=AQADtN...&mbid.0=a1b2c3d4...
duration.1=180&fingerprint.1=AQABtK...&mbid.1=e5f67890...
```
**Example Request**:
```
POST /v2/submit
Content-Type: application/x-www-form-urlencoded
client=8XaBELgH&user=AbCdEfGh&duration.0=240&fingerprint.0=AQADtNGiJE...&mbid.0=a1b2c3d4-e5f6-7890-abcd-ef1234567890
```
**Example Response**:
```json
{
"status": "ok",
"submissions": [
{
"id": 12345678,
"status": "pending"
}
]
}
```
**Response Fields**:
| Field | Type | Description |
|-------|------|-------------|
| `status` | string | `ok` or `error` |
| `submissions` | array | Array of submission results |
| `submissions[].id` | integer | Submission ID |
| `submissions[].status` | string | `pending`, `imported`, or `error` |
### Submission Status
Check the processing status of submitted fingerprints.
#### `/v2/submission_status`
**Method**: GET
**Authentication**: Required (client key)
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
| `id` | integer | Submission ID (from submit response) |
| `format` | string | Response format: `json`, `xml`, `jsonp` |
**Example Request**:
```
GET /v2/submission_status?client=8XaBELgH&id=12345678
```
**Example Response**:
```json
{
"status": "ok",
"submission": {
"id": 12345678,
"status": "imported",
"result": {
"id": "7e8b1234-5678-90ab-cdef-1234567890ab"
}
}
}
```
**Status Values**:
- `pending`: Queued for processing
- `imported`: Successfully processed
- `error`: Processing failed
### Fingerprint Retrieval
Retrieve stored fingerprint data.
#### `/v2/fingerprint`
**Method**: GET
**Authentication**: Required (client key)
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
| `id` | string | AcoustID track ID |
| `format` | string | Response format: `json`, `xml`, `jsonp` |
**Example Request**:
```
GET /v2/fingerprint?client=8XaBELgH&id=7e8b1234-5678-90ab-cdef-1234567890ab
```
**Example Response**:
```json
{
"status": "ok",
"fingerprints": [
{
"id": 987654321,
"fingerprint": "AQADtNGiJE...",
"duration": 240,
"submission_count": 5
}
]
}
```
### Track Listing by MBID
List AcoustID tracks linked to a MusicBrainz recording.
#### `/v2/track/list_by_mbid`
**Method**: GET
**Authentication**: Required (client key)
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
| `mbid` | string | MusicBrainz recording MBID |
| `format` | string | Response format: `json`, `xml`, `jsonp` |
**Example Request**:
```
GET /v2/track/list_by_mbid?client=8XaBELgH&mbid=a1b2c3d4-e5f6-7890-abcd-ef1234567890
```
**Example Response**:
```json
{
"status": "ok",
"tracks": [
{
"id": "7e8b1234-5678-90ab-cdef-1234567890ab",
"disabled": false
}
]
}
```
### Track Listing by PUID
List AcoustID tracks linked to a MusicIP PUID (legacy).
#### `/v2/track/list_by_puid`
**Method**: GET
**Authentication**: Required (client key)
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
| `puid` | string | MusicIP PUID |
| `format` | string | Response format: `json`, `xml`, `jsonp` |
### User Management
#### `/v2/user/lookup`
Lookup user API key by MusicBrainz account.
**Method**: POST
**Authentication**: Required (client key)
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
| `musicbrainz_id` | string | MusicBrainz username |
#### `/v2/user/create_anonymous`
Create anonymous user API key.
**Method**: POST
**Authentication**: Required (client key)
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
**Example Response**:
```json
{
"status": "ok",
"user": {
"apikey": "AbCdEfGh"
}
}
```
#### `/v2/user/create_musicbrainz`
Create user API key linked to MusicBrainz account.
**Method**: POST
**Authentication**: Required (client key)
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `client` | string | Application API key |
| `access_token` | string | MusicBrainz OAuth access token |
## Legacy API Endpoints
### `/lookup`
Legacy lookup endpoint (API v1).
**Status**: Deprecated, use `/v2/lookup` instead
**Differences**: Limited metadata options, different response format
### `/submit`
Legacy submit endpoint (API v1).
**Status**: Deprecated, use `/v2/submit` instead
**Differences**: Synchronous processing, no batch support
## Health Check Endpoints
### `/_health`
Full health check with database write test.
**Method**: GET
**Authentication**: None
**Response**:
```json
{
"status": "ok"
}
```
**Status Codes**:
- `200`: All systems operational
- `503`: Service unavailable
### `/_health_ro`
Read-only health check (database read test only).
**Method**: GET
**Authentication**: None
### `/_health_docker`
Docker-specific health check (minimal checks).
**Method**: GET
**Authentication**: None
## Internal API Endpoints
These endpoints are for administrative use only and require special authentication.
### `/v2/internal/update_lookup_stats`
Trigger lookup statistics update.
**Method**: POST
**Authentication**: Internal only
### `/v2/internal/update_user_agent_stats`
Trigger user agent statistics update.
**Method**: POST
**Authentication**: Internal only
### `/v2/internal/lookup_stats`
Retrieve lookup statistics.
**Method**: GET
**Authentication**: Internal only
### `/v2/internal/create_account`
Create new user account.
**Method**: POST
**Authentication**: Internal only
### `/v2/internal/create_application`
Create new API application.
**Method**: POST
**Authentication**: Internal only
### `/v2/internal/update_application_status`
Update application status (active/inactive).
**Method**: POST
**Authentication**: Internal only
### `/v2/internal/check_application`
Check application validity.
**Method**: GET
**Authentication**: Internal only
## Index API Endpoints
The fingerprint index service exposes its own HTTP API (separate from the main API).
**Base URL**: `http://index:6081` (internal)
**Protocol**: HTTP
**Format**: MessagePack
### `PUT /:index`
Create new index.
**Parameters**:
- `:index`: Index name
### `GET /:index`
Get index information.
**Response**:
```json
{
"name": "fingerprints",
"doc_count": 1234567,
"segment_count": 42,
"memory_segment_size": 1048576
}
```
### `DELETE /:index`
Delete index.
### `POST /:index/_search`
Search for fingerprints.
**Request Body** (MessagePack):
```python
{
"query": [term1, term2, term3, ...],
"limit": 10,
"min_score": 0.5
}
```
**Response** (MessagePack):
```python
{
"results": [
{"id": fpid1, "score": 0.95},
{"id": fpid2, "score": 0.87}
]
}
```
### `POST /:index/_update`
Batch update fingerprints.
**Request Body** (MessagePack):
```python
{
"updates": [
{"id": fpid1, "terms": [term1, term2, ...]},
{"id": fpid2, "terms": [term3, term4, ...]}
]
}
```
### `GET /:index/_segments`
List index segments.
**Response**:
```json
{
"segments": [
{
"id": 0,
"type": "memory",
"doc_count": 1024,
"size_bytes": 1048576
},
{
"id": 1,
"type": "file",
"doc_count": 100000,
"size_bytes": 52428800
}
]
}
```
### `GET /:index/_snapshot`
Create index snapshot.
**Response**:
```json
{
"snapshot_id": "snapshot_20250428_120000",
"path": "/var/lib/acoustid-index/snapshots/snapshot_20250428_120000"
}
```
### `PUT /:index/:fpid`
Insert or update fingerprint.
**Parameters**:
- `:index`: Index name
- `:fpid`: Fingerprint ID
**Request Body** (MessagePack):
```python
{
"terms": [term1, term2, term3, ...]
}
```
### `GET /:index/:fpid`
Retrieve fingerprint.
**Response** (MessagePack):
```python
{
"id": fpid,
"terms": [term1, term2, term3, ...]
}
```
### `DELETE /:index/:fpid`
Delete fingerprint.
### `GET /_health`
Index health check.
**Response**:
```json
{
"status": "ok"
}
```
### `GET /_metrics`
Prometheus metrics.
**Response** (Prometheus text format):
```
# HELP fpindex_search_duration_seconds Search duration
# TYPE fpindex_search_duration_seconds histogram
fpindex_search_duration_seconds_bucket{le="0.005"} 1234
fpindex_search_duration_seconds_bucket{le="0.01"} 5678
...
```
## Rate Limiting
### Rate Limit Tiers
AcoustID implements a three-tier rate limiting system:
| Tier | Scope | Default Limit | Override |
|------|-------|---------------|----------|
| Global | All requests | 3 req/s | Config: `cluster.rate_limiter.global_limit` |
| Application | Per API key | 10 req/s | Database: `application.rate_limit` |
| IP Address | Per client IP | 3 req/s | Config: `cluster.rate_limiter.ip_limit` |
### Rate Limit Algorithm
**Implementation**: Redis-based sliding window
**Window Configuration**:
- Window duration: 20 seconds
- Window steps: 4 (5-second buckets)
- Cleanup: Automatic expiration (25-second TTL)
**Redis Keys**:
```
rl:bucket:global:{timestamp}
rl:bucket:app:{api_key}:{timestamp}
rl:bucket:ip:{ip_address}:{timestamp}
```
### Rate Limit Headers
Responses include rate limit information:
```
X-RateLimit-Limit: 10
X-RateLimit-Remaining: 7
X-RateLimit-Reset: 1714305600
```
### Rate Limit Exceeded Response
**Status Code**: 429 Too Many Requests
**Response**:
```json
{
"status": "error",
"error": {
"code": 5,
"message": "Rate limit exceeded"
}
}
```
## Error Handling
### Error Response Format
All errors return a consistent structure:
```json
{
"status": "error",
"error": {
"code": 1,
"message": "Invalid API key"
}
}
```
### Error Codes
| Code | Message | Description |
|------|---------|-------------|
| 1 | Invalid API key | Client or user key is invalid |
| 2 | Missing required parameter | Required parameter not provided |
| 3 | Invalid fingerprint | Fingerprint format is invalid |
| 4 | Internal error | Server-side error occurred |
| 5 | Rate limit exceeded | Too many requests |
| 6 | Invalid format | Unsupported response format |
| 7 | Fingerprint not found | Requested fingerprint doesn't exist |
| 8 | Too many requests | Batch size exceeds limit |
### HTTP Status Codes
| Code | Meaning | Usage |
|------|---------|-------|
| 200 | OK | Successful request |
| 400 | Bad Request | Invalid parameters |
| 401 | Unauthorized | Missing or invalid API key |
| 403 | Forbidden | API key lacks permission |
| 404 | Not Found | Resource not found |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Server Error | Server error |
| 503 | Service Unavailable | Service down or degraded |
## Authentication
### API Key Types
1. **Application Key** (`client` parameter):
- Identifies the client application
- Required for all API calls
- Obtain from https://acoustid.org/new-application
2. **User Key** (`user` parameter):
- Identifies the end user
- Required for submissions
- Created via `/v2/user/create_*` endpoints
3. **Demo Key**:
- Limited functionality
- For testing only
- Key: `8XaBELgH`
### Key Management
**Application Keys**:
- Created via web UI or internal API
- Can be active or inactive
- Rate limits configurable per key
- Usage statistics tracked
**User Keys**:
- Anonymous or MusicBrainz-linked
- Created programmatically
- Tied to application key
- Submission history tracked
## Best Practices
### Lookup Optimization
1. **Use batch lookups** for multiple files (up to 20 per request)
2. **Request only needed metadata** (use specific `meta` flags)
3. **Cache results** to avoid redundant lookups
4. **Handle rate limits** with exponential backoff
### Submission Guidelines
1. **Include MBIDs** when known (improves accuracy)
2. **Provide metadata** (artist, album, track) for better matching
3. **Use batch submissions** for efficiency
4. **Poll submission status** asynchronously
### Error Handling
1. **Retry on 5xx errors** with exponential backoff
2. **Respect rate limits** (check headers)
3. **Validate fingerprints** before submission
4. **Log errors** for debugging
### Performance
1. **Use POST** for large requests (avoid URL length limits)
2. **Enable compression** (`meta=compress`)
3. **Reuse connections** (HTTP keep-alive)
4. **Implement timeouts** (30-60 seconds recommended)
@@ -0,0 +1,611 @@
# AcoustID Architecture
## System Architecture Overview
AcoustID employs a **monolithic multi-process architecture** with microservice-like separation of concerns. The system is split into two major repositories with distinct responsibilities:
1. **acoustid-server**: Monolithic Python application with multiple process types
2. **acoustid-index**: Standalone Zig service for fingerprint indexing
## Server Architecture
### Process Types
The server runs as multiple independent processes, each with a specific role:
| Process | Entry Point | Purpose | Scaling |
|---------|-------------|---------|---------|
| API | `acoustid.server:make_application()` | Handle API requests | Horizontal |
| Web | `acoustid.server:make_application()` | Serve web UI | Horizontal |
| Worker | `acoustid.worker:run()` | Process background jobs | Horizontal |
| Cron | `acoustid.cron:run()` | Execute scheduled tasks | Single instance |
| Import | `acoustid.scripts.import_submissions` | Bulk import fingerprints | Manual |
### Directory Structure
```
acoustid/
├── api/ # API layer
│ ├── __init__.py # API application factory
│ ├── errors.py # Error handling
│ ├── ratelimit.py # Rate limiting logic
│ └── v2/ # API v2 endpoints
│ ├── __init__.py
│ ├── lookup.py # Fingerprint lookup
│ ├── submit.py # Fingerprint submission
│ ├── misc.py # Utility endpoints
│ └── internal.py # Internal admin endpoints
├── data/ # Business logic layer
│ ├── account.py # User account operations
│ ├── application.py # API application management
│ ├── fingerprint.py # Fingerprint operations
│ ├── foreignid.py # Foreign ID management
│ ├── meta.py # Metadata operations
│ ├── musicbrainz.py # MusicBrainz queries
│ ├── stats.py # Statistics tracking
│ ├── submission.py # Submission processing
│ └── track.py # Track operations
├── future/ # Starlette migration
│ ├── app.py # ASGI application
│ ├── lookup.py # Async lookup handler
│ └── submit.py # Async submit handler
├── web/ # Web UI layer
│ ├── __init__.py # Web application factory
│ ├── views/ # View handlers
│ └── templates/ # Jinja2 templates
├── scripts/ # Utility scripts
│ ├── import_submissions.py
│ ├── backfill_fingerprint_index.py
│ └── update_lookup_stats.py
├── cli.py # CLI command definitions
├── server.py # WSGI/ASGI application
├── worker.py # Background worker
├── cron.py # Cron job scheduler
├── fingerprint.py # Fingerprint utilities
├── indexclient.py # Legacy TCP index client
├── fpstore.py # Modern HTTP index client
├── db.py # Database connection management
├── config.py # Configuration loading
└── tables.py # SQLAlchemy ORM models
```
### Layered Architecture
The server follows a traditional layered architecture:
```
┌─────────────────────────────────────────┐
│ Presentation Layer │
│ (api/, web/, future/) │
│ - HTTP request/response handling │
│ - Input validation │
│ - Response formatting │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Business Logic Layer │
│ (data/) │
│ - Domain operations │
│ - Business rules │
│ - Orchestration │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Data Access Layer │
│ (db.py, tables.py) │
│ - Database queries │
│ - ORM models │
│ - Transaction management │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ External Services Layer │
│ (indexclient.py, fpstore.py) │
│ - Index communication │
│ - MusicBrainz queries │
│ - Redis operations │
└─────────────────────────────────────────┘
```
### Framework Transition
The server is actively transitioning from Flask to Starlette:
**Current (Flask/Werkzeug)**:
- Location: `acoustid/api/`, `acoustid/web/`
- WSGI-based synchronous request handling
- Gunicorn as application server
- Blocking database operations with psycopg2
**Future (Starlette)**:
- Location: `acoustid/future/`
- ASGI-based asynchronous request handling
- Uvicorn as application server
- Async database operations with asyncpg
**Migration Status**:
- Core lookup and submit endpoints have async implementations
- Legacy endpoints still use Flask
- Both frameworks run simultaneously during transition
- Configuration flag controls which implementation is used
## Index Architecture
### LSM-Tree Design
The index uses a **Log-Structured Merge-tree (LSM-tree)** for efficient fingerprint storage and retrieval.
**Core Concept**:
- Writes go to in-memory segment (fast)
- Memory segment periodically flushed to disk
- Background process merges disk segments
- Reads check memory segment first, then disk segments
**Components**:
```
┌─────────────────────────────────────────┐
│ MultiIndex │
│ - Manages multiple named indexes │
│ - Routes requests to correct index │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Index │
│ - Single fingerprint index │
│ - Coordinates segments and merging │
└─────────────────────────────────────────┘
┌──────────────────┬──────────────────────┐
│ MemorySegment │ FileSegment(s) │
│ - In-memory │ - On-disk │
│ - Fast writes │ - Immutable │
│ - Volatile │ - Persistent │
└──────────────────┴──────────────────────┘
┌─────────────────────────────────────────┐
│ Oplog (Write-Ahead Log) │
│ - Durability for memory segment │
│ - Replay on crash recovery │
└─────────────────────────────────────────┘
```
### Segment Management
**MemorySegment** (`src/MemorySegment.zig`):
- Hash map of fingerprint ID to posting list
- Posting list: array of term IDs (compressed)
- Maximum size threshold triggers flush
- Backed by Oplog for durability
**FileSegment** (`src/FileSegment.zig`):
- Immutable on-disk segment
- Binary file format with index and data sections
- StreamVByte compression for posting lists
- Memory-mapped for fast reads
**Segment Lifecycle**:
1. Writes accumulate in MemorySegment
2. MemorySegment reaches size threshold
3. Flush to new FileSegment
4. Clear MemorySegment and Oplog
5. Background merger selects segments to merge
6. Merge creates new larger FileSegment
7. Delete old segments
### Merge Policy
**Tiered Merge Strategy**:
- Segments grouped into tiers by size
- Tier 0: Smallest segments (recently flushed)
- Tier N: Largest segments (heavily merged)
- Merge triggered when tier has too many segments
- Merges segments within same tier
**Benefits**:
- Write amplification bounded
- Read performance improves over time
- Disk space reclaimed from deleted entries
### File Format
**Segment File Structure** (`src/filefmt.zig`):
```
┌─────────────────────────────────────────┐
│ Header │
│ - Magic number │
│ - Version │
│ - Metadata │
├─────────────────────────────────────────┤
│ Index Section │
│ - Fingerprint ID → Offset mapping │
│ - Binary search tree or hash table │
├─────────────────────────────────────────┤
│ Data Section │
│ - Compressed posting lists │
│ - StreamVByte encoded │
└─────────────────────────────────────────┘
```
**Block Compression** (`src/block.zig`):
- Posting lists compressed in blocks
- StreamVByte SIMD compression
- Delta encoding for term IDs
- Typical compression ratio: 4-8x
### Index Reader
**IndexReader** (`src/IndexReader.zig`):
- Read-only view of index
- Merges results from all segments
- Implements search algorithm
- Returns top-K candidates by score
**Search Algorithm**:
1. Extract query terms from fingerprint
2. For each term, fetch posting lists from all segments
3. Merge posting lists (union)
4. Score each candidate by term overlap
5. Return top-K candidates sorted by score
## Data Flow
### Submission Flow (Detailed)
```
┌─────────┐
│ Client │
└────┬────┘
│ POST /v2/submit
┌─────────────────────────────────────────┐
│ SubmitHandler (api/v2/submit.py) │
│ 1. Validate API keys (client + user) │
│ 2. Check rate limits (Redis) │
│ 3. Decode fingerprints │
│ 4. Insert into submission table │
│ 5. Publish to NATS queue │
└─────────────────────────────────────────┘
↓ NATS message
┌─────────────────────────────────────────┐
│ Worker (worker.py) │
│ 1. Consume message from NATS │
│ 2. Load submission from database │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ FingerprintSearcher (data/fingerprint) │
│ 1. Extract query from fingerprint │
│ 2. Search index for matches │
└─────────────────────────────────────────┘
↓ HTTP POST /:index/_search
┌─────────────────────────────────────────┐
│ Index (fpindex) │
│ 1. Decode MessagePack request │
│ 2. Search segments │
│ 3. Score candidates │
│ 4. Return top matches │
└─────────────────────────────────────────┘
↓ Candidate fingerprint IDs
┌─────────────────────────────────────────┐
│ Worker (continued) │
│ 1. Fetch candidate metadata from DB │
│ 2. Decide: create new track or link │
│ 3. Insert/update track tables │
│ 4. Update index with new fingerprint │
│ 5. Store result in submission_result │
└─────────────────────────────────────────┘
↓ HTTP PUT /:index/:fpid
┌─────────────────────────────────────────┐
│ Index (fpindex) │
│ 1. Add fingerprint to MemorySegment │
│ 2. Append to Oplog │
│ 3. Trigger flush if needed │
└─────────────────────────────────────────┘
```
### Lookup Flow (Detailed)
```
┌─────────┐
│ Client │
└────┬────┘
│ GET/POST /v2/lookup
┌─────────────────────────────────────────┐
│ LookupHandler (api/v2/lookup.py) │
│ 1. Validate API key (client) │
│ 2. Check rate limits (Redis) │
│ 3. Parse parameters │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ decode_fingerprint (fingerprint.py) │
│ 1. Decode base64 or compressed format │
│ 2. Decompress if needed │
│ 3. Parse Chromaprint data │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ extract_query (fingerprint.py) │
│ 1. Extract hash terms from fingerprint│
│ 2. Build query structure │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ fpstore.search (fpstore.py) │
│ 1. Encode query as MessagePack │
│ 2. HTTP POST to index │
└─────────────────────────────────────────┘
↓ HTTP POST /:index/_search
┌─────────────────────────────────────────┐
│ Index (fpindex) │
│ 1. Parse MessagePack query │
│ 2. Search all segments │
│ 3. Merge and score results │
│ 4. Return top-K candidates │
└─────────────────────────────────────────┘
↓ Candidate fingerprint IDs + scores
┌─────────────────────────────────────────┐
│ LookupHandler (continued) │
│ 1. Fetch fingerprint metadata from DB │
│ 2. Fetch track metadata from DB │
│ 3. Fetch MusicBrainz data if requested│
│ 4. Build result structure │
│ 5. Format as JSON/XML │
└─────────────────────────────────────────┘
↓ JSON response
┌─────────┐
│ Client │
└─────────┘
```
### Background Processing
**Cron Jobs** (`acoustid/cron.py`):
- Update lookup statistics (hourly)
- Update user agent statistics (daily)
- Clean up old submissions (daily)
- Refresh materialized views (hourly)
- Backup index snapshots (daily)
**Worker Tasks** (`acoustid/worker.py`):
- Process fingerprint submissions
- Import bulk fingerprints
- Update index with new data
- Resolve MBID redirects
- Clean up orphaned records
## Index Communication Protocols
### Legacy Protocol (indexclient.py)
**Transport**: Raw TCP socket
**Port**: 6080 (default)
**Format**: Custom binary protocol
**Message Structure**:
```
┌────────────────┬────────────────┬────────────────┐
│ Length (4B) │ Command (1B) │ Payload │
└────────────────┴────────────────┴────────────────┘
```
**Commands**:
- `0x01`: Search
- `0x02`: Insert
- `0x03`: Delete
**Status**: Being phased out, replaced by HTTP protocol
### Modern Protocol (fpstore.py)
**Transport**: HTTP/1.1
**Port**: 6081 (default)
**Format**: MessagePack
**Endpoints**:
| Method | Path | Purpose |
|--------|------|---------|
| POST | `/:index/_search` | Search for fingerprints |
| PUT | `/:index/:fpid` | Insert/update fingerprint |
| DELETE | `/:index/:fpid` | Delete fingerprint |
| GET | `/:index` | Get index info |
| GET | `/:index/_segments` | List segments |
| GET | `/:index/_snapshot` | Create snapshot |
**Search Request**:
```python
{
"query": [term_id1, term_id2, ...], # Query terms
"limit": 10, # Max results
"min_score": 0.5 # Score threshold
}
```
**Search Response**:
```python
{
"results": [
{"id": fpid1, "score": 0.95},
{"id": fpid2, "score": 0.87},
...
]
}
```
## Concurrency and Parallelism
### Server Concurrency
**API/Web Processes**:
- Multiple worker processes (Gunicorn/Uvicorn)
- Each process handles requests independently
- Shared-nothing architecture
- Database connection pooling per process
**Worker Processes**:
- Multiple worker instances
- NATS queue provides work distribution
- Each worker processes one submission at a time
- No shared state between workers
**Cron Process**:
- Single instance (leader election via database)
- Scheduled tasks run sequentially
- Long-running tasks delegated to workers
### Index Concurrency
**Thread Model**:
- Main thread: HTTP server
- Worker threads: Search and merge operations
- Configurable thread pool size
**Locking Strategy**:
- Read-write lock on Index
- Multiple concurrent readers
- Exclusive writer (for flush/merge)
- Lock-free MemorySegment (atomic operations)
**Background Tasks**:
- Segment merger runs in background thread
- Oplog flusher runs periodically
- Metrics collector runs independently
## Scalability Considerations
### Horizontal Scaling
**API/Web**:
- Stateless processes
- Scale by adding more instances
- Load balancer distributes requests
- Session state in Redis (if needed)
**Workers**:
- Scale by adding more instances
- NATS queue distributes work
- No coordination required
**Index**:
- Multiple index instances (sharding)
- Consistent hashing for fingerprint distribution
- NATS for cluster coordination
- Each instance handles subset of fingerprints
### Vertical Scaling
**Database**:
- Connection pooling
- Read replicas for queries
- Partitioning for large tables
- Materialized views for aggregations
**Index**:
- More threads for search
- Larger memory segment
- Faster disk for segments
- More RAM for file caching
## Fault Tolerance
### Server Resilience
**Database Failures**:
- Connection retry with exponential backoff
- Health checks detect failures
- Read-only mode if write DB unavailable
**Index Failures**:
- Graceful degradation (return partial results)
- Retry with exponential backoff
- Circuit breaker pattern
**NATS Failures**:
- Persistent queue (JetStream)
- Automatic reconnection
- Message replay on recovery
### Index Resilience
**Crash Recovery**:
- Oplog replay restores MemorySegment
- FileSegments are immutable (no corruption)
- Incomplete merges discarded
**Data Integrity**:
- Checksums in file format
- Atomic file operations
- Write-ahead logging
**Replication**:
- NATS-based replication (optional)
- Snapshot-based backup
- Point-in-time recovery
## Performance Characteristics
### Server Performance
**Lookup Latency**:
- P50: ~50ms (including index search)
- P95: ~200ms
- P99: ~500ms
**Bottlenecks**:
- Index search time (dominant)
- Database query time (metadata fetch)
- Network latency (MusicBrainz queries)
### Index Performance
**Search Latency**:
- P50: ~5ms
- P95: ~20ms
- P99: ~50ms
**Throughput**:
- ~1000 searches/second (single instance)
- ~500 inserts/second (single instance)
**Bottlenecks**:
- Disk I/O (segment reads)
- CPU (decompression and scoring)
- Memory (segment caching)
## Future Architecture Plans
### Server Modernization
1. Complete migration to Starlette/ASGI
2. Remove Flask dependencies
3. Async database operations everywhere
4. GraphQL API alongside REST
### Index Enhancements
1. Distributed index with automatic sharding
2. Replication for high availability
3. Incremental snapshots
4. Query result caching
### Infrastructure
1. Kubernetes deployment
2. Service mesh (Istio/Linkerd)
3. Distributed tracing (OpenTelemetry)
4. Advanced monitoring (Prometheus + Grafana)
File diff suppressed because it is too large Load Diff
+871
View File
@@ -0,0 +1,871 @@
# AcoustID Data Model
## Database Architecture
AcoustID uses a multi-database PostgreSQL architecture with separate databases for different concerns.
### Database Instances
| Database | Purpose | Tables | Extensions |
|----------|---------|--------|------------|
| `acoustid_app` | Application data (accounts, apps, stats) | 8 | pgcrypto |
| `acoustid_fingerprint` | Fingerprint and track data | 19 | intarray, acoustid, cube |
| `acoustid_ingest` | Submission processing | 3 | - |
| `musicbrainz` | MusicBrainz mirror (read-only) | Many | - |
### PostgreSQL Extensions
**intarray**: Integer array operations
- Used for fingerprint array queries
- Provides `&&` (overlap) and `@>` (contains) operators
**pgcrypto**: Cryptographic functions
- UUID generation (`gen_random_uuid()`)
- API key hashing
**acoustid** (custom): Fingerprint similarity functions
- `acoustid_compare(int[], int[])`: Compare two fingerprints
- `acoustid_extract_query(int[])`: Extract query terms
- Source: `acoustid-ext` C extension
**cube**: Multi-dimensional cube data type
- Used for simhash-based fingerprint indexing
- Enables fast approximate nearest neighbor search
## Core Tables
### Account Management (acoustid_app)
#### `account`
User accounts for API access.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Account ID |
| `name` | VARCHAR(255) | NOT NULL | Display name |
| `apikey` | VARCHAR(40) | UNIQUE, NOT NULL | API key (user key) |
| `mbuser` | VARCHAR(64) | UNIQUE | MusicBrainz username |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `lastlogin` | TIMESTAMP | | Last login timestamp |
| `submission_count` | INTEGER | DEFAULT 0 | Total submissions |
| `application_id` | INTEGER | FOREIGN KEY | Default application |
| `application_version` | VARCHAR(255) | | Application version |
| `created_from` | INET | | Registration IP |
| `is_admin` | BOOLEAN | DEFAULT FALSE | Admin flag |
**Indexes**:
- `account_pkey` (PRIMARY KEY on `id`)
- `account_apikey_key` (UNIQUE on `apikey`)
- `account_mbuser_key` (UNIQUE on `mbuser`)
#### `application`
API client applications.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Application ID |
| `name` | VARCHAR(255) | NOT NULL | Application name |
| `version` | VARCHAR(255) | | Version string |
| `apikey` | VARCHAR(40) | UNIQUE, NOT NULL | API key (client key) |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `active` | BOOLEAN | DEFAULT TRUE | Active status |
| `account_id` | INTEGER | FOREIGN KEY | Owner account |
| `email` | VARCHAR(255) | | Contact email |
| `website` | VARCHAR(1000) | | Website URL |
| `rate_limit` | INTEGER | | Custom rate limit (req/s) |
**Indexes**:
- `application_pkey` (PRIMARY KEY on `id`)
- `application_apikey_key` (UNIQUE on `apikey`)
#### `account_openid`
OpenID authentication links.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `openid` | VARCHAR(255) | PRIMARY KEY | OpenID identifier |
| `account_id` | INTEGER | FOREIGN KEY | Linked account |
#### `account_google`
Google OAuth authentication links.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `google_user_id` | VARCHAR(255) | PRIMARY KEY | Google user ID |
| `account_id` | INTEGER | FOREIGN KEY | Linked account |
### Fingerprint Data (acoustid_fingerprint)
#### `track`
Unique audio tracks identified by fingerprints.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Track ID |
| `gid` | UUID | UNIQUE, NOT NULL | Public track UUID |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `new_id` | INTEGER | FOREIGN KEY | Merge target (if merged) |
| `disabled` | BOOLEAN | DEFAULT FALSE | Disabled flag |
**Indexes**:
- `track_pkey` (PRIMARY KEY on `id`)
- `track_gid_key` (UNIQUE on `gid`)
- `track_new_id_idx` (on `new_id`)
**Notes**:
- `gid` is the public-facing AcoustID track ID
- `new_id` points to merged track (for deduplication)
- Disabled tracks excluded from search results
#### `fingerprint`
Audio fingerprints linked to tracks.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Fingerprint ID |
| `track_id` | INTEGER | FOREIGN KEY | Linked track |
| `fingerprint` | INTEGER[] | NOT NULL | Chromaprint hash array |
| `length` | SMALLINT | NOT NULL | Duration in seconds |
| `bitrate` | SMALLINT | | Audio bitrate (kbps) |
| `format_id` | INTEGER | FOREIGN KEY | Audio format |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
**Indexes**:
- `fingerprint_pkey` (PRIMARY KEY on `id`)
- `fingerprint_track_id_idx` (on `track_id`)
- `fingerprint_length_idx` (on `length`)
- `fingerprint_fingerprint_idx` (GIN on `fingerprint` using `intarray`)
**Notes**:
- `fingerprint` is an array of 32-bit integers (Chromaprint hashes)
- GIN index enables fast similarity search
- `submission_count` tracks popularity
#### `fingerprint_data`
Extended fingerprint data with simhash.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `fingerprint_id` | INTEGER | PRIMARY KEY, FOREIGN KEY | Fingerprint ID |
| `fingerprint` | BYTEA | NOT NULL | Raw fingerprint data |
| `simhash` | CUBE | | Locality-sensitive hash |
**Indexes**:
- `fingerprint_data_pkey` (PRIMARY KEY on `fingerprint_id`)
- `fingerprint_data_simhash_idx` (GIST on `simhash`)
**Notes**:
- `fingerprint` stores compressed Chromaprint data
- `simhash` enables approximate nearest neighbor search
- GIST index for fast similarity queries
#### `track_mbid`
Links tracks to MusicBrainz recordings.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
| `mbid` | UUID | NOT NULL | MusicBrainz recording MBID |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
| `disabled` | BOOLEAN | DEFAULT FALSE | Disabled flag |
**Indexes**:
- `track_mbid_pkey` (PRIMARY KEY on `id`)
- `track_mbid_track_id_mbid_key` (UNIQUE on `track_id, mbid`)
- `track_mbid_mbid_idx` (on `mbid`)
**Notes**:
- Multiple MBIDs per track possible (different recordings)
- `submission_count` indicates confidence
- Disabled links excluded from results
#### `meta`
User-submitted metadata.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Metadata ID |
| `track` | VARCHAR(255) | | Track title |
| `artist` | VARCHAR(255) | | Artist name |
| `album` | VARCHAR(255) | | Album title |
| `album_artist` | VARCHAR(255) | | Album artist |
| `track_no` | INTEGER | | Track number |
| `disc_no` | INTEGER | | Disc number |
| `year` | INTEGER | | Release year |
**Indexes**:
- `meta_pkey` (PRIMARY KEY on `id`)
#### `track_meta`
Links tracks to user metadata.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
| `meta_id` | INTEGER | FOREIGN KEY | Metadata record |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
**Indexes**:
- `track_meta_pkey` (PRIMARY KEY on `id`)
- `track_meta_track_id_meta_id_key` (UNIQUE on `track_id, meta_id`)
#### `format`
Audio file formats.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Format ID |
| `name` | VARCHAR(20) | UNIQUE, NOT NULL | Format name (mp3, flac, etc.) |
**Indexes**:
- `format_pkey` (PRIMARY KEY on `id`)
- `format_name_key` (UNIQUE on `name`)
**Common Values**:
- `mp3`, `flac`, `ogg`, `m4a`, `wma`, `ape`, `wav`
#### `source`
Submission sources (applications).
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Source ID |
| `application_id` | INTEGER | FOREIGN KEY | Application |
| `account_id` | INTEGER | FOREIGN KEY | User account |
| `version` | VARCHAR(255) | | Application version |
**Indexes**:
- `source_pkey` (PRIMARY KEY on `id`)
- `source_application_id_account_id_version_key` (UNIQUE on `application_id, account_id, version`)
### Foreign IDs (acoustid_fingerprint)
#### `foreignid_vendor`
External ID providers.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Vendor ID |
| `name` | VARCHAR(255) | UNIQUE, NOT NULL | Vendor name |
**Indexes**:
- `foreignid_vendor_pkey` (PRIMARY KEY on `id`)
- `foreignid_vendor_name_key` (UNIQUE on `name`)
**Common Values**:
- `musicbrainz`, `musicip`, `discogs`, `spotify`
#### `foreignid`
External identifiers.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Foreign ID |
| `vendor_id` | INTEGER | FOREIGN KEY | Vendor |
| `name` | VARCHAR(255) | NOT NULL | External ID value |
**Indexes**:
- `foreignid_pkey` (PRIMARY KEY on `id`)
- `foreignid_vendor_id_name_key` (UNIQUE on `vendor_id, name`)
#### `track_foreignid`
Links tracks to external IDs.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
| `foreignid_id` | INTEGER | FOREIGN KEY | External ID |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
**Indexes**:
- `track_foreignid_pkey` (PRIMARY KEY on `id`)
- `track_foreignid_track_id_foreignid_id_key` (UNIQUE on `track_id, foreignid_id`)
#### `track_puid`
Legacy MusicIP PUID links.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
| `puid` | UUID | NOT NULL | MusicIP PUID |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
**Indexes**:
- `track_puid_pkey` (PRIMARY KEY on `id`)
- `track_puid_track_id_puid_key` (UNIQUE on `track_id, puid`)
- `track_puid_puid_idx` (on `puid`)
### Statistics (acoustid_app)
#### `stats`
General statistics.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Stat ID |
| `name` | VARCHAR(255) | UNIQUE, NOT NULL | Stat name |
| `value` | INTEGER | NOT NULL | Stat value |
| `date` | DATE | NOT NULL | Stat date |
**Indexes**:
- `stats_pkey` (PRIMARY KEY on `id`)
- `stats_name_date_key` (UNIQUE on `name, date`)
**Common Stats**:
- `lookup.count`, `submission.count`, `track.count`, `fingerprint.count`
#### `stats_lookups`
Lookup statistics by hour.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Stat ID |
| `hour` | TIMESTAMP | NOT NULL | Hour timestamp |
| `application_id` | INTEGER | FOREIGN KEY | Application |
| `count_hits` | INTEGER | DEFAULT 0 | Successful lookups |
| `count_misses` | INTEGER | DEFAULT 0 | Failed lookups |
**Indexes**:
- `stats_lookups_pkey` (PRIMARY KEY on `id`)
- `stats_lookups_hour_application_id_key` (UNIQUE on `hour, application_id`)
#### `stats_user_agents`
User agent statistics.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Stat ID |
| `date` | DATE | NOT NULL | Date |
| `application_id` | INTEGER | FOREIGN KEY | Application |
| `user_agent` | VARCHAR(1000) | NOT NULL | User agent string |
| `ip` | INET | NOT NULL | IP address |
| `count` | INTEGER | DEFAULT 0 | Request count |
**Indexes**:
- `stats_user_agents_pkey` (PRIMARY KEY on `id`)
- `stats_user_agents_date_application_id_user_agent_ip_key` (UNIQUE on `date, application_id, user_agent, ip`)
#### `stats_top_accounts`
Top submitter accounts.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Stat ID |
| `account_id` | INTEGER | FOREIGN KEY | Account |
| `count` | INTEGER | NOT NULL | Submission count |
**Indexes**:
- `stats_top_accounts_pkey` (PRIMARY KEY on `id`)
- `stats_top_accounts_account_id_key` (UNIQUE on `account_id`)
### Submission Processing (acoustid_ingest)
#### `submission`
Pending fingerprint submissions.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Submission ID |
| `fingerprint` | INTEGER[] | NOT NULL | Chromaprint hash array |
| `length` | SMALLINT | NOT NULL | Duration in seconds |
| `bitrate` | SMALLINT | | Audio bitrate |
| `format_id` | INTEGER | | Audio format |
| `created` | TIMESTAMP | NOT NULL | Submission timestamp |
| `source_id` | INTEGER | FOREIGN KEY | Submission source |
| `mbid` | UUID | | MusicBrainz MBID (if provided) |
| `handled` | BOOLEAN | DEFAULT FALSE | Processing status |
| `meta_id` | INTEGER | FOREIGN KEY | User metadata |
**Indexes**:
- `submission_pkey` (PRIMARY KEY on `id`)
- `submission_handled_idx` (on `handled` WHERE `handled = FALSE`)
**Notes**:
- Worker processes unhandled submissions
- `handled = TRUE` after processing
#### `submission_result`
Processing results for submissions.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Result ID |
| `submission_id` | INTEGER | FOREIGN KEY | Submission |
| `track_id` | INTEGER | FOREIGN KEY | Matched/created track |
| `created` | TIMESTAMP | NOT NULL | Processing timestamp |
**Indexes**:
- `submission_result_pkey` (PRIMARY KEY on `id`)
- `submission_result_submission_id_key` (UNIQUE on `submission_id`)
#### `pending_submission`
Queue for async submission processing.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Queue ID |
| `submission_id` | INTEGER | FOREIGN KEY | Submission |
| `created` | TIMESTAMP | NOT NULL | Queue timestamp |
**Indexes**:
- `pending_submission_pkey` (PRIMARY KEY on `id`)
- `pending_submission_submission_id_key` (UNIQUE on `submission_id`)
**Notes**:
- Replaced by NATS queue in newer deployments
- Legacy table, may be deprecated
### Provenance Tables (acoustid_fingerprint)
Track data lineage and changes.
#### `fingerprint_source`
Links fingerprints to submission sources.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `fingerprint_id` | INTEGER | FOREIGN KEY | Fingerprint |
| `source_id` | INTEGER | FOREIGN KEY | Source |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
#### `track_mbid_source`
Links track-MBID associations to sources.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_mbid_id` | INTEGER | FOREIGN KEY | Track-MBID link |
| `source_id` | INTEGER | FOREIGN KEY | Source |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
#### `track_mbid_change`
Audit log for track-MBID changes.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Change ID |
| `track_mbid_id` | INTEGER | FOREIGN KEY | Track-MBID link |
| `account_id` | INTEGER | FOREIGN KEY | Account that made change |
| `disabled` | BOOLEAN | NOT NULL | New disabled status |
| `created` | TIMESTAMP | NOT NULL | Change timestamp |
| `note` | TEXT | | Change reason |
## ORM Layer (SQLAlchemy)
### Multi-Database Configuration
**File**: `acoustid/db.py`
```python
# Database bind keys
BIND_KEYS = {
'app': 'acoustid_app',
'fingerprint': 'acoustid_fingerprint',
'ingest': 'acoustid_ingest',
'musicbrainz': 'musicbrainz'
}
```
**Model Binding**:
```python
class Account(Base):
__bind_key__ = 'app'
__tablename__ = 'account'
# ...
class Track(Base):
__bind_key__ = 'fingerprint'
__tablename__ = 'track'
# ...
```
### Connection Pooling
**Configuration** (`acoustid.conf`):
```ini
[database]
name = acoustid_app
user = acoustid
password_file = /run/secrets/db_password
host = postgres
port = 5432
pool_size = 20
pool_recycle = 3600
```
**Pool Settings**:
- `pool_size`: Maximum connections per process
- `pool_recycle`: Recycle connections after N seconds
- `pool_pre_ping`: Test connections before use
### Query Patterns
**Fingerprint Search** (legacy, pre-index):
```python
# Find similar fingerprints using intarray overlap
query = db.session.query(Fingerprint).filter(
Fingerprint.fingerprint.op('&&')(query_fingerprint),
Fingerprint.length.between(duration - 5, duration + 5)
).order_by(
func.acoustid_compare(Fingerprint.fingerprint, query_fingerprint).desc()
).limit(10)
```
**Track Lookup with MBIDs**:
```python
# Fetch track with all linked MBIDs
track = db.session.query(Track).options(
joinedload(Track.mbids)
).filter(Track.gid == track_gid).first()
```
**Submission Processing**:
```python
# Find unhandled submissions
submissions = db.session.query(Submission).filter(
Submission.handled == False
).order_by(Submission.created).limit(100).all()
```
## Database Migrations
### Alembic Configuration
**File**: `alembic.ini`
**Migration Directories**:
- `alembic/versions/app/`: acoustid_app migrations
- `alembic/versions/fingerprint/`: acoustid_fingerprint migrations
- `alembic/versions/ingest/`: acoustid_ingest migrations
**Multi-Database Support**:
```python
# alembic/env.py
def run_migrations_online():
for bind_key in ['app', 'fingerprint', 'ingest']:
engine = get_engine(bind_key)
with engine.connect() as connection:
context.configure(
connection=connection,
target_metadata=get_metadata(bind_key)
)
with context.begin_transaction():
context.run_migrations()
```
### Migration Commands
```bash
# Create new migration
alembic revision --autogenerate -m "Add new column"
# Apply migrations
alembic upgrade head
# Rollback migration
alembic downgrade -1
# Show current version
alembic current
# Show migration history
alembic history
```
## Redis Data Structures
### Rate Limiting
**Key Pattern**: `rl:bucket:{scope}:{identifier}:{timestamp}`
**Example Keys**:
```
rl:bucket:global:1714305600
rl:bucket:app:8XaBELgH:1714305600
rl:bucket:ip:192.168.1.1:1714305600
```
**Value**: Integer (request count)
**TTL**: 25 seconds (window duration + buffer)
**Algorithm**:
```python
# Increment bucket for current window
bucket_key = f"rl:bucket:{scope}:{identifier}:{current_window}"
count = redis.incr(bucket_key)
redis.expire(bucket_key, 25)
# Sum counts across all windows in sliding window
total = sum(redis.get(f"rl:bucket:{scope}:{identifier}:{w}")
for w in windows)
```
### Task Queue (Legacy)
**Key Pattern**: `queue:{queue_name}`
**Operations**:
```python
# Push task
redis.rpush('queue:submissions', json.dumps(task_data))
# Pop task
task_data = redis.lpop('queue:submissions')
```
**Note**: Being replaced by NATS in newer deployments
### API Key Cache
**Implementation**: In-memory TTLCache (not Redis)
```python
from cachetools import TTLCache
api_key_cache = TTLCache(maxsize=1000, ttl=60)
```
**Purpose**: Reduce database queries for API key validation
### Backfill State
**Key Pattern**: `backfill:{index_name}:{state_key}`
**Example Keys**:
```
backfill:fingerprints:last_id
backfill:fingerprints:batch_size
backfill:fingerprints:completed
```
**Purpose**: Track progress of index backfill operations
### Unknown MBID Cache
**Key Pattern**: `unknown_mbid:{mbid}`
**Value**: Boolean (1 if MBID not found in MusicBrainz)
**TTL**: 3600 seconds (1 hour)
**Purpose**: Avoid repeated MusicBrainz queries for non-existent MBIDs
## Data Integrity
### Constraints
**Foreign Keys**:
- All foreign keys have `ON DELETE CASCADE` or `ON DELETE SET NULL`
- Orphaned records cleaned up automatically
**Unique Constraints**:
- Prevent duplicate fingerprints per track
- Prevent duplicate MBID links per track
- Ensure API key uniqueness
**Check Constraints**:
- Duration must be positive
- Bitrate must be positive
- Submission count must be non-negative
### Triggers
**Update Submission Count**:
```sql
CREATE TRIGGER update_fingerprint_submission_count
AFTER INSERT ON fingerprint_source
FOR EACH ROW
EXECUTE FUNCTION increment_submission_count();
```
**Track Merge Propagation**:
```sql
CREATE TRIGGER propagate_track_merge
AFTER UPDATE OF new_id ON track
FOR EACH ROW
EXECUTE FUNCTION update_merged_track_references();
```
### Indexes for Performance
**Covering Indexes**:
```sql
-- Lookup by fingerprint and duration
CREATE INDEX fingerprint_lookup_idx
ON fingerprint (length, track_id)
INCLUDE (fingerprint);
```
**Partial Indexes**:
```sql
-- Only index unhandled submissions
CREATE INDEX submission_unhandled_idx
ON submission (created)
WHERE handled = FALSE;
```
**GIN Indexes**:
```sql
-- Fast fingerprint array queries
CREATE INDEX fingerprint_fingerprint_idx
ON fingerprint USING GIN (fingerprint gin__int_ops);
```
## Data Lifecycle
### Fingerprint Submission
1. Insert into `submission` table (acoustid_ingest)
2. Publish to NATS queue
3. Worker processes submission
4. Insert into `fingerprint` table (acoustid_fingerprint)
5. Link to `track` (create or match)
6. Insert into `fingerprint_source` (provenance)
7. Update index via HTTP API
8. Insert into `submission_result`
9. Mark `submission.handled = TRUE`
### Track Merging
1. Identify duplicate tracks (manual or automated)
2. Set `track.new_id` to target track
3. Trigger updates all references
4. Merge fingerprints, MBIDs, metadata
5. Disable old track (`track.disabled = TRUE`)
### Data Cleanup
**Cron Jobs**:
- Delete old handled submissions (>30 days)
- Clean up orphaned metadata records
- Remove disabled tracks with no references
- Archive old statistics
## Performance Optimization
### Query Optimization
**Materialized Views**:
```sql
CREATE MATERIALIZED VIEW track_stats AS
SELECT
track_id,
COUNT(DISTINCT fingerprint_id) AS fingerprint_count,
COUNT(DISTINCT mbid) AS mbid_count,
SUM(submission_count) AS total_submissions
FROM fingerprint
LEFT JOIN track_mbid USING (track_id)
GROUP BY track_id;
```
**Partitioning** (future):
```sql
-- Partition submissions by month
CREATE TABLE submission_2025_04 PARTITION OF submission
FOR VALUES FROM ('2025-04-01') TO ('2025-05-01');
```
### Caching Strategy
**Application-Level**:
- API key validation (TTLCache, 60s)
- Format ID lookup (permanent cache)
- MusicBrainz MBID existence (Redis, 1h)
**Database-Level**:
- Shared buffers (PostgreSQL config)
- Connection pooling (SQLAlchemy)
- Query result caching (pg_stat_statements)
### Bulk Operations
**Batch Inserts**:
```python
# Insert multiple fingerprints efficiently
db.session.bulk_insert_mappings(Fingerprint, fingerprint_dicts)
db.session.commit()
```
**Bulk Updates**:
```python
# Update submission counts in batch
db.session.execute(
update(Fingerprint).where(
Fingerprint.id.in_(fingerprint_ids)
).values(
submission_count=Fingerprint.submission_count + 1
)
)
```
## Backup and Recovery
### Backup Strategy
**PostgreSQL**:
- Daily full backups (pg_dump)
- Continuous WAL archiving
- Point-in-time recovery enabled
**Index**:
- Daily snapshots via `/:index/_snapshot`
- Incremental backups of Oplog
- Segment files backed up separately
### Disaster Recovery
**Database Restore**:
```bash
# Restore from dump
pg_restore -d acoustid_app acoustid_app_backup.dump
# Point-in-time recovery
pg_restore --target-time='2025-04-28 12:00:00'
```
**Index Rebuild**:
```bash
# Rebuild from database
python manage.py run import --rebuild-index
```
@@ -0,0 +1,946 @@
# AcoustID Deployment
## Deployment Overview
AcoustID supports multiple deployment models: production multi-server, Docker Compose for self-hosting, and local development. The system requires coordination between multiple services: PostgreSQL, Redis, NATS, the Python server, and the Zig index.
## Docker Deployment
### Server Docker Image
**Dockerfile**: `docker/Dockerfile`
#### Multi-Stage Build
**Stage 1: Chromaprint Build**
```dockerfile
FROM ubuntu:24.04 AS chromaprint-build
RUN apt-get update && apt-get install -y \
git \
cmake \
build-essential \
libfftw3-dev
WORKDIR /build
RUN git clone https://github.com/acoustid/chromaprint.git && \
cd chromaprint && \
git checkout 41a3e8fb && \
cmake -DCMAKE_BUILD_TYPE=Release \
-DBUILD_TOOLS=OFF \
-DBUILD_TESTS=OFF . && \
make -j$(nproc) && \
make install
```
**Stage 2: Base Image**
```dockerfile
FROM ubuntu:24.04 AS base
RUN apt-get update && apt-get install -y \
python3.12 \
python3-pip \
libfftw3-3 \
libpq5 \
&& rm -rf /var/lib/apt/lists/*
COPY --from=chromaprint-build /usr/local/lib/libchromaprint.so* /usr/local/lib/
COPY --from=chromaprint-build /usr/local/include/chromaprint.h /usr/local/include/
RUN ldconfig
```
**Stage 3: Builder**
```dockerfile
FROM base AS builder
RUN apt-get update && apt-get install -y \
build-essential \
python3-dev \
libpq-dev \
curl \
&& rm -rf /var/lib/apt/lists/*
# Install uv
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.cargo/bin:$PATH"
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
RUN uv build
```
**Stage 4: Final Image**
```dockerfile
FROM base AS final
# Create non-root user
RUN useradd -m -u 1000 acoustid
WORKDIR /app
# Copy built wheel and dependencies
COPY --from=builder /app/.venv /app/.venv
COPY --from=builder /app/dist/*.whl /tmp/
# Install application
RUN /app/.venv/bin/pip install /tmp/*.whl && rm /tmp/*.whl
# Copy configuration template
COPY acoustid.conf.dist /etc/acoustid/acoustid.conf.dist
USER acoustid
ENV PATH="/app/.venv/bin:$PATH"
ENV PYTHONUNBUFFERED=1
ENTRYPOINT ["python", "manage.py"]
CMD ["run", "api"]
```
**Image Size**: ~400MB (compressed)
**Base OS**: Ubuntu 24.04
**Python Version**: 3.12
### Index Docker Image
**Dockerfile**: `docker/Dockerfile.index`
```dockerfile
FROM ubuntu:24.04 AS builder
RUN apt-get update && apt-get install -y \
curl \
xz-utils \
&& rm -rf /var/lib/apt/lists/*
# Install Zig
RUN curl -L https://ziglang.org/download/0.11.0/zig-linux-x86_64-0.11.0.tar.xz | \
tar -xJ -C /usr/local && \
ln -s /usr/local/zig-linux-x86_64-0.11.0/zig /usr/local/bin/zig
WORKDIR /build
COPY . .
RUN zig build -Doptimize=ReleaseFast
FROM ubuntu:24.04
RUN useradd -m -u 1000 acoustid
WORKDIR /app
COPY --from=builder /build/zig-out/bin/fpindex /app/fpindex
RUN mkdir -p /var/lib/acoustid-index && \
chown acoustid:acoustid /var/lib/acoustid-index
USER acoustid
EXPOSE 6081
ENTRYPOINT ["/app/fpindex"]
CMD ["--dir", "/var/lib/acoustid-index", "--port", "6081"]
```
**Image Size**: ~50MB (compressed)
**Base OS**: Ubuntu 24.04
**Binary**: Single statically-linked executable
### Docker Compose Configuration
**File**: `docker-compose.yml`
```yaml
version: '3.8'
services:
postgres:
image: ghcr.io/acoustid/postgresql:17.4
environment:
POSTGRES_USER: acoustid
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_MULTIPLE_DATABASES: acoustid_app,acoustid_fingerprint,acoustid_ingest
volumes:
- postgres_data:/var/lib/postgresql/data
- ./docker/init-db.sh:/docker-entrypoint-initdb.d/init-db.sh
secrets:
- db_password
ports:
- "5432:5432"
healthcheck:
test: ["CMD-EXEC", "pg_isready -U acoustid"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
command: redis-server --requirepass-file /run/secrets/redis_password
volumes:
- redis_data:/data
secrets:
- redis_password
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
nats:
image: nats:2-alpine
command: -js -sd /data
volumes:
- nats_data:/data
ports:
- "4222:4222"
- "8222:8222"
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://localhost:8222/healthz"]
interval: 10s
timeout: 5s
retries: 5
index:
image: ghcr.io/acoustid/acoustid-index:latest
command: >
--dir /var/lib/acoustid-index
--port 6081
--threads 4
--log-level info
volumes:
- index_data:/var/lib/acoustid-index
ports:
- "6081:6081"
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://localhost:6081/_health"]
interval: 10s
timeout: 5s
retries: 5
profiles:
- backend
api:
image: ghcr.io/acoustid/acoustid-server:latest
command: run api
environment:
ACOUSTID_CONFIG: /etc/acoustid/acoustid.conf
volumes:
- ./acoustid.conf:/etc/acoustid/acoustid.conf:ro
secrets:
- db_password
- redis_password
ports:
- "5000:5000"
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
nats:
condition: service_healthy
index:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://localhost:5000/_health"]
interval: 30s
timeout: 10s
retries: 3
profiles:
- frontend
web:
image: ghcr.io/acoustid/acoustid-server:latest
command: run web
environment:
ACOUSTID_CONFIG: /etc/acoustid/acoustid.conf
volumes:
- ./acoustid.conf:/etc/acoustid/acoustid.conf:ro
secrets:
- db_password
- redis_password
ports:
- "5001:5001"
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://localhost:5001/_health"]
interval: 30s
timeout: 10s
retries: 3
profiles:
- frontend
worker:
image: ghcr.io/acoustid/acoustid-server:latest
command: run worker
environment:
ACOUSTID_CONFIG: /etc/acoustid/acoustid.conf
volumes:
- ./acoustid.conf:/etc/acoustid/acoustid.conf:ro
secrets:
- db_password
- redis_password
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
nats:
condition: service_healthy
index:
condition: service_healthy
deploy:
replicas: 2
profiles:
- backend
cron:
image: ghcr.io/acoustid/acoustid-server:latest
command: run cron
environment:
ACOUSTID_CONFIG: /etc/acoustid/acoustid.conf
volumes:
- ./acoustid.conf:/etc/acoustid/acoustid.conf:ro
secrets:
- db_password
- redis_password
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
profiles:
- backend
volumes:
postgres_data:
redis_data:
nats_data:
index_data:
secrets:
db_password:
file: ./secrets/db_password.txt
redis_password:
file: ./secrets/redis_password.txt
```
### Docker Compose Profiles
**Frontend Profile** (public-facing services):
```bash
docker compose --profile frontend up
```
Services: api, web
**Backend Profile** (background services):
```bash
docker compose --profile backend up
```
Services: index, worker, cron
**Full Stack**:
```bash
docker compose --profile frontend --profile backend up
```
**Tools Profile** (one-off commands):
```bash
docker compose run --rm tools python manage.py <command>
```
## PostgreSQL Setup
### Custom PostgreSQL Image
**Image**: `ghcr.io/acoustid/postgresql:17.4`
**Base**: `postgres:17.4`
**Dockerfile**: `docker/Dockerfile.postgres`
```dockerfile
FROM postgres:17.4
# Install extensions
RUN apt-get update && apt-get install -y \
postgresql-17-intarray \
postgresql-17-pgcrypto \
postgresql-17-cube \
build-essential \
postgresql-server-dev-17 \
&& rm -rf /var/lib/apt/lists/*
# Build acoustid extension
COPY extensions/acoustid /build/acoustid
WORKDIR /build/acoustid
RUN make && make install
# Copy initialization scripts
COPY docker/init-db.sh /docker-entrypoint-initdb.d/
```
### Database Initialization
**Script**: `docker/init-db.sh`
```bash
#!/bin/bash
set -e
# Create multiple databases
for db in acoustid_app acoustid_fingerprint acoustid_ingest; do
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" <<-EOSQL
CREATE DATABASE $db;
\c $db
CREATE EXTENSION IF NOT EXISTS pgcrypto;
EOSQL
done
# Install extensions for fingerprint database
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" -d acoustid_fingerprint <<-EOSQL
CREATE EXTENSION IF NOT EXISTS intarray;
CREATE EXTENSION IF NOT EXISTS cube;
CREATE EXTENSION IF NOT EXISTS acoustid;
EOSQL
# Run migrations
cd /app
python manage.py db upgrade
```
### Database Configuration
**postgresql.conf** (custom settings):
```ini
# Connection settings
max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
# Write-ahead log
wal_level = replica
max_wal_size = 2GB
min_wal_size = 1GB
# Query planner
random_page_cost = 1.1 # SSD
effective_io_concurrency = 200
# Parallel query
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
# Logging
log_min_duration_statement = 1000 # Log slow queries (>1s)
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
# Autovacuum
autovacuum_max_workers = 4
autovacuum_naptime = 10s
```
## CI/CD Pipeline
### GitHub Actions Workflows
**File**: `.github/workflows/ci.yml`
```yaml
name: CI
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Install dependencies
run: uv sync
- name: Run isort
run: uv run isort --check-only acoustid/
- name: Run black
run: uv run black --check acoustid/
- name: Run flake8
run: uv run flake8 acoustid/
- name: Run mypy
run: uv run mypy acoustid/
test:
runs-on: ubuntu-latest
services:
postgres:
image: ghcr.io/acoustid/postgresql:17.4
env:
POSTGRES_USER: acoustid
POSTGRES_PASSWORD: acoustid
POSTGRES_DB: acoustid_test
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432
redis:
image: redis:7-alpine
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 6379:6379
nats:
image: nats:2-alpine
options: >-
--health-cmd "wget -q -O- http://localhost:8222/healthz"
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 4222:4222
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Install dependencies
run: uv sync
- name: Run migrations
run: uv run python manage.py db upgrade
env:
ACOUSTID_DATABASE_NAME: acoustid_test
ACOUSTID_DATABASE_USER: acoustid
ACOUSTID_DATABASE_PASSWORD: acoustid
ACOUSTID_DATABASE_HOST: localhost
- name: Run tests
run: uv run pytest -v --cov=acoustid --cov-report=xml
env:
ACOUSTID_DATABASE_NAME: acoustid_test
ACOUSTID_DATABASE_USER: acoustid
ACOUSTID_DATABASE_PASSWORD: acoustid
ACOUSTID_DATABASE_HOST: localhost
ACOUSTID_REDIS_HOST: localhost
ACOUSTID_NATS_SERVERS: nats://localhost:4222
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: ./coverage.xml
build:
runs-on: ubuntu-latest
needs: [lint, test]
if: github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push server image
uses: docker/build-push-action@v5
with:
context: .
file: docker/Dockerfile
push: true
tags: |
ghcr.io/acoustid/acoustid-server:latest
ghcr.io/acoustid/acoustid-server:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Build and push index image
uses: docker/build-push-action@v5
with:
context: .
file: docker/Dockerfile.index
push: true
tags: |
ghcr.io/acoustid/acoustid-index:latest
ghcr.io/acoustid/acoustid-index:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
```
### Linting Tools
**isort** (import sorting):
```ini
# pyproject.toml
[tool.isort]
profile = "black"
line_length = 100
```
**black** (code formatting):
```ini
# pyproject.toml
[tool.black]
line-length = 100
target-version = ['py312']
```
**flake8** (style checking):
```ini
# .flake8
[flake8]
max-line-length = 100
extend-ignore = E203, W503
exclude = .git,__pycache__,build,dist,.venv
```
**mypy** (type checking):
```ini
# pyproject.toml
[tool.mypy]
python_version = "3.12"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
```
### Testing
**pytest** configuration:
```ini
# pyproject.toml
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
addopts = "-v --strict-markers --tb=short"
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"integration: marks tests as integration tests",
]
```
**Test Files** (24 total):
```
tests/
├── test_api_lookup.py
├── test_api_submit.py
├── test_fingerprint.py
├── test_indexclient.py
├── test_fpstore.py
├── test_data_account.py
├── test_data_fingerprint.py
├── test_data_track.py
├── test_data_musicbrainz.py
├── test_worker.py
├── test_cron.py
├── test_ratelimit.py
├── test_db.py
├── test_config.py
└── ...
```
**Test Fixtures**:
```python
# tests/conftest.py
import pytest
from acoustid.db import create_engine, create_session
@pytest.fixture
def with_database():
"""Provide test database session."""
engine = create_engine('acoustid_test')
session = create_session(engine)
yield session
session.rollback()
session.close()
@pytest.fixture
def with_script():
"""Provide script context with database."""
from acoustid.script import Script
script = Script('test')
script.setup()
yield script
script.teardown()
@pytest.fixture
def fingerprint_fixture():
"""Predefined test fingerprint."""
return [123456789, 987654321, 456789123, ...]
```
## Infrastructure Requirements
### Minimum Requirements (Self-Hosted)
| Component | CPU | RAM | Disk | Notes |
|-----------|-----|-----|------|-------|
| PostgreSQL | 2 cores | 4 GB | 100 GB SSD | For small dataset |
| Redis | 1 core | 1 GB | 10 GB | Mostly in-memory |
| NATS | 1 core | 512 MB | 10 GB | JetStream storage |
| Index | 2 cores | 2 GB | 50 GB SSD | Depends on dataset size |
| API | 2 cores | 2 GB | 10 GB | Per instance |
| Worker | 2 cores | 2 GB | 10 GB | Per instance |
| **Total** | **10 cores** | **11.5 GB** | **190 GB** | Single-host deployment |
### Production Requirements (acoustid.org scale)
| Component | CPU | RAM | Disk | Instances | Notes |
|-----------|-----|-----|------|-----------|-------|
| PostgreSQL | 16 cores | 64 GB | 2 TB NVMe | 1 primary + 2 replicas | High IOPS required |
| Redis | 4 cores | 16 GB | 100 GB SSD | 3 (cluster) | Persistence enabled |
| NATS | 4 cores | 8 GB | 500 GB SSD | 3 (cluster) | JetStream storage |
| Index | 8 cores | 16 GB | 1 TB NVMe | 4+ | Sharded by fingerprint ID |
| API | 4 cores | 8 GB | 50 GB | 4+ | Behind load balancer |
| Web | 2 cores | 4 GB | 50 GB | 2+ | Behind load balancer |
| Worker | 4 cores | 8 GB | 50 GB | 8+ | Auto-scaling |
| Cron | 2 cores | 4 GB | 50 GB | 1 | Leader election |
### Network Requirements
**Bandwidth**:
- API: 100 Mbps per instance (burst to 1 Gbps)
- Index: 1 Gbps (internal network)
- Database: 1 Gbps (internal network)
**Latency**:
- API to Index: <5ms
- API to Database: <5ms
- API to Redis: <1ms
## Monitoring and Observability
### Health Checks
**Endpoints**:
- `/_health`: Full health check (database write test)
- `/_health_ro`: Read-only health check
- `/_health_docker`: Minimal health check for Docker
**Kubernetes Probes**:
```yaml
livenessProbe:
httpGet:
path: /_health_docker
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /_health_ro
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
```
### Metrics
**StatsD Metrics** (server):
- `api.requests_total{endpoint,method,status}`
- `api.request_duration_seconds{endpoint,method}`
- `api.handled_errors_total{error_code}`
- `api.unhandled_errors_total`
- `api.lookup.searches.total`
- `api.lookup.matches.total`
- `new_submissions`
**Prometheus Metrics** (index):
- `fpindex_search_duration_seconds`
- `fpindex_insert_duration_seconds`
- `fpindex_segment_count`
- `fpindex_memory_segment_size_bytes`
- `fpindex_file_segment_size_bytes`
- `fpindex_merge_duration_seconds`
### Logging
**Log Levels**:
- `DEBUG`: Detailed diagnostic information
- `INFO`: General informational messages
- `WARNING`: Warning messages
- `ERROR`: Error messages
- `CRITICAL`: Critical errors
**Log Format**:
```
%(asctime)s [%(process)d] [%(levelname)s] %(name)s: %(message)s
```
**Environment Variables**:
```bash
ACOUSTID_LOGGING_LEVEL=INFO
ACOUSTID_LOGGING_LEVEL_ACOUSTID=DEBUG
ACOUSTID_LOGGING_LEVEL_SQLALCHEMY=WARNING
```
### Error Tracking
**Sentry Integration**:
```ini
# acoustid.conf
[sentry]
dsn = https://...@sentry.io/...
environment = production
traces_sample_rate = 0.1
```
**Configuration**:
```python
import sentry_sdk
from sentry_sdk.integrations.flask import FlaskIntegration
sentry_sdk.init(
dsn=config.sentry.dsn,
environment=config.sentry.environment,
traces_sample_rate=config.sentry.traces_sample_rate,
integrations=[FlaskIntegration()]
)
```
## Scaling Strategies
### Horizontal Scaling
**API/Web**:
- Add more instances behind load balancer
- No shared state (stateless)
- Session data in Redis if needed
**Workers**:
- Add more instances
- NATS distributes work automatically
- No coordination required
**Index**:
- Shard by fingerprint ID
- Consistent hashing for distribution
- NATS for cluster coordination
### Vertical Scaling
**Database**:
- Increase shared_buffers (25% of RAM)
- Increase effective_cache_size (50-75% of RAM)
- Add more CPU for parallel queries
**Index**:
- Increase thread count
- Larger memory segment
- Faster disk (NVMe)
### Caching
**Application-Level**:
- API key cache (in-memory, 60s TTL)
- Format lookup cache (permanent)
- MBID existence cache (Redis, 1h TTL)
**Database-Level**:
- Connection pooling
- Query result caching
- Materialized views
## Backup and Disaster Recovery
### Backup Strategy
**PostgreSQL**:
```bash
# Daily full backup
pg_dump -Fc acoustid_app > acoustid_app_$(date +%Y%m%d).dump
# Continuous WAL archiving
archive_command = 'cp %p /backup/wal/%f'
```
**Index**:
```bash
# Daily snapshot
curl -X GET http://index:6081/fingerprints/_snapshot
# Backup segment files
rsync -av /var/lib/acoustid-index/ /backup/index/
```
**Redis**:
```bash
# RDB snapshot (automatic)
save 900 1
save 300 10
save 60 10000
# AOF (append-only file)
appendonly yes
appendfsync everysec
```
### Disaster Recovery
**Recovery Time Objective (RTO)**: 1 hour
**Recovery Point Objective (RPO)**: 5 minutes
**Recovery Steps**:
1. Restore PostgreSQL from latest backup
2. Replay WAL to point-in-time
3. Restore Redis from RDB/AOF
4. Restore index from snapshot
5. Rebuild index from database if needed
6. Restart all services
7. Verify health checks
@@ -0,0 +1,617 @@
# AcoustID System Evaluation
## Executive Summary
AcoustID is a mature, production-proven audio fingerprinting system that combines a Python-based web service with a cutting-edge Zig-based search index. The system has been running in production for over a decade, processing millions of fingerprint submissions and lookups. This evaluation assesses its strengths, weaknesses, integration potential, and relevance for metadata aggregation projects.
## Strengths
### 1. Open Source and Well-Licensed
**Advantage**: Complete transparency and flexibility
- **Server License**: MIT (permissive, commercial-friendly)
- **Index License**: GPL-3.0 (copyleft, but separate service)
- **Chromaprint**: MIT (can be used independently)
- **No Vendor Lock-in**: Full control over deployment and modifications
**Impact**: Can be self-hosted, modified, or used as a reference implementation without licensing concerns. The GPL license on the index is acceptable since it runs as a separate service.
### 2. Production-Proven at Scale
**Advantage**: Battle-tested reliability
- **Years in Production**: 10+ years serving acoustid.org
- **Database Size**: Millions of fingerprints and tracks
- **Request Volume**: Handles high traffic with proven architecture
- **Real-World Data**: Extensive test coverage from actual usage
**Impact**: Low risk of fundamental design flaws. Known performance characteristics and scaling patterns.
### 3. Advanced Index Technology
**Advantage**: State-of-the-art search performance
- **LSM-Tree Architecture**: Efficient for write-heavy workloads
- **SIMD Compression**: StreamVByte for 4-8x compression with minimal CPU overhead
- **Sub-Millisecond Search**: P50 latency around 5ms
- **Modern Language**: Zig provides memory safety without garbage collection overhead
**Impact**: The index is one of the most sophisticated open-source fingerprint search implementations available. Significantly faster than naive database-based approaches.
### 4. MusicBrainz Integration
**Advantage**: Direct access to comprehensive music metadata
- **Direct Database Access**: No API rate limits or latency
- **Rich Metadata**: Artist credits, releases, release groups, tracks
- **MBID Mapping**: Links audio fingerprints to canonical music identifiers
- **Redirect Resolution**: Handles merged entities automatically
**Impact**: Provides a complete solution for audio identification with metadata enrichment. Eliminates need for separate metadata lookup infrastructure.
### 5. Comprehensive API
**Advantage**: Well-designed public API
- **Multiple Endpoints**: Lookup, submit, status, user management
- **Batch Operations**: Up to 20 fingerprints per request
- **Flexible Metadata**: Configurable response detail levels
- **Multiple Formats**: JSON, XML, JSONP support
- **Rate Limiting**: Built-in protection against abuse
**Impact**: Easy to integrate as a client. Can also serve as a reference for building similar APIs.
### 6. Well-Structured Codebase
**Advantage**: Maintainable and extensible
- **Layered Architecture**: Clear separation of concerns
- **Service Pattern**: Business logic isolated from presentation
- **Type Hints**: Modern Python with type annotations
- **Comprehensive Tests**: 24 test files with good coverage
- **Documentation**: Inline comments and docstrings
**Impact**: Easy to understand, modify, and extend. Low barrier to contribution or customization.
### 7. Modern Infrastructure
**Advantage**: Uses current best practices
- **Docker Support**: Full containerization with multi-stage builds
- **Docker Compose**: Complete local development environment
- **CI/CD**: GitHub Actions for automated testing and deployment
- **Async Support**: Migration to Starlette for async operations
- **Message Queue**: NATS with JetStream for reliable async processing
**Impact**: Easy to deploy and operate. Follows industry standards for cloud-native applications.
## Weaknesses
### 1. Complex Deployment Requirements
**Disadvantage**: High operational overhead
**Required Services**:
- PostgreSQL 17.4 (4 separate databases)
- Custom PostgreSQL extension (acoustid)
- Redis (caching and rate limiting)
- NATS with JetStream (message queue)
- Zig-based index service
- Multiple Python processes (API, web, worker, cron)
**Minimum Resources**:
- 10+ CPU cores
- 11.5 GB RAM
- 190 GB disk space
**Impact**: Self-hosting requires significant infrastructure investment. Not suitable for small-scale deployments or embedded use cases. The custom PostgreSQL extension adds deployment complexity.
### 2. Custom PostgreSQL Extension Required
**Disadvantage**: Non-standard database setup
- **C Extension**: acoustid extension must be compiled and installed
- **Platform-Specific**: Requires PostgreSQL development headers
- **Maintenance Burden**: Must be updated for new PostgreSQL versions
- **Deployment Complexity**: Cannot use standard PostgreSQL images without modification
**Impact**: Increases deployment complexity and maintenance burden. Limits hosting options (managed PostgreSQL services won't work).
### 3. Transitioning Codebase
**Disadvantage**: Mixed old and new code
**Transition Areas**:
- Flask to Starlette (both frameworks present)
- Legacy TCP index protocol to HTTP (both protocols supported)
- Synchronous to asynchronous operations (mixed patterns)
**Impact**: Code complexity from supporting both old and new approaches. Potential for bugs at transition boundaries. Documentation may be inconsistent.
### 4. Legacy Code Paths
**Disadvantage**: Technical debt
**Legacy Components**:
- Old API v1 endpoints (deprecated but still present)
- TCP-based index client (being phased out)
- Synchronous database operations (alongside async)
- PUID support (MusicIP legacy)
**Impact**: Increased codebase size and complexity. Potential security or performance issues in unmaintained code paths.
### 5. Zig Index Maturity
**Disadvantage**: Relatively new implementation
- **Language Maturity**: Zig is pre-1.0 (currently 0.11.0)
- **Ecosystem**: Limited third-party libraries
- **Community**: Smaller than established languages
- **Breaking Changes**: Zig language still evolving
- **Debugging Tools**: Less mature than C/C++/Rust
**Impact**: Potential for language-level breaking changes. Smaller pool of developers familiar with Zig. May require more effort to debug or extend.
### 6. Limited Documentation
**Disadvantage**: Steep learning curve
**Documentation Gaps**:
- No comprehensive architecture documentation (until this analysis)
- Limited API examples beyond basic usage
- Index protocol not formally documented
- Deployment guide assumes Docker knowledge
- No performance tuning guide
**Impact**: Difficult for newcomers to understand system internals. Trial and error required for optimization and troubleshooting.
### 7. Tight MusicBrainz Coupling
**Disadvantage**: Assumes MusicBrainz availability
- **Direct Database Dependency**: Requires MusicBrainz database replica
- **Schema Coupling**: Queries specific MusicBrainz table structures
- **No Abstraction**: MusicBrainz logic embedded throughout codebase
- **Alternative Sources**: Difficult to use other metadata providers
**Impact**: Cannot easily substitute alternative metadata sources. Requires maintaining MusicBrainz database replica for full functionality.
## Integration Considerations
### As a Public API Client
**Recommendation**: Best approach for most use cases
**Advantages**:
- No infrastructure to maintain
- Proven reliability (acoustid.org uptime)
- Free for reasonable usage
- Immediate availability
**Disadvantages**:
- Rate limits (3 req/s default, 10 req/s with API key)
- Network latency
- Dependency on external service
- No control over data or features
**Best For**:
- Small to medium scale applications
- Prototyping and development
- Applications with intermittent fingerprinting needs
- Projects without infrastructure budget
**Implementation**:
```python
import requests
def lookup_fingerprint(fingerprint, duration):
response = requests.post('https://api.acoustid.org/v2/lookup', data={
'client': 'YOUR_API_KEY',
'duration': duration,
'fingerprint': fingerprint,
'meta': 'recordings+releases'
})
return response.json()
```
### Self-Hosted Deployment
**Recommendation**: Only for large-scale or specialized needs
**Advantages**:
- Full control over data and features
- No rate limits
- Low latency (local network)
- Customization possible
- Data privacy
**Disadvantages**:
- High infrastructure cost
- Operational complexity
- Maintenance burden
- Requires expertise
**Best For**:
- Large-scale commercial applications
- Privacy-sensitive use cases
- Custom fingerprinting algorithms
- Research and development
**Minimum Viable Deployment**:
```yaml
# docker-compose.yml (simplified)
services:
postgres:
image: ghcr.io/acoustid/postgresql:17.4
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
nats:
image: nats:2-alpine
command: -js
index:
image: ghcr.io/acoustid/acoustid-index:latest
volumes:
- index_data:/var/lib/acoustid-index
api:
image: ghcr.io/acoustid/acoustid-server:latest
command: run api
depends_on: [postgres, redis, nats, index]
```
### Chromaprint Library Only
**Recommendation**: For custom fingerprinting without AcoustID infrastructure
**Advantages**:
- Minimal dependencies (just Chromaprint library)
- Full control over fingerprint storage and matching
- No network dependency
- Lightweight
**Disadvantages**:
- Must implement own matching algorithm
- No MusicBrainz integration
- No existing fingerprint database
- Higher development effort
**Best For**:
- Custom audio analysis applications
- Offline fingerprinting
- Embedded systems
- Research projects
**Implementation**:
```python
import chromaprint
# Generate fingerprint
fpcalc = chromaprint.Chromaprint()
fpcalc.start(sample_rate, num_channels)
fpcalc.feed(audio_data)
fpcalc.finish()
fingerprint = fpcalc.get_fingerprint()
# Store and match fingerprints yourself
# (requires custom implementation)
```
### Hybrid Approach
**Recommendation**: Best of both worlds for growing applications
**Strategy**:
1. Start with public API for lookups
2. Use Chromaprint library for fingerprint generation
3. Store fingerprints locally for future use
4. Migrate to self-hosted when scale justifies cost
**Advantages**:
- Low initial cost
- Gradual migration path
- Flexibility to optimize later
- Reduced vendor lock-in
**Implementation**:
```python
class HybridFingerprintService:
def __init__(self):
self.local_db = LocalFingerprintDB()
self.acoustid_client = AcoustIDClient()
def identify(self, audio_file):
# Generate fingerprint locally
fingerprint = chromaprint.generate(audio_file)
# Check local database first
match = self.local_db.search(fingerprint)
if match:
return match
# Fall back to AcoustID API
result = self.acoustid_client.lookup(fingerprint)
# Cache result locally
if result:
self.local_db.store(fingerprint, result)
return result
```
## Relevance for Metadata Aggregation
### High Relevance Scenarios
**1. Audio File Identification**
AcoustID excels at identifying audio files without metadata:
- **Use Case**: User uploads audio file with missing tags
- **Solution**: Generate fingerprint, lookup via AcoustID, retrieve MBIDs
- **Benefit**: Accurate identification even with transcoding or quality differences
**2. Duplicate Detection**
Fingerprints enable perceptual duplicate detection:
- **Use Case**: Detect duplicate tracks in large music library
- **Solution**: Fingerprint all tracks, compare for similarity
- **Benefit**: Finds duplicates even with different encodings or slight edits
**3. MBID Enrichment**
Links audio files to canonical MusicBrainz identifiers:
- **Use Case**: Enrich audio metadata with MusicBrainz data
- **Solution**: Fingerprint -> AcoustID -> MBID -> MusicBrainz metadata
- **Benefit**: Access to comprehensive, community-maintained metadata
**4. Quality Verification**
Verify metadata accuracy:
- **Use Case**: Check if file metadata matches actual audio content
- **Solution**: Compare fingerprint-based identification with existing tags
- **Benefit**: Detect mislabeled or corrupted files
### Medium Relevance Scenarios
**5. Playlist Generation**
Acoustic similarity for recommendations:
- **Use Case**: Generate playlists of similar-sounding tracks
- **Solution**: Compare fingerprints for acoustic similarity
- **Benefit**: Recommendations based on actual audio, not just metadata
**6. Copyright Detection**
Identify copyrighted content:
- **Use Case**: Detect copyrighted music in user uploads
- **Solution**: Fingerprint uploads, match against known copyrighted works
- **Benefit**: Automated content moderation
### Low Relevance Scenarios
**7. Real-Time Audio Recognition**
AcoustID is not optimized for real-time use:
- **Limitation**: Requires full audio file or significant portion
- **Alternative**: Shazam-style services designed for short audio snippets
- **Workaround**: Use Chromaprint with custom matching for real-time needs
**8. Music Recommendation**
Limited to acoustic similarity:
- **Limitation**: No semantic understanding of music (genre, mood, etc.)
- **Alternative**: Dedicated recommendation engines (Spotify API, Last.fm)
- **Workaround**: Combine with metadata-based recommendation
## Comparison with Alternatives
### vs. Shazam/ACRCloud (Commercial)
| Feature | AcoustID | Shazam/ACRCloud |
|---------|----------|-----------------|
| License | Open source (MIT/GPL) | Proprietary |
| Cost | Free (self-host or API) | Paid API |
| Database Size | Community-driven | Commercial catalog |
| Real-Time | No | Yes |
| Accuracy | High | Very high |
| Customization | Full | Limited |
**Verdict**: AcoustID better for self-hosted, customizable solutions. Shazam better for real-time recognition and commercial catalog coverage.
### vs. Echoprint (Open Source)
| Feature | AcoustID | Echoprint |
|---------|----------|-----------|
| Maintenance | Active | Abandoned (2014) |
| Index Technology | Modern (LSM-tree, SIMD) | Legacy |
| Language | Python + Zig | Python + C++ |
| MusicBrainz | Integrated | No |
| Community | Active | Dead |
**Verdict**: AcoustID is the clear winner. Echoprint is no longer maintained.
### vs. Chromaprint Alone
| Feature | AcoustID | Chromaprint Only |
|---------|----------|------------------|
| Fingerprint Generation | Yes | Yes |
| Fingerprint Matching | Yes | No (DIY) |
| Metadata | MusicBrainz | No |
| Infrastructure | Required | Minimal |
| Development Effort | Low | High |
**Verdict**: AcoustID provides complete solution. Chromaprint alone requires significant custom development.
## Recommendations
### For Small Projects (< 10k lookups/month)
**Recommendation**: Use public AcoustID API
**Rationale**:
- Free tier sufficient
- No infrastructure cost
- Immediate availability
- Proven reliability
**Implementation**:
```python
# Simple integration
import acoustid
results = acoustid.match(api_key, audio_file)
for score, recording_id, title, artist in results:
print(f"{title} by {artist} (score: {score})")
```
### For Medium Projects (10k-1M lookups/month)
**Recommendation**: Hybrid approach
**Rationale**:
- Public API for initial lookups
- Local caching for repeated queries
- Gradual migration path to self-hosted
- Cost-effective scaling
**Implementation**:
- Use public API with caching layer
- Store fingerprints locally
- Monitor usage and costs
- Migrate to self-hosted when justified
### For Large Projects (> 1M lookups/month)
**Recommendation**: Self-hosted deployment
**Rationale**:
- Cost savings at scale
- Full control and customization
- Low latency
- No rate limits
**Implementation**:
- Deploy full stack (PostgreSQL, Redis, NATS, Index, API)
- Import existing fingerprint database
- Implement monitoring and alerting
- Plan for high availability
### For Research Projects
**Recommendation**: Chromaprint library + custom matching
**Rationale**:
- Full control over algorithms
- No external dependencies
- Flexibility for experimentation
- Academic freedom
**Implementation**:
- Use Chromaprint for fingerprint generation
- Implement custom similarity metrics
- Experiment with index structures
- Publish findings
### For Privacy-Sensitive Applications
**Recommendation**: Self-hosted deployment
**Rationale**:
- No data sent to third parties
- Full control over data retention
- Compliance with privacy regulations
- Audit trail
**Implementation**:
- Deploy on-premises or private cloud
- Implement access controls
- Enable audit logging
- Regular security updates
## Future Considerations
### Potential Improvements
**1. Simplified Deployment**
- Single-binary deployment option
- Embedded database (SQLite) for small-scale use
- Optional components (make MusicBrainz integration optional)
**2. Better Documentation**
- Architecture guide (this document is a start)
- Performance tuning guide
- Troubleshooting guide
- Video tutorials
**3. Alternative Metadata Sources**
- Plugin system for metadata providers
- Support for Discogs, Spotify, etc.
- Configurable metadata priority
**4. Enhanced API**
- GraphQL endpoint
- WebSocket for real-time updates
- Bulk operations API
- Admin API for self-hosted instances
**5. Index Improvements**
- Distributed index with automatic sharding
- Replication for high availability
- Incremental backups
- Query result caching
### Technology Evolution
**Zig Maturity**:
- Monitor Zig 1.0 release
- Evaluate stability and ecosystem growth
- Consider Rust alternative if Zig adoption stalls
**Async Migration**:
- Complete Flask to Starlette transition
- Remove legacy synchronous code paths
- Optimize for async/await patterns
**Cloud-Native**:
- Kubernetes deployment manifests
- Helm charts
- Operator for automated management
- Service mesh integration
## Conclusion
AcoustID is a **highly capable, production-ready audio fingerprinting system** with significant strengths in accuracy, performance, and MusicBrainz integration. The open-source license and mature codebase make it an excellent choice for projects requiring audio identification.
**Key Takeaways**:
1. **Use the public API** for most small to medium projects
2. **Self-host only when scale justifies** the operational complexity
3. **Chromaprint library alone** is viable for custom implementations
4. **MusicBrainz integration** is a major value-add for metadata enrichment
5. **Deployment complexity** is the main barrier to adoption
**Overall Assessment**: **Highly Recommended** for metadata aggregation projects that need audio fingerprinting, with the caveat that self-hosting requires significant infrastructure investment.
**Rating**: 8.5/10
**Strengths**: Production-proven, open source, excellent MusicBrainz integration, modern index technology
**Weaknesses**: Complex deployment, custom PostgreSQL extension, transitioning codebase
**Best Use Case**: Audio file identification and MBID enrichment via public API or self-hosted deployment at scale
@@ -0,0 +1,768 @@
# AcoustID Integrations
## Overview
AcoustID integrates with multiple external services and libraries to provide comprehensive audio fingerprinting and metadata enrichment. The system's architecture separates concerns between fingerprint generation (Chromaprint), fingerprint indexing (acoustid-index), metadata enrichment (MusicBrainz), and supporting infrastructure (Redis, NATS).
## MusicBrainz Integration
### Connection Method
**Type**: Direct PostgreSQL database connection (NOT REST API)
**Database**: `musicbrainz` (read-only replica)
**Access**: Separate database connection pool
**Configuration** (`acoustid.conf`):
```ini
[musicbrainz]
host = musicbrainz-db.example.com
port = 5432
name = musicbrainz_db
user = acoustid_readonly
password_file = /run/secrets/mb_password
```
**File**: `acoustid/data/musicbrainz.py`
### Queried Tables
The integration queries the following MusicBrainz tables directly:
| Table | Purpose | Columns Used |
|-------|---------|--------------|
| `artist_credit` | Artist information | `id`, `name`, `artist_count` |
| `artist_credit_name` | Artist credit details | `artist_credit`, `position`, `artist`, `name`, `join_phrase` |
| `artist` | Artist entities | `id`, `gid`, `name`, `sort_name` |
| `recording` | Recording metadata | `id`, `gid`, `name`, `length`, `artist_credit`, `comment` |
| `release` | Release information | `id`, `gid`, `name`, `artist_credit`, `release_group`, `status`, `packaging`, `barcode` |
| `release_group` | Release group data | `id`, `gid`, `name`, `artist_credit`, `type`, `comment` |
| `track` | Track listings | `id`, `gid`, `recording`, `position`, `number`, `name`, `length`, `artist_credit` |
| `medium` | Medium information | `id`, `release`, `position`, `format`, `track_count` |
| `release_country` | Release countries | `release`, `country`, `date_year`, `date_month`, `date_day` |
### Query Patterns
**Fetch Recording by MBID**:
```python
def get_recording_by_mbid(db, mbid):
"""Fetch recording with artist credits and releases."""
query = """
SELECT
r.gid AS recording_mbid,
r.name AS recording_title,
r.length AS duration,
ac.name AS artist_credit_name,
array_agg(DISTINCT rel.gid) AS release_mbids
FROM recording r
JOIN artist_credit ac ON r.artist_credit = ac.id
LEFT JOIN track t ON t.recording = r.id
LEFT JOIN medium m ON t.medium = m.id
LEFT JOIN release rel ON m.release = rel.id
WHERE r.gid = :mbid
GROUP BY r.gid, r.name, r.length, ac.name
"""
return db.execute(query, {'mbid': mbid}).fetchone()
```
**Fetch Release with Tracks**:
```python
def get_release_with_tracks(db, release_mbid):
"""Fetch complete release with all tracks."""
query = """
SELECT
rel.gid AS release_mbid,
rel.name AS release_title,
rel.barcode,
rc.country,
rc.date_year,
rc.date_month,
rc.date_day,
m.position AS medium_position,
m.format AS medium_format,
t.position AS track_position,
t.number AS track_number,
t.name AS track_title,
rec.gid AS recording_mbid,
ac.name AS artist_credit
FROM release rel
LEFT JOIN release_country rc ON rel.id = rc.release
LEFT JOIN medium m ON rel.id = m.release
LEFT JOIN track t ON m.id = t.medium
LEFT JOIN recording rec ON t.recording = rec.id
LEFT JOIN artist_credit ac ON rec.artist_credit = ac.id
WHERE rel.gid = :mbid
ORDER BY m.position, t.position
"""
return db.execute(query, {'mbid': release_mbid}).fetchall()
```
**Fetch Artist Credits**:
```python
def get_artist_credit(db, artist_credit_id):
"""Fetch artist credit with all artists."""
query = """
SELECT
acn.position,
a.gid AS artist_mbid,
a.name AS artist_name,
a.sort_name AS artist_sort_name,
acn.name AS credited_name,
acn.join_phrase
FROM artist_credit_name acn
JOIN artist a ON acn.artist = a.id
WHERE acn.artist_credit = :ac_id
ORDER BY acn.position
"""
return db.execute(query, {'ac_id': artist_credit_id}).fetchall()
```
### MBID Redirect Resolution
MusicBrainz uses MBID redirects when entities are merged. AcoustID resolves these automatically.
**File**: `acoustid/data/musicbrainz.py`
```python
def resolve_recording_mbid(db, mbid):
"""Resolve recording MBID redirects."""
query = """
SELECT new_id
FROM recording_gid_redirect
WHERE gid = :mbid
"""
result = db.execute(query, {'mbid': mbid}).fetchone()
if result:
# Recursively resolve redirects
return resolve_recording_mbid(db, result['new_id'])
return mbid
```
**Redirect Tables Used**:
- `recording_gid_redirect`
- `release_gid_redirect`
- `release_group_gid_redirect`
- `artist_gid_redirect`
### Metadata Enrichment
When a lookup request includes metadata flags, AcoustID fetches additional data from MusicBrainz:
**Metadata Levels**:
| Flag | Data Fetched | Query Complexity |
|------|--------------|------------------|
| `recordingids` | Recording MBIDs only | Low (join only) |
| `recordings` | Full recording metadata | Medium (artist credits) |
| `releaseids` | Release MBIDs only | Low (join only) |
| `releases` | Full release metadata | High (tracks, mediums, countries) |
| `releasegroupids` | Release group MBIDs only | Low (join only) |
| `releasegroups` | Full release group metadata | Medium (artist credits) |
**Example Enriched Response**:
```json
{
"recordings": [
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"title": "Example Song",
"duration": 240000,
"artists": [
{
"id": "12345678-90ab-cdef-1234-567890abcdef",
"name": "Example Artist",
"joinphrase": " & "
}
],
"releases": [
{
"id": "abcdef12-3456-7890-abcd-ef1234567890",
"title": "Example Album",
"country": "US",
"date": {
"year": 2020,
"month": 5,
"day": 15
},
"track_count": 12,
"medium_count": 1,
"releasegroup": {
"id": "fedcba98-7654-3210-fedc-ba9876543210",
"type": "Album"
}
}
]
}
]
}
```
### Performance Considerations
**Connection Pooling**:
- Separate pool for MusicBrainz database
- Pool size: 10 connections (configurable)
- Pool recycle: 3600 seconds
**Query Optimization**:
- Indexes on `gid` columns (MusicBrainz maintains these)
- Batch queries when possible
- Limit joins to requested metadata only
**Caching**:
- Unknown MBID cache (Redis, 1 hour TTL)
- Avoids repeated queries for non-existent MBIDs
**Fallback**:
- If MusicBrainz database unavailable, return AcoustID data only
- Graceful degradation (no metadata enrichment)
## Chromaprint Integration
### Library Information
**Name**: Chromaprint
**Version**: Built from source (commit `41a3e8fb`)
**License**: MIT
**Language**: C++
**Wrapper**: acoustid-ext (C extension for Python)
**Repository**: https://github.com/acoustid/chromaprint
### Build Process
**Dockerfile** (`docker/Dockerfile`):
```dockerfile
# Stage 1: Build Chromaprint
FROM ubuntu:24.04 AS chromaprint-build
RUN apt-get update && apt-get install -y \
git cmake build-essential libfftw3-dev
WORKDIR /build
RUN git clone https://github.com/acoustid/chromaprint.git && \
cd chromaprint && \
git checkout 41a3e8fb && \
cmake -DCMAKE_BUILD_TYPE=Release . && \
make && \
make install
# Stage 2: Build acoustid-ext
FROM ubuntu:24.04 AS builder
COPY --from=chromaprint-build /usr/local/lib/libchromaprint.so* /usr/local/lib/
COPY --from=chromaprint-build /usr/local/include/chromaprint.h /usr/local/include/
RUN pip install acoustid-ext
```
### Python Extension (acoustid-ext)
**Package**: `acoustid-ext`
**File**: `acoustid/fingerprint.py`
**Functions Exposed**:
```python
from acoustid_ext import (
decode_fingerprint,
encode_fingerprint,
compress_fingerprint,
decompress_fingerprint,
fingerprint_compare
)
```
**Function Signatures**:
| Function | Input | Output | Purpose |
|----------|-------|--------|---------|
| `decode_fingerprint(data)` | bytes/str | list[int] | Decode base64/compressed fingerprint |
| `encode_fingerprint(hashes)` | list[int] | str | Encode fingerprint to base64 |
| `compress_fingerprint(hashes)` | list[int] | bytes | Compress fingerprint (zstd) |
| `decompress_fingerprint(data)` | bytes | list[int] | Decompress fingerprint |
| `fingerprint_compare(fp1, fp2)` | list[int], list[int] | float | Compare similarity (0.0-1.0) |
### Fingerprint Format
**Raw Format** (Chromaprint output):
- Array of 32-bit unsigned integers
- Each integer represents a hash of audio features
- Typical length: 100-300 hashes (for 3-5 minute track)
**Compressed Format** (for transmission):
- Base64-encoded compressed data
- Compression: zstd or custom Chromaprint compression
- Typical size: 200-500 bytes
**Example**:
```python
# Raw fingerprint
fingerprint = [123456789, 987654321, 456789123, ...]
# Encoded (base64)
encoded = "AQADtNGiJEqUHUemR..."
# Compressed (bytes)
compressed = b'\x28\xb5\x2f\xfd...'
```
### Query Extraction
**File**: `acoustid/fingerprint.py`
```python
def extract_query(fingerprint, max_terms=100):
"""Extract query terms from fingerprint for index search.
Args:
fingerprint: List of 32-bit hash integers
max_terms: Maximum number of terms to extract
Returns:
List of term IDs (subset of fingerprint hashes)
"""
# Select most discriminative terms
# (implementation uses simhash or random sampling)
terms = select_discriminative_terms(fingerprint, max_terms)
return terms
```
**Query Strategy**:
- Extract subset of hashes (typically 50-100 terms)
- Prioritize discriminative hashes (high entropy)
- Balance between precision and recall
### Fingerprint Comparison
**PostgreSQL Function** (custom extension):
```sql
CREATE FUNCTION acoustid_compare(fp1 INTEGER[], fp2 INTEGER[])
RETURNS FLOAT AS $$
-- Calculate Jaccard similarity
SELECT COUNT(*)::FLOAT /
(array_length(fp1, 1) + array_length(fp2, 1) - COUNT(*))
FROM unnest(fp1) AS h1
JOIN unnest(fp2) AS h2 ON h1 = h2
$$ LANGUAGE SQL IMMUTABLE;
```
**Python Implementation**:
```python
def compare_fingerprints(fp1, fp2):
"""Calculate similarity between two fingerprints.
Returns:
Float between 0.0 (no match) and 1.0 (identical)
"""
set1 = set(fp1)
set2 = set(fp2)
intersection = len(set1 & set2)
union = len(set1 | set2)
return intersection / union if union > 0 else 0.0
```
## AcoustID Index Integration
### Client Implementations
AcoustID server has two index client implementations:
#### Legacy TCP Client (indexclient.py)
**Status**: Deprecated, being phased out
**Protocol**: Custom binary over TCP
**Port**: 6080 (default)
**File**: `acoustid/indexclient.py`
```python
class IndexClientPool:
"""Connection pool for legacy TCP index."""
def __init__(self, host, port, pool_size=10):
self.host = host
self.port = port
self.pool = Queue(maxsize=pool_size)
def search(self, fingerprint, limit=10):
"""Search index for similar fingerprints."""
client = self.pool.get()
try:
# Send search command
client.send_command(CMD_SEARCH, {
'fingerprint': fingerprint,
'limit': limit
})
# Receive results
results = client.receive_response()
return results
finally:
self.pool.put(client)
```
**Message Format**:
```
┌────────────┬─────────┬──────────────────┐
│ Length (4B)│ Cmd (1B)│ Payload (msgpack)│
└────────────┴─────────┴──────────────────┘
```
#### Modern HTTP Client (fpstore.py)
**Status**: Current, recommended
**Protocol**: HTTP/1.1 with MessagePack
**Port**: 6081 (default)
**File**: `acoustid/fpstore.py`
```python
class FingerprintIndexClient:
"""Async HTTP client for fingerprint index."""
def __init__(self, base_url, index_name='fingerprints'):
self.base_url = base_url
self.index_name = index_name
self.session = aiohttp.ClientSession()
async def search(self, query_terms, limit=10, min_score=0.5):
"""Search index for matching fingerprints.
Args:
query_terms: List of hash integers
limit: Maximum results to return
min_score: Minimum similarity score
Returns:
List of (fingerprint_id, score) tuples
"""
url = f"{self.base_url}/{self.index_name}/_search"
payload = msgspec.msgpack.encode({
'query': query_terms,
'limit': limit,
'min_score': min_score
})
async with self.session.post(url, data=payload) as resp:
data = await resp.read()
result = msgspec.msgpack.decode(data)
return [(r['id'], r['score']) for r in result['results']]
async def insert(self, fingerprint_id, terms):
"""Insert or update fingerprint in index."""
url = f"{self.base_url}/{self.index_name}/{fingerprint_id}"
payload = msgspec.msgpack.encode({'terms': terms})
async with self.session.put(url, data=payload) as resp:
return resp.status == 200
async def delete(self, fingerprint_id):
"""Delete fingerprint from index."""
url = f"{self.base_url}/{self.index_name}/{fingerprint_id}"
async with self.session.delete(url) as resp:
return resp.status == 200
```
### Index Operations
**Search Flow**:
1. Extract query terms from fingerprint (50-100 hashes)
2. Encode query as MessagePack
3. POST to `/:index/_search`
4. Decode MessagePack response
5. Return list of (fingerprint_id, score) tuples
**Insert Flow**:
1. Extract all terms from fingerprint
2. Encode as MessagePack
3. PUT to `/:index/:fingerprint_id`
4. Index adds to MemorySegment
5. Appends to Oplog for durability
**Batch Update Flow**:
1. Collect multiple fingerprint updates
2. Encode batch as MessagePack
3. POST to `/:index/_update`
4. Index processes all updates atomically
### Error Handling
**Retry Strategy**:
```python
async def search_with_retry(client, query, max_retries=3):
"""Search with exponential backoff retry."""
for attempt in range(max_retries):
try:
return await client.search(query)
except aiohttp.ClientError as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt
await asyncio.sleep(wait_time)
```
**Circuit Breaker**:
```python
class CircuitBreaker:
"""Prevent cascading failures to index."""
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
self.state = 'closed' # closed, open, half-open
async def call(self, func, *args, **kwargs):
if self.state == 'open':
if time.time() - self.last_failure_time > self.timeout:
self.state = 'half-open'
else:
raise CircuitBreakerOpen()
try:
result = await func(*args, **kwargs)
if self.state == 'half-open':
self.state = 'closed'
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = 'open'
raise
```
## Fingerprint Store (fpstore)
### Optional Service
**Purpose**: Separate storage for raw fingerprint data
**Status**: Optional (can use PostgreSQL instead)
**Protocol**: HTTP with MessagePack
**Configuration**:
```ini
[fingerprint_store]
enabled = true
base_url = http://fpstore:8080
```
**Operations**:
```python
class FingerprintStore:
"""Client for fingerprint storage service."""
async def store(self, fingerprint_id, fingerprint_data):
"""Store raw fingerprint data."""
url = f"{self.base_url}/fingerprints/{fingerprint_id}"
payload = msgspec.msgpack.encode({
'data': fingerprint_data
})
async with self.session.put(url, data=payload) as resp:
return resp.status == 200
async def retrieve(self, fingerprint_id):
"""Retrieve raw fingerprint data."""
url = f"{self.base_url}/fingerprints/{fingerprint_id}"
async with self.session.get(url) as resp:
data = await resp.read()
result = msgspec.msgpack.decode(data)
return result['data']
```
## NATS Integration
### Message Queue
**Purpose**: Async submission processing
**Technology**: NATS with JetStream (persistent queue)
**Library**: `nats-py`
**Configuration**:
```ini
[nats]
servers = nats://nats:4222
stream = acoustid_submissions
consumer = acoustid_worker
```
**File**: `acoustid/worker.py`
### Publisher (API Server)
```python
import nats
from nats.js import JetStreamContext
async def publish_submission(submission_id):
"""Publish submission to NATS queue."""
nc = await nats.connect(servers=["nats://nats:4222"])
js: JetStreamContext = nc.jetstream()
# Ensure stream exists
await js.add_stream(
name="acoustid_submissions",
subjects=["submissions.*"],
retention="workqueue"
)
# Publish message
await js.publish(
subject="submissions.new",
payload=msgspec.json.encode({
'submission_id': submission_id,
'timestamp': time.time()
})
)
await nc.close()
```
### Consumer (Worker)
```python
async def consume_submissions():
"""Consume submissions from NATS queue."""
nc = await nats.connect(servers=["nats://nats:4222"])
js: JetStreamContext = nc.jetstream()
# Create consumer
consumer = await js.pull_subscribe(
subject="submissions.*",
durable="acoustid_worker",
config=nats.js.api.ConsumerConfig(
ack_policy="explicit",
max_deliver=3,
ack_wait=300 # 5 minutes
)
)
while True:
# Fetch batch of messages
messages = await consumer.fetch(batch=10, timeout=5)
for msg in messages:
try:
data = msgspec.json.decode(msg.data)
await process_submission(data['submission_id'])
await msg.ack()
except Exception as e:
logger.error(f"Failed to process submission: {e}")
await msg.nak(delay=60) # Retry after 1 minute
```
### JetStream Configuration
**Stream Settings**:
- Retention: WorkQueue (messages deleted after ack)
- Max age: 7 days (unprocessed messages)
- Max messages: 1,000,000
- Storage: File (persistent)
**Consumer Settings**:
- Ack policy: Explicit (manual acknowledgment)
- Max deliver: 3 (retry up to 3 times)
- Ack wait: 300 seconds (5 minutes timeout)
- Max ack pending: 100 (max unacked messages)
## Redis Integration
### Use Cases
1. **Rate Limiting**: Sliding window counters
2. **Task Queue** (legacy): RPUSH/LPOP queue
3. **Caching**: API key validation, MBID existence
4. **State Management**: Backfill progress, worker state
**Configuration**:
```ini
[redis]
host = redis
port = 6379
db = 0
password_file = /run/secrets/redis_password
```
**File**: `acoustid/redis.py`
### Connection Pool
```python
import redis
redis_pool = redis.ConnectionPool(
host='redis',
port=6379,
db=0,
max_connections=50,
socket_timeout=5,
socket_connect_timeout=5
)
redis_client = redis.Redis(connection_pool=redis_pool)
```
### Rate Limiting Implementation
See DATA.md for detailed rate limiting data structures.
### Caching Patterns
**API Key Cache**:
```python
from cachetools import TTLCache
api_key_cache = TTLCache(maxsize=1000, ttl=60)
def get_application_by_key(api_key):
if api_key in api_key_cache:
return api_key_cache[api_key]
app = db.query(Application).filter_by(apikey=api_key).first()
if app:
api_key_cache[api_key] = app
return app
```
**Unknown MBID Cache**:
```python
def is_mbid_known(mbid):
"""Check if MBID exists in MusicBrainz."""
cache_key = f"unknown_mbid:{mbid}"
# Check cache
if redis_client.exists(cache_key):
return False
# Query MusicBrainz
exists = mb_db.query(Recording).filter_by(gid=mbid).count() > 0
# Cache negative result
if not exists:
redis_client.setex(cache_key, 3600, '1')
return exists
```
## Integration Summary
| Service | Protocol | Purpose | Criticality |
|---------|----------|---------|-------------|
| MusicBrainz | PostgreSQL | Metadata enrichment | High |
| Chromaprint | C library | Fingerprint generation | Critical |
| Index (HTTP) | HTTP/MessagePack | Fingerprint search | Critical |
| Index (TCP) | TCP binary | Legacy fingerprint search | Low (deprecated) |
| Fingerprint Store | HTTP/MessagePack | Raw fingerprint storage | Low (optional) |
| NATS | NATS protocol | Async job queue | High |
| Redis | Redis protocol | Caching, rate limiting | High |
+391
View File
@@ -0,0 +1,391 @@
# AcoustID System Overview
## Introduction
AcoustID is an open-source audio fingerprinting service that identifies music recordings by analyzing their acoustic characteristics. The system consists of two primary components working in tandem: a Python-based web service (acoustid-server) and a high-performance Zig-based fingerprint index (acoustid-index). Together, they provide a production-grade solution for matching audio fingerprints to MusicBrainz metadata.
## System Components
### acoustid-server (Python)
The server component handles all user-facing operations, database management, and business logic.
**Repository**: acoustid/acoustid-server
**License**: MIT
**Language**: Python 3.12+
**Current Version**: 26.3.1
**Core Technologies**:
- **Web Framework**: Werkzeug/Flask (current) with migration to Starlette (future async)
- **ORM**: SQLAlchemy 2.x with multi-database support
- **Database**: PostgreSQL 17.4 (4 separate databases)
- **Cache/Queue**: Redis for rate limiting and task queues
- **Message Queue**: NATS with JetStream for async submission processing
- **ASGI Server**: Uvicorn for async endpoints, Gunicorn for legacy
**Key Dependencies**:
```
acoustid-ext (C extension for Chromaprint)
Flask (current web framework)
Starlette (future async framework)
aiohttp (async HTTP client)
SQLAlchemy 2.x (ORM)
alembic (database migrations)
asyncpg (async PostgreSQL driver)
psycopg2 (sync PostgreSQL driver)
nats-py (NATS client)
mbdata (MusicBrainz data models)
msgspec (fast JSON/MessagePack)
zstd (compression)
gunicorn (WSGI server)
uvicorn (ASGI server)
```
**Entry Point**:
```bash
# Main CLI entry
python manage.py -> acoustid.cli:main()
# Available commands
python manage.py run web # Web UI server
python manage.py run api # API server
python manage.py run cron # Scheduled tasks
python manage.py run worker # Background worker
python manage.py run import # Import fingerprints
```
**File Locations**:
- Entry script: `manage.py`
- CLI implementation: `acoustid/cli.py`
- Server logic: `acoustid/server.py`
- Worker logic: `acoustid/worker.py`
- Cron jobs: `acoustid/cron.py`
- Configuration: `acoustid/config.py`
### acoustid-index (Zig)
The index component provides ultra-fast fingerprint search using advanced data structures and SIMD optimizations.
**Repository**: acoustid/acoustid-index
**License**: GPL-3.0
**Language**: Zig
**Build System**: Zig build system
**Core Technologies**:
- **HTTP Server**: httpz (Zig HTTP library)
- **Data Structure**: LSM-tree (Log-Structured Merge-tree) inverted index
- **Compression**: StreamVByte SIMD compression for posting lists
- **Serialization**: MessagePack for wire protocol
- **Metrics**: Prometheus-compatible metrics endpoint
**Key Dependencies**:
```
httpz (HTTP server framework)
metrics (Prometheus metrics)
zul (Zig utility library)
msgpack (MessagePack serialization)
nats (NATS client)
```
**Entry Point**:
```bash
# Build and run
zig build run -- --dir /tmp --port 8080
# Binary name
fpindex
# CLI flags
--dir <path> # Data directory for index storage
--port <number> # HTTP server port (default: 6081)
--threads <number> # Worker thread count
--log-level <level> # Logging verbosity
--cluster <name> # Cluster name for distributed setup
--nats-url <url> # NATS server URL for clustering
```
**File Locations**:
- Main entry: `src/main.zig`
- HTTP server: `src/server.zig`
- API handlers: `src/api.zig`
- Multi-index manager: `src/MultiIndex.zig`
- Core index: `src/Index.zig`
- Index reader: `src/IndexReader.zig`
- Segment management: `src/segment.zig`
- Memory segment: `src/MemorySegment.zig`
- File segment: `src/FileSegment.zig`
- Write-ahead log: `src/Oplog.zig`
- File format: `src/filefmt.zig`
- Block compression: `src/block.zig`
- SIMD compression: `src/streamvbyte.zig`
- Metrics: `src/metrics.zig`
## Build and Run
### Server Build
```bash
# Install dependencies with uv
uv sync
# Build Chromaprint extension
# (handled automatically in Docker build)
# Run with docker-compose
docker compose up
```
**Docker Compose Services**:
- `nats`: Message queue
- `redis`: Cache and rate limiting
- `postgres`: Database (custom pg17.4 image)
- `index`: Fingerprint index service
- `api`: API server
- `web`: Web UI server
- `cron`: Scheduled tasks
- `worker`: Background job processor
### Index Build
```bash
# Build binary
zig build
# Run with options
zig build run -- --dir /var/lib/acoustid-index --port 6081 --threads 4
```
## Architecture Relationship
The two components work together in a client-server model:
1. **Server** receives fingerprint submissions and lookup requests via HTTP API
2. **Server** stores metadata in PostgreSQL
3. **Server** sends fingerprint data to **Index** via HTTP/MessagePack protocol
4. **Index** performs ultra-fast similarity search using LSM-tree
5. **Index** returns candidate fingerprint IDs to **Server**
6. **Server** enriches results with metadata from PostgreSQL and MusicBrainz
7. **Server** returns final results to client
## Communication Protocols
### Server to Index
**Modern Protocol** (fpstore.py):
- HTTP POST to `http://index:6081/:index/_search`
- Request body: MessagePack-encoded fingerprint query
- Response: MessagePack-encoded list of candidate IDs with scores
**Legacy Protocol** (indexclient.py):
- Raw TCP socket connection
- Binary protocol with custom framing
- Being phased out in favor of HTTP
### Client to Server
**Public API**:
- HTTP GET/POST to `https://api.acoustid.org/v2/*`
- JSON/XML/JSONP responses
- Rate-limited by API key and IP
## Version Information
**Server Version**: 26.3.1
- Semantic versioning
- Tagged releases in Git
- Version defined in `acoustid/__init__.py`
**Index Version**: No formal versioning yet
- Tracked by Git commit hash
- Breaking changes communicated via commit messages
## Deployment Models
### Production (acoustid.org)
- Multi-server deployment
- Separate API, web, worker, and cron processes
- Dedicated PostgreSQL cluster (4 databases)
- Redis cluster for caching
- NATS cluster for message queue
- Multiple index instances for load balancing
### Self-Hosted (Docker Compose)
- Single-host deployment
- All services in containers
- Shared PostgreSQL instance
- Single Redis instance
- Single NATS instance
- Single index instance
### Development (Local)
- Python virtual environment with uv
- Local PostgreSQL (or Docker)
- Local Redis (or Docker)
- Local NATS (or Docker)
- Index built and run locally with Zig
## Key Features
### Server Features
- **Fingerprint Submission**: Accept audio fingerprints with optional metadata
- **Fingerprint Lookup**: Match fingerprints to known recordings
- **MusicBrainz Integration**: Link fingerprints to MBIDs
- **User Management**: API key generation and management
- **Rate Limiting**: Multi-tier rate limiting (global, app, IP)
- **Batch Operations**: Submit/lookup up to 20 fingerprints per request
- **Async Processing**: Background workers for heavy operations
- **Health Checks**: Multiple health endpoints for monitoring
- **Metrics**: StatsD metrics for observability
### Index Features
- **Fast Search**: Sub-millisecond fingerprint matching
- **SIMD Optimization**: StreamVByte compression for posting lists
- **LSM-Tree Storage**: Efficient write and read performance
- **Background Merging**: Automatic segment compaction
- **Snapshot Support**: Point-in-time index snapshots
- **Cluster Support**: Distributed index via NATS
- **Prometheus Metrics**: Built-in metrics endpoint
- **HTTP API**: RESTful API for all operations
## Configuration
### Server Configuration
**Config File**: `acoustid.conf` (INI format)
**Environment Variables**: `ACOUSTID_*` prefix
**Secret Files**: `*_file` suffix for file-based secrets
Example:
```ini
[database]
name = acoustid_app
user = acoustid
password_file = /run/secrets/db_password
[redis]
host = redis
port = 6379
[fingerprint_index]
host = index
port = 6081
```
### Index Configuration
**CLI Flags Only**: No config file support
**Environment Variables**: Limited support
Example:
```bash
fpindex \
--dir /var/lib/acoustid-index \
--port 6081 \
--threads 4 \
--log-level info \
--nats-url nats://nats:4222
```
## Data Flow Summary
### Submission Flow
1. Client submits fingerprint via `/v2/submit`
2. Server validates API keys and rate limits
3. Server stores submission in `submission` table
4. Server publishes message to NATS queue
5. Worker picks up message from NATS
6. Worker searches index for matches
7. Worker creates or links track in PostgreSQL
8. Worker updates index with new fingerprint
9. Client polls `/v2/submission_status` for result
### Lookup Flow
1. Client requests lookup via `/v2/lookup`
2. Server validates API key and rate limits
3. Server decodes fingerprint from request
4. Server extracts query features from fingerprint
5. Server sends search request to index
6. Index returns candidate fingerprint IDs
7. Server fetches metadata from PostgreSQL
8. Server fetches MusicBrainz data if requested
9. Server returns enriched results as JSON
## Technology Stack Summary
| Component | Server | Index |
|-----------|--------|-------|
| Language | Python 3.12+ | Zig |
| Web Framework | Flask/Starlette | httpz |
| Database | PostgreSQL 17.4 | N/A (file-based) |
| ORM | SQLAlchemy 2.x | N/A |
| Cache | Redis | N/A |
| Queue | NATS+JetStream | NATS (optional) |
| Serialization | JSON/MessagePack | MessagePack |
| Compression | zstd | StreamVByte |
| Metrics | StatsD | Prometheus |
| Testing | pytest | Zig test |
| Build | uv | zig build |
| Container | Docker | Docker |
## Repository Structure
### acoustid-server
```
acoustid/
├── api/ # API handlers
│ └── v2/ # API v2 endpoints
├── data/ # Business logic layer
├── future/ # Starlette migration code
├── web/ # Web UI handlers
├── scripts/ # Utility scripts
├── cli.py # CLI commands
├── server.py # Server entry point
├── worker.py # Background worker
├── cron.py # Scheduled tasks
├── fingerprint.py # Fingerprint utilities
├── indexclient.py # Legacy index client
├── fpstore.py # Modern index client
├── db.py # Database connection
├── config.py # Configuration
└── tables.py # SQLAlchemy models
```
### acoustid-index
```
src/
├── main.zig # Entry point
├── server.zig # HTTP server
├── api.zig # API handlers
├── MultiIndex.zig # Multi-index manager
├── Index.zig # Core index
├── IndexReader.zig # Read-only index view
├── segment.zig # Segment interface
├── MemorySegment.zig # In-memory segment
├── FileSegment.zig # On-disk segment
├── Oplog.zig # Write-ahead log
├── filefmt.zig # File format
├── block.zig # Block compression
├── streamvbyte.zig # SIMD compression
└── metrics.zig # Prometheus metrics
```
## Next Steps
For detailed information on specific aspects of the AcoustID system, refer to:
- **ARCHITECTURE.md**: Detailed architecture and data flow
- **API.md**: Complete API reference
- **DATA.md**: Database schema and data models
- **INTEGRATIONS.md**: External service integrations
- **DEPLOYMENT.md**: Deployment and infrastructure
- **CODEBASE.md**: Code organization and patterns
- **EVALUATION.md**: System evaluation and recommendations