feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,55 @@
|
||||
# AcoustID
|
||||
|
||||
## Overview
|
||||
|
||||
AcoustID is an open-source audio fingerprinting service. It identifies music tracks by their acoustic fingerprint and links them to MusicBrainz recordings.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Purpose**: Audio identification via acoustic fingerprinting
|
||||
- **Technology**: Chromaprint fingerprint generation
|
||||
- **Database**: Crowdsourced fingerprints linked to MusicBrainz
|
||||
- **License**: MIT (code), CC BY-SA 3.0 (data)
|
||||
|
||||
## Source
|
||||
|
||||
| Resource | URL |
|
||||
|----------|-----|
|
||||
| **Server Repository** | https://github.com/acoustid/acoustid-server |
|
||||
| **Index Repository** | https://github.com/acoustid/acoustid-index |
|
||||
| **Chromaprint Library** | https://github.com/acoustid/chromaprint |
|
||||
| **API Documentation** | https://acoustid.org/webservice |
|
||||
| **Website** | https://acoustid.org |
|
||||
|
||||
## API Examples
|
||||
|
||||
```bash
|
||||
# Lookup by fingerprint
|
||||
GET /v2/lookup?client=YOUR_API_KEY&meta=recordings&fingerprint={fp}&duration={dur}
|
||||
|
||||
# Submit new fingerprint
|
||||
POST /v2/submit
|
||||
```
|
||||
|
||||
## Chromaprint CLI
|
||||
|
||||
```bash
|
||||
# Generate fingerprint from audio file
|
||||
fpcalc song.mp3
|
||||
# Returns: FINGERPRINT=... DURATION=...
|
||||
```
|
||||
|
||||
## Self-Hosting
|
||||
|
||||
The acoustid-index v2 is written in Zig for performance:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/acoustid/acoustid-index.git
|
||||
# Follow build instructions in README
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Used by: Beets, Picard, Kid3, MusicBrainz ecosystem
|
||||
- Free API for audio fingerprint matching
|
||||
- Identify unknown files → get MusicBrainz metadata
|
||||
@@ -0,0 +1,807 @@
|
||||
# AcoustID API Reference
|
||||
|
||||
## API Overview
|
||||
|
||||
The AcoustID API provides fingerprint-based music identification services. The API is RESTful, supports multiple response formats (JSON, XML, JSONP), and requires API key authentication for most operations.
|
||||
|
||||
**Base URL**: `https://api.acoustid.org`
|
||||
**Protocol**: HTTPS only
|
||||
**Authentication**: API key (application key + user key for submissions)
|
||||
**Rate Limiting**: Multi-tier (global, application, IP-based)
|
||||
|
||||
## Public API Endpoints
|
||||
|
||||
### Fingerprint Lookup
|
||||
|
||||
Identify recordings by audio fingerprint.
|
||||
|
||||
#### `/v2/lookup`
|
||||
|
||||
**Methods**: GET, POST
|
||||
**Authentication**: Required (client key)
|
||||
**Rate Limit**: 3 requests/second (IP), 10 requests/second (application)
|
||||
|
||||
**Required Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
| `duration` | integer | Track duration in seconds (if using fingerprint) |
|
||||
| `trackid` | string | AcoustID track ID (alternative to fingerprint) |
|
||||
|
||||
**Optional Parameters**:
|
||||
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `fingerprint` | string | Chromaprint fingerprint (base64 or compressed) | - |
|
||||
| `format` | string | Response format: `json`, `xml`, `jsonp` | `json` |
|
||||
| `jsoncallback` | string | JSONP callback function name | - |
|
||||
| `meta` | string | Metadata to include (see below) | - |
|
||||
|
||||
**Metadata Options** (comma-separated):
|
||||
|
||||
- `recordings`: Include MusicBrainz recording metadata
|
||||
- `recordingids`: Include only recording MBIDs (faster)
|
||||
- `releases`: Include release metadata
|
||||
- `releaseids`: Include only release MBIDs
|
||||
- `releasegroups`: Include release group metadata
|
||||
- `releasegroupids`: Include only release group MBIDs
|
||||
- `tracks`: Include track metadata
|
||||
- `compress`: Compress response with gzip
|
||||
- `usermeta`: Include user-submitted metadata
|
||||
- `sources`: Include submission source information
|
||||
|
||||
**Batch Lookup**:
|
||||
|
||||
Submit multiple fingerprints in a single request using indexed parameters:
|
||||
|
||||
```
|
||||
duration.0=240&fingerprint.0=AQADtN...
|
||||
duration.1=180&fingerprint.1=AQABtK...
|
||||
```
|
||||
|
||||
**Limits**:
|
||||
- Maximum 20 fingerprints per batch request
|
||||
- Maximum 100 track IDs per request
|
||||
|
||||
**Example Request** (GET):
|
||||
```
|
||||
GET /v2/lookup?client=8XaBELgH&duration=240&fingerprint=AQADtNGiJE...&meta=recordings
|
||||
```
|
||||
|
||||
**Example Request** (POST):
|
||||
```
|
||||
POST /v2/lookup
|
||||
Content-Type: application/x-www-form-urlencoded
|
||||
|
||||
client=8XaBELgH&duration=240&fingerprint=AQADtNGiJE...&meta=recordings
|
||||
```
|
||||
|
||||
**Example Response** (JSON):
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"results": [
|
||||
{
|
||||
"id": "7e8b1234-5678-90ab-cdef-1234567890ab",
|
||||
"score": 0.95,
|
||||
"recordings": [
|
||||
{
|
||||
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
|
||||
"title": "Example Song",
|
||||
"duration": 240,
|
||||
"artists": [
|
||||
{
|
||||
"id": "12345678-90ab-cdef-1234-567890abcdef",
|
||||
"name": "Example Artist"
|
||||
}
|
||||
],
|
||||
"releases": [
|
||||
{
|
||||
"id": "abcdef12-3456-7890-abcd-ef1234567890",
|
||||
"title": "Example Album",
|
||||
"country": "US",
|
||||
"date": {
|
||||
"year": 2020,
|
||||
"month": 5,
|
||||
"day": 15
|
||||
},
|
||||
"track_count": 12,
|
||||
"medium_count": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Response Fields**:
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `status` | string | `ok` or `error` |
|
||||
| `results` | array | Array of match results |
|
||||
| `results[].id` | string | AcoustID track ID |
|
||||
| `results[].score` | float | Match confidence (0.0-1.0) |
|
||||
| `results[].recordings` | array | MusicBrainz recordings (if requested) |
|
||||
|
||||
### Fingerprint Submission
|
||||
|
||||
Submit audio fingerprints with optional metadata.
|
||||
|
||||
#### `/v2/submit`
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Required (client key + user key)
|
||||
**Rate Limit**: 3 requests/second (IP), 10 requests/second (application)
|
||||
|
||||
**Required Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
| `user` | string | User API key |
|
||||
| `duration.#` | integer | Track duration in seconds |
|
||||
| `fingerprint.#` | string | Chromaprint fingerprint |
|
||||
|
||||
**Optional Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `clientversion` | string | Client application version |
|
||||
| `bitrate.#` | integer | Audio bitrate in kbps |
|
||||
| `fileformat.#` | string | Audio file format (mp3, flac, etc.) |
|
||||
| `mbid.#` | string | MusicBrainz recording MBID |
|
||||
| `track.#` | string | Track title |
|
||||
| `artist.#` | string | Artist name |
|
||||
| `album.#` | string | Album title |
|
||||
| `albumartist.#` | string | Album artist name |
|
||||
| `year.#` | integer | Release year |
|
||||
| `trackno.#` | integer | Track number |
|
||||
| `discno.#` | integer | Disc number |
|
||||
|
||||
**Batch Submission**:
|
||||
|
||||
Use indexed parameters (`.0`, `.1`, `.2`, etc.) to submit multiple fingerprints:
|
||||
|
||||
```
|
||||
duration.0=240&fingerprint.0=AQADtN...&mbid.0=a1b2c3d4...
|
||||
duration.1=180&fingerprint.1=AQABtK...&mbid.1=e5f67890...
|
||||
```
|
||||
|
||||
**Example Request**:
|
||||
```
|
||||
POST /v2/submit
|
||||
Content-Type: application/x-www-form-urlencoded
|
||||
|
||||
client=8XaBELgH&user=AbCdEfGh&duration.0=240&fingerprint.0=AQADtNGiJE...&mbid.0=a1b2c3d4-e5f6-7890-abcd-ef1234567890
|
||||
```
|
||||
|
||||
**Example Response**:
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"submissions": [
|
||||
{
|
||||
"id": 12345678,
|
||||
"status": "pending"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Response Fields**:
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `status` | string | `ok` or `error` |
|
||||
| `submissions` | array | Array of submission results |
|
||||
| `submissions[].id` | integer | Submission ID |
|
||||
| `submissions[].status` | string | `pending`, `imported`, or `error` |
|
||||
|
||||
### Submission Status
|
||||
|
||||
Check the processing status of submitted fingerprints.
|
||||
|
||||
#### `/v2/submission_status`
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: Required (client key)
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
| `id` | integer | Submission ID (from submit response) |
|
||||
| `format` | string | Response format: `json`, `xml`, `jsonp` |
|
||||
|
||||
**Example Request**:
|
||||
```
|
||||
GET /v2/submission_status?client=8XaBELgH&id=12345678
|
||||
```
|
||||
|
||||
**Example Response**:
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"submission": {
|
||||
"id": 12345678,
|
||||
"status": "imported",
|
||||
"result": {
|
||||
"id": "7e8b1234-5678-90ab-cdef-1234567890ab"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Status Values**:
|
||||
- `pending`: Queued for processing
|
||||
- `imported`: Successfully processed
|
||||
- `error`: Processing failed
|
||||
|
||||
### Fingerprint Retrieval
|
||||
|
||||
Retrieve stored fingerprint data.
|
||||
|
||||
#### `/v2/fingerprint`
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: Required (client key)
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
| `id` | string | AcoustID track ID |
|
||||
| `format` | string | Response format: `json`, `xml`, `jsonp` |
|
||||
|
||||
**Example Request**:
|
||||
```
|
||||
GET /v2/fingerprint?client=8XaBELgH&id=7e8b1234-5678-90ab-cdef-1234567890ab
|
||||
```
|
||||
|
||||
**Example Response**:
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"fingerprints": [
|
||||
{
|
||||
"id": 987654321,
|
||||
"fingerprint": "AQADtNGiJE...",
|
||||
"duration": 240,
|
||||
"submission_count": 5
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Track Listing by MBID
|
||||
|
||||
List AcoustID tracks linked to a MusicBrainz recording.
|
||||
|
||||
#### `/v2/track/list_by_mbid`
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: Required (client key)
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
| `mbid` | string | MusicBrainz recording MBID |
|
||||
| `format` | string | Response format: `json`, `xml`, `jsonp` |
|
||||
|
||||
**Example Request**:
|
||||
```
|
||||
GET /v2/track/list_by_mbid?client=8XaBELgH&mbid=a1b2c3d4-e5f6-7890-abcd-ef1234567890
|
||||
```
|
||||
|
||||
**Example Response**:
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"tracks": [
|
||||
{
|
||||
"id": "7e8b1234-5678-90ab-cdef-1234567890ab",
|
||||
"disabled": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Track Listing by PUID
|
||||
|
||||
List AcoustID tracks linked to a MusicIP PUID (legacy).
|
||||
|
||||
#### `/v2/track/list_by_puid`
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: Required (client key)
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
| `puid` | string | MusicIP PUID |
|
||||
| `format` | string | Response format: `json`, `xml`, `jsonp` |
|
||||
|
||||
### User Management
|
||||
|
||||
#### `/v2/user/lookup`
|
||||
|
||||
Lookup user API key by MusicBrainz account.
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Required (client key)
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
| `musicbrainz_id` | string | MusicBrainz username |
|
||||
|
||||
#### `/v2/user/create_anonymous`
|
||||
|
||||
Create anonymous user API key.
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Required (client key)
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
|
||||
**Example Response**:
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"user": {
|
||||
"apikey": "AbCdEfGh"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `/v2/user/create_musicbrainz`
|
||||
|
||||
Create user API key linked to MusicBrainz account.
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Required (client key)
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `client` | string | Application API key |
|
||||
| `access_token` | string | MusicBrainz OAuth access token |
|
||||
|
||||
## Legacy API Endpoints
|
||||
|
||||
### `/lookup`
|
||||
|
||||
Legacy lookup endpoint (API v1).
|
||||
|
||||
**Status**: Deprecated, use `/v2/lookup` instead
|
||||
**Differences**: Limited metadata options, different response format
|
||||
|
||||
### `/submit`
|
||||
|
||||
Legacy submit endpoint (API v1).
|
||||
|
||||
**Status**: Deprecated, use `/v2/submit` instead
|
||||
**Differences**: Synchronous processing, no batch support
|
||||
|
||||
## Health Check Endpoints
|
||||
|
||||
### `/_health`
|
||||
|
||||
Full health check with database write test.
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: None
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"status": "ok"
|
||||
}
|
||||
```
|
||||
|
||||
**Status Codes**:
|
||||
- `200`: All systems operational
|
||||
- `503`: Service unavailable
|
||||
|
||||
### `/_health_ro`
|
||||
|
||||
Read-only health check (database read test only).
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: None
|
||||
|
||||
### `/_health_docker`
|
||||
|
||||
Docker-specific health check (minimal checks).
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: None
|
||||
|
||||
## Internal API Endpoints
|
||||
|
||||
These endpoints are for administrative use only and require special authentication.
|
||||
|
||||
### `/v2/internal/update_lookup_stats`
|
||||
|
||||
Trigger lookup statistics update.
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Internal only
|
||||
|
||||
### `/v2/internal/update_user_agent_stats`
|
||||
|
||||
Trigger user agent statistics update.
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Internal only
|
||||
|
||||
### `/v2/internal/lookup_stats`
|
||||
|
||||
Retrieve lookup statistics.
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: Internal only
|
||||
|
||||
### `/v2/internal/create_account`
|
||||
|
||||
Create new user account.
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Internal only
|
||||
|
||||
### `/v2/internal/create_application`
|
||||
|
||||
Create new API application.
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Internal only
|
||||
|
||||
### `/v2/internal/update_application_status`
|
||||
|
||||
Update application status (active/inactive).
|
||||
|
||||
**Method**: POST
|
||||
**Authentication**: Internal only
|
||||
|
||||
### `/v2/internal/check_application`
|
||||
|
||||
Check application validity.
|
||||
|
||||
**Method**: GET
|
||||
**Authentication**: Internal only
|
||||
|
||||
## Index API Endpoints
|
||||
|
||||
The fingerprint index service exposes its own HTTP API (separate from the main API).
|
||||
|
||||
**Base URL**: `http://index:6081` (internal)
|
||||
**Protocol**: HTTP
|
||||
**Format**: MessagePack
|
||||
|
||||
### `PUT /:index`
|
||||
|
||||
Create new index.
|
||||
|
||||
**Parameters**:
|
||||
- `:index`: Index name
|
||||
|
||||
### `GET /:index`
|
||||
|
||||
Get index information.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"name": "fingerprints",
|
||||
"doc_count": 1234567,
|
||||
"segment_count": 42,
|
||||
"memory_segment_size": 1048576
|
||||
}
|
||||
```
|
||||
|
||||
### `DELETE /:index`
|
||||
|
||||
Delete index.
|
||||
|
||||
### `POST /:index/_search`
|
||||
|
||||
Search for fingerprints.
|
||||
|
||||
**Request Body** (MessagePack):
|
||||
```python
|
||||
{
|
||||
"query": [term1, term2, term3, ...],
|
||||
"limit": 10,
|
||||
"min_score": 0.5
|
||||
}
|
||||
```
|
||||
|
||||
**Response** (MessagePack):
|
||||
```python
|
||||
{
|
||||
"results": [
|
||||
{"id": fpid1, "score": 0.95},
|
||||
{"id": fpid2, "score": 0.87}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### `POST /:index/_update`
|
||||
|
||||
Batch update fingerprints.
|
||||
|
||||
**Request Body** (MessagePack):
|
||||
```python
|
||||
{
|
||||
"updates": [
|
||||
{"id": fpid1, "terms": [term1, term2, ...]},
|
||||
{"id": fpid2, "terms": [term3, term4, ...]}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /:index/_segments`
|
||||
|
||||
List index segments.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"segments": [
|
||||
{
|
||||
"id": 0,
|
||||
"type": "memory",
|
||||
"doc_count": 1024,
|
||||
"size_bytes": 1048576
|
||||
},
|
||||
{
|
||||
"id": 1,
|
||||
"type": "file",
|
||||
"doc_count": 100000,
|
||||
"size_bytes": 52428800
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /:index/_snapshot`
|
||||
|
||||
Create index snapshot.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"snapshot_id": "snapshot_20250428_120000",
|
||||
"path": "/var/lib/acoustid-index/snapshots/snapshot_20250428_120000"
|
||||
}
|
||||
```
|
||||
|
||||
### `PUT /:index/:fpid`
|
||||
|
||||
Insert or update fingerprint.
|
||||
|
||||
**Parameters**:
|
||||
- `:index`: Index name
|
||||
- `:fpid`: Fingerprint ID
|
||||
|
||||
**Request Body** (MessagePack):
|
||||
```python
|
||||
{
|
||||
"terms": [term1, term2, term3, ...]
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /:index/:fpid`
|
||||
|
||||
Retrieve fingerprint.
|
||||
|
||||
**Response** (MessagePack):
|
||||
```python
|
||||
{
|
||||
"id": fpid,
|
||||
"terms": [term1, term2, term3, ...]
|
||||
}
|
||||
```
|
||||
|
||||
### `DELETE /:index/:fpid`
|
||||
|
||||
Delete fingerprint.
|
||||
|
||||
### `GET /_health`
|
||||
|
||||
Index health check.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"status": "ok"
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /_metrics`
|
||||
|
||||
Prometheus metrics.
|
||||
|
||||
**Response** (Prometheus text format):
|
||||
```
|
||||
# HELP fpindex_search_duration_seconds Search duration
|
||||
# TYPE fpindex_search_duration_seconds histogram
|
||||
fpindex_search_duration_seconds_bucket{le="0.005"} 1234
|
||||
fpindex_search_duration_seconds_bucket{le="0.01"} 5678
|
||||
...
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
### Rate Limit Tiers
|
||||
|
||||
AcoustID implements a three-tier rate limiting system:
|
||||
|
||||
| Tier | Scope | Default Limit | Override |
|
||||
|------|-------|---------------|----------|
|
||||
| Global | All requests | 3 req/s | Config: `cluster.rate_limiter.global_limit` |
|
||||
| Application | Per API key | 10 req/s | Database: `application.rate_limit` |
|
||||
| IP Address | Per client IP | 3 req/s | Config: `cluster.rate_limiter.ip_limit` |
|
||||
|
||||
### Rate Limit Algorithm
|
||||
|
||||
**Implementation**: Redis-based sliding window
|
||||
|
||||
**Window Configuration**:
|
||||
- Window duration: 20 seconds
|
||||
- Window steps: 4 (5-second buckets)
|
||||
- Cleanup: Automatic expiration (25-second TTL)
|
||||
|
||||
**Redis Keys**:
|
||||
```
|
||||
rl:bucket:global:{timestamp}
|
||||
rl:bucket:app:{api_key}:{timestamp}
|
||||
rl:bucket:ip:{ip_address}:{timestamp}
|
||||
```
|
||||
|
||||
### Rate Limit Headers
|
||||
|
||||
Responses include rate limit information:
|
||||
|
||||
```
|
||||
X-RateLimit-Limit: 10
|
||||
X-RateLimit-Remaining: 7
|
||||
X-RateLimit-Reset: 1714305600
|
||||
```
|
||||
|
||||
### Rate Limit Exceeded Response
|
||||
|
||||
**Status Code**: 429 Too Many Requests
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"status": "error",
|
||||
"error": {
|
||||
"code": 5,
|
||||
"message": "Rate limit exceeded"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Error Response Format
|
||||
|
||||
All errors return a consistent structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "error",
|
||||
"error": {
|
||||
"code": 1,
|
||||
"message": "Invalid API key"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Error Codes
|
||||
|
||||
| Code | Message | Description |
|
||||
|------|---------|-------------|
|
||||
| 1 | Invalid API key | Client or user key is invalid |
|
||||
| 2 | Missing required parameter | Required parameter not provided |
|
||||
| 3 | Invalid fingerprint | Fingerprint format is invalid |
|
||||
| 4 | Internal error | Server-side error occurred |
|
||||
| 5 | Rate limit exceeded | Too many requests |
|
||||
| 6 | Invalid format | Unsupported response format |
|
||||
| 7 | Fingerprint not found | Requested fingerprint doesn't exist |
|
||||
| 8 | Too many requests | Batch size exceeds limit |
|
||||
|
||||
### HTTP Status Codes
|
||||
|
||||
| Code | Meaning | Usage |
|
||||
|------|---------|-------|
|
||||
| 200 | OK | Successful request |
|
||||
| 400 | Bad Request | Invalid parameters |
|
||||
| 401 | Unauthorized | Missing or invalid API key |
|
||||
| 403 | Forbidden | API key lacks permission |
|
||||
| 404 | Not Found | Resource not found |
|
||||
| 429 | Too Many Requests | Rate limit exceeded |
|
||||
| 500 | Internal Server Error | Server error |
|
||||
| 503 | Service Unavailable | Service down or degraded |
|
||||
|
||||
## Authentication
|
||||
|
||||
### API Key Types
|
||||
|
||||
1. **Application Key** (`client` parameter):
|
||||
- Identifies the client application
|
||||
- Required for all API calls
|
||||
- Obtain from https://acoustid.org/new-application
|
||||
|
||||
2. **User Key** (`user` parameter):
|
||||
- Identifies the end user
|
||||
- Required for submissions
|
||||
- Created via `/v2/user/create_*` endpoints
|
||||
|
||||
3. **Demo Key**:
|
||||
- Limited functionality
|
||||
- For testing only
|
||||
- Key: `8XaBELgH`
|
||||
|
||||
### Key Management
|
||||
|
||||
**Application Keys**:
|
||||
- Created via web UI or internal API
|
||||
- Can be active or inactive
|
||||
- Rate limits configurable per key
|
||||
- Usage statistics tracked
|
||||
|
||||
**User Keys**:
|
||||
- Anonymous or MusicBrainz-linked
|
||||
- Created programmatically
|
||||
- Tied to application key
|
||||
- Submission history tracked
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Lookup Optimization
|
||||
|
||||
1. **Use batch lookups** for multiple files (up to 20 per request)
|
||||
2. **Request only needed metadata** (use specific `meta` flags)
|
||||
3. **Cache results** to avoid redundant lookups
|
||||
4. **Handle rate limits** with exponential backoff
|
||||
|
||||
### Submission Guidelines
|
||||
|
||||
1. **Include MBIDs** when known (improves accuracy)
|
||||
2. **Provide metadata** (artist, album, track) for better matching
|
||||
3. **Use batch submissions** for efficiency
|
||||
4. **Poll submission status** asynchronously
|
||||
|
||||
### Error Handling
|
||||
|
||||
1. **Retry on 5xx errors** with exponential backoff
|
||||
2. **Respect rate limits** (check headers)
|
||||
3. **Validate fingerprints** before submission
|
||||
4. **Log errors** for debugging
|
||||
|
||||
### Performance
|
||||
|
||||
1. **Use POST** for large requests (avoid URL length limits)
|
||||
2. **Enable compression** (`meta=compress`)
|
||||
3. **Reuse connections** (HTTP keep-alive)
|
||||
4. **Implement timeouts** (30-60 seconds recommended)
|
||||
@@ -0,0 +1,611 @@
|
||||
# AcoustID Architecture
|
||||
|
||||
## System Architecture Overview
|
||||
|
||||
AcoustID employs a **monolithic multi-process architecture** with microservice-like separation of concerns. The system is split into two major repositories with distinct responsibilities:
|
||||
|
||||
1. **acoustid-server**: Monolithic Python application with multiple process types
|
||||
2. **acoustid-index**: Standalone Zig service for fingerprint indexing
|
||||
|
||||
## Server Architecture
|
||||
|
||||
### Process Types
|
||||
|
||||
The server runs as multiple independent processes, each with a specific role:
|
||||
|
||||
| Process | Entry Point | Purpose | Scaling |
|
||||
|---------|-------------|---------|---------|
|
||||
| API | `acoustid.server:make_application()` | Handle API requests | Horizontal |
|
||||
| Web | `acoustid.server:make_application()` | Serve web UI | Horizontal |
|
||||
| Worker | `acoustid.worker:run()` | Process background jobs | Horizontal |
|
||||
| Cron | `acoustid.cron:run()` | Execute scheduled tasks | Single instance |
|
||||
| Import | `acoustid.scripts.import_submissions` | Bulk import fingerprints | Manual |
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
acoustid/
|
||||
├── api/ # API layer
|
||||
│ ├── __init__.py # API application factory
|
||||
│ ├── errors.py # Error handling
|
||||
│ ├── ratelimit.py # Rate limiting logic
|
||||
│ └── v2/ # API v2 endpoints
|
||||
│ ├── __init__.py
|
||||
│ ├── lookup.py # Fingerprint lookup
|
||||
│ ├── submit.py # Fingerprint submission
|
||||
│ ├── misc.py # Utility endpoints
|
||||
│ └── internal.py # Internal admin endpoints
|
||||
├── data/ # Business logic layer
|
||||
│ ├── account.py # User account operations
|
||||
│ ├── application.py # API application management
|
||||
│ ├── fingerprint.py # Fingerprint operations
|
||||
│ ├── foreignid.py # Foreign ID management
|
||||
│ ├── meta.py # Metadata operations
|
||||
│ ├── musicbrainz.py # MusicBrainz queries
|
||||
│ ├── stats.py # Statistics tracking
|
||||
│ ├── submission.py # Submission processing
|
||||
│ └── track.py # Track operations
|
||||
├── future/ # Starlette migration
|
||||
│ ├── app.py # ASGI application
|
||||
│ ├── lookup.py # Async lookup handler
|
||||
│ └── submit.py # Async submit handler
|
||||
├── web/ # Web UI layer
|
||||
│ ├── __init__.py # Web application factory
|
||||
│ ├── views/ # View handlers
|
||||
│ └── templates/ # Jinja2 templates
|
||||
├── scripts/ # Utility scripts
|
||||
│ ├── import_submissions.py
|
||||
│ ├── backfill_fingerprint_index.py
|
||||
│ └── update_lookup_stats.py
|
||||
├── cli.py # CLI command definitions
|
||||
├── server.py # WSGI/ASGI application
|
||||
├── worker.py # Background worker
|
||||
├── cron.py # Cron job scheduler
|
||||
├── fingerprint.py # Fingerprint utilities
|
||||
├── indexclient.py # Legacy TCP index client
|
||||
├── fpstore.py # Modern HTTP index client
|
||||
├── db.py # Database connection management
|
||||
├── config.py # Configuration loading
|
||||
└── tables.py # SQLAlchemy ORM models
|
||||
```
|
||||
|
||||
### Layered Architecture
|
||||
|
||||
The server follows a traditional layered architecture:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Presentation Layer │
|
||||
│ (api/, web/, future/) │
|
||||
│ - HTTP request/response handling │
|
||||
│ - Input validation │
|
||||
│ - Response formatting │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Business Logic Layer │
|
||||
│ (data/) │
|
||||
│ - Domain operations │
|
||||
│ - Business rules │
|
||||
│ - Orchestration │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Data Access Layer │
|
||||
│ (db.py, tables.py) │
|
||||
│ - Database queries │
|
||||
│ - ORM models │
|
||||
│ - Transaction management │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ External Services Layer │
|
||||
│ (indexclient.py, fpstore.py) │
|
||||
│ - Index communication │
|
||||
│ - MusicBrainz queries │
|
||||
│ - Redis operations │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Framework Transition
|
||||
|
||||
The server is actively transitioning from Flask to Starlette:
|
||||
|
||||
**Current (Flask/Werkzeug)**:
|
||||
- Location: `acoustid/api/`, `acoustid/web/`
|
||||
- WSGI-based synchronous request handling
|
||||
- Gunicorn as application server
|
||||
- Blocking database operations with psycopg2
|
||||
|
||||
**Future (Starlette)**:
|
||||
- Location: `acoustid/future/`
|
||||
- ASGI-based asynchronous request handling
|
||||
- Uvicorn as application server
|
||||
- Async database operations with asyncpg
|
||||
|
||||
**Migration Status**:
|
||||
- Core lookup and submit endpoints have async implementations
|
||||
- Legacy endpoints still use Flask
|
||||
- Both frameworks run simultaneously during transition
|
||||
- Configuration flag controls which implementation is used
|
||||
|
||||
## Index Architecture
|
||||
|
||||
### LSM-Tree Design
|
||||
|
||||
The index uses a **Log-Structured Merge-tree (LSM-tree)** for efficient fingerprint storage and retrieval.
|
||||
|
||||
**Core Concept**:
|
||||
- Writes go to in-memory segment (fast)
|
||||
- Memory segment periodically flushed to disk
|
||||
- Background process merges disk segments
|
||||
- Reads check memory segment first, then disk segments
|
||||
|
||||
**Components**:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ MultiIndex │
|
||||
│ - Manages multiple named indexes │
|
||||
│ - Routes requests to correct index │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Index │
|
||||
│ - Single fingerprint index │
|
||||
│ - Coordinates segments and merging │
|
||||
└─────────────────────────────────────────┘
|
||||
↓
|
||||
┌──────────────────┬──────────────────────┐
|
||||
│ MemorySegment │ FileSegment(s) │
|
||||
│ - In-memory │ - On-disk │
|
||||
│ - Fast writes │ - Immutable │
|
||||
│ - Volatile │ - Persistent │
|
||||
└──────────────────┴──────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Oplog (Write-Ahead Log) │
|
||||
│ - Durability for memory segment │
|
||||
│ - Replay on crash recovery │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Segment Management
|
||||
|
||||
**MemorySegment** (`src/MemorySegment.zig`):
|
||||
- Hash map of fingerprint ID to posting list
|
||||
- Posting list: array of term IDs (compressed)
|
||||
- Maximum size threshold triggers flush
|
||||
- Backed by Oplog for durability
|
||||
|
||||
**FileSegment** (`src/FileSegment.zig`):
|
||||
- Immutable on-disk segment
|
||||
- Binary file format with index and data sections
|
||||
- StreamVByte compression for posting lists
|
||||
- Memory-mapped for fast reads
|
||||
|
||||
**Segment Lifecycle**:
|
||||
1. Writes accumulate in MemorySegment
|
||||
2. MemorySegment reaches size threshold
|
||||
3. Flush to new FileSegment
|
||||
4. Clear MemorySegment and Oplog
|
||||
5. Background merger selects segments to merge
|
||||
6. Merge creates new larger FileSegment
|
||||
7. Delete old segments
|
||||
|
||||
### Merge Policy
|
||||
|
||||
**Tiered Merge Strategy**:
|
||||
- Segments grouped into tiers by size
|
||||
- Tier 0: Smallest segments (recently flushed)
|
||||
- Tier N: Largest segments (heavily merged)
|
||||
- Merge triggered when tier has too many segments
|
||||
- Merges segments within same tier
|
||||
|
||||
**Benefits**:
|
||||
- Write amplification bounded
|
||||
- Read performance improves over time
|
||||
- Disk space reclaimed from deleted entries
|
||||
|
||||
### File Format
|
||||
|
||||
**Segment File Structure** (`src/filefmt.zig`):
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Header │
|
||||
│ - Magic number │
|
||||
│ - Version │
|
||||
│ - Metadata │
|
||||
├─────────────────────────────────────────┤
|
||||
│ Index Section │
|
||||
│ - Fingerprint ID → Offset mapping │
|
||||
│ - Binary search tree or hash table │
|
||||
├─────────────────────────────────────────┤
|
||||
│ Data Section │
|
||||
│ - Compressed posting lists │
|
||||
│ - StreamVByte encoded │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Block Compression** (`src/block.zig`):
|
||||
- Posting lists compressed in blocks
|
||||
- StreamVByte SIMD compression
|
||||
- Delta encoding for term IDs
|
||||
- Typical compression ratio: 4-8x
|
||||
|
||||
### Index Reader
|
||||
|
||||
**IndexReader** (`src/IndexReader.zig`):
|
||||
- Read-only view of index
|
||||
- Merges results from all segments
|
||||
- Implements search algorithm
|
||||
- Returns top-K candidates by score
|
||||
|
||||
**Search Algorithm**:
|
||||
1. Extract query terms from fingerprint
|
||||
2. For each term, fetch posting lists from all segments
|
||||
3. Merge posting lists (union)
|
||||
4. Score each candidate by term overlap
|
||||
5. Return top-K candidates sorted by score
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Submission Flow (Detailed)
|
||||
|
||||
```
|
||||
┌─────────┐
|
||||
│ Client │
|
||||
└────┬────┘
|
||||
│ POST /v2/submit
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ SubmitHandler (api/v2/submit.py) │
|
||||
│ 1. Validate API keys (client + user) │
|
||||
│ 2. Check rate limits (Redis) │
|
||||
│ 3. Decode fingerprints │
|
||||
│ 4. Insert into submission table │
|
||||
│ 5. Publish to NATS queue │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ NATS message
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Worker (worker.py) │
|
||||
│ 1. Consume message from NATS │
|
||||
│ 2. Load submission from database │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ FingerprintSearcher (data/fingerprint) │
|
||||
│ 1. Extract query from fingerprint │
|
||||
│ 2. Search index for matches │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ HTTP POST /:index/_search
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Index (fpindex) │
|
||||
│ 1. Decode MessagePack request │
|
||||
│ 2. Search segments │
|
||||
│ 3. Score candidates │
|
||||
│ 4. Return top matches │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ Candidate fingerprint IDs
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Worker (continued) │
|
||||
│ 1. Fetch candidate metadata from DB │
|
||||
│ 2. Decide: create new track or link │
|
||||
│ 3. Insert/update track tables │
|
||||
│ 4. Update index with new fingerprint │
|
||||
│ 5. Store result in submission_result │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ HTTP PUT /:index/:fpid
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Index (fpindex) │
|
||||
│ 1. Add fingerprint to MemorySegment │
|
||||
│ 2. Append to Oplog │
|
||||
│ 3. Trigger flush if needed │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Lookup Flow (Detailed)
|
||||
|
||||
```
|
||||
┌─────────┐
|
||||
│ Client │
|
||||
└────┬────┘
|
||||
│ GET/POST /v2/lookup
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ LookupHandler (api/v2/lookup.py) │
|
||||
│ 1. Validate API key (client) │
|
||||
│ 2. Check rate limits (Redis) │
|
||||
│ 3. Parse parameters │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ decode_fingerprint (fingerprint.py) │
|
||||
│ 1. Decode base64 or compressed format │
|
||||
│ 2. Decompress if needed │
|
||||
│ 3. Parse Chromaprint data │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ extract_query (fingerprint.py) │
|
||||
│ 1. Extract hash terms from fingerprint│
|
||||
│ 2. Build query structure │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ fpstore.search (fpstore.py) │
|
||||
│ 1. Encode query as MessagePack │
|
||||
│ 2. HTTP POST to index │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ HTTP POST /:index/_search
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Index (fpindex) │
|
||||
│ 1. Parse MessagePack query │
|
||||
│ 2. Search all segments │
|
||||
│ 3. Merge and score results │
|
||||
│ 4. Return top-K candidates │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ Candidate fingerprint IDs + scores
|
||||
┌─────────────────────────────────────────┐
|
||||
│ LookupHandler (continued) │
|
||||
│ 1. Fetch fingerprint metadata from DB │
|
||||
│ 2. Fetch track metadata from DB │
|
||||
│ 3. Fetch MusicBrainz data if requested│
|
||||
│ 4. Build result structure │
|
||||
│ 5. Format as JSON/XML │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
↓ JSON response
|
||||
┌─────────┐
|
||||
│ Client │
|
||||
└─────────┘
|
||||
```
|
||||
|
||||
### Background Processing
|
||||
|
||||
**Cron Jobs** (`acoustid/cron.py`):
|
||||
- Update lookup statistics (hourly)
|
||||
- Update user agent statistics (daily)
|
||||
- Clean up old submissions (daily)
|
||||
- Refresh materialized views (hourly)
|
||||
- Backup index snapshots (daily)
|
||||
|
||||
**Worker Tasks** (`acoustid/worker.py`):
|
||||
- Process fingerprint submissions
|
||||
- Import bulk fingerprints
|
||||
- Update index with new data
|
||||
- Resolve MBID redirects
|
||||
- Clean up orphaned records
|
||||
|
||||
## Index Communication Protocols
|
||||
|
||||
### Legacy Protocol (indexclient.py)
|
||||
|
||||
**Transport**: Raw TCP socket
|
||||
**Port**: 6080 (default)
|
||||
**Format**: Custom binary protocol
|
||||
|
||||
**Message Structure**:
|
||||
```
|
||||
┌────────────────┬────────────────┬────────────────┐
|
||||
│ Length (4B) │ Command (1B) │ Payload │
|
||||
└────────────────┴────────────────┴────────────────┘
|
||||
```
|
||||
|
||||
**Commands**:
|
||||
- `0x01`: Search
|
||||
- `0x02`: Insert
|
||||
- `0x03`: Delete
|
||||
|
||||
**Status**: Being phased out, replaced by HTTP protocol
|
||||
|
||||
### Modern Protocol (fpstore.py)
|
||||
|
||||
**Transport**: HTTP/1.1
|
||||
**Port**: 6081 (default)
|
||||
**Format**: MessagePack
|
||||
|
||||
**Endpoints**:
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|------|---------|
|
||||
| POST | `/:index/_search` | Search for fingerprints |
|
||||
| PUT | `/:index/:fpid` | Insert/update fingerprint |
|
||||
| DELETE | `/:index/:fpid` | Delete fingerprint |
|
||||
| GET | `/:index` | Get index info |
|
||||
| GET | `/:index/_segments` | List segments |
|
||||
| GET | `/:index/_snapshot` | Create snapshot |
|
||||
|
||||
**Search Request**:
|
||||
```python
|
||||
{
|
||||
"query": [term_id1, term_id2, ...], # Query terms
|
||||
"limit": 10, # Max results
|
||||
"min_score": 0.5 # Score threshold
|
||||
}
|
||||
```
|
||||
|
||||
**Search Response**:
|
||||
```python
|
||||
{
|
||||
"results": [
|
||||
{"id": fpid1, "score": 0.95},
|
||||
{"id": fpid2, "score": 0.87},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Concurrency and Parallelism
|
||||
|
||||
### Server Concurrency
|
||||
|
||||
**API/Web Processes**:
|
||||
- Multiple worker processes (Gunicorn/Uvicorn)
|
||||
- Each process handles requests independently
|
||||
- Shared-nothing architecture
|
||||
- Database connection pooling per process
|
||||
|
||||
**Worker Processes**:
|
||||
- Multiple worker instances
|
||||
- NATS queue provides work distribution
|
||||
- Each worker processes one submission at a time
|
||||
- No shared state between workers
|
||||
|
||||
**Cron Process**:
|
||||
- Single instance (leader election via database)
|
||||
- Scheduled tasks run sequentially
|
||||
- Long-running tasks delegated to workers
|
||||
|
||||
### Index Concurrency
|
||||
|
||||
**Thread Model**:
|
||||
- Main thread: HTTP server
|
||||
- Worker threads: Search and merge operations
|
||||
- Configurable thread pool size
|
||||
|
||||
**Locking Strategy**:
|
||||
- Read-write lock on Index
|
||||
- Multiple concurrent readers
|
||||
- Exclusive writer (for flush/merge)
|
||||
- Lock-free MemorySegment (atomic operations)
|
||||
|
||||
**Background Tasks**:
|
||||
- Segment merger runs in background thread
|
||||
- Oplog flusher runs periodically
|
||||
- Metrics collector runs independently
|
||||
|
||||
## Scalability Considerations
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
**API/Web**:
|
||||
- Stateless processes
|
||||
- Scale by adding more instances
|
||||
- Load balancer distributes requests
|
||||
- Session state in Redis (if needed)
|
||||
|
||||
**Workers**:
|
||||
- Scale by adding more instances
|
||||
- NATS queue distributes work
|
||||
- No coordination required
|
||||
|
||||
**Index**:
|
||||
- Multiple index instances (sharding)
|
||||
- Consistent hashing for fingerprint distribution
|
||||
- NATS for cluster coordination
|
||||
- Each instance handles subset of fingerprints
|
||||
|
||||
### Vertical Scaling
|
||||
|
||||
**Database**:
|
||||
- Connection pooling
|
||||
- Read replicas for queries
|
||||
- Partitioning for large tables
|
||||
- Materialized views for aggregations
|
||||
|
||||
**Index**:
|
||||
- More threads for search
|
||||
- Larger memory segment
|
||||
- Faster disk for segments
|
||||
- More RAM for file caching
|
||||
|
||||
## Fault Tolerance
|
||||
|
||||
### Server Resilience
|
||||
|
||||
**Database Failures**:
|
||||
- Connection retry with exponential backoff
|
||||
- Health checks detect failures
|
||||
- Read-only mode if write DB unavailable
|
||||
|
||||
**Index Failures**:
|
||||
- Graceful degradation (return partial results)
|
||||
- Retry with exponential backoff
|
||||
- Circuit breaker pattern
|
||||
|
||||
**NATS Failures**:
|
||||
- Persistent queue (JetStream)
|
||||
- Automatic reconnection
|
||||
- Message replay on recovery
|
||||
|
||||
### Index Resilience
|
||||
|
||||
**Crash Recovery**:
|
||||
- Oplog replay restores MemorySegment
|
||||
- FileSegments are immutable (no corruption)
|
||||
- Incomplete merges discarded
|
||||
|
||||
**Data Integrity**:
|
||||
- Checksums in file format
|
||||
- Atomic file operations
|
||||
- Write-ahead logging
|
||||
|
||||
**Replication**:
|
||||
- NATS-based replication (optional)
|
||||
- Snapshot-based backup
|
||||
- Point-in-time recovery
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Server Performance
|
||||
|
||||
**Lookup Latency**:
|
||||
- P50: ~50ms (including index search)
|
||||
- P95: ~200ms
|
||||
- P99: ~500ms
|
||||
|
||||
**Bottlenecks**:
|
||||
- Index search time (dominant)
|
||||
- Database query time (metadata fetch)
|
||||
- Network latency (MusicBrainz queries)
|
||||
|
||||
### Index Performance
|
||||
|
||||
**Search Latency**:
|
||||
- P50: ~5ms
|
||||
- P95: ~20ms
|
||||
- P99: ~50ms
|
||||
|
||||
**Throughput**:
|
||||
- ~1000 searches/second (single instance)
|
||||
- ~500 inserts/second (single instance)
|
||||
|
||||
**Bottlenecks**:
|
||||
- Disk I/O (segment reads)
|
||||
- CPU (decompression and scoring)
|
||||
- Memory (segment caching)
|
||||
|
||||
## Future Architecture Plans
|
||||
|
||||
### Server Modernization
|
||||
|
||||
1. Complete migration to Starlette/ASGI
|
||||
2. Remove Flask dependencies
|
||||
3. Async database operations everywhere
|
||||
4. GraphQL API alongside REST
|
||||
|
||||
### Index Enhancements
|
||||
|
||||
1. Distributed index with automatic sharding
|
||||
2. Replication for high availability
|
||||
3. Incremental snapshots
|
||||
4. Query result caching
|
||||
|
||||
### Infrastructure
|
||||
|
||||
1. Kubernetes deployment
|
||||
2. Service mesh (Istio/Linkerd)
|
||||
3. Distributed tracing (OpenTelemetry)
|
||||
4. Advanced monitoring (Prometheus + Grafana)
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,871 @@
|
||||
# AcoustID Data Model
|
||||
|
||||
## Database Architecture
|
||||
|
||||
AcoustID uses a multi-database PostgreSQL architecture with separate databases for different concerns.
|
||||
|
||||
### Database Instances
|
||||
|
||||
| Database | Purpose | Tables | Extensions |
|
||||
|----------|---------|--------|------------|
|
||||
| `acoustid_app` | Application data (accounts, apps, stats) | 8 | pgcrypto |
|
||||
| `acoustid_fingerprint` | Fingerprint and track data | 19 | intarray, acoustid, cube |
|
||||
| `acoustid_ingest` | Submission processing | 3 | - |
|
||||
| `musicbrainz` | MusicBrainz mirror (read-only) | Many | - |
|
||||
|
||||
### PostgreSQL Extensions
|
||||
|
||||
**intarray**: Integer array operations
|
||||
- Used for fingerprint array queries
|
||||
- Provides `&&` (overlap) and `@>` (contains) operators
|
||||
|
||||
**pgcrypto**: Cryptographic functions
|
||||
- UUID generation (`gen_random_uuid()`)
|
||||
- API key hashing
|
||||
|
||||
**acoustid** (custom): Fingerprint similarity functions
|
||||
- `acoustid_compare(int[], int[])`: Compare two fingerprints
|
||||
- `acoustid_extract_query(int[])`: Extract query terms
|
||||
- Source: `acoustid-ext` C extension
|
||||
|
||||
**cube**: Multi-dimensional cube data type
|
||||
- Used for simhash-based fingerprint indexing
|
||||
- Enables fast approximate nearest neighbor search
|
||||
|
||||
## Core Tables
|
||||
|
||||
### Account Management (acoustid_app)
|
||||
|
||||
#### `account`
|
||||
|
||||
User accounts for API access.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Account ID |
|
||||
| `name` | VARCHAR(255) | NOT NULL | Display name |
|
||||
| `apikey` | VARCHAR(40) | UNIQUE, NOT NULL | API key (user key) |
|
||||
| `mbuser` | VARCHAR(64) | UNIQUE | MusicBrainz username |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
| `lastlogin` | TIMESTAMP | | Last login timestamp |
|
||||
| `submission_count` | INTEGER | DEFAULT 0 | Total submissions |
|
||||
| `application_id` | INTEGER | FOREIGN KEY | Default application |
|
||||
| `application_version` | VARCHAR(255) | | Application version |
|
||||
| `created_from` | INET | | Registration IP |
|
||||
| `is_admin` | BOOLEAN | DEFAULT FALSE | Admin flag |
|
||||
|
||||
**Indexes**:
|
||||
- `account_pkey` (PRIMARY KEY on `id`)
|
||||
- `account_apikey_key` (UNIQUE on `apikey`)
|
||||
- `account_mbuser_key` (UNIQUE on `mbuser`)
|
||||
|
||||
#### `application`
|
||||
|
||||
API client applications.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Application ID |
|
||||
| `name` | VARCHAR(255) | NOT NULL | Application name |
|
||||
| `version` | VARCHAR(255) | | Version string |
|
||||
| `apikey` | VARCHAR(40) | UNIQUE, NOT NULL | API key (client key) |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
| `active` | BOOLEAN | DEFAULT TRUE | Active status |
|
||||
| `account_id` | INTEGER | FOREIGN KEY | Owner account |
|
||||
| `email` | VARCHAR(255) | | Contact email |
|
||||
| `website` | VARCHAR(1000) | | Website URL |
|
||||
| `rate_limit` | INTEGER | | Custom rate limit (req/s) |
|
||||
|
||||
**Indexes**:
|
||||
- `application_pkey` (PRIMARY KEY on `id`)
|
||||
- `application_apikey_key` (UNIQUE on `apikey`)
|
||||
|
||||
#### `account_openid`
|
||||
|
||||
OpenID authentication links.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `openid` | VARCHAR(255) | PRIMARY KEY | OpenID identifier |
|
||||
| `account_id` | INTEGER | FOREIGN KEY | Linked account |
|
||||
|
||||
#### `account_google`
|
||||
|
||||
Google OAuth authentication links.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `google_user_id` | VARCHAR(255) | PRIMARY KEY | Google user ID |
|
||||
| `account_id` | INTEGER | FOREIGN KEY | Linked account |
|
||||
|
||||
### Fingerprint Data (acoustid_fingerprint)
|
||||
|
||||
#### `track`
|
||||
|
||||
Unique audio tracks identified by fingerprints.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Track ID |
|
||||
| `gid` | UUID | UNIQUE, NOT NULL | Public track UUID |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
| `new_id` | INTEGER | FOREIGN KEY | Merge target (if merged) |
|
||||
| `disabled` | BOOLEAN | DEFAULT FALSE | Disabled flag |
|
||||
|
||||
**Indexes**:
|
||||
- `track_pkey` (PRIMARY KEY on `id`)
|
||||
- `track_gid_key` (UNIQUE on `gid`)
|
||||
- `track_new_id_idx` (on `new_id`)
|
||||
|
||||
**Notes**:
|
||||
- `gid` is the public-facing AcoustID track ID
|
||||
- `new_id` points to merged track (for deduplication)
|
||||
- Disabled tracks excluded from search results
|
||||
|
||||
#### `fingerprint`
|
||||
|
||||
Audio fingerprints linked to tracks.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Fingerprint ID |
|
||||
| `track_id` | INTEGER | FOREIGN KEY | Linked track |
|
||||
| `fingerprint` | INTEGER[] | NOT NULL | Chromaprint hash array |
|
||||
| `length` | SMALLINT | NOT NULL | Duration in seconds |
|
||||
| `bitrate` | SMALLINT | | Audio bitrate (kbps) |
|
||||
| `format_id` | INTEGER | FOREIGN KEY | Audio format |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
|
||||
|
||||
**Indexes**:
|
||||
- `fingerprint_pkey` (PRIMARY KEY on `id`)
|
||||
- `fingerprint_track_id_idx` (on `track_id`)
|
||||
- `fingerprint_length_idx` (on `length`)
|
||||
- `fingerprint_fingerprint_idx` (GIN on `fingerprint` using `intarray`)
|
||||
|
||||
**Notes**:
|
||||
- `fingerprint` is an array of 32-bit integers (Chromaprint hashes)
|
||||
- GIN index enables fast similarity search
|
||||
- `submission_count` tracks popularity
|
||||
|
||||
#### `fingerprint_data`
|
||||
|
||||
Extended fingerprint data with simhash.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `fingerprint_id` | INTEGER | PRIMARY KEY, FOREIGN KEY | Fingerprint ID |
|
||||
| `fingerprint` | BYTEA | NOT NULL | Raw fingerprint data |
|
||||
| `simhash` | CUBE | | Locality-sensitive hash |
|
||||
|
||||
**Indexes**:
|
||||
- `fingerprint_data_pkey` (PRIMARY KEY on `fingerprint_id`)
|
||||
- `fingerprint_data_simhash_idx` (GIST on `simhash`)
|
||||
|
||||
**Notes**:
|
||||
- `fingerprint` stores compressed Chromaprint data
|
||||
- `simhash` enables approximate nearest neighbor search
|
||||
- GIST index for fast similarity queries
|
||||
|
||||
#### `track_mbid`
|
||||
|
||||
Links tracks to MusicBrainz recordings.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Link ID |
|
||||
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
|
||||
| `mbid` | UUID | NOT NULL | MusicBrainz recording MBID |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
|
||||
| `disabled` | BOOLEAN | DEFAULT FALSE | Disabled flag |
|
||||
|
||||
**Indexes**:
|
||||
- `track_mbid_pkey` (PRIMARY KEY on `id`)
|
||||
- `track_mbid_track_id_mbid_key` (UNIQUE on `track_id, mbid`)
|
||||
- `track_mbid_mbid_idx` (on `mbid`)
|
||||
|
||||
**Notes**:
|
||||
- Multiple MBIDs per track possible (different recordings)
|
||||
- `submission_count` indicates confidence
|
||||
- Disabled links excluded from results
|
||||
|
||||
#### `meta`
|
||||
|
||||
User-submitted metadata.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Metadata ID |
|
||||
| `track` | VARCHAR(255) | | Track title |
|
||||
| `artist` | VARCHAR(255) | | Artist name |
|
||||
| `album` | VARCHAR(255) | | Album title |
|
||||
| `album_artist` | VARCHAR(255) | | Album artist |
|
||||
| `track_no` | INTEGER | | Track number |
|
||||
| `disc_no` | INTEGER | | Disc number |
|
||||
| `year` | INTEGER | | Release year |
|
||||
|
||||
**Indexes**:
|
||||
- `meta_pkey` (PRIMARY KEY on `id`)
|
||||
|
||||
#### `track_meta`
|
||||
|
||||
Links tracks to user metadata.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Link ID |
|
||||
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
|
||||
| `meta_id` | INTEGER | FOREIGN KEY | Metadata record |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
|
||||
|
||||
**Indexes**:
|
||||
- `track_meta_pkey` (PRIMARY KEY on `id`)
|
||||
- `track_meta_track_id_meta_id_key` (UNIQUE on `track_id, meta_id`)
|
||||
|
||||
#### `format`
|
||||
|
||||
Audio file formats.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Format ID |
|
||||
| `name` | VARCHAR(20) | UNIQUE, NOT NULL | Format name (mp3, flac, etc.) |
|
||||
|
||||
**Indexes**:
|
||||
- `format_pkey` (PRIMARY KEY on `id`)
|
||||
- `format_name_key` (UNIQUE on `name`)
|
||||
|
||||
**Common Values**:
|
||||
- `mp3`, `flac`, `ogg`, `m4a`, `wma`, `ape`, `wav`
|
||||
|
||||
#### `source`
|
||||
|
||||
Submission sources (applications).
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Source ID |
|
||||
| `application_id` | INTEGER | FOREIGN KEY | Application |
|
||||
| `account_id` | INTEGER | FOREIGN KEY | User account |
|
||||
| `version` | VARCHAR(255) | | Application version |
|
||||
|
||||
**Indexes**:
|
||||
- `source_pkey` (PRIMARY KEY on `id`)
|
||||
- `source_application_id_account_id_version_key` (UNIQUE on `application_id, account_id, version`)
|
||||
|
||||
### Foreign IDs (acoustid_fingerprint)
|
||||
|
||||
#### `foreignid_vendor`
|
||||
|
||||
External ID providers.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Vendor ID |
|
||||
| `name` | VARCHAR(255) | UNIQUE, NOT NULL | Vendor name |
|
||||
|
||||
**Indexes**:
|
||||
- `foreignid_vendor_pkey` (PRIMARY KEY on `id`)
|
||||
- `foreignid_vendor_name_key` (UNIQUE on `name`)
|
||||
|
||||
**Common Values**:
|
||||
- `musicbrainz`, `musicip`, `discogs`, `spotify`
|
||||
|
||||
#### `foreignid`
|
||||
|
||||
External identifiers.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Foreign ID |
|
||||
| `vendor_id` | INTEGER | FOREIGN KEY | Vendor |
|
||||
| `name` | VARCHAR(255) | NOT NULL | External ID value |
|
||||
|
||||
**Indexes**:
|
||||
- `foreignid_pkey` (PRIMARY KEY on `id`)
|
||||
- `foreignid_vendor_id_name_key` (UNIQUE on `vendor_id, name`)
|
||||
|
||||
#### `track_foreignid`
|
||||
|
||||
Links tracks to external IDs.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Link ID |
|
||||
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
|
||||
| `foreignid_id` | INTEGER | FOREIGN KEY | External ID |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
|
||||
|
||||
**Indexes**:
|
||||
- `track_foreignid_pkey` (PRIMARY KEY on `id`)
|
||||
- `track_foreignid_track_id_foreignid_id_key` (UNIQUE on `track_id, foreignid_id`)
|
||||
|
||||
#### `track_puid`
|
||||
|
||||
Legacy MusicIP PUID links.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Link ID |
|
||||
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
|
||||
| `puid` | UUID | NOT NULL | MusicIP PUID |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
|
||||
|
||||
**Indexes**:
|
||||
- `track_puid_pkey` (PRIMARY KEY on `id`)
|
||||
- `track_puid_track_id_puid_key` (UNIQUE on `track_id, puid`)
|
||||
- `track_puid_puid_idx` (on `puid`)
|
||||
|
||||
### Statistics (acoustid_app)
|
||||
|
||||
#### `stats`
|
||||
|
||||
General statistics.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Stat ID |
|
||||
| `name` | VARCHAR(255) | UNIQUE, NOT NULL | Stat name |
|
||||
| `value` | INTEGER | NOT NULL | Stat value |
|
||||
| `date` | DATE | NOT NULL | Stat date |
|
||||
|
||||
**Indexes**:
|
||||
- `stats_pkey` (PRIMARY KEY on `id`)
|
||||
- `stats_name_date_key` (UNIQUE on `name, date`)
|
||||
|
||||
**Common Stats**:
|
||||
- `lookup.count`, `submission.count`, `track.count`, `fingerprint.count`
|
||||
|
||||
#### `stats_lookups`
|
||||
|
||||
Lookup statistics by hour.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Stat ID |
|
||||
| `hour` | TIMESTAMP | NOT NULL | Hour timestamp |
|
||||
| `application_id` | INTEGER | FOREIGN KEY | Application |
|
||||
| `count_hits` | INTEGER | DEFAULT 0 | Successful lookups |
|
||||
| `count_misses` | INTEGER | DEFAULT 0 | Failed lookups |
|
||||
|
||||
**Indexes**:
|
||||
- `stats_lookups_pkey` (PRIMARY KEY on `id`)
|
||||
- `stats_lookups_hour_application_id_key` (UNIQUE on `hour, application_id`)
|
||||
|
||||
#### `stats_user_agents`
|
||||
|
||||
User agent statistics.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Stat ID |
|
||||
| `date` | DATE | NOT NULL | Date |
|
||||
| `application_id` | INTEGER | FOREIGN KEY | Application |
|
||||
| `user_agent` | VARCHAR(1000) | NOT NULL | User agent string |
|
||||
| `ip` | INET | NOT NULL | IP address |
|
||||
| `count` | INTEGER | DEFAULT 0 | Request count |
|
||||
|
||||
**Indexes**:
|
||||
- `stats_user_agents_pkey` (PRIMARY KEY on `id`)
|
||||
- `stats_user_agents_date_application_id_user_agent_ip_key` (UNIQUE on `date, application_id, user_agent, ip`)
|
||||
|
||||
#### `stats_top_accounts`
|
||||
|
||||
Top submitter accounts.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Stat ID |
|
||||
| `account_id` | INTEGER | FOREIGN KEY | Account |
|
||||
| `count` | INTEGER | NOT NULL | Submission count |
|
||||
|
||||
**Indexes**:
|
||||
- `stats_top_accounts_pkey` (PRIMARY KEY on `id`)
|
||||
- `stats_top_accounts_account_id_key` (UNIQUE on `account_id`)
|
||||
|
||||
### Submission Processing (acoustid_ingest)
|
||||
|
||||
#### `submission`
|
||||
|
||||
Pending fingerprint submissions.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Submission ID |
|
||||
| `fingerprint` | INTEGER[] | NOT NULL | Chromaprint hash array |
|
||||
| `length` | SMALLINT | NOT NULL | Duration in seconds |
|
||||
| `bitrate` | SMALLINT | | Audio bitrate |
|
||||
| `format_id` | INTEGER | | Audio format |
|
||||
| `created` | TIMESTAMP | NOT NULL | Submission timestamp |
|
||||
| `source_id` | INTEGER | FOREIGN KEY | Submission source |
|
||||
| `mbid` | UUID | | MusicBrainz MBID (if provided) |
|
||||
| `handled` | BOOLEAN | DEFAULT FALSE | Processing status |
|
||||
| `meta_id` | INTEGER | FOREIGN KEY | User metadata |
|
||||
|
||||
**Indexes**:
|
||||
- `submission_pkey` (PRIMARY KEY on `id`)
|
||||
- `submission_handled_idx` (on `handled` WHERE `handled = FALSE`)
|
||||
|
||||
**Notes**:
|
||||
- Worker processes unhandled submissions
|
||||
- `handled = TRUE` after processing
|
||||
|
||||
#### `submission_result`
|
||||
|
||||
Processing results for submissions.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Result ID |
|
||||
| `submission_id` | INTEGER | FOREIGN KEY | Submission |
|
||||
| `track_id` | INTEGER | FOREIGN KEY | Matched/created track |
|
||||
| `created` | TIMESTAMP | NOT NULL | Processing timestamp |
|
||||
|
||||
**Indexes**:
|
||||
- `submission_result_pkey` (PRIMARY KEY on `id`)
|
||||
- `submission_result_submission_id_key` (UNIQUE on `submission_id`)
|
||||
|
||||
#### `pending_submission`
|
||||
|
||||
Queue for async submission processing.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Queue ID |
|
||||
| `submission_id` | INTEGER | FOREIGN KEY | Submission |
|
||||
| `created` | TIMESTAMP | NOT NULL | Queue timestamp |
|
||||
|
||||
**Indexes**:
|
||||
- `pending_submission_pkey` (PRIMARY KEY on `id`)
|
||||
- `pending_submission_submission_id_key` (UNIQUE on `submission_id`)
|
||||
|
||||
**Notes**:
|
||||
- Replaced by NATS queue in newer deployments
|
||||
- Legacy table, may be deprecated
|
||||
|
||||
### Provenance Tables (acoustid_fingerprint)
|
||||
|
||||
Track data lineage and changes.
|
||||
|
||||
#### `fingerprint_source`
|
||||
|
||||
Links fingerprints to submission sources.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Link ID |
|
||||
| `fingerprint_id` | INTEGER | FOREIGN KEY | Fingerprint |
|
||||
| `source_id` | INTEGER | FOREIGN KEY | Source |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
|
||||
#### `track_mbid_source`
|
||||
|
||||
Links track-MBID associations to sources.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Link ID |
|
||||
| `track_mbid_id` | INTEGER | FOREIGN KEY | Track-MBID link |
|
||||
| `source_id` | INTEGER | FOREIGN KEY | Source |
|
||||
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
|
||||
|
||||
#### `track_mbid_change`
|
||||
|
||||
Audit log for track-MBID changes.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `id` | SERIAL | PRIMARY KEY | Change ID |
|
||||
| `track_mbid_id` | INTEGER | FOREIGN KEY | Track-MBID link |
|
||||
| `account_id` | INTEGER | FOREIGN KEY | Account that made change |
|
||||
| `disabled` | BOOLEAN | NOT NULL | New disabled status |
|
||||
| `created` | TIMESTAMP | NOT NULL | Change timestamp |
|
||||
| `note` | TEXT | | Change reason |
|
||||
|
||||
## ORM Layer (SQLAlchemy)
|
||||
|
||||
### Multi-Database Configuration
|
||||
|
||||
**File**: `acoustid/db.py`
|
||||
|
||||
```python
|
||||
# Database bind keys
|
||||
BIND_KEYS = {
|
||||
'app': 'acoustid_app',
|
||||
'fingerprint': 'acoustid_fingerprint',
|
||||
'ingest': 'acoustid_ingest',
|
||||
'musicbrainz': 'musicbrainz'
|
||||
}
|
||||
```
|
||||
|
||||
**Model Binding**:
|
||||
|
||||
```python
|
||||
class Account(Base):
|
||||
__bind_key__ = 'app'
|
||||
__tablename__ = 'account'
|
||||
# ...
|
||||
|
||||
class Track(Base):
|
||||
__bind_key__ = 'fingerprint'
|
||||
__tablename__ = 'track'
|
||||
# ...
|
||||
```
|
||||
|
||||
### Connection Pooling
|
||||
|
||||
**Configuration** (`acoustid.conf`):
|
||||
|
||||
```ini
|
||||
[database]
|
||||
name = acoustid_app
|
||||
user = acoustid
|
||||
password_file = /run/secrets/db_password
|
||||
host = postgres
|
||||
port = 5432
|
||||
pool_size = 20
|
||||
pool_recycle = 3600
|
||||
```
|
||||
|
||||
**Pool Settings**:
|
||||
- `pool_size`: Maximum connections per process
|
||||
- `pool_recycle`: Recycle connections after N seconds
|
||||
- `pool_pre_ping`: Test connections before use
|
||||
|
||||
### Query Patterns
|
||||
|
||||
**Fingerprint Search** (legacy, pre-index):
|
||||
|
||||
```python
|
||||
# Find similar fingerprints using intarray overlap
|
||||
query = db.session.query(Fingerprint).filter(
|
||||
Fingerprint.fingerprint.op('&&')(query_fingerprint),
|
||||
Fingerprint.length.between(duration - 5, duration + 5)
|
||||
).order_by(
|
||||
func.acoustid_compare(Fingerprint.fingerprint, query_fingerprint).desc()
|
||||
).limit(10)
|
||||
```
|
||||
|
||||
**Track Lookup with MBIDs**:
|
||||
|
||||
```python
|
||||
# Fetch track with all linked MBIDs
|
||||
track = db.session.query(Track).options(
|
||||
joinedload(Track.mbids)
|
||||
).filter(Track.gid == track_gid).first()
|
||||
```
|
||||
|
||||
**Submission Processing**:
|
||||
|
||||
```python
|
||||
# Find unhandled submissions
|
||||
submissions = db.session.query(Submission).filter(
|
||||
Submission.handled == False
|
||||
).order_by(Submission.created).limit(100).all()
|
||||
```
|
||||
|
||||
## Database Migrations
|
||||
|
||||
### Alembic Configuration
|
||||
|
||||
**File**: `alembic.ini`
|
||||
|
||||
**Migration Directories**:
|
||||
- `alembic/versions/app/`: acoustid_app migrations
|
||||
- `alembic/versions/fingerprint/`: acoustid_fingerprint migrations
|
||||
- `alembic/versions/ingest/`: acoustid_ingest migrations
|
||||
|
||||
**Multi-Database Support**:
|
||||
|
||||
```python
|
||||
# alembic/env.py
|
||||
def run_migrations_online():
|
||||
for bind_key in ['app', 'fingerprint', 'ingest']:
|
||||
engine = get_engine(bind_key)
|
||||
with engine.connect() as connection:
|
||||
context.configure(
|
||||
connection=connection,
|
||||
target_metadata=get_metadata(bind_key)
|
||||
)
|
||||
with context.begin_transaction():
|
||||
context.run_migrations()
|
||||
```
|
||||
|
||||
### Migration Commands
|
||||
|
||||
```bash
|
||||
# Create new migration
|
||||
alembic revision --autogenerate -m "Add new column"
|
||||
|
||||
# Apply migrations
|
||||
alembic upgrade head
|
||||
|
||||
# Rollback migration
|
||||
alembic downgrade -1
|
||||
|
||||
# Show current version
|
||||
alembic current
|
||||
|
||||
# Show migration history
|
||||
alembic history
|
||||
```
|
||||
|
||||
## Redis Data Structures
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
**Key Pattern**: `rl:bucket:{scope}:{identifier}:{timestamp}`
|
||||
|
||||
**Example Keys**:
|
||||
```
|
||||
rl:bucket:global:1714305600
|
||||
rl:bucket:app:8XaBELgH:1714305600
|
||||
rl:bucket:ip:192.168.1.1:1714305600
|
||||
```
|
||||
|
||||
**Value**: Integer (request count)
|
||||
**TTL**: 25 seconds (window duration + buffer)
|
||||
|
||||
**Algorithm**:
|
||||
```python
|
||||
# Increment bucket for current window
|
||||
bucket_key = f"rl:bucket:{scope}:{identifier}:{current_window}"
|
||||
count = redis.incr(bucket_key)
|
||||
redis.expire(bucket_key, 25)
|
||||
|
||||
# Sum counts across all windows in sliding window
|
||||
total = sum(redis.get(f"rl:bucket:{scope}:{identifier}:{w}")
|
||||
for w in windows)
|
||||
```
|
||||
|
||||
### Task Queue (Legacy)
|
||||
|
||||
**Key Pattern**: `queue:{queue_name}`
|
||||
|
||||
**Operations**:
|
||||
```python
|
||||
# Push task
|
||||
redis.rpush('queue:submissions', json.dumps(task_data))
|
||||
|
||||
# Pop task
|
||||
task_data = redis.lpop('queue:submissions')
|
||||
```
|
||||
|
||||
**Note**: Being replaced by NATS in newer deployments
|
||||
|
||||
### API Key Cache
|
||||
|
||||
**Implementation**: In-memory TTLCache (not Redis)
|
||||
|
||||
```python
|
||||
from cachetools import TTLCache
|
||||
|
||||
api_key_cache = TTLCache(maxsize=1000, ttl=60)
|
||||
```
|
||||
|
||||
**Purpose**: Reduce database queries for API key validation
|
||||
|
||||
### Backfill State
|
||||
|
||||
**Key Pattern**: `backfill:{index_name}:{state_key}`
|
||||
|
||||
**Example Keys**:
|
||||
```
|
||||
backfill:fingerprints:last_id
|
||||
backfill:fingerprints:batch_size
|
||||
backfill:fingerprints:completed
|
||||
```
|
||||
|
||||
**Purpose**: Track progress of index backfill operations
|
||||
|
||||
### Unknown MBID Cache
|
||||
|
||||
**Key Pattern**: `unknown_mbid:{mbid}`
|
||||
|
||||
**Value**: Boolean (1 if MBID not found in MusicBrainz)
|
||||
**TTL**: 3600 seconds (1 hour)
|
||||
|
||||
**Purpose**: Avoid repeated MusicBrainz queries for non-existent MBIDs
|
||||
|
||||
## Data Integrity
|
||||
|
||||
### Constraints
|
||||
|
||||
**Foreign Keys**:
|
||||
- All foreign keys have `ON DELETE CASCADE` or `ON DELETE SET NULL`
|
||||
- Orphaned records cleaned up automatically
|
||||
|
||||
**Unique Constraints**:
|
||||
- Prevent duplicate fingerprints per track
|
||||
- Prevent duplicate MBID links per track
|
||||
- Ensure API key uniqueness
|
||||
|
||||
**Check Constraints**:
|
||||
- Duration must be positive
|
||||
- Bitrate must be positive
|
||||
- Submission count must be non-negative
|
||||
|
||||
### Triggers
|
||||
|
||||
**Update Submission Count**:
|
||||
```sql
|
||||
CREATE TRIGGER update_fingerprint_submission_count
|
||||
AFTER INSERT ON fingerprint_source
|
||||
FOR EACH ROW
|
||||
EXECUTE FUNCTION increment_submission_count();
|
||||
```
|
||||
|
||||
**Track Merge Propagation**:
|
||||
```sql
|
||||
CREATE TRIGGER propagate_track_merge
|
||||
AFTER UPDATE OF new_id ON track
|
||||
FOR EACH ROW
|
||||
EXECUTE FUNCTION update_merged_track_references();
|
||||
```
|
||||
|
||||
### Indexes for Performance
|
||||
|
||||
**Covering Indexes**:
|
||||
```sql
|
||||
-- Lookup by fingerprint and duration
|
||||
CREATE INDEX fingerprint_lookup_idx
|
||||
ON fingerprint (length, track_id)
|
||||
INCLUDE (fingerprint);
|
||||
```
|
||||
|
||||
**Partial Indexes**:
|
||||
```sql
|
||||
-- Only index unhandled submissions
|
||||
CREATE INDEX submission_unhandled_idx
|
||||
ON submission (created)
|
||||
WHERE handled = FALSE;
|
||||
```
|
||||
|
||||
**GIN Indexes**:
|
||||
```sql
|
||||
-- Fast fingerprint array queries
|
||||
CREATE INDEX fingerprint_fingerprint_idx
|
||||
ON fingerprint USING GIN (fingerprint gin__int_ops);
|
||||
```
|
||||
|
||||
## Data Lifecycle
|
||||
|
||||
### Fingerprint Submission
|
||||
|
||||
1. Insert into `submission` table (acoustid_ingest)
|
||||
2. Publish to NATS queue
|
||||
3. Worker processes submission
|
||||
4. Insert into `fingerprint` table (acoustid_fingerprint)
|
||||
5. Link to `track` (create or match)
|
||||
6. Insert into `fingerprint_source` (provenance)
|
||||
7. Update index via HTTP API
|
||||
8. Insert into `submission_result`
|
||||
9. Mark `submission.handled = TRUE`
|
||||
|
||||
### Track Merging
|
||||
|
||||
1. Identify duplicate tracks (manual or automated)
|
||||
2. Set `track.new_id` to target track
|
||||
3. Trigger updates all references
|
||||
4. Merge fingerprints, MBIDs, metadata
|
||||
5. Disable old track (`track.disabled = TRUE`)
|
||||
|
||||
### Data Cleanup
|
||||
|
||||
**Cron Jobs**:
|
||||
- Delete old handled submissions (>30 days)
|
||||
- Clean up orphaned metadata records
|
||||
- Remove disabled tracks with no references
|
||||
- Archive old statistics
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Query Optimization
|
||||
|
||||
**Materialized Views**:
|
||||
```sql
|
||||
CREATE MATERIALIZED VIEW track_stats AS
|
||||
SELECT
|
||||
track_id,
|
||||
COUNT(DISTINCT fingerprint_id) AS fingerprint_count,
|
||||
COUNT(DISTINCT mbid) AS mbid_count,
|
||||
SUM(submission_count) AS total_submissions
|
||||
FROM fingerprint
|
||||
LEFT JOIN track_mbid USING (track_id)
|
||||
GROUP BY track_id;
|
||||
```
|
||||
|
||||
**Partitioning** (future):
|
||||
```sql
|
||||
-- Partition submissions by month
|
||||
CREATE TABLE submission_2025_04 PARTITION OF submission
|
||||
FOR VALUES FROM ('2025-04-01') TO ('2025-05-01');
|
||||
```
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
**Application-Level**:
|
||||
- API key validation (TTLCache, 60s)
|
||||
- Format ID lookup (permanent cache)
|
||||
- MusicBrainz MBID existence (Redis, 1h)
|
||||
|
||||
**Database-Level**:
|
||||
- Shared buffers (PostgreSQL config)
|
||||
- Connection pooling (SQLAlchemy)
|
||||
- Query result caching (pg_stat_statements)
|
||||
|
||||
### Bulk Operations
|
||||
|
||||
**Batch Inserts**:
|
||||
```python
|
||||
# Insert multiple fingerprints efficiently
|
||||
db.session.bulk_insert_mappings(Fingerprint, fingerprint_dicts)
|
||||
db.session.commit()
|
||||
```
|
||||
|
||||
**Bulk Updates**:
|
||||
```python
|
||||
# Update submission counts in batch
|
||||
db.session.execute(
|
||||
update(Fingerprint).where(
|
||||
Fingerprint.id.in_(fingerprint_ids)
|
||||
).values(
|
||||
submission_count=Fingerprint.submission_count + 1
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Backup Strategy
|
||||
|
||||
**PostgreSQL**:
|
||||
- Daily full backups (pg_dump)
|
||||
- Continuous WAL archiving
|
||||
- Point-in-time recovery enabled
|
||||
|
||||
**Index**:
|
||||
- Daily snapshots via `/:index/_snapshot`
|
||||
- Incremental backups of Oplog
|
||||
- Segment files backed up separately
|
||||
|
||||
### Disaster Recovery
|
||||
|
||||
**Database Restore**:
|
||||
```bash
|
||||
# Restore from dump
|
||||
pg_restore -d acoustid_app acoustid_app_backup.dump
|
||||
|
||||
# Point-in-time recovery
|
||||
pg_restore --target-time='2025-04-28 12:00:00'
|
||||
```
|
||||
|
||||
**Index Rebuild**:
|
||||
```bash
|
||||
# Rebuild from database
|
||||
python manage.py run import --rebuild-index
|
||||
```
|
||||
@@ -0,0 +1,946 @@
|
||||
# AcoustID Deployment
|
||||
|
||||
## Deployment Overview
|
||||
|
||||
AcoustID supports multiple deployment models: production multi-server, Docker Compose for self-hosting, and local development. The system requires coordination between multiple services: PostgreSQL, Redis, NATS, the Python server, and the Zig index.
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
### Server Docker Image
|
||||
|
||||
**Dockerfile**: `docker/Dockerfile`
|
||||
|
||||
#### Multi-Stage Build
|
||||
|
||||
**Stage 1: Chromaprint Build**
|
||||
|
||||
```dockerfile
|
||||
FROM ubuntu:24.04 AS chromaprint-build
|
||||
|
||||
RUN apt-get update && apt-get install -y \
|
||||
git \
|
||||
cmake \
|
||||
build-essential \
|
||||
libfftw3-dev
|
||||
|
||||
WORKDIR /build
|
||||
RUN git clone https://github.com/acoustid/chromaprint.git && \
|
||||
cd chromaprint && \
|
||||
git checkout 41a3e8fb && \
|
||||
cmake -DCMAKE_BUILD_TYPE=Release \
|
||||
-DBUILD_TOOLS=OFF \
|
||||
-DBUILD_TESTS=OFF . && \
|
||||
make -j$(nproc) && \
|
||||
make install
|
||||
```
|
||||
|
||||
**Stage 2: Base Image**
|
||||
|
||||
```dockerfile
|
||||
FROM ubuntu:24.04 AS base
|
||||
|
||||
RUN apt-get update && apt-get install -y \
|
||||
python3.12 \
|
||||
python3-pip \
|
||||
libfftw3-3 \
|
||||
libpq5 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
COPY --from=chromaprint-build /usr/local/lib/libchromaprint.so* /usr/local/lib/
|
||||
COPY --from=chromaprint-build /usr/local/include/chromaprint.h /usr/local/include/
|
||||
|
||||
RUN ldconfig
|
||||
```
|
||||
|
||||
**Stage 3: Builder**
|
||||
|
||||
```dockerfile
|
||||
FROM base AS builder
|
||||
|
||||
RUN apt-get update && apt-get install -y \
|
||||
build-essential \
|
||||
python3-dev \
|
||||
libpq-dev \
|
||||
curl \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install uv
|
||||
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
ENV PATH="/root/.cargo/bin:$PATH"
|
||||
|
||||
WORKDIR /app
|
||||
COPY pyproject.toml uv.lock ./
|
||||
RUN uv sync --frozen --no-dev
|
||||
|
||||
COPY . .
|
||||
RUN uv build
|
||||
```
|
||||
|
||||
**Stage 4: Final Image**
|
||||
|
||||
```dockerfile
|
||||
FROM base AS final
|
||||
|
||||
# Create non-root user
|
||||
RUN useradd -m -u 1000 acoustid
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Copy built wheel and dependencies
|
||||
COPY --from=builder /app/.venv /app/.venv
|
||||
COPY --from=builder /app/dist/*.whl /tmp/
|
||||
|
||||
# Install application
|
||||
RUN /app/.venv/bin/pip install /tmp/*.whl && rm /tmp/*.whl
|
||||
|
||||
# Copy configuration template
|
||||
COPY acoustid.conf.dist /etc/acoustid/acoustid.conf.dist
|
||||
|
||||
USER acoustid
|
||||
|
||||
ENV PATH="/app/.venv/bin:$PATH"
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
ENTRYPOINT ["python", "manage.py"]
|
||||
CMD ["run", "api"]
|
||||
```
|
||||
|
||||
**Image Size**: ~400MB (compressed)
|
||||
**Base OS**: Ubuntu 24.04
|
||||
**Python Version**: 3.12
|
||||
|
||||
### Index Docker Image
|
||||
|
||||
**Dockerfile**: `docker/Dockerfile.index`
|
||||
|
||||
```dockerfile
|
||||
FROM ubuntu:24.04 AS builder
|
||||
|
||||
RUN apt-get update && apt-get install -y \
|
||||
curl \
|
||||
xz-utils \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Zig
|
||||
RUN curl -L https://ziglang.org/download/0.11.0/zig-linux-x86_64-0.11.0.tar.xz | \
|
||||
tar -xJ -C /usr/local && \
|
||||
ln -s /usr/local/zig-linux-x86_64-0.11.0/zig /usr/local/bin/zig
|
||||
|
||||
WORKDIR /build
|
||||
COPY . .
|
||||
|
||||
RUN zig build -Doptimize=ReleaseFast
|
||||
|
||||
FROM ubuntu:24.04
|
||||
|
||||
RUN useradd -m -u 1000 acoustid
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY --from=builder /build/zig-out/bin/fpindex /app/fpindex
|
||||
|
||||
RUN mkdir -p /var/lib/acoustid-index && \
|
||||
chown acoustid:acoustid /var/lib/acoustid-index
|
||||
|
||||
USER acoustid
|
||||
|
||||
EXPOSE 6081
|
||||
|
||||
ENTRYPOINT ["/app/fpindex"]
|
||||
CMD ["--dir", "/var/lib/acoustid-index", "--port", "6081"]
|
||||
```
|
||||
|
||||
**Image Size**: ~50MB (compressed)
|
||||
**Base OS**: Ubuntu 24.04
|
||||
**Binary**: Single statically-linked executable
|
||||
|
||||
### Docker Compose Configuration
|
||||
|
||||
**File**: `docker-compose.yml`
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
postgres:
|
||||
image: ghcr.io/acoustid/postgresql:17.4
|
||||
environment:
|
||||
POSTGRES_USER: acoustid
|
||||
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
|
||||
POSTGRES_MULTIPLE_DATABASES: acoustid_app,acoustid_fingerprint,acoustid_ingest
|
||||
volumes:
|
||||
- postgres_data:/var/lib/postgresql/data
|
||||
- ./docker/init-db.sh:/docker-entrypoint-initdb.d/init-db.sh
|
||||
secrets:
|
||||
- db_password
|
||||
ports:
|
||||
- "5432:5432"
|
||||
healthcheck:
|
||||
test: ["CMD-EXEC", "pg_isready -U acoustid"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
command: redis-server --requirepass-file /run/secrets/redis_password
|
||||
volumes:
|
||||
- redis_data:/data
|
||||
secrets:
|
||||
- redis_password
|
||||
ports:
|
||||
- "6379:6379"
|
||||
healthcheck:
|
||||
test: ["CMD", "redis-cli", "ping"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
nats:
|
||||
image: nats:2-alpine
|
||||
command: -js -sd /data
|
||||
volumes:
|
||||
- nats_data:/data
|
||||
ports:
|
||||
- "4222:4222"
|
||||
- "8222:8222"
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "-O-", "http://localhost:8222/healthz"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
index:
|
||||
image: ghcr.io/acoustid/acoustid-index:latest
|
||||
command: >
|
||||
--dir /var/lib/acoustid-index
|
||||
--port 6081
|
||||
--threads 4
|
||||
--log-level info
|
||||
volumes:
|
||||
- index_data:/var/lib/acoustid-index
|
||||
ports:
|
||||
- "6081:6081"
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "-O-", "http://localhost:6081/_health"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
profiles:
|
||||
- backend
|
||||
|
||||
api:
|
||||
image: ghcr.io/acoustid/acoustid-server:latest
|
||||
command: run api
|
||||
environment:
|
||||
ACOUSTID_CONFIG: /etc/acoustid/acoustid.conf
|
||||
volumes:
|
||||
- ./acoustid.conf:/etc/acoustid/acoustid.conf:ro
|
||||
secrets:
|
||||
- db_password
|
||||
- redis_password
|
||||
ports:
|
||||
- "5000:5000"
|
||||
depends_on:
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
redis:
|
||||
condition: service_healthy
|
||||
nats:
|
||||
condition: service_healthy
|
||||
index:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "-O-", "http://localhost:5000/_health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
profiles:
|
||||
- frontend
|
||||
|
||||
web:
|
||||
image: ghcr.io/acoustid/acoustid-server:latest
|
||||
command: run web
|
||||
environment:
|
||||
ACOUSTID_CONFIG: /etc/acoustid/acoustid.conf
|
||||
volumes:
|
||||
- ./acoustid.conf:/etc/acoustid/acoustid.conf:ro
|
||||
secrets:
|
||||
- db_password
|
||||
- redis_password
|
||||
ports:
|
||||
- "5001:5001"
|
||||
depends_on:
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
redis:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "-O-", "http://localhost:5001/_health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
profiles:
|
||||
- frontend
|
||||
|
||||
worker:
|
||||
image: ghcr.io/acoustid/acoustid-server:latest
|
||||
command: run worker
|
||||
environment:
|
||||
ACOUSTID_CONFIG: /etc/acoustid/acoustid.conf
|
||||
volumes:
|
||||
- ./acoustid.conf:/etc/acoustid/acoustid.conf:ro
|
||||
secrets:
|
||||
- db_password
|
||||
- redis_password
|
||||
depends_on:
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
redis:
|
||||
condition: service_healthy
|
||||
nats:
|
||||
condition: service_healthy
|
||||
index:
|
||||
condition: service_healthy
|
||||
deploy:
|
||||
replicas: 2
|
||||
profiles:
|
||||
- backend
|
||||
|
||||
cron:
|
||||
image: ghcr.io/acoustid/acoustid-server:latest
|
||||
command: run cron
|
||||
environment:
|
||||
ACOUSTID_CONFIG: /etc/acoustid/acoustid.conf
|
||||
volumes:
|
||||
- ./acoustid.conf:/etc/acoustid/acoustid.conf:ro
|
||||
secrets:
|
||||
- db_password
|
||||
- redis_password
|
||||
depends_on:
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
redis:
|
||||
condition: service_healthy
|
||||
profiles:
|
||||
- backend
|
||||
|
||||
volumes:
|
||||
postgres_data:
|
||||
redis_data:
|
||||
nats_data:
|
||||
index_data:
|
||||
|
||||
secrets:
|
||||
db_password:
|
||||
file: ./secrets/db_password.txt
|
||||
redis_password:
|
||||
file: ./secrets/redis_password.txt
|
||||
```
|
||||
|
||||
### Docker Compose Profiles
|
||||
|
||||
**Frontend Profile** (public-facing services):
|
||||
```bash
|
||||
docker compose --profile frontend up
|
||||
```
|
||||
Services: api, web
|
||||
|
||||
**Backend Profile** (background services):
|
||||
```bash
|
||||
docker compose --profile backend up
|
||||
```
|
||||
Services: index, worker, cron
|
||||
|
||||
**Full Stack**:
|
||||
```bash
|
||||
docker compose --profile frontend --profile backend up
|
||||
```
|
||||
|
||||
**Tools Profile** (one-off commands):
|
||||
```bash
|
||||
docker compose run --rm tools python manage.py <command>
|
||||
```
|
||||
|
||||
## PostgreSQL Setup
|
||||
|
||||
### Custom PostgreSQL Image
|
||||
|
||||
**Image**: `ghcr.io/acoustid/postgresql:17.4`
|
||||
**Base**: `postgres:17.4`
|
||||
|
||||
**Dockerfile**: `docker/Dockerfile.postgres`
|
||||
|
||||
```dockerfile
|
||||
FROM postgres:17.4
|
||||
|
||||
# Install extensions
|
||||
RUN apt-get update && apt-get install -y \
|
||||
postgresql-17-intarray \
|
||||
postgresql-17-pgcrypto \
|
||||
postgresql-17-cube \
|
||||
build-essential \
|
||||
postgresql-server-dev-17 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Build acoustid extension
|
||||
COPY extensions/acoustid /build/acoustid
|
||||
WORKDIR /build/acoustid
|
||||
RUN make && make install
|
||||
|
||||
# Copy initialization scripts
|
||||
COPY docker/init-db.sh /docker-entrypoint-initdb.d/
|
||||
```
|
||||
|
||||
### Database Initialization
|
||||
|
||||
**Script**: `docker/init-db.sh`
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
# Create multiple databases
|
||||
for db in acoustid_app acoustid_fingerprint acoustid_ingest; do
|
||||
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" <<-EOSQL
|
||||
CREATE DATABASE $db;
|
||||
\c $db
|
||||
CREATE EXTENSION IF NOT EXISTS pgcrypto;
|
||||
EOSQL
|
||||
done
|
||||
|
||||
# Install extensions for fingerprint database
|
||||
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" -d acoustid_fingerprint <<-EOSQL
|
||||
CREATE EXTENSION IF NOT EXISTS intarray;
|
||||
CREATE EXTENSION IF NOT EXISTS cube;
|
||||
CREATE EXTENSION IF NOT EXISTS acoustid;
|
||||
EOSQL
|
||||
|
||||
# Run migrations
|
||||
cd /app
|
||||
python manage.py db upgrade
|
||||
```
|
||||
|
||||
### Database Configuration
|
||||
|
||||
**postgresql.conf** (custom settings):
|
||||
|
||||
```ini
|
||||
# Connection settings
|
||||
max_connections = 200
|
||||
shared_buffers = 4GB
|
||||
effective_cache_size = 12GB
|
||||
|
||||
# Write-ahead log
|
||||
wal_level = replica
|
||||
max_wal_size = 2GB
|
||||
min_wal_size = 1GB
|
||||
|
||||
# Query planner
|
||||
random_page_cost = 1.1 # SSD
|
||||
effective_io_concurrency = 200
|
||||
|
||||
# Parallel query
|
||||
max_parallel_workers_per_gather = 4
|
||||
max_parallel_workers = 8
|
||||
|
||||
# Logging
|
||||
log_min_duration_statement = 1000 # Log slow queries (>1s)
|
||||
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
|
||||
|
||||
# Autovacuum
|
||||
autovacuum_max_workers = 4
|
||||
autovacuum_naptime = 10s
|
||||
```
|
||||
|
||||
## CI/CD Pipeline
|
||||
|
||||
### GitHub Actions Workflows
|
||||
|
||||
**File**: `.github/workflows/ci.yml`
|
||||
|
||||
```yaml
|
||||
name: CI
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main, develop]
|
||||
pull_request:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
lint:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.12'
|
||||
|
||||
- name: Install uv
|
||||
run: curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
- name: Install dependencies
|
||||
run: uv sync
|
||||
|
||||
- name: Run isort
|
||||
run: uv run isort --check-only acoustid/
|
||||
|
||||
- name: Run black
|
||||
run: uv run black --check acoustid/
|
||||
|
||||
- name: Run flake8
|
||||
run: uv run flake8 acoustid/
|
||||
|
||||
- name: Run mypy
|
||||
run: uv run mypy acoustid/
|
||||
|
||||
test:
|
||||
runs-on: ubuntu-latest
|
||||
services:
|
||||
postgres:
|
||||
image: ghcr.io/acoustid/postgresql:17.4
|
||||
env:
|
||||
POSTGRES_USER: acoustid
|
||||
POSTGRES_PASSWORD: acoustid
|
||||
POSTGRES_DB: acoustid_test
|
||||
options: >-
|
||||
--health-cmd pg_isready
|
||||
--health-interval 10s
|
||||
--health-timeout 5s
|
||||
--health-retries 5
|
||||
ports:
|
||||
- 5432:5432
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
options: >-
|
||||
--health-cmd "redis-cli ping"
|
||||
--health-interval 10s
|
||||
--health-timeout 5s
|
||||
--health-retries 5
|
||||
ports:
|
||||
- 6379:6379
|
||||
|
||||
nats:
|
||||
image: nats:2-alpine
|
||||
options: >-
|
||||
--health-cmd "wget -q -O- http://localhost:8222/healthz"
|
||||
--health-interval 10s
|
||||
--health-timeout 5s
|
||||
--health-retries 5
|
||||
ports:
|
||||
- 4222:4222
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.12'
|
||||
|
||||
- name: Install uv
|
||||
run: curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
- name: Install dependencies
|
||||
run: uv sync
|
||||
|
||||
- name: Run migrations
|
||||
run: uv run python manage.py db upgrade
|
||||
env:
|
||||
ACOUSTID_DATABASE_NAME: acoustid_test
|
||||
ACOUSTID_DATABASE_USER: acoustid
|
||||
ACOUSTID_DATABASE_PASSWORD: acoustid
|
||||
ACOUSTID_DATABASE_HOST: localhost
|
||||
|
||||
- name: Run tests
|
||||
run: uv run pytest -v --cov=acoustid --cov-report=xml
|
||||
env:
|
||||
ACOUSTID_DATABASE_NAME: acoustid_test
|
||||
ACOUSTID_DATABASE_USER: acoustid
|
||||
ACOUSTID_DATABASE_PASSWORD: acoustid
|
||||
ACOUSTID_DATABASE_HOST: localhost
|
||||
ACOUSTID_REDIS_HOST: localhost
|
||||
ACOUSTID_NATS_SERVERS: nats://localhost:4222
|
||||
|
||||
- name: Upload coverage
|
||||
uses: codecov/codecov-action@v4
|
||||
with:
|
||||
file: ./coverage.xml
|
||||
|
||||
build:
|
||||
runs-on: ubuntu-latest
|
||||
needs: [lint, test]
|
||||
if: github.event_name == 'push'
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
|
||||
- name: Login to GitHub Container Registry
|
||||
uses: docker/login-action@v3
|
||||
with:
|
||||
registry: ghcr.io
|
||||
username: ${{ github.actor }}
|
||||
password: ${{ secrets.GITHUB_TOKEN }}
|
||||
|
||||
- name: Build and push server image
|
||||
uses: docker/build-push-action@v5
|
||||
with:
|
||||
context: .
|
||||
file: docker/Dockerfile
|
||||
push: true
|
||||
tags: |
|
||||
ghcr.io/acoustid/acoustid-server:latest
|
||||
ghcr.io/acoustid/acoustid-server:${{ github.sha }}
|
||||
cache-from: type=gha
|
||||
cache-to: type=gha,mode=max
|
||||
|
||||
- name: Build and push index image
|
||||
uses: docker/build-push-action@v5
|
||||
with:
|
||||
context: .
|
||||
file: docker/Dockerfile.index
|
||||
push: true
|
||||
tags: |
|
||||
ghcr.io/acoustid/acoustid-index:latest
|
||||
ghcr.io/acoustid/acoustid-index:${{ github.sha }}
|
||||
cache-from: type=gha
|
||||
cache-to: type=gha,mode=max
|
||||
```
|
||||
|
||||
### Linting Tools
|
||||
|
||||
**isort** (import sorting):
|
||||
```ini
|
||||
# pyproject.toml
|
||||
[tool.isort]
|
||||
profile = "black"
|
||||
line_length = 100
|
||||
```
|
||||
|
||||
**black** (code formatting):
|
||||
```ini
|
||||
# pyproject.toml
|
||||
[tool.black]
|
||||
line-length = 100
|
||||
target-version = ['py312']
|
||||
```
|
||||
|
||||
**flake8** (style checking):
|
||||
```ini
|
||||
# .flake8
|
||||
[flake8]
|
||||
max-line-length = 100
|
||||
extend-ignore = E203, W503
|
||||
exclude = .git,__pycache__,build,dist,.venv
|
||||
```
|
||||
|
||||
**mypy** (type checking):
|
||||
```ini
|
||||
# pyproject.toml
|
||||
[tool.mypy]
|
||||
python_version = "3.12"
|
||||
warn_return_any = true
|
||||
warn_unused_configs = true
|
||||
disallow_untyped_defs = true
|
||||
```
|
||||
|
||||
### Testing
|
||||
|
||||
**pytest** configuration:
|
||||
|
||||
```ini
|
||||
# pyproject.toml
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
python_files = ["test_*.py"]
|
||||
python_classes = ["Test*"]
|
||||
python_functions = ["test_*"]
|
||||
addopts = "-v --strict-markers --tb=short"
|
||||
markers = [
|
||||
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
|
||||
"integration: marks tests as integration tests",
|
||||
]
|
||||
```
|
||||
|
||||
**Test Files** (24 total):
|
||||
```
|
||||
tests/
|
||||
├── test_api_lookup.py
|
||||
├── test_api_submit.py
|
||||
├── test_fingerprint.py
|
||||
├── test_indexclient.py
|
||||
├── test_fpstore.py
|
||||
├── test_data_account.py
|
||||
├── test_data_fingerprint.py
|
||||
├── test_data_track.py
|
||||
├── test_data_musicbrainz.py
|
||||
├── test_worker.py
|
||||
├── test_cron.py
|
||||
├── test_ratelimit.py
|
||||
├── test_db.py
|
||||
├── test_config.py
|
||||
└── ...
|
||||
```
|
||||
|
||||
**Test Fixtures**:
|
||||
|
||||
```python
|
||||
# tests/conftest.py
|
||||
import pytest
|
||||
from acoustid.db import create_engine, create_session
|
||||
|
||||
@pytest.fixture
|
||||
def with_database():
|
||||
"""Provide test database session."""
|
||||
engine = create_engine('acoustid_test')
|
||||
session = create_session(engine)
|
||||
yield session
|
||||
session.rollback()
|
||||
session.close()
|
||||
|
||||
@pytest.fixture
|
||||
def with_script():
|
||||
"""Provide script context with database."""
|
||||
from acoustid.script import Script
|
||||
script = Script('test')
|
||||
script.setup()
|
||||
yield script
|
||||
script.teardown()
|
||||
|
||||
@pytest.fixture
|
||||
def fingerprint_fixture():
|
||||
"""Predefined test fingerprint."""
|
||||
return [123456789, 987654321, 456789123, ...]
|
||||
```
|
||||
|
||||
## Infrastructure Requirements
|
||||
|
||||
### Minimum Requirements (Self-Hosted)
|
||||
|
||||
| Component | CPU | RAM | Disk | Notes |
|
||||
|-----------|-----|-----|------|-------|
|
||||
| PostgreSQL | 2 cores | 4 GB | 100 GB SSD | For small dataset |
|
||||
| Redis | 1 core | 1 GB | 10 GB | Mostly in-memory |
|
||||
| NATS | 1 core | 512 MB | 10 GB | JetStream storage |
|
||||
| Index | 2 cores | 2 GB | 50 GB SSD | Depends on dataset size |
|
||||
| API | 2 cores | 2 GB | 10 GB | Per instance |
|
||||
| Worker | 2 cores | 2 GB | 10 GB | Per instance |
|
||||
| **Total** | **10 cores** | **11.5 GB** | **190 GB** | Single-host deployment |
|
||||
|
||||
### Production Requirements (acoustid.org scale)
|
||||
|
||||
| Component | CPU | RAM | Disk | Instances | Notes |
|
||||
|-----------|-----|-----|------|-----------|-------|
|
||||
| PostgreSQL | 16 cores | 64 GB | 2 TB NVMe | 1 primary + 2 replicas | High IOPS required |
|
||||
| Redis | 4 cores | 16 GB | 100 GB SSD | 3 (cluster) | Persistence enabled |
|
||||
| NATS | 4 cores | 8 GB | 500 GB SSD | 3 (cluster) | JetStream storage |
|
||||
| Index | 8 cores | 16 GB | 1 TB NVMe | 4+ | Sharded by fingerprint ID |
|
||||
| API | 4 cores | 8 GB | 50 GB | 4+ | Behind load balancer |
|
||||
| Web | 2 cores | 4 GB | 50 GB | 2+ | Behind load balancer |
|
||||
| Worker | 4 cores | 8 GB | 50 GB | 8+ | Auto-scaling |
|
||||
| Cron | 2 cores | 4 GB | 50 GB | 1 | Leader election |
|
||||
|
||||
### Network Requirements
|
||||
|
||||
**Bandwidth**:
|
||||
- API: 100 Mbps per instance (burst to 1 Gbps)
|
||||
- Index: 1 Gbps (internal network)
|
||||
- Database: 1 Gbps (internal network)
|
||||
|
||||
**Latency**:
|
||||
- API to Index: <5ms
|
||||
- API to Database: <5ms
|
||||
- API to Redis: <1ms
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Health Checks
|
||||
|
||||
**Endpoints**:
|
||||
- `/_health`: Full health check (database write test)
|
||||
- `/_health_ro`: Read-only health check
|
||||
- `/_health_docker`: Minimal health check for Docker
|
||||
|
||||
**Kubernetes Probes**:
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /_health_docker
|
||||
port: 5000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /_health_ro
|
||||
port: 5000
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 5
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 2
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
**StatsD Metrics** (server):
|
||||
- `api.requests_total{endpoint,method,status}`
|
||||
- `api.request_duration_seconds{endpoint,method}`
|
||||
- `api.handled_errors_total{error_code}`
|
||||
- `api.unhandled_errors_total`
|
||||
- `api.lookup.searches.total`
|
||||
- `api.lookup.matches.total`
|
||||
- `new_submissions`
|
||||
|
||||
**Prometheus Metrics** (index):
|
||||
- `fpindex_search_duration_seconds`
|
||||
- `fpindex_insert_duration_seconds`
|
||||
- `fpindex_segment_count`
|
||||
- `fpindex_memory_segment_size_bytes`
|
||||
- `fpindex_file_segment_size_bytes`
|
||||
- `fpindex_merge_duration_seconds`
|
||||
|
||||
### Logging
|
||||
|
||||
**Log Levels**:
|
||||
- `DEBUG`: Detailed diagnostic information
|
||||
- `INFO`: General informational messages
|
||||
- `WARNING`: Warning messages
|
||||
- `ERROR`: Error messages
|
||||
- `CRITICAL`: Critical errors
|
||||
|
||||
**Log Format**:
|
||||
```
|
||||
%(asctime)s [%(process)d] [%(levelname)s] %(name)s: %(message)s
|
||||
```
|
||||
|
||||
**Environment Variables**:
|
||||
```bash
|
||||
ACOUSTID_LOGGING_LEVEL=INFO
|
||||
ACOUSTID_LOGGING_LEVEL_ACOUSTID=DEBUG
|
||||
ACOUSTID_LOGGING_LEVEL_SQLALCHEMY=WARNING
|
||||
```
|
||||
|
||||
### Error Tracking
|
||||
|
||||
**Sentry Integration**:
|
||||
|
||||
```ini
|
||||
# acoustid.conf
|
||||
[sentry]
|
||||
dsn = https://...@sentry.io/...
|
||||
environment = production
|
||||
traces_sample_rate = 0.1
|
||||
```
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
import sentry_sdk
|
||||
from sentry_sdk.integrations.flask import FlaskIntegration
|
||||
|
||||
sentry_sdk.init(
|
||||
dsn=config.sentry.dsn,
|
||||
environment=config.sentry.environment,
|
||||
traces_sample_rate=config.sentry.traces_sample_rate,
|
||||
integrations=[FlaskIntegration()]
|
||||
)
|
||||
```
|
||||
|
||||
## Scaling Strategies
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
**API/Web**:
|
||||
- Add more instances behind load balancer
|
||||
- No shared state (stateless)
|
||||
- Session data in Redis if needed
|
||||
|
||||
**Workers**:
|
||||
- Add more instances
|
||||
- NATS distributes work automatically
|
||||
- No coordination required
|
||||
|
||||
**Index**:
|
||||
- Shard by fingerprint ID
|
||||
- Consistent hashing for distribution
|
||||
- NATS for cluster coordination
|
||||
|
||||
### Vertical Scaling
|
||||
|
||||
**Database**:
|
||||
- Increase shared_buffers (25% of RAM)
|
||||
- Increase effective_cache_size (50-75% of RAM)
|
||||
- Add more CPU for parallel queries
|
||||
|
||||
**Index**:
|
||||
- Increase thread count
|
||||
- Larger memory segment
|
||||
- Faster disk (NVMe)
|
||||
|
||||
### Caching
|
||||
|
||||
**Application-Level**:
|
||||
- API key cache (in-memory, 60s TTL)
|
||||
- Format lookup cache (permanent)
|
||||
- MBID existence cache (Redis, 1h TTL)
|
||||
|
||||
**Database-Level**:
|
||||
- Connection pooling
|
||||
- Query result caching
|
||||
- Materialized views
|
||||
|
||||
## Backup and Disaster Recovery
|
||||
|
||||
### Backup Strategy
|
||||
|
||||
**PostgreSQL**:
|
||||
```bash
|
||||
# Daily full backup
|
||||
pg_dump -Fc acoustid_app > acoustid_app_$(date +%Y%m%d).dump
|
||||
|
||||
# Continuous WAL archiving
|
||||
archive_command = 'cp %p /backup/wal/%f'
|
||||
```
|
||||
|
||||
**Index**:
|
||||
```bash
|
||||
# Daily snapshot
|
||||
curl -X GET http://index:6081/fingerprints/_snapshot
|
||||
|
||||
# Backup segment files
|
||||
rsync -av /var/lib/acoustid-index/ /backup/index/
|
||||
```
|
||||
|
||||
**Redis**:
|
||||
```bash
|
||||
# RDB snapshot (automatic)
|
||||
save 900 1
|
||||
save 300 10
|
||||
save 60 10000
|
||||
|
||||
# AOF (append-only file)
|
||||
appendonly yes
|
||||
appendfsync everysec
|
||||
```
|
||||
|
||||
### Disaster Recovery
|
||||
|
||||
**Recovery Time Objective (RTO)**: 1 hour
|
||||
**Recovery Point Objective (RPO)**: 5 minutes
|
||||
|
||||
**Recovery Steps**:
|
||||
1. Restore PostgreSQL from latest backup
|
||||
2. Replay WAL to point-in-time
|
||||
3. Restore Redis from RDB/AOF
|
||||
4. Restore index from snapshot
|
||||
5. Rebuild index from database if needed
|
||||
6. Restart all services
|
||||
7. Verify health checks
|
||||
@@ -0,0 +1,617 @@
|
||||
# AcoustID System Evaluation
|
||||
|
||||
## Executive Summary
|
||||
|
||||
AcoustID is a mature, production-proven audio fingerprinting system that combines a Python-based web service with a cutting-edge Zig-based search index. The system has been running in production for over a decade, processing millions of fingerprint submissions and lookups. This evaluation assesses its strengths, weaknesses, integration potential, and relevance for metadata aggregation projects.
|
||||
|
||||
## Strengths
|
||||
|
||||
### 1. Open Source and Well-Licensed
|
||||
|
||||
**Advantage**: Complete transparency and flexibility
|
||||
|
||||
- **Server License**: MIT (permissive, commercial-friendly)
|
||||
- **Index License**: GPL-3.0 (copyleft, but separate service)
|
||||
- **Chromaprint**: MIT (can be used independently)
|
||||
- **No Vendor Lock-in**: Full control over deployment and modifications
|
||||
|
||||
**Impact**: Can be self-hosted, modified, or used as a reference implementation without licensing concerns. The GPL license on the index is acceptable since it runs as a separate service.
|
||||
|
||||
### 2. Production-Proven at Scale
|
||||
|
||||
**Advantage**: Battle-tested reliability
|
||||
|
||||
- **Years in Production**: 10+ years serving acoustid.org
|
||||
- **Database Size**: Millions of fingerprints and tracks
|
||||
- **Request Volume**: Handles high traffic with proven architecture
|
||||
- **Real-World Data**: Extensive test coverage from actual usage
|
||||
|
||||
**Impact**: Low risk of fundamental design flaws. Known performance characteristics and scaling patterns.
|
||||
|
||||
### 3. Advanced Index Technology
|
||||
|
||||
**Advantage**: State-of-the-art search performance
|
||||
|
||||
- **LSM-Tree Architecture**: Efficient for write-heavy workloads
|
||||
- **SIMD Compression**: StreamVByte for 4-8x compression with minimal CPU overhead
|
||||
- **Sub-Millisecond Search**: P50 latency around 5ms
|
||||
- **Modern Language**: Zig provides memory safety without garbage collection overhead
|
||||
|
||||
**Impact**: The index is one of the most sophisticated open-source fingerprint search implementations available. Significantly faster than naive database-based approaches.
|
||||
|
||||
### 4. MusicBrainz Integration
|
||||
|
||||
**Advantage**: Direct access to comprehensive music metadata
|
||||
|
||||
- **Direct Database Access**: No API rate limits or latency
|
||||
- **Rich Metadata**: Artist credits, releases, release groups, tracks
|
||||
- **MBID Mapping**: Links audio fingerprints to canonical music identifiers
|
||||
- **Redirect Resolution**: Handles merged entities automatically
|
||||
|
||||
**Impact**: Provides a complete solution for audio identification with metadata enrichment. Eliminates need for separate metadata lookup infrastructure.
|
||||
|
||||
### 5. Comprehensive API
|
||||
|
||||
**Advantage**: Well-designed public API
|
||||
|
||||
- **Multiple Endpoints**: Lookup, submit, status, user management
|
||||
- **Batch Operations**: Up to 20 fingerprints per request
|
||||
- **Flexible Metadata**: Configurable response detail levels
|
||||
- **Multiple Formats**: JSON, XML, JSONP support
|
||||
- **Rate Limiting**: Built-in protection against abuse
|
||||
|
||||
**Impact**: Easy to integrate as a client. Can also serve as a reference for building similar APIs.
|
||||
|
||||
### 6. Well-Structured Codebase
|
||||
|
||||
**Advantage**: Maintainable and extensible
|
||||
|
||||
- **Layered Architecture**: Clear separation of concerns
|
||||
- **Service Pattern**: Business logic isolated from presentation
|
||||
- **Type Hints**: Modern Python with type annotations
|
||||
- **Comprehensive Tests**: 24 test files with good coverage
|
||||
- **Documentation**: Inline comments and docstrings
|
||||
|
||||
**Impact**: Easy to understand, modify, and extend. Low barrier to contribution or customization.
|
||||
|
||||
### 7. Modern Infrastructure
|
||||
|
||||
**Advantage**: Uses current best practices
|
||||
|
||||
- **Docker Support**: Full containerization with multi-stage builds
|
||||
- **Docker Compose**: Complete local development environment
|
||||
- **CI/CD**: GitHub Actions for automated testing and deployment
|
||||
- **Async Support**: Migration to Starlette for async operations
|
||||
- **Message Queue**: NATS with JetStream for reliable async processing
|
||||
|
||||
**Impact**: Easy to deploy and operate. Follows industry standards for cloud-native applications.
|
||||
|
||||
## Weaknesses
|
||||
|
||||
### 1. Complex Deployment Requirements
|
||||
|
||||
**Disadvantage**: High operational overhead
|
||||
|
||||
**Required Services**:
|
||||
- PostgreSQL 17.4 (4 separate databases)
|
||||
- Custom PostgreSQL extension (acoustid)
|
||||
- Redis (caching and rate limiting)
|
||||
- NATS with JetStream (message queue)
|
||||
- Zig-based index service
|
||||
- Multiple Python processes (API, web, worker, cron)
|
||||
|
||||
**Minimum Resources**:
|
||||
- 10+ CPU cores
|
||||
- 11.5 GB RAM
|
||||
- 190 GB disk space
|
||||
|
||||
**Impact**: Self-hosting requires significant infrastructure investment. Not suitable for small-scale deployments or embedded use cases. The custom PostgreSQL extension adds deployment complexity.
|
||||
|
||||
### 2. Custom PostgreSQL Extension Required
|
||||
|
||||
**Disadvantage**: Non-standard database setup
|
||||
|
||||
- **C Extension**: acoustid extension must be compiled and installed
|
||||
- **Platform-Specific**: Requires PostgreSQL development headers
|
||||
- **Maintenance Burden**: Must be updated for new PostgreSQL versions
|
||||
- **Deployment Complexity**: Cannot use standard PostgreSQL images without modification
|
||||
|
||||
**Impact**: Increases deployment complexity and maintenance burden. Limits hosting options (managed PostgreSQL services won't work).
|
||||
|
||||
### 3. Transitioning Codebase
|
||||
|
||||
**Disadvantage**: Mixed old and new code
|
||||
|
||||
**Transition Areas**:
|
||||
- Flask to Starlette (both frameworks present)
|
||||
- Legacy TCP index protocol to HTTP (both protocols supported)
|
||||
- Synchronous to asynchronous operations (mixed patterns)
|
||||
|
||||
**Impact**: Code complexity from supporting both old and new approaches. Potential for bugs at transition boundaries. Documentation may be inconsistent.
|
||||
|
||||
### 4. Legacy Code Paths
|
||||
|
||||
**Disadvantage**: Technical debt
|
||||
|
||||
**Legacy Components**:
|
||||
- Old API v1 endpoints (deprecated but still present)
|
||||
- TCP-based index client (being phased out)
|
||||
- Synchronous database operations (alongside async)
|
||||
- PUID support (MusicIP legacy)
|
||||
|
||||
**Impact**: Increased codebase size and complexity. Potential security or performance issues in unmaintained code paths.
|
||||
|
||||
### 5. Zig Index Maturity
|
||||
|
||||
**Disadvantage**: Relatively new implementation
|
||||
|
||||
- **Language Maturity**: Zig is pre-1.0 (currently 0.11.0)
|
||||
- **Ecosystem**: Limited third-party libraries
|
||||
- **Community**: Smaller than established languages
|
||||
- **Breaking Changes**: Zig language still evolving
|
||||
- **Debugging Tools**: Less mature than C/C++/Rust
|
||||
|
||||
**Impact**: Potential for language-level breaking changes. Smaller pool of developers familiar with Zig. May require more effort to debug or extend.
|
||||
|
||||
### 6. Limited Documentation
|
||||
|
||||
**Disadvantage**: Steep learning curve
|
||||
|
||||
**Documentation Gaps**:
|
||||
- No comprehensive architecture documentation (until this analysis)
|
||||
- Limited API examples beyond basic usage
|
||||
- Index protocol not formally documented
|
||||
- Deployment guide assumes Docker knowledge
|
||||
- No performance tuning guide
|
||||
|
||||
**Impact**: Difficult for newcomers to understand system internals. Trial and error required for optimization and troubleshooting.
|
||||
|
||||
### 7. Tight MusicBrainz Coupling
|
||||
|
||||
**Disadvantage**: Assumes MusicBrainz availability
|
||||
|
||||
- **Direct Database Dependency**: Requires MusicBrainz database replica
|
||||
- **Schema Coupling**: Queries specific MusicBrainz table structures
|
||||
- **No Abstraction**: MusicBrainz logic embedded throughout codebase
|
||||
- **Alternative Sources**: Difficult to use other metadata providers
|
||||
|
||||
**Impact**: Cannot easily substitute alternative metadata sources. Requires maintaining MusicBrainz database replica for full functionality.
|
||||
|
||||
## Integration Considerations
|
||||
|
||||
### As a Public API Client
|
||||
|
||||
**Recommendation**: Best approach for most use cases
|
||||
|
||||
**Advantages**:
|
||||
- No infrastructure to maintain
|
||||
- Proven reliability (acoustid.org uptime)
|
||||
- Free for reasonable usage
|
||||
- Immediate availability
|
||||
|
||||
**Disadvantages**:
|
||||
- Rate limits (3 req/s default, 10 req/s with API key)
|
||||
- Network latency
|
||||
- Dependency on external service
|
||||
- No control over data or features
|
||||
|
||||
**Best For**:
|
||||
- Small to medium scale applications
|
||||
- Prototyping and development
|
||||
- Applications with intermittent fingerprinting needs
|
||||
- Projects without infrastructure budget
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
import requests
|
||||
|
||||
def lookup_fingerprint(fingerprint, duration):
|
||||
response = requests.post('https://api.acoustid.org/v2/lookup', data={
|
||||
'client': 'YOUR_API_KEY',
|
||||
'duration': duration,
|
||||
'fingerprint': fingerprint,
|
||||
'meta': 'recordings+releases'
|
||||
})
|
||||
return response.json()
|
||||
```
|
||||
|
||||
### Self-Hosted Deployment
|
||||
|
||||
**Recommendation**: Only for large-scale or specialized needs
|
||||
|
||||
**Advantages**:
|
||||
- Full control over data and features
|
||||
- No rate limits
|
||||
- Low latency (local network)
|
||||
- Customization possible
|
||||
- Data privacy
|
||||
|
||||
**Disadvantages**:
|
||||
- High infrastructure cost
|
||||
- Operational complexity
|
||||
- Maintenance burden
|
||||
- Requires expertise
|
||||
|
||||
**Best For**:
|
||||
- Large-scale commercial applications
|
||||
- Privacy-sensitive use cases
|
||||
- Custom fingerprinting algorithms
|
||||
- Research and development
|
||||
|
||||
**Minimum Viable Deployment**:
|
||||
```yaml
|
||||
# docker-compose.yml (simplified)
|
||||
services:
|
||||
postgres:
|
||||
image: ghcr.io/acoustid/postgresql:17.4
|
||||
volumes:
|
||||
- postgres_data:/var/lib/postgresql/data
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
|
||||
nats:
|
||||
image: nats:2-alpine
|
||||
command: -js
|
||||
|
||||
index:
|
||||
image: ghcr.io/acoustid/acoustid-index:latest
|
||||
volumes:
|
||||
- index_data:/var/lib/acoustid-index
|
||||
|
||||
api:
|
||||
image: ghcr.io/acoustid/acoustid-server:latest
|
||||
command: run api
|
||||
depends_on: [postgres, redis, nats, index]
|
||||
```
|
||||
|
||||
### Chromaprint Library Only
|
||||
|
||||
**Recommendation**: For custom fingerprinting without AcoustID infrastructure
|
||||
|
||||
**Advantages**:
|
||||
- Minimal dependencies (just Chromaprint library)
|
||||
- Full control over fingerprint storage and matching
|
||||
- No network dependency
|
||||
- Lightweight
|
||||
|
||||
**Disadvantages**:
|
||||
- Must implement own matching algorithm
|
||||
- No MusicBrainz integration
|
||||
- No existing fingerprint database
|
||||
- Higher development effort
|
||||
|
||||
**Best For**:
|
||||
- Custom audio analysis applications
|
||||
- Offline fingerprinting
|
||||
- Embedded systems
|
||||
- Research projects
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
import chromaprint
|
||||
|
||||
# Generate fingerprint
|
||||
fpcalc = chromaprint.Chromaprint()
|
||||
fpcalc.start(sample_rate, num_channels)
|
||||
fpcalc.feed(audio_data)
|
||||
fpcalc.finish()
|
||||
fingerprint = fpcalc.get_fingerprint()
|
||||
|
||||
# Store and match fingerprints yourself
|
||||
# (requires custom implementation)
|
||||
```
|
||||
|
||||
### Hybrid Approach
|
||||
|
||||
**Recommendation**: Best of both worlds for growing applications
|
||||
|
||||
**Strategy**:
|
||||
1. Start with public API for lookups
|
||||
2. Use Chromaprint library for fingerprint generation
|
||||
3. Store fingerprints locally for future use
|
||||
4. Migrate to self-hosted when scale justifies cost
|
||||
|
||||
**Advantages**:
|
||||
- Low initial cost
|
||||
- Gradual migration path
|
||||
- Flexibility to optimize later
|
||||
- Reduced vendor lock-in
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
class HybridFingerprintService:
|
||||
def __init__(self):
|
||||
self.local_db = LocalFingerprintDB()
|
||||
self.acoustid_client = AcoustIDClient()
|
||||
|
||||
def identify(self, audio_file):
|
||||
# Generate fingerprint locally
|
||||
fingerprint = chromaprint.generate(audio_file)
|
||||
|
||||
# Check local database first
|
||||
match = self.local_db.search(fingerprint)
|
||||
if match:
|
||||
return match
|
||||
|
||||
# Fall back to AcoustID API
|
||||
result = self.acoustid_client.lookup(fingerprint)
|
||||
|
||||
# Cache result locally
|
||||
if result:
|
||||
self.local_db.store(fingerprint, result)
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
## Relevance for Metadata Aggregation
|
||||
|
||||
### High Relevance Scenarios
|
||||
|
||||
**1. Audio File Identification**
|
||||
|
||||
AcoustID excels at identifying audio files without metadata:
|
||||
|
||||
- **Use Case**: User uploads audio file with missing tags
|
||||
- **Solution**: Generate fingerprint, lookup via AcoustID, retrieve MBIDs
|
||||
- **Benefit**: Accurate identification even with transcoding or quality differences
|
||||
|
||||
**2. Duplicate Detection**
|
||||
|
||||
Fingerprints enable perceptual duplicate detection:
|
||||
|
||||
- **Use Case**: Detect duplicate tracks in large music library
|
||||
- **Solution**: Fingerprint all tracks, compare for similarity
|
||||
- **Benefit**: Finds duplicates even with different encodings or slight edits
|
||||
|
||||
**3. MBID Enrichment**
|
||||
|
||||
Links audio files to canonical MusicBrainz identifiers:
|
||||
|
||||
- **Use Case**: Enrich audio metadata with MusicBrainz data
|
||||
- **Solution**: Fingerprint -> AcoustID -> MBID -> MusicBrainz metadata
|
||||
- **Benefit**: Access to comprehensive, community-maintained metadata
|
||||
|
||||
**4. Quality Verification**
|
||||
|
||||
Verify metadata accuracy:
|
||||
|
||||
- **Use Case**: Check if file metadata matches actual audio content
|
||||
- **Solution**: Compare fingerprint-based identification with existing tags
|
||||
- **Benefit**: Detect mislabeled or corrupted files
|
||||
|
||||
### Medium Relevance Scenarios
|
||||
|
||||
**5. Playlist Generation**
|
||||
|
||||
Acoustic similarity for recommendations:
|
||||
|
||||
- **Use Case**: Generate playlists of similar-sounding tracks
|
||||
- **Solution**: Compare fingerprints for acoustic similarity
|
||||
- **Benefit**: Recommendations based on actual audio, not just metadata
|
||||
|
||||
**6. Copyright Detection**
|
||||
|
||||
Identify copyrighted content:
|
||||
|
||||
- **Use Case**: Detect copyrighted music in user uploads
|
||||
- **Solution**: Fingerprint uploads, match against known copyrighted works
|
||||
- **Benefit**: Automated content moderation
|
||||
|
||||
### Low Relevance Scenarios
|
||||
|
||||
**7. Real-Time Audio Recognition**
|
||||
|
||||
AcoustID is not optimized for real-time use:
|
||||
|
||||
- **Limitation**: Requires full audio file or significant portion
|
||||
- **Alternative**: Shazam-style services designed for short audio snippets
|
||||
- **Workaround**: Use Chromaprint with custom matching for real-time needs
|
||||
|
||||
**8. Music Recommendation**
|
||||
|
||||
Limited to acoustic similarity:
|
||||
|
||||
- **Limitation**: No semantic understanding of music (genre, mood, etc.)
|
||||
- **Alternative**: Dedicated recommendation engines (Spotify API, Last.fm)
|
||||
- **Workaround**: Combine with metadata-based recommendation
|
||||
|
||||
## Comparison with Alternatives
|
||||
|
||||
### vs. Shazam/ACRCloud (Commercial)
|
||||
|
||||
| Feature | AcoustID | Shazam/ACRCloud |
|
||||
|---------|----------|-----------------|
|
||||
| License | Open source (MIT/GPL) | Proprietary |
|
||||
| Cost | Free (self-host or API) | Paid API |
|
||||
| Database Size | Community-driven | Commercial catalog |
|
||||
| Real-Time | No | Yes |
|
||||
| Accuracy | High | Very high |
|
||||
| Customization | Full | Limited |
|
||||
|
||||
**Verdict**: AcoustID better for self-hosted, customizable solutions. Shazam better for real-time recognition and commercial catalog coverage.
|
||||
|
||||
### vs. Echoprint (Open Source)
|
||||
|
||||
| Feature | AcoustID | Echoprint |
|
||||
|---------|----------|-----------|
|
||||
| Maintenance | Active | Abandoned (2014) |
|
||||
| Index Technology | Modern (LSM-tree, SIMD) | Legacy |
|
||||
| Language | Python + Zig | Python + C++ |
|
||||
| MusicBrainz | Integrated | No |
|
||||
| Community | Active | Dead |
|
||||
|
||||
**Verdict**: AcoustID is the clear winner. Echoprint is no longer maintained.
|
||||
|
||||
### vs. Chromaprint Alone
|
||||
|
||||
| Feature | AcoustID | Chromaprint Only |
|
||||
|---------|----------|------------------|
|
||||
| Fingerprint Generation | Yes | Yes |
|
||||
| Fingerprint Matching | Yes | No (DIY) |
|
||||
| Metadata | MusicBrainz | No |
|
||||
| Infrastructure | Required | Minimal |
|
||||
| Development Effort | Low | High |
|
||||
|
||||
**Verdict**: AcoustID provides complete solution. Chromaprint alone requires significant custom development.
|
||||
|
||||
## Recommendations
|
||||
|
||||
### For Small Projects (< 10k lookups/month)
|
||||
|
||||
**Recommendation**: Use public AcoustID API
|
||||
|
||||
**Rationale**:
|
||||
- Free tier sufficient
|
||||
- No infrastructure cost
|
||||
- Immediate availability
|
||||
- Proven reliability
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
# Simple integration
|
||||
import acoustid
|
||||
|
||||
results = acoustid.match(api_key, audio_file)
|
||||
for score, recording_id, title, artist in results:
|
||||
print(f"{title} by {artist} (score: {score})")
|
||||
```
|
||||
|
||||
### For Medium Projects (10k-1M lookups/month)
|
||||
|
||||
**Recommendation**: Hybrid approach
|
||||
|
||||
**Rationale**:
|
||||
- Public API for initial lookups
|
||||
- Local caching for repeated queries
|
||||
- Gradual migration path to self-hosted
|
||||
- Cost-effective scaling
|
||||
|
||||
**Implementation**:
|
||||
- Use public API with caching layer
|
||||
- Store fingerprints locally
|
||||
- Monitor usage and costs
|
||||
- Migrate to self-hosted when justified
|
||||
|
||||
### For Large Projects (> 1M lookups/month)
|
||||
|
||||
**Recommendation**: Self-hosted deployment
|
||||
|
||||
**Rationale**:
|
||||
- Cost savings at scale
|
||||
- Full control and customization
|
||||
- Low latency
|
||||
- No rate limits
|
||||
|
||||
**Implementation**:
|
||||
- Deploy full stack (PostgreSQL, Redis, NATS, Index, API)
|
||||
- Import existing fingerprint database
|
||||
- Implement monitoring and alerting
|
||||
- Plan for high availability
|
||||
|
||||
### For Research Projects
|
||||
|
||||
**Recommendation**: Chromaprint library + custom matching
|
||||
|
||||
**Rationale**:
|
||||
- Full control over algorithms
|
||||
- No external dependencies
|
||||
- Flexibility for experimentation
|
||||
- Academic freedom
|
||||
|
||||
**Implementation**:
|
||||
- Use Chromaprint for fingerprint generation
|
||||
- Implement custom similarity metrics
|
||||
- Experiment with index structures
|
||||
- Publish findings
|
||||
|
||||
### For Privacy-Sensitive Applications
|
||||
|
||||
**Recommendation**: Self-hosted deployment
|
||||
|
||||
**Rationale**:
|
||||
- No data sent to third parties
|
||||
- Full control over data retention
|
||||
- Compliance with privacy regulations
|
||||
- Audit trail
|
||||
|
||||
**Implementation**:
|
||||
- Deploy on-premises or private cloud
|
||||
- Implement access controls
|
||||
- Enable audit logging
|
||||
- Regular security updates
|
||||
|
||||
## Future Considerations
|
||||
|
||||
### Potential Improvements
|
||||
|
||||
**1. Simplified Deployment**
|
||||
|
||||
- Single-binary deployment option
|
||||
- Embedded database (SQLite) for small-scale use
|
||||
- Optional components (make MusicBrainz integration optional)
|
||||
|
||||
**2. Better Documentation**
|
||||
|
||||
- Architecture guide (this document is a start)
|
||||
- Performance tuning guide
|
||||
- Troubleshooting guide
|
||||
- Video tutorials
|
||||
|
||||
**3. Alternative Metadata Sources**
|
||||
|
||||
- Plugin system for metadata providers
|
||||
- Support for Discogs, Spotify, etc.
|
||||
- Configurable metadata priority
|
||||
|
||||
**4. Enhanced API**
|
||||
|
||||
- GraphQL endpoint
|
||||
- WebSocket for real-time updates
|
||||
- Bulk operations API
|
||||
- Admin API for self-hosted instances
|
||||
|
||||
**5. Index Improvements**
|
||||
|
||||
- Distributed index with automatic sharding
|
||||
- Replication for high availability
|
||||
- Incremental backups
|
||||
- Query result caching
|
||||
|
||||
### Technology Evolution
|
||||
|
||||
**Zig Maturity**:
|
||||
- Monitor Zig 1.0 release
|
||||
- Evaluate stability and ecosystem growth
|
||||
- Consider Rust alternative if Zig adoption stalls
|
||||
|
||||
**Async Migration**:
|
||||
- Complete Flask to Starlette transition
|
||||
- Remove legacy synchronous code paths
|
||||
- Optimize for async/await patterns
|
||||
|
||||
**Cloud-Native**:
|
||||
- Kubernetes deployment manifests
|
||||
- Helm charts
|
||||
- Operator for automated management
|
||||
- Service mesh integration
|
||||
|
||||
## Conclusion
|
||||
|
||||
AcoustID is a **highly capable, production-ready audio fingerprinting system** with significant strengths in accuracy, performance, and MusicBrainz integration. The open-source license and mature codebase make it an excellent choice for projects requiring audio identification.
|
||||
|
||||
**Key Takeaways**:
|
||||
|
||||
1. **Use the public API** for most small to medium projects
|
||||
2. **Self-host only when scale justifies** the operational complexity
|
||||
3. **Chromaprint library alone** is viable for custom implementations
|
||||
4. **MusicBrainz integration** is a major value-add for metadata enrichment
|
||||
5. **Deployment complexity** is the main barrier to adoption
|
||||
|
||||
**Overall Assessment**: **Highly Recommended** for metadata aggregation projects that need audio fingerprinting, with the caveat that self-hosting requires significant infrastructure investment.
|
||||
|
||||
**Rating**: 8.5/10
|
||||
|
||||
**Strengths**: Production-proven, open source, excellent MusicBrainz integration, modern index technology
|
||||
**Weaknesses**: Complex deployment, custom PostgreSQL extension, transitioning codebase
|
||||
**Best Use Case**: Audio file identification and MBID enrichment via public API or self-hosted deployment at scale
|
||||
@@ -0,0 +1,768 @@
|
||||
# AcoustID Integrations
|
||||
|
||||
## Overview
|
||||
|
||||
AcoustID integrates with multiple external services and libraries to provide comprehensive audio fingerprinting and metadata enrichment. The system's architecture separates concerns between fingerprint generation (Chromaprint), fingerprint indexing (acoustid-index), metadata enrichment (MusicBrainz), and supporting infrastructure (Redis, NATS).
|
||||
|
||||
## MusicBrainz Integration
|
||||
|
||||
### Connection Method
|
||||
|
||||
**Type**: Direct PostgreSQL database connection (NOT REST API)
|
||||
**Database**: `musicbrainz` (read-only replica)
|
||||
**Access**: Separate database connection pool
|
||||
|
||||
**Configuration** (`acoustid.conf`):
|
||||
```ini
|
||||
[musicbrainz]
|
||||
host = musicbrainz-db.example.com
|
||||
port = 5432
|
||||
name = musicbrainz_db
|
||||
user = acoustid_readonly
|
||||
password_file = /run/secrets/mb_password
|
||||
```
|
||||
|
||||
**File**: `acoustid/data/musicbrainz.py`
|
||||
|
||||
### Queried Tables
|
||||
|
||||
The integration queries the following MusicBrainz tables directly:
|
||||
|
||||
| Table | Purpose | Columns Used |
|
||||
|-------|---------|--------------|
|
||||
| `artist_credit` | Artist information | `id`, `name`, `artist_count` |
|
||||
| `artist_credit_name` | Artist credit details | `artist_credit`, `position`, `artist`, `name`, `join_phrase` |
|
||||
| `artist` | Artist entities | `id`, `gid`, `name`, `sort_name` |
|
||||
| `recording` | Recording metadata | `id`, `gid`, `name`, `length`, `artist_credit`, `comment` |
|
||||
| `release` | Release information | `id`, `gid`, `name`, `artist_credit`, `release_group`, `status`, `packaging`, `barcode` |
|
||||
| `release_group` | Release group data | `id`, `gid`, `name`, `artist_credit`, `type`, `comment` |
|
||||
| `track` | Track listings | `id`, `gid`, `recording`, `position`, `number`, `name`, `length`, `artist_credit` |
|
||||
| `medium` | Medium information | `id`, `release`, `position`, `format`, `track_count` |
|
||||
| `release_country` | Release countries | `release`, `country`, `date_year`, `date_month`, `date_day` |
|
||||
|
||||
### Query Patterns
|
||||
|
||||
**Fetch Recording by MBID**:
|
||||
|
||||
```python
|
||||
def get_recording_by_mbid(db, mbid):
|
||||
"""Fetch recording with artist credits and releases."""
|
||||
query = """
|
||||
SELECT
|
||||
r.gid AS recording_mbid,
|
||||
r.name AS recording_title,
|
||||
r.length AS duration,
|
||||
ac.name AS artist_credit_name,
|
||||
array_agg(DISTINCT rel.gid) AS release_mbids
|
||||
FROM recording r
|
||||
JOIN artist_credit ac ON r.artist_credit = ac.id
|
||||
LEFT JOIN track t ON t.recording = r.id
|
||||
LEFT JOIN medium m ON t.medium = m.id
|
||||
LEFT JOIN release rel ON m.release = rel.id
|
||||
WHERE r.gid = :mbid
|
||||
GROUP BY r.gid, r.name, r.length, ac.name
|
||||
"""
|
||||
return db.execute(query, {'mbid': mbid}).fetchone()
|
||||
```
|
||||
|
||||
**Fetch Release with Tracks**:
|
||||
|
||||
```python
|
||||
def get_release_with_tracks(db, release_mbid):
|
||||
"""Fetch complete release with all tracks."""
|
||||
query = """
|
||||
SELECT
|
||||
rel.gid AS release_mbid,
|
||||
rel.name AS release_title,
|
||||
rel.barcode,
|
||||
rc.country,
|
||||
rc.date_year,
|
||||
rc.date_month,
|
||||
rc.date_day,
|
||||
m.position AS medium_position,
|
||||
m.format AS medium_format,
|
||||
t.position AS track_position,
|
||||
t.number AS track_number,
|
||||
t.name AS track_title,
|
||||
rec.gid AS recording_mbid,
|
||||
ac.name AS artist_credit
|
||||
FROM release rel
|
||||
LEFT JOIN release_country rc ON rel.id = rc.release
|
||||
LEFT JOIN medium m ON rel.id = m.release
|
||||
LEFT JOIN track t ON m.id = t.medium
|
||||
LEFT JOIN recording rec ON t.recording = rec.id
|
||||
LEFT JOIN artist_credit ac ON rec.artist_credit = ac.id
|
||||
WHERE rel.gid = :mbid
|
||||
ORDER BY m.position, t.position
|
||||
"""
|
||||
return db.execute(query, {'mbid': release_mbid}).fetchall()
|
||||
```
|
||||
|
||||
**Fetch Artist Credits**:
|
||||
|
||||
```python
|
||||
def get_artist_credit(db, artist_credit_id):
|
||||
"""Fetch artist credit with all artists."""
|
||||
query = """
|
||||
SELECT
|
||||
acn.position,
|
||||
a.gid AS artist_mbid,
|
||||
a.name AS artist_name,
|
||||
a.sort_name AS artist_sort_name,
|
||||
acn.name AS credited_name,
|
||||
acn.join_phrase
|
||||
FROM artist_credit_name acn
|
||||
JOIN artist a ON acn.artist = a.id
|
||||
WHERE acn.artist_credit = :ac_id
|
||||
ORDER BY acn.position
|
||||
"""
|
||||
return db.execute(query, {'ac_id': artist_credit_id}).fetchall()
|
||||
```
|
||||
|
||||
### MBID Redirect Resolution
|
||||
|
||||
MusicBrainz uses MBID redirects when entities are merged. AcoustID resolves these automatically.
|
||||
|
||||
**File**: `acoustid/data/musicbrainz.py`
|
||||
|
||||
```python
|
||||
def resolve_recording_mbid(db, mbid):
|
||||
"""Resolve recording MBID redirects."""
|
||||
query = """
|
||||
SELECT new_id
|
||||
FROM recording_gid_redirect
|
||||
WHERE gid = :mbid
|
||||
"""
|
||||
result = db.execute(query, {'mbid': mbid}).fetchone()
|
||||
if result:
|
||||
# Recursively resolve redirects
|
||||
return resolve_recording_mbid(db, result['new_id'])
|
||||
return mbid
|
||||
```
|
||||
|
||||
**Redirect Tables Used**:
|
||||
- `recording_gid_redirect`
|
||||
- `release_gid_redirect`
|
||||
- `release_group_gid_redirect`
|
||||
- `artist_gid_redirect`
|
||||
|
||||
### Metadata Enrichment
|
||||
|
||||
When a lookup request includes metadata flags, AcoustID fetches additional data from MusicBrainz:
|
||||
|
||||
**Metadata Levels**:
|
||||
|
||||
| Flag | Data Fetched | Query Complexity |
|
||||
|------|--------------|------------------|
|
||||
| `recordingids` | Recording MBIDs only | Low (join only) |
|
||||
| `recordings` | Full recording metadata | Medium (artist credits) |
|
||||
| `releaseids` | Release MBIDs only | Low (join only) |
|
||||
| `releases` | Full release metadata | High (tracks, mediums, countries) |
|
||||
| `releasegroupids` | Release group MBIDs only | Low (join only) |
|
||||
| `releasegroups` | Full release group metadata | Medium (artist credits) |
|
||||
|
||||
**Example Enriched Response**:
|
||||
|
||||
```json
|
||||
{
|
||||
"recordings": [
|
||||
{
|
||||
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
|
||||
"title": "Example Song",
|
||||
"duration": 240000,
|
||||
"artists": [
|
||||
{
|
||||
"id": "12345678-90ab-cdef-1234-567890abcdef",
|
||||
"name": "Example Artist",
|
||||
"joinphrase": " & "
|
||||
}
|
||||
],
|
||||
"releases": [
|
||||
{
|
||||
"id": "abcdef12-3456-7890-abcd-ef1234567890",
|
||||
"title": "Example Album",
|
||||
"country": "US",
|
||||
"date": {
|
||||
"year": 2020,
|
||||
"month": 5,
|
||||
"day": 15
|
||||
},
|
||||
"track_count": 12,
|
||||
"medium_count": 1,
|
||||
"releasegroup": {
|
||||
"id": "fedcba98-7654-3210-fedc-ba9876543210",
|
||||
"type": "Album"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
**Connection Pooling**:
|
||||
- Separate pool for MusicBrainz database
|
||||
- Pool size: 10 connections (configurable)
|
||||
- Pool recycle: 3600 seconds
|
||||
|
||||
**Query Optimization**:
|
||||
- Indexes on `gid` columns (MusicBrainz maintains these)
|
||||
- Batch queries when possible
|
||||
- Limit joins to requested metadata only
|
||||
|
||||
**Caching**:
|
||||
- Unknown MBID cache (Redis, 1 hour TTL)
|
||||
- Avoids repeated queries for non-existent MBIDs
|
||||
|
||||
**Fallback**:
|
||||
- If MusicBrainz database unavailable, return AcoustID data only
|
||||
- Graceful degradation (no metadata enrichment)
|
||||
|
||||
## Chromaprint Integration
|
||||
|
||||
### Library Information
|
||||
|
||||
**Name**: Chromaprint
|
||||
**Version**: Built from source (commit `41a3e8fb`)
|
||||
**License**: MIT
|
||||
**Language**: C++
|
||||
**Wrapper**: acoustid-ext (C extension for Python)
|
||||
|
||||
**Repository**: https://github.com/acoustid/chromaprint
|
||||
|
||||
### Build Process
|
||||
|
||||
**Dockerfile** (`docker/Dockerfile`):
|
||||
|
||||
```dockerfile
|
||||
# Stage 1: Build Chromaprint
|
||||
FROM ubuntu:24.04 AS chromaprint-build
|
||||
|
||||
RUN apt-get update && apt-get install -y \
|
||||
git cmake build-essential libfftw3-dev
|
||||
|
||||
WORKDIR /build
|
||||
RUN git clone https://github.com/acoustid/chromaprint.git && \
|
||||
cd chromaprint && \
|
||||
git checkout 41a3e8fb && \
|
||||
cmake -DCMAKE_BUILD_TYPE=Release . && \
|
||||
make && \
|
||||
make install
|
||||
|
||||
# Stage 2: Build acoustid-ext
|
||||
FROM ubuntu:24.04 AS builder
|
||||
|
||||
COPY --from=chromaprint-build /usr/local/lib/libchromaprint.so* /usr/local/lib/
|
||||
COPY --from=chromaprint-build /usr/local/include/chromaprint.h /usr/local/include/
|
||||
|
||||
RUN pip install acoustid-ext
|
||||
```
|
||||
|
||||
### Python Extension (acoustid-ext)
|
||||
|
||||
**Package**: `acoustid-ext`
|
||||
**File**: `acoustid/fingerprint.py`
|
||||
|
||||
**Functions Exposed**:
|
||||
|
||||
```python
|
||||
from acoustid_ext import (
|
||||
decode_fingerprint,
|
||||
encode_fingerprint,
|
||||
compress_fingerprint,
|
||||
decompress_fingerprint,
|
||||
fingerprint_compare
|
||||
)
|
||||
```
|
||||
|
||||
**Function Signatures**:
|
||||
|
||||
| Function | Input | Output | Purpose |
|
||||
|----------|-------|--------|---------|
|
||||
| `decode_fingerprint(data)` | bytes/str | list[int] | Decode base64/compressed fingerprint |
|
||||
| `encode_fingerprint(hashes)` | list[int] | str | Encode fingerprint to base64 |
|
||||
| `compress_fingerprint(hashes)` | list[int] | bytes | Compress fingerprint (zstd) |
|
||||
| `decompress_fingerprint(data)` | bytes | list[int] | Decompress fingerprint |
|
||||
| `fingerprint_compare(fp1, fp2)` | list[int], list[int] | float | Compare similarity (0.0-1.0) |
|
||||
|
||||
### Fingerprint Format
|
||||
|
||||
**Raw Format** (Chromaprint output):
|
||||
- Array of 32-bit unsigned integers
|
||||
- Each integer represents a hash of audio features
|
||||
- Typical length: 100-300 hashes (for 3-5 minute track)
|
||||
|
||||
**Compressed Format** (for transmission):
|
||||
- Base64-encoded compressed data
|
||||
- Compression: zstd or custom Chromaprint compression
|
||||
- Typical size: 200-500 bytes
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
# Raw fingerprint
|
||||
fingerprint = [123456789, 987654321, 456789123, ...]
|
||||
|
||||
# Encoded (base64)
|
||||
encoded = "AQADtNGiJEqUHUemR..."
|
||||
|
||||
# Compressed (bytes)
|
||||
compressed = b'\x28\xb5\x2f\xfd...'
|
||||
```
|
||||
|
||||
### Query Extraction
|
||||
|
||||
**File**: `acoustid/fingerprint.py`
|
||||
|
||||
```python
|
||||
def extract_query(fingerprint, max_terms=100):
|
||||
"""Extract query terms from fingerprint for index search.
|
||||
|
||||
Args:
|
||||
fingerprint: List of 32-bit hash integers
|
||||
max_terms: Maximum number of terms to extract
|
||||
|
||||
Returns:
|
||||
List of term IDs (subset of fingerprint hashes)
|
||||
"""
|
||||
# Select most discriminative terms
|
||||
# (implementation uses simhash or random sampling)
|
||||
terms = select_discriminative_terms(fingerprint, max_terms)
|
||||
return terms
|
||||
```
|
||||
|
||||
**Query Strategy**:
|
||||
- Extract subset of hashes (typically 50-100 terms)
|
||||
- Prioritize discriminative hashes (high entropy)
|
||||
- Balance between precision and recall
|
||||
|
||||
### Fingerprint Comparison
|
||||
|
||||
**PostgreSQL Function** (custom extension):
|
||||
|
||||
```sql
|
||||
CREATE FUNCTION acoustid_compare(fp1 INTEGER[], fp2 INTEGER[])
|
||||
RETURNS FLOAT AS $$
|
||||
-- Calculate Jaccard similarity
|
||||
SELECT COUNT(*)::FLOAT /
|
||||
(array_length(fp1, 1) + array_length(fp2, 1) - COUNT(*))
|
||||
FROM unnest(fp1) AS h1
|
||||
JOIN unnest(fp2) AS h2 ON h1 = h2
|
||||
$$ LANGUAGE SQL IMMUTABLE;
|
||||
```
|
||||
|
||||
**Python Implementation**:
|
||||
|
||||
```python
|
||||
def compare_fingerprints(fp1, fp2):
|
||||
"""Calculate similarity between two fingerprints.
|
||||
|
||||
Returns:
|
||||
Float between 0.0 (no match) and 1.0 (identical)
|
||||
"""
|
||||
set1 = set(fp1)
|
||||
set2 = set(fp2)
|
||||
intersection = len(set1 & set2)
|
||||
union = len(set1 | set2)
|
||||
return intersection / union if union > 0 else 0.0
|
||||
```
|
||||
|
||||
## AcoustID Index Integration
|
||||
|
||||
### Client Implementations
|
||||
|
||||
AcoustID server has two index client implementations:
|
||||
|
||||
#### Legacy TCP Client (indexclient.py)
|
||||
|
||||
**Status**: Deprecated, being phased out
|
||||
**Protocol**: Custom binary over TCP
|
||||
**Port**: 6080 (default)
|
||||
|
||||
**File**: `acoustid/indexclient.py`
|
||||
|
||||
```python
|
||||
class IndexClientPool:
|
||||
"""Connection pool for legacy TCP index."""
|
||||
|
||||
def __init__(self, host, port, pool_size=10):
|
||||
self.host = host
|
||||
self.port = port
|
||||
self.pool = Queue(maxsize=pool_size)
|
||||
|
||||
def search(self, fingerprint, limit=10):
|
||||
"""Search index for similar fingerprints."""
|
||||
client = self.pool.get()
|
||||
try:
|
||||
# Send search command
|
||||
client.send_command(CMD_SEARCH, {
|
||||
'fingerprint': fingerprint,
|
||||
'limit': limit
|
||||
})
|
||||
# Receive results
|
||||
results = client.receive_response()
|
||||
return results
|
||||
finally:
|
||||
self.pool.put(client)
|
||||
```
|
||||
|
||||
**Message Format**:
|
||||
```
|
||||
┌────────────┬─────────┬──────────────────┐
|
||||
│ Length (4B)│ Cmd (1B)│ Payload (msgpack)│
|
||||
└────────────┴─────────┴──────────────────┘
|
||||
```
|
||||
|
||||
#### Modern HTTP Client (fpstore.py)
|
||||
|
||||
**Status**: Current, recommended
|
||||
**Protocol**: HTTP/1.1 with MessagePack
|
||||
**Port**: 6081 (default)
|
||||
|
||||
**File**: `acoustid/fpstore.py`
|
||||
|
||||
```python
|
||||
class FingerprintIndexClient:
|
||||
"""Async HTTP client for fingerprint index."""
|
||||
|
||||
def __init__(self, base_url, index_name='fingerprints'):
|
||||
self.base_url = base_url
|
||||
self.index_name = index_name
|
||||
self.session = aiohttp.ClientSession()
|
||||
|
||||
async def search(self, query_terms, limit=10, min_score=0.5):
|
||||
"""Search index for matching fingerprints.
|
||||
|
||||
Args:
|
||||
query_terms: List of hash integers
|
||||
limit: Maximum results to return
|
||||
min_score: Minimum similarity score
|
||||
|
||||
Returns:
|
||||
List of (fingerprint_id, score) tuples
|
||||
"""
|
||||
url = f"{self.base_url}/{self.index_name}/_search"
|
||||
payload = msgspec.msgpack.encode({
|
||||
'query': query_terms,
|
||||
'limit': limit,
|
||||
'min_score': min_score
|
||||
})
|
||||
|
||||
async with self.session.post(url, data=payload) as resp:
|
||||
data = await resp.read()
|
||||
result = msgspec.msgpack.decode(data)
|
||||
return [(r['id'], r['score']) for r in result['results']]
|
||||
|
||||
async def insert(self, fingerprint_id, terms):
|
||||
"""Insert or update fingerprint in index."""
|
||||
url = f"{self.base_url}/{self.index_name}/{fingerprint_id}"
|
||||
payload = msgspec.msgpack.encode({'terms': terms})
|
||||
|
||||
async with self.session.put(url, data=payload) as resp:
|
||||
return resp.status == 200
|
||||
|
||||
async def delete(self, fingerprint_id):
|
||||
"""Delete fingerprint from index."""
|
||||
url = f"{self.base_url}/{self.index_name}/{fingerprint_id}"
|
||||
async with self.session.delete(url) as resp:
|
||||
return resp.status == 200
|
||||
```
|
||||
|
||||
### Index Operations
|
||||
|
||||
**Search Flow**:
|
||||
1. Extract query terms from fingerprint (50-100 hashes)
|
||||
2. Encode query as MessagePack
|
||||
3. POST to `/:index/_search`
|
||||
4. Decode MessagePack response
|
||||
5. Return list of (fingerprint_id, score) tuples
|
||||
|
||||
**Insert Flow**:
|
||||
1. Extract all terms from fingerprint
|
||||
2. Encode as MessagePack
|
||||
3. PUT to `/:index/:fingerprint_id`
|
||||
4. Index adds to MemorySegment
|
||||
5. Appends to Oplog for durability
|
||||
|
||||
**Batch Update Flow**:
|
||||
1. Collect multiple fingerprint updates
|
||||
2. Encode batch as MessagePack
|
||||
3. POST to `/:index/_update`
|
||||
4. Index processes all updates atomically
|
||||
|
||||
### Error Handling
|
||||
|
||||
**Retry Strategy**:
|
||||
|
||||
```python
|
||||
async def search_with_retry(client, query, max_retries=3):
|
||||
"""Search with exponential backoff retry."""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
return await client.search(query)
|
||||
except aiohttp.ClientError as e:
|
||||
if attempt == max_retries - 1:
|
||||
raise
|
||||
wait_time = 2 ** attempt
|
||||
await asyncio.sleep(wait_time)
|
||||
```
|
||||
|
||||
**Circuit Breaker**:
|
||||
|
||||
```python
|
||||
class CircuitBreaker:
|
||||
"""Prevent cascading failures to index."""
|
||||
|
||||
def __init__(self, failure_threshold=5, timeout=60):
|
||||
self.failure_count = 0
|
||||
self.failure_threshold = failure_threshold
|
||||
self.timeout = timeout
|
||||
self.last_failure_time = None
|
||||
self.state = 'closed' # closed, open, half-open
|
||||
|
||||
async def call(self, func, *args, **kwargs):
|
||||
if self.state == 'open':
|
||||
if time.time() - self.last_failure_time > self.timeout:
|
||||
self.state = 'half-open'
|
||||
else:
|
||||
raise CircuitBreakerOpen()
|
||||
|
||||
try:
|
||||
result = await func(*args, **kwargs)
|
||||
if self.state == 'half-open':
|
||||
self.state = 'closed'
|
||||
self.failure_count = 0
|
||||
return result
|
||||
except Exception as e:
|
||||
self.failure_count += 1
|
||||
self.last_failure_time = time.time()
|
||||
if self.failure_count >= self.failure_threshold:
|
||||
self.state = 'open'
|
||||
raise
|
||||
```
|
||||
|
||||
## Fingerprint Store (fpstore)
|
||||
|
||||
### Optional Service
|
||||
|
||||
**Purpose**: Separate storage for raw fingerprint data
|
||||
**Status**: Optional (can use PostgreSQL instead)
|
||||
**Protocol**: HTTP with MessagePack
|
||||
|
||||
**Configuration**:
|
||||
```ini
|
||||
[fingerprint_store]
|
||||
enabled = true
|
||||
base_url = http://fpstore:8080
|
||||
```
|
||||
|
||||
**Operations**:
|
||||
|
||||
```python
|
||||
class FingerprintStore:
|
||||
"""Client for fingerprint storage service."""
|
||||
|
||||
async def store(self, fingerprint_id, fingerprint_data):
|
||||
"""Store raw fingerprint data."""
|
||||
url = f"{self.base_url}/fingerprints/{fingerprint_id}"
|
||||
payload = msgspec.msgpack.encode({
|
||||
'data': fingerprint_data
|
||||
})
|
||||
async with self.session.put(url, data=payload) as resp:
|
||||
return resp.status == 200
|
||||
|
||||
async def retrieve(self, fingerprint_id):
|
||||
"""Retrieve raw fingerprint data."""
|
||||
url = f"{self.base_url}/fingerprints/{fingerprint_id}"
|
||||
async with self.session.get(url) as resp:
|
||||
data = await resp.read()
|
||||
result = msgspec.msgpack.decode(data)
|
||||
return result['data']
|
||||
```
|
||||
|
||||
## NATS Integration
|
||||
|
||||
### Message Queue
|
||||
|
||||
**Purpose**: Async submission processing
|
||||
**Technology**: NATS with JetStream (persistent queue)
|
||||
**Library**: `nats-py`
|
||||
|
||||
**Configuration**:
|
||||
```ini
|
||||
[nats]
|
||||
servers = nats://nats:4222
|
||||
stream = acoustid_submissions
|
||||
consumer = acoustid_worker
|
||||
```
|
||||
|
||||
**File**: `acoustid/worker.py`
|
||||
|
||||
### Publisher (API Server)
|
||||
|
||||
```python
|
||||
import nats
|
||||
from nats.js import JetStreamContext
|
||||
|
||||
async def publish_submission(submission_id):
|
||||
"""Publish submission to NATS queue."""
|
||||
nc = await nats.connect(servers=["nats://nats:4222"])
|
||||
js: JetStreamContext = nc.jetstream()
|
||||
|
||||
# Ensure stream exists
|
||||
await js.add_stream(
|
||||
name="acoustid_submissions",
|
||||
subjects=["submissions.*"],
|
||||
retention="workqueue"
|
||||
)
|
||||
|
||||
# Publish message
|
||||
await js.publish(
|
||||
subject="submissions.new",
|
||||
payload=msgspec.json.encode({
|
||||
'submission_id': submission_id,
|
||||
'timestamp': time.time()
|
||||
})
|
||||
)
|
||||
|
||||
await nc.close()
|
||||
```
|
||||
|
||||
### Consumer (Worker)
|
||||
|
||||
```python
|
||||
async def consume_submissions():
|
||||
"""Consume submissions from NATS queue."""
|
||||
nc = await nats.connect(servers=["nats://nats:4222"])
|
||||
js: JetStreamContext = nc.jetstream()
|
||||
|
||||
# Create consumer
|
||||
consumer = await js.pull_subscribe(
|
||||
subject="submissions.*",
|
||||
durable="acoustid_worker",
|
||||
config=nats.js.api.ConsumerConfig(
|
||||
ack_policy="explicit",
|
||||
max_deliver=3,
|
||||
ack_wait=300 # 5 minutes
|
||||
)
|
||||
)
|
||||
|
||||
while True:
|
||||
# Fetch batch of messages
|
||||
messages = await consumer.fetch(batch=10, timeout=5)
|
||||
|
||||
for msg in messages:
|
||||
try:
|
||||
data = msgspec.json.decode(msg.data)
|
||||
await process_submission(data['submission_id'])
|
||||
await msg.ack()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to process submission: {e}")
|
||||
await msg.nak(delay=60) # Retry after 1 minute
|
||||
```
|
||||
|
||||
### JetStream Configuration
|
||||
|
||||
**Stream Settings**:
|
||||
- Retention: WorkQueue (messages deleted after ack)
|
||||
- Max age: 7 days (unprocessed messages)
|
||||
- Max messages: 1,000,000
|
||||
- Storage: File (persistent)
|
||||
|
||||
**Consumer Settings**:
|
||||
- Ack policy: Explicit (manual acknowledgment)
|
||||
- Max deliver: 3 (retry up to 3 times)
|
||||
- Ack wait: 300 seconds (5 minutes timeout)
|
||||
- Max ack pending: 100 (max unacked messages)
|
||||
|
||||
## Redis Integration
|
||||
|
||||
### Use Cases
|
||||
|
||||
1. **Rate Limiting**: Sliding window counters
|
||||
2. **Task Queue** (legacy): RPUSH/LPOP queue
|
||||
3. **Caching**: API key validation, MBID existence
|
||||
4. **State Management**: Backfill progress, worker state
|
||||
|
||||
**Configuration**:
|
||||
```ini
|
||||
[redis]
|
||||
host = redis
|
||||
port = 6379
|
||||
db = 0
|
||||
password_file = /run/secrets/redis_password
|
||||
```
|
||||
|
||||
**File**: `acoustid/redis.py`
|
||||
|
||||
### Connection Pool
|
||||
|
||||
```python
|
||||
import redis
|
||||
|
||||
redis_pool = redis.ConnectionPool(
|
||||
host='redis',
|
||||
port=6379,
|
||||
db=0,
|
||||
max_connections=50,
|
||||
socket_timeout=5,
|
||||
socket_connect_timeout=5
|
||||
)
|
||||
|
||||
redis_client = redis.Redis(connection_pool=redis_pool)
|
||||
```
|
||||
|
||||
### Rate Limiting Implementation
|
||||
|
||||
See DATA.md for detailed rate limiting data structures.
|
||||
|
||||
### Caching Patterns
|
||||
|
||||
**API Key Cache**:
|
||||
```python
|
||||
from cachetools import TTLCache
|
||||
|
||||
api_key_cache = TTLCache(maxsize=1000, ttl=60)
|
||||
|
||||
def get_application_by_key(api_key):
|
||||
if api_key in api_key_cache:
|
||||
return api_key_cache[api_key]
|
||||
|
||||
app = db.query(Application).filter_by(apikey=api_key).first()
|
||||
if app:
|
||||
api_key_cache[api_key] = app
|
||||
return app
|
||||
```
|
||||
|
||||
**Unknown MBID Cache**:
|
||||
```python
|
||||
def is_mbid_known(mbid):
|
||||
"""Check if MBID exists in MusicBrainz."""
|
||||
cache_key = f"unknown_mbid:{mbid}"
|
||||
|
||||
# Check cache
|
||||
if redis_client.exists(cache_key):
|
||||
return False
|
||||
|
||||
# Query MusicBrainz
|
||||
exists = mb_db.query(Recording).filter_by(gid=mbid).count() > 0
|
||||
|
||||
# Cache negative result
|
||||
if not exists:
|
||||
redis_client.setex(cache_key, 3600, '1')
|
||||
|
||||
return exists
|
||||
```
|
||||
|
||||
## Integration Summary
|
||||
|
||||
| Service | Protocol | Purpose | Criticality |
|
||||
|---------|----------|---------|-------------|
|
||||
| MusicBrainz | PostgreSQL | Metadata enrichment | High |
|
||||
| Chromaprint | C library | Fingerprint generation | Critical |
|
||||
| Index (HTTP) | HTTP/MessagePack | Fingerprint search | Critical |
|
||||
| Index (TCP) | TCP binary | Legacy fingerprint search | Low (deprecated) |
|
||||
| Fingerprint Store | HTTP/MessagePack | Raw fingerprint storage | Low (optional) |
|
||||
| NATS | NATS protocol | Async job queue | High |
|
||||
| Redis | Redis protocol | Caching, rate limiting | High |
|
||||
@@ -0,0 +1,391 @@
|
||||
# AcoustID System Overview
|
||||
|
||||
## Introduction
|
||||
|
||||
AcoustID is an open-source audio fingerprinting service that identifies music recordings by analyzing their acoustic characteristics. The system consists of two primary components working in tandem: a Python-based web service (acoustid-server) and a high-performance Zig-based fingerprint index (acoustid-index). Together, they provide a production-grade solution for matching audio fingerprints to MusicBrainz metadata.
|
||||
|
||||
## System Components
|
||||
|
||||
### acoustid-server (Python)
|
||||
|
||||
The server component handles all user-facing operations, database management, and business logic.
|
||||
|
||||
**Repository**: acoustid/acoustid-server
|
||||
**License**: MIT
|
||||
**Language**: Python 3.12+
|
||||
**Current Version**: 26.3.1
|
||||
|
||||
**Core Technologies**:
|
||||
- **Web Framework**: Werkzeug/Flask (current) with migration to Starlette (future async)
|
||||
- **ORM**: SQLAlchemy 2.x with multi-database support
|
||||
- **Database**: PostgreSQL 17.4 (4 separate databases)
|
||||
- **Cache/Queue**: Redis for rate limiting and task queues
|
||||
- **Message Queue**: NATS with JetStream for async submission processing
|
||||
- **ASGI Server**: Uvicorn for async endpoints, Gunicorn for legacy
|
||||
|
||||
**Key Dependencies**:
|
||||
```
|
||||
acoustid-ext (C extension for Chromaprint)
|
||||
Flask (current web framework)
|
||||
Starlette (future async framework)
|
||||
aiohttp (async HTTP client)
|
||||
SQLAlchemy 2.x (ORM)
|
||||
alembic (database migrations)
|
||||
asyncpg (async PostgreSQL driver)
|
||||
psycopg2 (sync PostgreSQL driver)
|
||||
nats-py (NATS client)
|
||||
mbdata (MusicBrainz data models)
|
||||
msgspec (fast JSON/MessagePack)
|
||||
zstd (compression)
|
||||
gunicorn (WSGI server)
|
||||
uvicorn (ASGI server)
|
||||
```
|
||||
|
||||
**Entry Point**:
|
||||
```bash
|
||||
# Main CLI entry
|
||||
python manage.py -> acoustid.cli:main()
|
||||
|
||||
# Available commands
|
||||
python manage.py run web # Web UI server
|
||||
python manage.py run api # API server
|
||||
python manage.py run cron # Scheduled tasks
|
||||
python manage.py run worker # Background worker
|
||||
python manage.py run import # Import fingerprints
|
||||
```
|
||||
|
||||
**File Locations**:
|
||||
- Entry script: `manage.py`
|
||||
- CLI implementation: `acoustid/cli.py`
|
||||
- Server logic: `acoustid/server.py`
|
||||
- Worker logic: `acoustid/worker.py`
|
||||
- Cron jobs: `acoustid/cron.py`
|
||||
- Configuration: `acoustid/config.py`
|
||||
|
||||
### acoustid-index (Zig)
|
||||
|
||||
The index component provides ultra-fast fingerprint search using advanced data structures and SIMD optimizations.
|
||||
|
||||
**Repository**: acoustid/acoustid-index
|
||||
**License**: GPL-3.0
|
||||
**Language**: Zig
|
||||
**Build System**: Zig build system
|
||||
|
||||
**Core Technologies**:
|
||||
- **HTTP Server**: httpz (Zig HTTP library)
|
||||
- **Data Structure**: LSM-tree (Log-Structured Merge-tree) inverted index
|
||||
- **Compression**: StreamVByte SIMD compression for posting lists
|
||||
- **Serialization**: MessagePack for wire protocol
|
||||
- **Metrics**: Prometheus-compatible metrics endpoint
|
||||
|
||||
**Key Dependencies**:
|
||||
```
|
||||
httpz (HTTP server framework)
|
||||
metrics (Prometheus metrics)
|
||||
zul (Zig utility library)
|
||||
msgpack (MessagePack serialization)
|
||||
nats (NATS client)
|
||||
```
|
||||
|
||||
**Entry Point**:
|
||||
```bash
|
||||
# Build and run
|
||||
zig build run -- --dir /tmp --port 8080
|
||||
|
||||
# Binary name
|
||||
fpindex
|
||||
|
||||
# CLI flags
|
||||
--dir <path> # Data directory for index storage
|
||||
--port <number> # HTTP server port (default: 6081)
|
||||
--threads <number> # Worker thread count
|
||||
--log-level <level> # Logging verbosity
|
||||
--cluster <name> # Cluster name for distributed setup
|
||||
--nats-url <url> # NATS server URL for clustering
|
||||
```
|
||||
|
||||
**File Locations**:
|
||||
- Main entry: `src/main.zig`
|
||||
- HTTP server: `src/server.zig`
|
||||
- API handlers: `src/api.zig`
|
||||
- Multi-index manager: `src/MultiIndex.zig`
|
||||
- Core index: `src/Index.zig`
|
||||
- Index reader: `src/IndexReader.zig`
|
||||
- Segment management: `src/segment.zig`
|
||||
- Memory segment: `src/MemorySegment.zig`
|
||||
- File segment: `src/FileSegment.zig`
|
||||
- Write-ahead log: `src/Oplog.zig`
|
||||
- File format: `src/filefmt.zig`
|
||||
- Block compression: `src/block.zig`
|
||||
- SIMD compression: `src/streamvbyte.zig`
|
||||
- Metrics: `src/metrics.zig`
|
||||
|
||||
## Build and Run
|
||||
|
||||
### Server Build
|
||||
|
||||
```bash
|
||||
# Install dependencies with uv
|
||||
uv sync
|
||||
|
||||
# Build Chromaprint extension
|
||||
# (handled automatically in Docker build)
|
||||
|
||||
# Run with docker-compose
|
||||
docker compose up
|
||||
```
|
||||
|
||||
**Docker Compose Services**:
|
||||
- `nats`: Message queue
|
||||
- `redis`: Cache and rate limiting
|
||||
- `postgres`: Database (custom pg17.4 image)
|
||||
- `index`: Fingerprint index service
|
||||
- `api`: API server
|
||||
- `web`: Web UI server
|
||||
- `cron`: Scheduled tasks
|
||||
- `worker`: Background job processor
|
||||
|
||||
### Index Build
|
||||
|
||||
```bash
|
||||
# Build binary
|
||||
zig build
|
||||
|
||||
# Run with options
|
||||
zig build run -- --dir /var/lib/acoustid-index --port 6081 --threads 4
|
||||
```
|
||||
|
||||
## Architecture Relationship
|
||||
|
||||
The two components work together in a client-server model:
|
||||
|
||||
1. **Server** receives fingerprint submissions and lookup requests via HTTP API
|
||||
2. **Server** stores metadata in PostgreSQL
|
||||
3. **Server** sends fingerprint data to **Index** via HTTP/MessagePack protocol
|
||||
4. **Index** performs ultra-fast similarity search using LSM-tree
|
||||
5. **Index** returns candidate fingerprint IDs to **Server**
|
||||
6. **Server** enriches results with metadata from PostgreSQL and MusicBrainz
|
||||
7. **Server** returns final results to client
|
||||
|
||||
## Communication Protocols
|
||||
|
||||
### Server to Index
|
||||
|
||||
**Modern Protocol** (fpstore.py):
|
||||
- HTTP POST to `http://index:6081/:index/_search`
|
||||
- Request body: MessagePack-encoded fingerprint query
|
||||
- Response: MessagePack-encoded list of candidate IDs with scores
|
||||
|
||||
**Legacy Protocol** (indexclient.py):
|
||||
- Raw TCP socket connection
|
||||
- Binary protocol with custom framing
|
||||
- Being phased out in favor of HTTP
|
||||
|
||||
### Client to Server
|
||||
|
||||
**Public API**:
|
||||
- HTTP GET/POST to `https://api.acoustid.org/v2/*`
|
||||
- JSON/XML/JSONP responses
|
||||
- Rate-limited by API key and IP
|
||||
|
||||
## Version Information
|
||||
|
||||
**Server Version**: 26.3.1
|
||||
- Semantic versioning
|
||||
- Tagged releases in Git
|
||||
- Version defined in `acoustid/__init__.py`
|
||||
|
||||
**Index Version**: No formal versioning yet
|
||||
- Tracked by Git commit hash
|
||||
- Breaking changes communicated via commit messages
|
||||
|
||||
## Deployment Models
|
||||
|
||||
### Production (acoustid.org)
|
||||
|
||||
- Multi-server deployment
|
||||
- Separate API, web, worker, and cron processes
|
||||
- Dedicated PostgreSQL cluster (4 databases)
|
||||
- Redis cluster for caching
|
||||
- NATS cluster for message queue
|
||||
- Multiple index instances for load balancing
|
||||
|
||||
### Self-Hosted (Docker Compose)
|
||||
|
||||
- Single-host deployment
|
||||
- All services in containers
|
||||
- Shared PostgreSQL instance
|
||||
- Single Redis instance
|
||||
- Single NATS instance
|
||||
- Single index instance
|
||||
|
||||
### Development (Local)
|
||||
|
||||
- Python virtual environment with uv
|
||||
- Local PostgreSQL (or Docker)
|
||||
- Local Redis (or Docker)
|
||||
- Local NATS (or Docker)
|
||||
- Index built and run locally with Zig
|
||||
|
||||
## Key Features
|
||||
|
||||
### Server Features
|
||||
|
||||
- **Fingerprint Submission**: Accept audio fingerprints with optional metadata
|
||||
- **Fingerprint Lookup**: Match fingerprints to known recordings
|
||||
- **MusicBrainz Integration**: Link fingerprints to MBIDs
|
||||
- **User Management**: API key generation and management
|
||||
- **Rate Limiting**: Multi-tier rate limiting (global, app, IP)
|
||||
- **Batch Operations**: Submit/lookup up to 20 fingerprints per request
|
||||
- **Async Processing**: Background workers for heavy operations
|
||||
- **Health Checks**: Multiple health endpoints for monitoring
|
||||
- **Metrics**: StatsD metrics for observability
|
||||
|
||||
### Index Features
|
||||
|
||||
- **Fast Search**: Sub-millisecond fingerprint matching
|
||||
- **SIMD Optimization**: StreamVByte compression for posting lists
|
||||
- **LSM-Tree Storage**: Efficient write and read performance
|
||||
- **Background Merging**: Automatic segment compaction
|
||||
- **Snapshot Support**: Point-in-time index snapshots
|
||||
- **Cluster Support**: Distributed index via NATS
|
||||
- **Prometheus Metrics**: Built-in metrics endpoint
|
||||
- **HTTP API**: RESTful API for all operations
|
||||
|
||||
## Configuration
|
||||
|
||||
### Server Configuration
|
||||
|
||||
**Config File**: `acoustid.conf` (INI format)
|
||||
**Environment Variables**: `ACOUSTID_*` prefix
|
||||
**Secret Files**: `*_file` suffix for file-based secrets
|
||||
|
||||
Example:
|
||||
```ini
|
||||
[database]
|
||||
name = acoustid_app
|
||||
user = acoustid
|
||||
password_file = /run/secrets/db_password
|
||||
|
||||
[redis]
|
||||
host = redis
|
||||
port = 6379
|
||||
|
||||
[fingerprint_index]
|
||||
host = index
|
||||
port = 6081
|
||||
```
|
||||
|
||||
### Index Configuration
|
||||
|
||||
**CLI Flags Only**: No config file support
|
||||
**Environment Variables**: Limited support
|
||||
|
||||
Example:
|
||||
```bash
|
||||
fpindex \
|
||||
--dir /var/lib/acoustid-index \
|
||||
--port 6081 \
|
||||
--threads 4 \
|
||||
--log-level info \
|
||||
--nats-url nats://nats:4222
|
||||
```
|
||||
|
||||
## Data Flow Summary
|
||||
|
||||
### Submission Flow
|
||||
|
||||
1. Client submits fingerprint via `/v2/submit`
|
||||
2. Server validates API keys and rate limits
|
||||
3. Server stores submission in `submission` table
|
||||
4. Server publishes message to NATS queue
|
||||
5. Worker picks up message from NATS
|
||||
6. Worker searches index for matches
|
||||
7. Worker creates or links track in PostgreSQL
|
||||
8. Worker updates index with new fingerprint
|
||||
9. Client polls `/v2/submission_status` for result
|
||||
|
||||
### Lookup Flow
|
||||
|
||||
1. Client requests lookup via `/v2/lookup`
|
||||
2. Server validates API key and rate limits
|
||||
3. Server decodes fingerprint from request
|
||||
4. Server extracts query features from fingerprint
|
||||
5. Server sends search request to index
|
||||
6. Index returns candidate fingerprint IDs
|
||||
7. Server fetches metadata from PostgreSQL
|
||||
8. Server fetches MusicBrainz data if requested
|
||||
9. Server returns enriched results as JSON
|
||||
|
||||
## Technology Stack Summary
|
||||
|
||||
| Component | Server | Index |
|
||||
|-----------|--------|-------|
|
||||
| Language | Python 3.12+ | Zig |
|
||||
| Web Framework | Flask/Starlette | httpz |
|
||||
| Database | PostgreSQL 17.4 | N/A (file-based) |
|
||||
| ORM | SQLAlchemy 2.x | N/A |
|
||||
| Cache | Redis | N/A |
|
||||
| Queue | NATS+JetStream | NATS (optional) |
|
||||
| Serialization | JSON/MessagePack | MessagePack |
|
||||
| Compression | zstd | StreamVByte |
|
||||
| Metrics | StatsD | Prometheus |
|
||||
| Testing | pytest | Zig test |
|
||||
| Build | uv | zig build |
|
||||
| Container | Docker | Docker |
|
||||
|
||||
## Repository Structure
|
||||
|
||||
### acoustid-server
|
||||
|
||||
```
|
||||
acoustid/
|
||||
├── api/ # API handlers
|
||||
│ └── v2/ # API v2 endpoints
|
||||
├── data/ # Business logic layer
|
||||
├── future/ # Starlette migration code
|
||||
├── web/ # Web UI handlers
|
||||
├── scripts/ # Utility scripts
|
||||
├── cli.py # CLI commands
|
||||
├── server.py # Server entry point
|
||||
├── worker.py # Background worker
|
||||
├── cron.py # Scheduled tasks
|
||||
├── fingerprint.py # Fingerprint utilities
|
||||
├── indexclient.py # Legacy index client
|
||||
├── fpstore.py # Modern index client
|
||||
├── db.py # Database connection
|
||||
├── config.py # Configuration
|
||||
└── tables.py # SQLAlchemy models
|
||||
```
|
||||
|
||||
### acoustid-index
|
||||
|
||||
```
|
||||
src/
|
||||
├── main.zig # Entry point
|
||||
├── server.zig # HTTP server
|
||||
├── api.zig # API handlers
|
||||
├── MultiIndex.zig # Multi-index manager
|
||||
├── Index.zig # Core index
|
||||
├── IndexReader.zig # Read-only index view
|
||||
├── segment.zig # Segment interface
|
||||
├── MemorySegment.zig # In-memory segment
|
||||
├── FileSegment.zig # On-disk segment
|
||||
├── Oplog.zig # Write-ahead log
|
||||
├── filefmt.zig # File format
|
||||
├── block.zig # Block compression
|
||||
├── streamvbyte.zig # SIMD compression
|
||||
└── metrics.zig # Prometheus metrics
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
For detailed information on specific aspects of the AcoustID system, refer to:
|
||||
|
||||
- **ARCHITECTURE.md**: Detailed architecture and data flow
|
||||
- **API.md**: Complete API reference
|
||||
- **DATA.md**: Database schema and data models
|
||||
- **INTEGRATIONS.md**: External service integrations
|
||||
- **DEPLOYMENT.md**: Deployment and infrastructure
|
||||
- **CODEBASE.md**: Code organization and patterns
|
||||
- **EVALUATION.md**: System evaluation and recommendations
|
||||
Reference in New Issue
Block a user