# AcoustID Integrations ## Overview AcoustID integrates with multiple external services and libraries to provide comprehensive audio fingerprinting and metadata enrichment. The system's architecture separates concerns between fingerprint generation (Chromaprint), fingerprint indexing (acoustid-index), metadata enrichment (MusicBrainz), and supporting infrastructure (Redis, NATS). ## MusicBrainz Integration ### Connection Method **Type**: Direct PostgreSQL database connection (NOT REST API) **Database**: `musicbrainz` (read-only replica) **Access**: Separate database connection pool **Configuration** (`acoustid.conf`): ```ini [musicbrainz] host = musicbrainz-db.example.com port = 5432 name = musicbrainz_db user = acoustid_readonly password_file = /run/secrets/mb_password ``` **File**: `acoustid/data/musicbrainz.py` ### Queried Tables The integration queries the following MusicBrainz tables directly: | Table | Purpose | Columns Used | |-------|---------|--------------| | `artist_credit` | Artist information | `id`, `name`, `artist_count` | | `artist_credit_name` | Artist credit details | `artist_credit`, `position`, `artist`, `name`, `join_phrase` | | `artist` | Artist entities | `id`, `gid`, `name`, `sort_name` | | `recording` | Recording metadata | `id`, `gid`, `name`, `length`, `artist_credit`, `comment` | | `release` | Release information | `id`, `gid`, `name`, `artist_credit`, `release_group`, `status`, `packaging`, `barcode` | | `release_group` | Release group data | `id`, `gid`, `name`, `artist_credit`, `type`, `comment` | | `track` | Track listings | `id`, `gid`, `recording`, `position`, `number`, `name`, `length`, `artist_credit` | | `medium` | Medium information | `id`, `release`, `position`, `format`, `track_count` | | `release_country` | Release countries | `release`, `country`, `date_year`, `date_month`, `date_day` | ### Query Patterns **Fetch Recording by MBID**: ```python def get_recording_by_mbid(db, mbid): """Fetch recording with artist credits and releases.""" query = """ SELECT r.gid AS recording_mbid, r.name AS recording_title, r.length AS duration, ac.name AS artist_credit_name, array_agg(DISTINCT rel.gid) AS release_mbids FROM recording r JOIN artist_credit ac ON r.artist_credit = ac.id LEFT JOIN track t ON t.recording = r.id LEFT JOIN medium m ON t.medium = m.id LEFT JOIN release rel ON m.release = rel.id WHERE r.gid = :mbid GROUP BY r.gid, r.name, r.length, ac.name """ return db.execute(query, {'mbid': mbid}).fetchone() ``` **Fetch Release with Tracks**: ```python def get_release_with_tracks(db, release_mbid): """Fetch complete release with all tracks.""" query = """ SELECT rel.gid AS release_mbid, rel.name AS release_title, rel.barcode, rc.country, rc.date_year, rc.date_month, rc.date_day, m.position AS medium_position, m.format AS medium_format, t.position AS track_position, t.number AS track_number, t.name AS track_title, rec.gid AS recording_mbid, ac.name AS artist_credit FROM release rel LEFT JOIN release_country rc ON rel.id = rc.release LEFT JOIN medium m ON rel.id = m.release LEFT JOIN track t ON m.id = t.medium LEFT JOIN recording rec ON t.recording = rec.id LEFT JOIN artist_credit ac ON rec.artist_credit = ac.id WHERE rel.gid = :mbid ORDER BY m.position, t.position """ return db.execute(query, {'mbid': release_mbid}).fetchall() ``` **Fetch Artist Credits**: ```python def get_artist_credit(db, artist_credit_id): """Fetch artist credit with all artists.""" query = """ SELECT acn.position, a.gid AS artist_mbid, a.name AS artist_name, a.sort_name AS artist_sort_name, acn.name AS credited_name, acn.join_phrase FROM artist_credit_name acn JOIN artist a ON acn.artist = a.id WHERE acn.artist_credit = :ac_id ORDER BY acn.position """ return db.execute(query, {'ac_id': artist_credit_id}).fetchall() ``` ### MBID Redirect Resolution MusicBrainz uses MBID redirects when entities are merged. AcoustID resolves these automatically. **File**: `acoustid/data/musicbrainz.py` ```python def resolve_recording_mbid(db, mbid): """Resolve recording MBID redirects.""" query = """ SELECT new_id FROM recording_gid_redirect WHERE gid = :mbid """ result = db.execute(query, {'mbid': mbid}).fetchone() if result: # Recursively resolve redirects return resolve_recording_mbid(db, result['new_id']) return mbid ``` **Redirect Tables Used**: - `recording_gid_redirect` - `release_gid_redirect` - `release_group_gid_redirect` - `artist_gid_redirect` ### Metadata Enrichment When a lookup request includes metadata flags, AcoustID fetches additional data from MusicBrainz: **Metadata Levels**: | Flag | Data Fetched | Query Complexity | |------|--------------|------------------| | `recordingids` | Recording MBIDs only | Low (join only) | | `recordings` | Full recording metadata | Medium (artist credits) | | `releaseids` | Release MBIDs only | Low (join only) | | `releases` | Full release metadata | High (tracks, mediums, countries) | | `releasegroupids` | Release group MBIDs only | Low (join only) | | `releasegroups` | Full release group metadata | Medium (artist credits) | **Example Enriched Response**: ```json { "recordings": [ { "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "title": "Example Song", "duration": 240000, "artists": [ { "id": "12345678-90ab-cdef-1234-567890abcdef", "name": "Example Artist", "joinphrase": " & " } ], "releases": [ { "id": "abcdef12-3456-7890-abcd-ef1234567890", "title": "Example Album", "country": "US", "date": { "year": 2020, "month": 5, "day": 15 }, "track_count": 12, "medium_count": 1, "releasegroup": { "id": "fedcba98-7654-3210-fedc-ba9876543210", "type": "Album" } } ] } ] } ``` ### Performance Considerations **Connection Pooling**: - Separate pool for MusicBrainz database - Pool size: 10 connections (configurable) - Pool recycle: 3600 seconds **Query Optimization**: - Indexes on `gid` columns (MusicBrainz maintains these) - Batch queries when possible - Limit joins to requested metadata only **Caching**: - Unknown MBID cache (Redis, 1 hour TTL) - Avoids repeated queries for non-existent MBIDs **Fallback**: - If MusicBrainz database unavailable, return AcoustID data only - Graceful degradation (no metadata enrichment) ## Chromaprint Integration ### Library Information **Name**: Chromaprint **Version**: Built from source (commit `41a3e8fb`) **License**: MIT **Language**: C++ **Wrapper**: acoustid-ext (C extension for Python) **Repository**: https://github.com/acoustid/chromaprint ### Build Process **Dockerfile** (`docker/Dockerfile`): ```dockerfile # Stage 1: Build Chromaprint FROM ubuntu:24.04 AS chromaprint-build RUN apt-get update && apt-get install -y \ git cmake build-essential libfftw3-dev WORKDIR /build RUN git clone https://github.com/acoustid/chromaprint.git && \ cd chromaprint && \ git checkout 41a3e8fb && \ cmake -DCMAKE_BUILD_TYPE=Release . && \ make && \ make install # Stage 2: Build acoustid-ext FROM ubuntu:24.04 AS builder COPY --from=chromaprint-build /usr/local/lib/libchromaprint.so* /usr/local/lib/ COPY --from=chromaprint-build /usr/local/include/chromaprint.h /usr/local/include/ RUN pip install acoustid-ext ``` ### Python Extension (acoustid-ext) **Package**: `acoustid-ext` **File**: `acoustid/fingerprint.py` **Functions Exposed**: ```python from acoustid_ext import ( decode_fingerprint, encode_fingerprint, compress_fingerprint, decompress_fingerprint, fingerprint_compare ) ``` **Function Signatures**: | Function | Input | Output | Purpose | |----------|-------|--------|---------| | `decode_fingerprint(data)` | bytes/str | list[int] | Decode base64/compressed fingerprint | | `encode_fingerprint(hashes)` | list[int] | str | Encode fingerprint to base64 | | `compress_fingerprint(hashes)` | list[int] | bytes | Compress fingerprint (zstd) | | `decompress_fingerprint(data)` | bytes | list[int] | Decompress fingerprint | | `fingerprint_compare(fp1, fp2)` | list[int], list[int] | float | Compare similarity (0.0-1.0) | ### Fingerprint Format **Raw Format** (Chromaprint output): - Array of 32-bit unsigned integers - Each integer represents a hash of audio features - Typical length: 100-300 hashes (for 3-5 minute track) **Compressed Format** (for transmission): - Base64-encoded compressed data - Compression: zstd or custom Chromaprint compression - Typical size: 200-500 bytes **Example**: ```python # Raw fingerprint fingerprint = [123456789, 987654321, 456789123, ...] # Encoded (base64) encoded = "AQADtNGiJEqUHUemR..." # Compressed (bytes) compressed = b'\x28\xb5\x2f\xfd...' ``` ### Query Extraction **File**: `acoustid/fingerprint.py` ```python def extract_query(fingerprint, max_terms=100): """Extract query terms from fingerprint for index search. Args: fingerprint: List of 32-bit hash integers max_terms: Maximum number of terms to extract Returns: List of term IDs (subset of fingerprint hashes) """ # Select most discriminative terms # (implementation uses simhash or random sampling) terms = select_discriminative_terms(fingerprint, max_terms) return terms ``` **Query Strategy**: - Extract subset of hashes (typically 50-100 terms) - Prioritize discriminative hashes (high entropy) - Balance between precision and recall ### Fingerprint Comparison **PostgreSQL Function** (custom extension): ```sql CREATE FUNCTION acoustid_compare(fp1 INTEGER[], fp2 INTEGER[]) RETURNS FLOAT AS $$ -- Calculate Jaccard similarity SELECT COUNT(*)::FLOAT / (array_length(fp1, 1) + array_length(fp2, 1) - COUNT(*)) FROM unnest(fp1) AS h1 JOIN unnest(fp2) AS h2 ON h1 = h2 $$ LANGUAGE SQL IMMUTABLE; ``` **Python Implementation**: ```python def compare_fingerprints(fp1, fp2): """Calculate similarity between two fingerprints. Returns: Float between 0.0 (no match) and 1.0 (identical) """ set1 = set(fp1) set2 = set(fp2) intersection = len(set1 & set2) union = len(set1 | set2) return intersection / union if union > 0 else 0.0 ``` ## AcoustID Index Integration ### Client Implementations AcoustID server has two index client implementations: #### Legacy TCP Client (indexclient.py) **Status**: Deprecated, being phased out **Protocol**: Custom binary over TCP **Port**: 6080 (default) **File**: `acoustid/indexclient.py` ```python class IndexClientPool: """Connection pool for legacy TCP index.""" def __init__(self, host, port, pool_size=10): self.host = host self.port = port self.pool = Queue(maxsize=pool_size) def search(self, fingerprint, limit=10): """Search index for similar fingerprints.""" client = self.pool.get() try: # Send search command client.send_command(CMD_SEARCH, { 'fingerprint': fingerprint, 'limit': limit }) # Receive results results = client.receive_response() return results finally: self.pool.put(client) ``` **Message Format**: ``` ┌────────────┬─────────┬──────────────────┐ │ Length (4B)│ Cmd (1B)│ Payload (msgpack)│ └────────────┴─────────┴──────────────────┘ ``` #### Modern HTTP Client (fpstore.py) **Status**: Current, recommended **Protocol**: HTTP/1.1 with MessagePack **Port**: 6081 (default) **File**: `acoustid/fpstore.py` ```python class FingerprintIndexClient: """Async HTTP client for fingerprint index.""" def __init__(self, base_url, index_name='fingerprints'): self.base_url = base_url self.index_name = index_name self.session = aiohttp.ClientSession() async def search(self, query_terms, limit=10, min_score=0.5): """Search index for matching fingerprints. Args: query_terms: List of hash integers limit: Maximum results to return min_score: Minimum similarity score Returns: List of (fingerprint_id, score) tuples """ url = f"{self.base_url}/{self.index_name}/_search" payload = msgspec.msgpack.encode({ 'query': query_terms, 'limit': limit, 'min_score': min_score }) async with self.session.post(url, data=payload) as resp: data = await resp.read() result = msgspec.msgpack.decode(data) return [(r['id'], r['score']) for r in result['results']] async def insert(self, fingerprint_id, terms): """Insert or update fingerprint in index.""" url = f"{self.base_url}/{self.index_name}/{fingerprint_id}" payload = msgspec.msgpack.encode({'terms': terms}) async with self.session.put(url, data=payload) as resp: return resp.status == 200 async def delete(self, fingerprint_id): """Delete fingerprint from index.""" url = f"{self.base_url}/{self.index_name}/{fingerprint_id}" async with self.session.delete(url) as resp: return resp.status == 200 ``` ### Index Operations **Search Flow**: 1. Extract query terms from fingerprint (50-100 hashes) 2. Encode query as MessagePack 3. POST to `/:index/_search` 4. Decode MessagePack response 5. Return list of (fingerprint_id, score) tuples **Insert Flow**: 1. Extract all terms from fingerprint 2. Encode as MessagePack 3. PUT to `/:index/:fingerprint_id` 4. Index adds to MemorySegment 5. Appends to Oplog for durability **Batch Update Flow**: 1. Collect multiple fingerprint updates 2. Encode batch as MessagePack 3. POST to `/:index/_update` 4. Index processes all updates atomically ### Error Handling **Retry Strategy**: ```python async def search_with_retry(client, query, max_retries=3): """Search with exponential backoff retry.""" for attempt in range(max_retries): try: return await client.search(query) except aiohttp.ClientError as e: if attempt == max_retries - 1: raise wait_time = 2 ** attempt await asyncio.sleep(wait_time) ``` **Circuit Breaker**: ```python class CircuitBreaker: """Prevent cascading failures to index.""" def __init__(self, failure_threshold=5, timeout=60): self.failure_count = 0 self.failure_threshold = failure_threshold self.timeout = timeout self.last_failure_time = None self.state = 'closed' # closed, open, half-open async def call(self, func, *args, **kwargs): if self.state == 'open': if time.time() - self.last_failure_time > self.timeout: self.state = 'half-open' else: raise CircuitBreakerOpen() try: result = await func(*args, **kwargs) if self.state == 'half-open': self.state = 'closed' self.failure_count = 0 return result except Exception as e: self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = 'open' raise ``` ## Fingerprint Store (fpstore) ### Optional Service **Purpose**: Separate storage for raw fingerprint data **Status**: Optional (can use PostgreSQL instead) **Protocol**: HTTP with MessagePack **Configuration**: ```ini [fingerprint_store] enabled = true base_url = http://fpstore:8080 ``` **Operations**: ```python class FingerprintStore: """Client for fingerprint storage service.""" async def store(self, fingerprint_id, fingerprint_data): """Store raw fingerprint data.""" url = f"{self.base_url}/fingerprints/{fingerprint_id}" payload = msgspec.msgpack.encode({ 'data': fingerprint_data }) async with self.session.put(url, data=payload) as resp: return resp.status == 200 async def retrieve(self, fingerprint_id): """Retrieve raw fingerprint data.""" url = f"{self.base_url}/fingerprints/{fingerprint_id}" async with self.session.get(url) as resp: data = await resp.read() result = msgspec.msgpack.decode(data) return result['data'] ``` ## NATS Integration ### Message Queue **Purpose**: Async submission processing **Technology**: NATS with JetStream (persistent queue) **Library**: `nats-py` **Configuration**: ```ini [nats] servers = nats://nats:4222 stream = acoustid_submissions consumer = acoustid_worker ``` **File**: `acoustid/worker.py` ### Publisher (API Server) ```python import nats from nats.js import JetStreamContext async def publish_submission(submission_id): """Publish submission to NATS queue.""" nc = await nats.connect(servers=["nats://nats:4222"]) js: JetStreamContext = nc.jetstream() # Ensure stream exists await js.add_stream( name="acoustid_submissions", subjects=["submissions.*"], retention="workqueue" ) # Publish message await js.publish( subject="submissions.new", payload=msgspec.json.encode({ 'submission_id': submission_id, 'timestamp': time.time() }) ) await nc.close() ``` ### Consumer (Worker) ```python async def consume_submissions(): """Consume submissions from NATS queue.""" nc = await nats.connect(servers=["nats://nats:4222"]) js: JetStreamContext = nc.jetstream() # Create consumer consumer = await js.pull_subscribe( subject="submissions.*", durable="acoustid_worker", config=nats.js.api.ConsumerConfig( ack_policy="explicit", max_deliver=3, ack_wait=300 # 5 minutes ) ) while True: # Fetch batch of messages messages = await consumer.fetch(batch=10, timeout=5) for msg in messages: try: data = msgspec.json.decode(msg.data) await process_submission(data['submission_id']) await msg.ack() except Exception as e: logger.error(f"Failed to process submission: {e}") await msg.nak(delay=60) # Retry after 1 minute ``` ### JetStream Configuration **Stream Settings**: - Retention: WorkQueue (messages deleted after ack) - Max age: 7 days (unprocessed messages) - Max messages: 1,000,000 - Storage: File (persistent) **Consumer Settings**: - Ack policy: Explicit (manual acknowledgment) - Max deliver: 3 (retry up to 3 times) - Ack wait: 300 seconds (5 minutes timeout) - Max ack pending: 100 (max unacked messages) ## Redis Integration ### Use Cases 1. **Rate Limiting**: Sliding window counters 2. **Task Queue** (legacy): RPUSH/LPOP queue 3. **Caching**: API key validation, MBID existence 4. **State Management**: Backfill progress, worker state **Configuration**: ```ini [redis] host = redis port = 6379 db = 0 password_file = /run/secrets/redis_password ``` **File**: `acoustid/redis.py` ### Connection Pool ```python import redis redis_pool = redis.ConnectionPool( host='redis', port=6379, db=0, max_connections=50, socket_timeout=5, socket_connect_timeout=5 ) redis_client = redis.Redis(connection_pool=redis_pool) ``` ### Rate Limiting Implementation See DATA.md for detailed rate limiting data structures. ### Caching Patterns **API Key Cache**: ```python from cachetools import TTLCache api_key_cache = TTLCache(maxsize=1000, ttl=60) def get_application_by_key(api_key): if api_key in api_key_cache: return api_key_cache[api_key] app = db.query(Application).filter_by(apikey=api_key).first() if app: api_key_cache[api_key] = app return app ``` **Unknown MBID Cache**: ```python def is_mbid_known(mbid): """Check if MBID exists in MusicBrainz.""" cache_key = f"unknown_mbid:{mbid}" # Check cache if redis_client.exists(cache_key): return False # Query MusicBrainz exists = mb_db.query(Recording).filter_by(gid=mbid).count() > 0 # Cache negative result if not exists: redis_client.setex(cache_key, 3600, '1') return exists ``` ## Integration Summary | Service | Protocol | Purpose | Criticality | |---------|----------|---------|-------------| | MusicBrainz | PostgreSQL | Metadata enrichment | High | | Chromaprint | C library | Fingerprint generation | Critical | | Index (HTTP) | HTTP/MessagePack | Fingerprint search | Critical | | Index (TCP) | TCP binary | Legacy fingerprint search | Low (deprecated) | | Fingerprint Store | HTTP/MessagePack | Raw fingerprint storage | Low (optional) | | NATS | NATS protocol | Async job queue | High | | Redis | Redis protocol | Caching, rate limiting | High |