metadata-agregator/docs/research/acoustid/analysis/INTEGRATIONS.md

# AcoustID Integrations

## Overview

AcoustID integrates with multiple external services and libraries to provide comprehensive audio fingerprinting and metadata enrichment. The system's architecture separates concerns between fingerprint generation (Chromaprint), fingerprint indexing (acoustid-index), metadata enrichment (MusicBrainz), and supporting infrastructure (Redis, NATS).

## MusicBrainz Integration

### Connection Method

**Type**: Direct PostgreSQL database connection (NOT REST API)
**Database**: `musicbrainz` (read-only replica)
**Access**: Separate database connection pool

**Configuration** (`acoustid.conf`):
```ini
[musicbrainz]
host = musicbrainz-db.example.com
port = 5432
name = musicbrainz_db
user = acoustid_readonly
password_file = /run/secrets/mb_password
```

**File**: `acoustid/data/musicbrainz.py`

### Queried Tables

The integration queries the following MusicBrainz tables directly:

| Table | Purpose | Columns Used |
|-------|---------|--------------|
| `artist_credit` | Artist information | `id`, `name`, `artist_count` |
| `artist_credit_name` | Artist credit details | `artist_credit`, `position`, `artist`, `name`, `join_phrase` |
| `artist` | Artist entities | `id`, `gid`, `name`, `sort_name` |
| `recording` | Recording metadata | `id`, `gid`, `name`, `length`, `artist_credit`, `comment` |
| `release` | Release information | `id`, `gid`, `name`, `artist_credit`, `release_group`, `status`, `packaging`, `barcode` |
| `release_group` | Release group data | `id`, `gid`, `name`, `artist_credit`, `type`, `comment` |
| `track` | Track listings | `id`, `gid`, `recording`, `position`, `number`, `name`, `length`, `artist_credit` |
| `medium` | Medium information | `id`, `release`, `position`, `format`, `track_count` |
| `release_country` | Release countries | `release`, `country`, `date_year`, `date_month`, `date_day` |

### Query Patterns

**Fetch Recording by MBID**:

```python
def get_recording_by_mbid(db, mbid):
    """Fetch recording with artist credits and releases."""
    query = """
        SELECT
            r.gid AS recording_mbid,
            r.name AS recording_title,
            r.length AS duration,
            ac.name AS artist_credit_name,
            array_agg(DISTINCT rel.gid) AS release_mbids
        FROM recording r
        JOIN artist_credit ac ON r.artist_credit = ac.id
        LEFT JOIN track t ON t.recording = r.id
        LEFT JOIN medium m ON t.medium = m.id
        LEFT JOIN release rel ON m.release = rel.id
        WHERE r.gid = :mbid
        GROUP BY r.gid, r.name, r.length, ac.name
    """
    return db.execute(query, {'mbid': mbid}).fetchone()
```

**Fetch Release with Tracks**:

```python
def get_release_with_tracks(db, release_mbid):
    """Fetch complete release with all tracks."""
    query = """
        SELECT
            rel.gid AS release_mbid,
            rel.name AS release_title,
            rel.barcode,
            rc.country,
            rc.date_year,
            rc.date_month,
            rc.date_day,
            m.position AS medium_position,
            m.format AS medium_format,
            t.position AS track_position,
            t.number AS track_number,
            t.name AS track_title,
            rec.gid AS recording_mbid,
            ac.name AS artist_credit
        FROM release rel
        LEFT JOIN release_country rc ON rel.id = rc.release
        LEFT JOIN medium m ON rel.id = m.release
        LEFT JOIN track t ON m.id = t.medium
        LEFT JOIN recording rec ON t.recording = rec.id
        LEFT JOIN artist_credit ac ON rec.artist_credit = ac.id
        WHERE rel.gid = :mbid
        ORDER BY m.position, t.position
    """
    return db.execute(query, {'mbid': release_mbid}).fetchall()
```

**Fetch Artist Credits**:

```python
def get_artist_credit(db, artist_credit_id):
    """Fetch artist credit with all artists."""
    query = """
        SELECT
            acn.position,
            a.gid AS artist_mbid,
            a.name AS artist_name,
            a.sort_name AS artist_sort_name,
            acn.name AS credited_name,
            acn.join_phrase
        FROM artist_credit_name acn
        JOIN artist a ON acn.artist = a.id
        WHERE acn.artist_credit = :ac_id
        ORDER BY acn.position
    """
    return db.execute(query, {'ac_id': artist_credit_id}).fetchall()
```

### MBID Redirect Resolution

MusicBrainz uses MBID redirects when entities are merged. AcoustID resolves these automatically.

**File**: `acoustid/data/musicbrainz.py`

```python
def resolve_recording_mbid(db, mbid):
    """Resolve recording MBID redirects."""
    query = """
        SELECT new_id
        FROM recording_gid_redirect
        WHERE gid = :mbid
    """
    result = db.execute(query, {'mbid': mbid}).fetchone()
    if result:
        # Recursively resolve redirects
        return resolve_recording_mbid(db, result['new_id'])
    return mbid
```

**Redirect Tables Used**:
- `recording_gid_redirect`
- `release_gid_redirect`
- `release_group_gid_redirect`
- `artist_gid_redirect`

### Metadata Enrichment

When a lookup request includes metadata flags, AcoustID fetches additional data from MusicBrainz:

**Metadata Levels**:

| Flag | Data Fetched | Query Complexity |
|------|--------------|------------------|
| `recordingids` | Recording MBIDs only | Low (join only) |
| `recordings` | Full recording metadata | Medium (artist credits) |
| `releaseids` | Release MBIDs only | Low (join only) |
| `releases` | Full release metadata | High (tracks, mediums, countries) |
| `releasegroupids` | Release group MBIDs only | Low (join only) |
| `releasegroups` | Full release group metadata | Medium (artist credits) |

**Example Enriched Response**:

```json
{
  "recordings": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "title": "Example Song",
      "duration": 240000,
      "artists": [
        {
          "id": "12345678-90ab-cdef-1234-567890abcdef",
          "name": "Example Artist",
          "joinphrase": " & "
        }
      ],
      "releases": [
        {
          "id": "abcdef12-3456-7890-abcd-ef1234567890",
          "title": "Example Album",
          "country": "US",
          "date": {
            "year": 2020,
            "month": 5,
            "day": 15
          },
          "track_count": 12,
          "medium_count": 1,
          "releasegroup": {
            "id": "fedcba98-7654-3210-fedc-ba9876543210",
            "type": "Album"
          }
        }
      ]
    }
  ]
}
```

### Performance Considerations

**Connection Pooling**:
- Separate pool for MusicBrainz database
- Pool size: 10 connections (configurable)
- Pool recycle: 3600 seconds

**Query Optimization**:
- Indexes on `gid` columns (MusicBrainz maintains these)
- Batch queries when possible
- Limit joins to requested metadata only

**Caching**:
- Unknown MBID cache (Redis, 1 hour TTL)
- Avoids repeated queries for non-existent MBIDs

**Fallback**:
- If MusicBrainz database unavailable, return AcoustID data only
- Graceful degradation (no metadata enrichment)

## Chromaprint Integration

### Library Information

**Name**: Chromaprint
**Version**: Built from source (commit `41a3e8fb`)
**License**: MIT
**Language**: C++
**Wrapper**: acoustid-ext (C extension for Python)

**Repository**: https://github.com/acoustid/chromaprint

### Build Process

**Dockerfile** (`docker/Dockerfile`):

```dockerfile
# Stage 1: Build Chromaprint
FROM ubuntu:24.04 AS chromaprint-build

RUN apt-get update && apt-get install -y \
    git cmake build-essential libfftw3-dev

WORKDIR /build
RUN git clone https://github.com/acoustid/chromaprint.git && \
    cd chromaprint && \
    git checkout 41a3e8fb && \
    cmake -DCMAKE_BUILD_TYPE=Release . && \
    make && \
    make install

# Stage 2: Build acoustid-ext
FROM ubuntu:24.04 AS builder

COPY --from=chromaprint-build /usr/local/lib/libchromaprint.so* /usr/local/lib/
COPY --from=chromaprint-build /usr/local/include/chromaprint.h /usr/local/include/

RUN pip install acoustid-ext
```

### Python Extension (acoustid-ext)

**Package**: `acoustid-ext`
**File**: `acoustid/fingerprint.py`

**Functions Exposed**:

```python
from acoustid_ext import (
    decode_fingerprint,
    encode_fingerprint,
    compress_fingerprint,
    decompress_fingerprint,
    fingerprint_compare
)
```

**Function Signatures**:

| Function | Input | Output | Purpose |
|----------|-------|--------|---------|
| `decode_fingerprint(data)` | bytes/str | list[int] | Decode base64/compressed fingerprint |
| `encode_fingerprint(hashes)` | list[int] | str | Encode fingerprint to base64 |
| `compress_fingerprint(hashes)` | list[int] | bytes | Compress fingerprint (zstd) |
| `decompress_fingerprint(data)` | bytes | list[int] | Decompress fingerprint |
| `fingerprint_compare(fp1, fp2)` | list[int], list[int] | float | Compare similarity (0.0-1.0) |

### Fingerprint Format

**Raw Format** (Chromaprint output):
- Array of 32-bit unsigned integers
- Each integer represents a hash of audio features
- Typical length: 100-300 hashes (for 3-5 minute track)

**Compressed Format** (for transmission):
- Base64-encoded compressed data
- Compression: zstd or custom Chromaprint compression
- Typical size: 200-500 bytes

**Example**:
```python
# Raw fingerprint
fingerprint = [123456789, 987654321, 456789123, ...]

# Encoded (base64)
encoded = "AQADtNGiJEqUHUemR..."

# Compressed (bytes)
compressed = b'\x28\xb5\x2f\xfd...'
```

### Query Extraction

**File**: `acoustid/fingerprint.py`

```python
def extract_query(fingerprint, max_terms=100):
    """Extract query terms from fingerprint for index search.

    Args:
        fingerprint: List of 32-bit hash integers
        max_terms: Maximum number of terms to extract

    Returns:
        List of term IDs (subset of fingerprint hashes)
    """
    # Select most discriminative terms
    # (implementation uses simhash or random sampling)
    terms = select_discriminative_terms(fingerprint, max_terms)
    return terms
```

**Query Strategy**:
- Extract subset of hashes (typically 50-100 terms)
- Prioritize discriminative hashes (high entropy)
- Balance between precision and recall

### Fingerprint Comparison

**PostgreSQL Function** (custom extension):

```sql
CREATE FUNCTION acoustid_compare(fp1 INTEGER[], fp2 INTEGER[])
RETURNS FLOAT AS $$
    -- Calculate Jaccard similarity
    SELECT COUNT(*)::FLOAT /
           (array_length(fp1, 1) + array_length(fp2, 1) - COUNT(*))
    FROM unnest(fp1) AS h1
    JOIN unnest(fp2) AS h2 ON h1 = h2
$$ LANGUAGE SQL IMMUTABLE;
```

**Python Implementation**:

```python
def compare_fingerprints(fp1, fp2):
    """Calculate similarity between two fingerprints.

    Returns:
        Float between 0.0 (no match) and 1.0 (identical)
    """
    set1 = set(fp1)
    set2 = set(fp2)
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union > 0 else 0.0
```

## AcoustID Index Integration

### Client Implementations

AcoustID server has two index client implementations:

#### Legacy TCP Client (indexclient.py)

**Status**: Deprecated, being phased out
**Protocol**: Custom binary over TCP
**Port**: 6080 (default)

**File**: `acoustid/indexclient.py`

```python
class IndexClientPool:
    """Connection pool for legacy TCP index."""

    def __init__(self, host, port, pool_size=10):
        self.host = host
        self.port = port
        self.pool = Queue(maxsize=pool_size)

    def search(self, fingerprint, limit=10):
        """Search index for similar fingerprints."""
        client = self.pool.get()
        try:
            # Send search command
            client.send_command(CMD_SEARCH, {
                'fingerprint': fingerprint,
                'limit': limit
            })
            # Receive results
            results = client.receive_response()
            return results
        finally:
            self.pool.put(client)
```

**Message Format**:
```
┌────────────┬─────────┬──────────────────┐
│ Length (4B)│ Cmd (1B)│ Payload (msgpack)│
└────────────┴─────────┴──────────────────┘
```

#### Modern HTTP Client (fpstore.py)

**Status**: Current, recommended
**Protocol**: HTTP/1.1 with MessagePack
**Port**: 6081 (default)

**File**: `acoustid/fpstore.py`

```python
class FingerprintIndexClient:
    """Async HTTP client for fingerprint index."""

    def __init__(self, base_url, index_name='fingerprints'):
        self.base_url = base_url
        self.index_name = index_name
        self.session = aiohttp.ClientSession()

    async def search(self, query_terms, limit=10, min_score=0.5):
        """Search index for matching fingerprints.

        Args:
            query_terms: List of hash integers
            limit: Maximum results to return
            min_score: Minimum similarity score

        Returns:
            List of (fingerprint_id, score) tuples
        """
        url = f"{self.base_url}/{self.index_name}/_search"
        payload = msgspec.msgpack.encode({
            'query': query_terms,
            'limit': limit,
            'min_score': min_score
        })

        async with self.session.post(url, data=payload) as resp:
            data = await resp.read()
            result = msgspec.msgpack.decode(data)
            return [(r['id'], r['score']) for r in result['results']]

    async def insert(self, fingerprint_id, terms):
        """Insert or update fingerprint in index."""
        url = f"{self.base_url}/{self.index_name}/{fingerprint_id}"
        payload = msgspec.msgpack.encode({'terms': terms})

        async with self.session.put(url, data=payload) as resp:
            return resp.status == 200

    async def delete(self, fingerprint_id):
        """Delete fingerprint from index."""
        url = f"{self.base_url}/{self.index_name}/{fingerprint_id}"
        async with self.session.delete(url) as resp:
            return resp.status == 200
```

### Index Operations

**Search Flow**:
1. Extract query terms from fingerprint (50-100 hashes)
2. Encode query as MessagePack
3. POST to `/:index/_search`
4. Decode MessagePack response
5. Return list of (fingerprint_id, score) tuples

**Insert Flow**:
1. Extract all terms from fingerprint
2. Encode as MessagePack
3. PUT to `/:index/:fingerprint_id`
4. Index adds to MemorySegment
5. Appends to Oplog for durability

**Batch Update Flow**:
1. Collect multiple fingerprint updates
2. Encode batch as MessagePack
3. POST to `/:index/_update`
4. Index processes all updates atomically

### Error Handling

**Retry Strategy**:

```python
async def search_with_retry(client, query, max_retries=3):
    """Search with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            return await client.search(query)
        except aiohttp.ClientError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt
            await asyncio.sleep(wait_time)
```

**Circuit Breaker**:

```python
class CircuitBreaker:
    """Prevent cascading failures to index."""

    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = 'closed'  # closed, open, half-open

    async def call(self, func, *args, **kwargs):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'half-open'
            else:
                raise CircuitBreakerOpen()

        try:
            result = await func(*args, **kwargs)
            if self.state == 'half-open':
                self.state = 'closed'
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'open'
            raise
```

## Fingerprint Store (fpstore)

### Optional Service

**Purpose**: Separate storage for raw fingerprint data
**Status**: Optional (can use PostgreSQL instead)
**Protocol**: HTTP with MessagePack

**Configuration**:
```ini
[fingerprint_store]
enabled = true
base_url = http://fpstore:8080
```

**Operations**:

```python
class FingerprintStore:
    """Client for fingerprint storage service."""

    async def store(self, fingerprint_id, fingerprint_data):
        """Store raw fingerprint data."""
        url = f"{self.base_url}/fingerprints/{fingerprint_id}"
        payload = msgspec.msgpack.encode({
            'data': fingerprint_data
        })
        async with self.session.put(url, data=payload) as resp:
            return resp.status == 200

    async def retrieve(self, fingerprint_id):
        """Retrieve raw fingerprint data."""
        url = f"{self.base_url}/fingerprints/{fingerprint_id}"
        async with self.session.get(url) as resp:
            data = await resp.read()
            result = msgspec.msgpack.decode(data)
            return result['data']
```

## NATS Integration

### Message Queue

**Purpose**: Async submission processing
**Technology**: NATS with JetStream (persistent queue)
**Library**: `nats-py`

**Configuration**:
```ini
[nats]
servers = nats://nats:4222
stream = acoustid_submissions
consumer = acoustid_worker
```

**File**: `acoustid/worker.py`

### Publisher (API Server)

```python
import nats
from nats.js import JetStreamContext

async def publish_submission(submission_id):
    """Publish submission to NATS queue."""
    nc = await nats.connect(servers=["nats://nats:4222"])
    js: JetStreamContext = nc.jetstream()

    # Ensure stream exists
    await js.add_stream(
        name="acoustid_submissions",
        subjects=["submissions.*"],
        retention="workqueue"
    )

    # Publish message
    await js.publish(
        subject="submissions.new",
        payload=msgspec.json.encode({
            'submission_id': submission_id,
            'timestamp': time.time()
        })
    )

    await nc.close()
```

### Consumer (Worker)

```python
async def consume_submissions():
    """Consume submissions from NATS queue."""
    nc = await nats.connect(servers=["nats://nats:4222"])
    js: JetStreamContext = nc.jetstream()

    # Create consumer
    consumer = await js.pull_subscribe(
        subject="submissions.*",
        durable="acoustid_worker",
        config=nats.js.api.ConsumerConfig(
            ack_policy="explicit",
            max_deliver=3,
            ack_wait=300  # 5 minutes
        )
    )

    while True:
        # Fetch batch of messages
        messages = await consumer.fetch(batch=10, timeout=5)

        for msg in messages:
            try:
                data = msgspec.json.decode(msg.data)
                await process_submission(data['submission_id'])
                await msg.ack()
            except Exception as e:
                logger.error(f"Failed to process submission: {e}")
                await msg.nak(delay=60)  # Retry after 1 minute
```

### JetStream Configuration

**Stream Settings**:
- Retention: WorkQueue (messages deleted after ack)
- Max age: 7 days (unprocessed messages)
- Max messages: 1,000,000
- Storage: File (persistent)

**Consumer Settings**:
- Ack policy: Explicit (manual acknowledgment)
- Max deliver: 3 (retry up to 3 times)
- Ack wait: 300 seconds (5 minutes timeout)
- Max ack pending: 100 (max unacked messages)

## Redis Integration

### Use Cases

1. **Rate Limiting**: Sliding window counters
2. **Task Queue** (legacy): RPUSH/LPOP queue
3. **Caching**: API key validation, MBID existence
4. **State Management**: Backfill progress, worker state

**Configuration**:
```ini
[redis]
host = redis
port = 6379
db = 0
password_file = /run/secrets/redis_password
```

**File**: `acoustid/redis.py`

### Connection Pool

```python
import redis

redis_pool = redis.ConnectionPool(
    host='redis',
    port=6379,
    db=0,
    max_connections=50,
    socket_timeout=5,
    socket_connect_timeout=5
)

redis_client = redis.Redis(connection_pool=redis_pool)
```

### Rate Limiting Implementation

See DATA.md for detailed rate limiting data structures.

### Caching Patterns

**API Key Cache**:
```python
from cachetools import TTLCache

api_key_cache = TTLCache(maxsize=1000, ttl=60)

def get_application_by_key(api_key):
    if api_key in api_key_cache:
        return api_key_cache[api_key]

    app = db.query(Application).filter_by(apikey=api_key).first()
    if app:
        api_key_cache[api_key] = app
    return app
```

**Unknown MBID Cache**:
```python
def is_mbid_known(mbid):
    """Check if MBID exists in MusicBrainz."""
    cache_key = f"unknown_mbid:{mbid}"

    # Check cache
    if redis_client.exists(cache_key):
        return False

    # Query MusicBrainz
    exists = mb_db.query(Recording).filter_by(gid=mbid).count() > 0

    # Cache negative result
    if not exists:
        redis_client.setex(cache_key, 3600, '1')

    return exists
```

## Integration Summary

| Service | Protocol | Purpose | Criticality |
|---------|----------|---------|-------------|
| MusicBrainz | PostgreSQL | Metadata enrichment | High |
| Chromaprint | C library | Fingerprint generation | Critical |
| Index (HTTP) | HTTP/MessagePack | Fingerprint search | Critical |
| Index (TCP) | TCP binary | Legacy fingerprint search | Low (deprecated) |
| Fingerprint Store | HTTP/MessagePack | Raw fingerprint storage | Low (optional) |
| NATS | NATS protocol | Async job queue | High |
| Redis | Redis protocol | Caching, rate limiting | High |