Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

21 KiB

Raw Blame History

AcoustID Integrations

Overview

AcoustID integrates with multiple external services and libraries to provide comprehensive audio fingerprinting and metadata enrichment. The system's architecture separates concerns between fingerprint generation (Chromaprint), fingerprint indexing (acoustid-index), metadata enrichment (MusicBrainz), and supporting infrastructure (Redis, NATS).

MusicBrainz Integration

Connection Method

Type: Direct PostgreSQL database connection (NOT REST API)
Database: musicbrainz (read-only replica)
Access: Separate database connection pool

Configuration (acoustid.conf):

[musicbrainz]
host = musicbrainz-db.example.com
port = 5432
name = musicbrainz_db
user = acoustid_readonly
password_file = /run/secrets/mb_password

File: acoustid/data/musicbrainz.py

Queried Tables

The integration queries the following MusicBrainz tables directly:

Table	Purpose	Columns Used
`artist_credit`	Artist information	`id`, `name`, `artist_count`
`artist_credit_name`	Artist credit details	`artist_credit`, `position`, `artist`, `name`, `join_phrase`
`artist`	Artist entities	`id`, `gid`, `name`, `sort_name`
`recording`	Recording metadata	`id`, `gid`, `name`, `length`, `artist_credit`, `comment`
`release`	Release information	`id`, `gid`, `name`, `artist_credit`, `release_group`, `status`, `packaging`, `barcode`
`release_group`	Release group data	`id`, `gid`, `name`, `artist_credit`, `type`, `comment`
`track`	Track listings	`id`, `gid`, `recording`, `position`, `number`, `name`, `length`, `artist_credit`
`medium`	Medium information	`id`, `release`, `position`, `format`, `track_count`
`release_country`	Release countries	`release`, `country`, `date_year`, `date_month`, `date_day`

Query Patterns

Fetch Recording by MBID:

def get_recording_by_mbid(db, mbid):
    """Fetch recording with artist credits and releases."""
    query = """
        SELECT 
            r.gid AS recording_mbid,
            r.name AS recording_title,
            r.length AS duration,
            ac.name AS artist_credit_name,
            array_agg(DISTINCT rel.gid) AS release_mbids
        FROM recording r
        JOIN artist_credit ac ON r.artist_credit = ac.id
        LEFT JOIN track t ON t.recording = r.id
        LEFT JOIN medium m ON t.medium = m.id
        LEFT JOIN release rel ON m.release = rel.id
        WHERE r.gid = :mbid
        GROUP BY r.gid, r.name, r.length, ac.name
    """
    return db.execute(query, {'mbid': mbid}).fetchone()

Fetch Release with Tracks:

def get_release_with_tracks(db, release_mbid):
    """Fetch complete release with all tracks."""
    query = """
        SELECT 
            rel.gid AS release_mbid,
            rel.name AS release_title,
            rel.barcode,
            rc.country,
            rc.date_year,
            rc.date_month,
            rc.date_day,
            m.position AS medium_position,
            m.format AS medium_format,
            t.position AS track_position,
            t.number AS track_number,
            t.name AS track_title,
            rec.gid AS recording_mbid,
            ac.name AS artist_credit
        FROM release rel
        LEFT JOIN release_country rc ON rel.id = rc.release
        LEFT JOIN medium m ON rel.id = m.release
        LEFT JOIN track t ON m.id = t.medium
        LEFT JOIN recording rec ON t.recording = rec.id
        LEFT JOIN artist_credit ac ON rec.artist_credit = ac.id
        WHERE rel.gid = :mbid
        ORDER BY m.position, t.position
    """
    return db.execute(query, {'mbid': release_mbid}).fetchall()

Fetch Artist Credits:

def get_artist_credit(db, artist_credit_id):
    """Fetch artist credit with all artists."""
    query = """
        SELECT 
            acn.position,
            a.gid AS artist_mbid,
            a.name AS artist_name,
            a.sort_name AS artist_sort_name,
            acn.name AS credited_name,
            acn.join_phrase
        FROM artist_credit_name acn
        JOIN artist a ON acn.artist = a.id
        WHERE acn.artist_credit = :ac_id
        ORDER BY acn.position
    """
    return db.execute(query, {'ac_id': artist_credit_id}).fetchall()

MBID Redirect Resolution

MusicBrainz uses MBID redirects when entities are merged. AcoustID resolves these automatically.

File: acoustid/data/musicbrainz.py

def resolve_recording_mbid(db, mbid):
    """Resolve recording MBID redirects."""
    query = """
        SELECT new_id 
        FROM recording_gid_redirect 
        WHERE gid = :mbid
    """
    result = db.execute(query, {'mbid': mbid}).fetchone()
    if result:
        # Recursively resolve redirects
        return resolve_recording_mbid(db, result['new_id'])
    return mbid

Redirect Tables Used:

recording_gid_redirect
release_gid_redirect
release_group_gid_redirect
artist_gid_redirect

Metadata Enrichment

When a lookup request includes metadata flags, AcoustID fetches additional data from MusicBrainz:

Metadata Levels:

Flag	Data Fetched	Query Complexity
`recordingids`	Recording MBIDs only	Low (join only)
`recordings`	Full recording metadata	Medium (artist credits)
`releaseids`	Release MBIDs only	Low (join only)
`releases`	Full release metadata	High (tracks, mediums, countries)
`releasegroupids`	Release group MBIDs only	Low (join only)
`releasegroups`	Full release group metadata	Medium (artist credits)

Example Enriched Response:

{
  "recordings": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "title": "Example Song",
      "duration": 240000,
      "artists": [
        {
          "id": "12345678-90ab-cdef-1234-567890abcdef",
          "name": "Example Artist",
          "joinphrase": " & "
        }
      ],
      "releases": [
        {
          "id": "abcdef12-3456-7890-abcd-ef1234567890",
          "title": "Example Album",
          "country": "US",
          "date": {
            "year": 2020,
            "month": 5,
            "day": 15
          },
          "track_count": 12,
          "medium_count": 1,
          "releasegroup": {
            "id": "fedcba98-7654-3210-fedc-ba9876543210",
            "type": "Album"
          }
        }
      ]
    }
  ]
}

Performance Considerations

Connection Pooling:

Separate pool for MusicBrainz database
Pool size: 10 connections (configurable)
Pool recycle: 3600 seconds

Query Optimization:

Indexes on gid columns (MusicBrainz maintains these)
Batch queries when possible
Limit joins to requested metadata only

Caching:

Unknown MBID cache (Redis, 1 hour TTL)
Avoids repeated queries for non-existent MBIDs

Fallback:

If MusicBrainz database unavailable, return AcoustID data only
Graceful degradation (no metadata enrichment)

Chromaprint Integration

Library Information

Name: Chromaprint
Version: Built from source (commit 41a3e8fb)
License: MIT
Language: C++
Wrapper: acoustid-ext (C extension for Python)

Repository: https://github.com/acoustid/chromaprint

Build Process

Dockerfile (docker/Dockerfile):

# Stage 1: Build Chromaprint
FROM ubuntu:24.04 AS chromaprint-build

RUN apt-get update && apt-get install -y \
    git cmake build-essential libfftw3-dev

WORKDIR /build
RUN git clone https://github.com/acoustid/chromaprint.git && \
    cd chromaprint && \
    git checkout 41a3e8fb && \
    cmake -DCMAKE_BUILD_TYPE=Release . && \
    make && \
    make install

# Stage 2: Build acoustid-ext
FROM ubuntu:24.04 AS builder

COPY --from=chromaprint-build /usr/local/lib/libchromaprint.so* /usr/local/lib/
COPY --from=chromaprint-build /usr/local/include/chromaprint.h /usr/local/include/

RUN pip install acoustid-ext

Python Extension (acoustid-ext)

Package: acoustid-ext
File: acoustid/fingerprint.py

Functions Exposed:

from acoustid_ext import (
    decode_fingerprint,
    encode_fingerprint,
    compress_fingerprint,
    decompress_fingerprint,
    fingerprint_compare
)

Function Signatures:

Function	Input	Output	Purpose
`decode_fingerprint(data)`	bytes/str	list[int]	Decode base64/compressed fingerprint
`encode_fingerprint(hashes)`	list[int]	str	Encode fingerprint to base64
`compress_fingerprint(hashes)`	list[int]	bytes	Compress fingerprint (zstd)
`decompress_fingerprint(data)`	bytes	list[int]	Decompress fingerprint
`fingerprint_compare(fp1, fp2)`	list[int], list[int]	float	Compare similarity (0.0-1.0)

Fingerprint Format

Raw Format (Chromaprint output):

Array of 32-bit unsigned integers
Each integer represents a hash of audio features
Typical length: 100-300 hashes (for 3-5 minute track)

Compressed Format (for transmission):

Base64-encoded compressed data
Compression: zstd or custom Chromaprint compression
Typical size: 200-500 bytes

Example:

# Raw fingerprint
fingerprint = [123456789, 987654321, 456789123, ...]

# Encoded (base64)
encoded = "AQADtNGiJEqUHUemR..."

# Compressed (bytes)
compressed = b'\x28\xb5\x2f\xfd...'

Query Extraction

File: acoustid/fingerprint.py

def extract_query(fingerprint, max_terms=100):
    """Extract query terms from fingerprint for index search.
    
    Args:
        fingerprint: List of 32-bit hash integers
        max_terms: Maximum number of terms to extract
        
    Returns:
        List of term IDs (subset of fingerprint hashes)
    """
    # Select most discriminative terms
    # (implementation uses simhash or random sampling)
    terms = select_discriminative_terms(fingerprint, max_terms)
    return terms

Query Strategy:

Extract subset of hashes (typically 50-100 terms)
Prioritize discriminative hashes (high entropy)
Balance between precision and recall

Fingerprint Comparison

PostgreSQL Function (custom extension):

CREATE FUNCTION acoustid_compare(fp1 INTEGER[], fp2 INTEGER[]) 
RETURNS FLOAT AS $$
    -- Calculate Jaccard similarity
    SELECT COUNT(*)::FLOAT / 
           (array_length(fp1, 1) + array_length(fp2, 1) - COUNT(*))
    FROM unnest(fp1) AS h1
    JOIN unnest(fp2) AS h2 ON h1 = h2
$$ LANGUAGE SQL IMMUTABLE;

Python Implementation:

def compare_fingerprints(fp1, fp2):
    """Calculate similarity between two fingerprints.
    
    Returns:
        Float between 0.0 (no match) and 1.0 (identical)
    """
    set1 = set(fp1)
    set2 = set(fp2)
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union > 0 else 0.0

AcoustID Index Integration

Client Implementations

AcoustID server has two index client implementations:

Legacy TCP Client (indexclient.py)

Status: Deprecated, being phased out
Protocol: Custom binary over TCP
Port: 6080 (default)

File: acoustid/indexclient.py

class IndexClientPool:
    """Connection pool for legacy TCP index."""
    
    def __init__(self, host, port, pool_size=10):
        self.host = host
        self.port = port
        self.pool = Queue(maxsize=pool_size)
        
    def search(self, fingerprint, limit=10):
        """Search index for similar fingerprints."""
        client = self.pool.get()
        try:
            # Send search command
            client.send_command(CMD_SEARCH, {
                'fingerprint': fingerprint,
                'limit': limit
            })
            # Receive results
            results = client.receive_response()
            return results
        finally:
            self.pool.put(client)

Message Format:

┌────────────┬─────────┬──────────────────┐
│ Length (4B)│ Cmd (1B)│ Payload (msgpack)│
└────────────┴─────────┴──────────────────┘

Modern HTTP Client (fpstore.py)

Status: Current, recommended
Protocol: HTTP/1.1 with MessagePack
Port: 6081 (default)

File: acoustid/fpstore.py

class FingerprintIndexClient:
    """Async HTTP client for fingerprint index."""
    
    def __init__(self, base_url, index_name='fingerprints'):
        self.base_url = base_url
        self.index_name = index_name
        self.session = aiohttp.ClientSession()
        
    async def search(self, query_terms, limit=10, min_score=0.5):
        """Search index for matching fingerprints.
        
        Args:
            query_terms: List of hash integers
            limit: Maximum results to return
            min_score: Minimum similarity score
            
        Returns:
            List of (fingerprint_id, score) tuples
        """
        url = f"{self.base_url}/{self.index_name}/_search"
        payload = msgspec.msgpack.encode({
            'query': query_terms,
            'limit': limit,
            'min_score': min_score
        })
        
        async with self.session.post(url, data=payload) as resp:
            data = await resp.read()
            result = msgspec.msgpack.decode(data)
            return [(r['id'], r['score']) for r in result['results']]
    
    async def insert(self, fingerprint_id, terms):
        """Insert or update fingerprint in index."""
        url = f"{self.base_url}/{self.index_name}/{fingerprint_id}"
        payload = msgspec.msgpack.encode({'terms': terms})
        
        async with self.session.put(url, data=payload) as resp:
            return resp.status == 200
    
    async def delete(self, fingerprint_id):
        """Delete fingerprint from index."""
        url = f"{self.base_url}/{self.index_name}/{fingerprint_id}"
        async with self.session.delete(url) as resp:
            return resp.status == 200

Index Operations

Search Flow:

Extract query terms from fingerprint (50-100 hashes)
Encode query as MessagePack
POST to /:index/_search
Decode MessagePack response
Return list of (fingerprint_id, score) tuples

Insert Flow:

Extract all terms from fingerprint
Encode as MessagePack
PUT to /:index/:fingerprint_id
Index adds to MemorySegment
Appends to Oplog for durability

Batch Update Flow:

Collect multiple fingerprint updates
Encode batch as MessagePack
POST to /:index/_update
Index processes all updates atomically

Error Handling

Retry Strategy:

async def search_with_retry(client, query, max_retries=3):
    """Search with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            return await client.search(query)
        except aiohttp.ClientError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt
            await asyncio.sleep(wait_time)

Circuit Breaker:

class CircuitBreaker:
    """Prevent cascading failures to index."""
    
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = 'closed'  # closed, open, half-open
        
    async def call(self, func, *args, **kwargs):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'half-open'
            else:
                raise CircuitBreakerOpen()
        
        try:
            result = await func(*args, **kwargs)
            if self.state == 'half-open':
                self.state = 'closed'
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'open'
            raise

Fingerprint Store (fpstore)

Optional Service

Purpose: Separate storage for raw fingerprint data
Status: Optional (can use PostgreSQL instead)
Protocol: HTTP with MessagePack

Configuration:

[fingerprint_store]
enabled = true
base_url = http://fpstore:8080

Operations:

class FingerprintStore:
    """Client for fingerprint storage service."""
    
    async def store(self, fingerprint_id, fingerprint_data):
        """Store raw fingerprint data."""
        url = f"{self.base_url}/fingerprints/{fingerprint_id}"
        payload = msgspec.msgpack.encode({
            'data': fingerprint_data
        })
        async with self.session.put(url, data=payload) as resp:
            return resp.status == 200
    
    async def retrieve(self, fingerprint_id):
        """Retrieve raw fingerprint data."""
        url = f"{self.base_url}/fingerprints/{fingerprint_id}"
        async with self.session.get(url) as resp:
            data = await resp.read()
            result = msgspec.msgpack.decode(data)
            return result['data']

NATS Integration

Message Queue

Purpose: Async submission processing
Technology: NATS with JetStream (persistent queue)
Library: nats-py

Configuration:

[nats]
servers = nats://nats:4222
stream = acoustid_submissions
consumer = acoustid_worker

File: acoustid/worker.py

Publisher (API Server)

import nats
from nats.js import JetStreamContext

async def publish_submission(submission_id):
    """Publish submission to NATS queue."""
    nc = await nats.connect(servers=["nats://nats:4222"])
    js: JetStreamContext = nc.jetstream()
    
    # Ensure stream exists
    await js.add_stream(
        name="acoustid_submissions",
        subjects=["submissions.*"],
        retention="workqueue"
    )
    
    # Publish message
    await js.publish(
        subject="submissions.new",
        payload=msgspec.json.encode({
            'submission_id': submission_id,
            'timestamp': time.time()
        })
    )
    
    await nc.close()

Consumer (Worker)

async def consume_submissions():
    """Consume submissions from NATS queue."""
    nc = await nats.connect(servers=["nats://nats:4222"])
    js: JetStreamContext = nc.jetstream()
    
    # Create consumer
    consumer = await js.pull_subscribe(
        subject="submissions.*",
        durable="acoustid_worker",
        config=nats.js.api.ConsumerConfig(
            ack_policy="explicit",
            max_deliver=3,
            ack_wait=300  # 5 minutes
        )
    )
    
    while True:
        # Fetch batch of messages
        messages = await consumer.fetch(batch=10, timeout=5)
        
        for msg in messages:
            try:
                data = msgspec.json.decode(msg.data)
                await process_submission(data['submission_id'])
                await msg.ack()
            except Exception as e:
                logger.error(f"Failed to process submission: {e}")
                await msg.nak(delay=60)  # Retry after 1 minute

JetStream Configuration

Stream Settings:

Retention: WorkQueue (messages deleted after ack)
Max age: 7 days (unprocessed messages)
Max messages: 1,000,000
Storage: File (persistent)

Consumer Settings:

Ack policy: Explicit (manual acknowledgment)
Max deliver: 3 (retry up to 3 times)
Ack wait: 300 seconds (5 minutes timeout)
Max ack pending: 100 (max unacked messages)

Redis Integration

Use Cases

Rate Limiting: Sliding window counters
Task Queue (legacy): RPUSH/LPOP queue
Caching: API key validation, MBID existence
State Management: Backfill progress, worker state

Configuration:

[redis]
host = redis
port = 6379
db = 0
password_file = /run/secrets/redis_password

File: acoustid/redis.py

Connection Pool

import redis

redis_pool = redis.ConnectionPool(
    host='redis',
    port=6379,
    db=0,
    max_connections=50,
    socket_timeout=5,
    socket_connect_timeout=5
)

redis_client = redis.Redis(connection_pool=redis_pool)

Rate Limiting Implementation

See DATA.md for detailed rate limiting data structures.

Caching Patterns

API Key Cache:

from cachetools import TTLCache

api_key_cache = TTLCache(maxsize=1000, ttl=60)

def get_application_by_key(api_key):
    if api_key in api_key_cache:
        return api_key_cache[api_key]
    
    app = db.query(Application).filter_by(apikey=api_key).first()
    if app:
        api_key_cache[api_key] = app
    return app

Unknown MBID Cache:

def is_mbid_known(mbid):
    """Check if MBID exists in MusicBrainz."""
    cache_key = f"unknown_mbid:{mbid}"
    
    # Check cache
    if redis_client.exists(cache_key):
        return False
    
    # Query MusicBrainz
    exists = mb_db.query(Recording).filter_by(gid=mbid).count() > 0
    
    # Cache negative result
    if not exists:
        redis_client.setex(cache_key, 3600, '1')
    
    return exists

Integration Summary

Service	Protocol	Purpose	Criticality
MusicBrainz	PostgreSQL	Metadata enrichment	High
Chromaprint	C library	Fingerprint generation	Critical
Index (HTTP)	HTTP/MessagePack	Fingerprint search	Critical
Index (TCP)	TCP binary	Legacy fingerprint search	Low (deprecated)
Fingerprint Store	HTTP/MessagePack	Raw fingerprint storage	Low (optional)
NATS	NATS protocol	Async job queue	High
Redis	Redis protocol	Caching, rate limiting	High

21 KiB Raw Blame History

AcoustID Integrations

Overview

MusicBrainz Integration

Connection Method

Queried Tables

Query Patterns

MBID Redirect Resolution

Metadata Enrichment

Performance Considerations

Chromaprint Integration

Library Information

Build Process

Python Extension (acoustid-ext)

Fingerprint Format

Query Extraction

Fingerprint Comparison

AcoustID Index Integration

Client Implementations

Legacy TCP Client (indexclient.py)

Modern HTTP Client (fpstore.py)

Index Operations

Error Handling

Fingerprint Store (fpstore)

Optional Service

NATS Integration

Message Queue

Publisher (API Server)

Consumer (Worker)

JetStream Configuration

Redis Integration

Use Cases

Connection Pool

Rate Limiting Implementation

Caching Patterns

Integration Summary

21 KiB

Raw Blame History