Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

22 KiB

Raw Blame History

AcoustID Architecture

System Architecture Overview

AcoustID employs a monolithic multi-process architecture with microservice-like separation of concerns. The system is split into two major repositories with distinct responsibilities:

acoustid-server: Monolithic Python application with multiple process types
acoustid-index: Standalone Zig service for fingerprint indexing

Server Architecture

Process Types

The server runs as multiple independent processes, each with a specific role:

Process	Entry Point	Purpose	Scaling
API	`acoustid.server:make_application()`	Handle API requests	Horizontal
Web	`acoustid.server:make_application()`	Serve web UI	Horizontal
Worker	`acoustid.worker:run()`	Process background jobs	Horizontal
Cron	`acoustid.cron:run()`	Execute scheduled tasks	Single instance
Import	`acoustid.scripts.import_submissions`	Bulk import fingerprints	Manual

Directory Structure

acoustid/
├── api/                    # API layer
│   ├── __init__.py        # API application factory
│   ├── errors.py          # Error handling
│   ├── ratelimit.py       # Rate limiting logic
│   └── v2/                # API v2 endpoints
│       ├── __init__.py
│       ├── lookup.py      # Fingerprint lookup
│       ├── submit.py      # Fingerprint submission
│       ├── misc.py        # Utility endpoints
│       └── internal.py    # Internal admin endpoints
├── data/                   # Business logic layer
│   ├── account.py         # User account operations
│   ├── application.py     # API application management
│   ├── fingerprint.py     # Fingerprint operations
│   ├── foreignid.py       # Foreign ID management
│   ├── meta.py            # Metadata operations
│   ├── musicbrainz.py     # MusicBrainz queries
│   ├── stats.py           # Statistics tracking
│   ├── submission.py      # Submission processing
│   └── track.py           # Track operations
├── future/                 # Starlette migration
│   ├── app.py             # ASGI application
│   ├── lookup.py          # Async lookup handler
│   └── submit.py          # Async submit handler
├── web/                    # Web UI layer
│   ├── __init__.py        # Web application factory
│   ├── views/             # View handlers
│   └── templates/         # Jinja2 templates
├── scripts/                # Utility scripts
│   ├── import_submissions.py
│   ├── backfill_fingerprint_index.py
│   └── update_lookup_stats.py
├── cli.py                  # CLI command definitions
├── server.py               # WSGI/ASGI application
├── worker.py               # Background worker
├── cron.py                 # Cron job scheduler
├── fingerprint.py          # Fingerprint utilities
├── indexclient.py          # Legacy TCP index client
├── fpstore.py              # Modern HTTP index client
├── db.py                   # Database connection management
├── config.py               # Configuration loading
└── tables.py               # SQLAlchemy ORM models

Layered Architecture

The server follows a traditional layered architecture:

┌─────────────────────────────────────────┐
│     Presentation Layer                  │
│  (api/, web/, future/)                  │
│  - HTTP request/response handling       │
│  - Input validation                     │
│  - Response formatting                  │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│     Business Logic Layer                │
│  (data/)                                │
│  - Domain operations                    │
│  - Business rules                       │
│  - Orchestration                        │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│     Data Access Layer                   │
│  (db.py, tables.py)                     │
│  - Database queries                     │
│  - ORM models                           │
│  - Transaction management               │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│     External Services Layer             │
│  (indexclient.py, fpstore.py)           │
│  - Index communication                  │
│  - MusicBrainz queries                  │
│  - Redis operations                     │
└─────────────────────────────────────────┘

Framework Transition

The server is actively transitioning from Flask to Starlette:

Current (Flask/Werkzeug):

Location: acoustid/api/, acoustid/web/
WSGI-based synchronous request handling
Gunicorn as application server
Blocking database operations with psycopg2

Future (Starlette):

Location: acoustid/future/
ASGI-based asynchronous request handling
Uvicorn as application server
Async database operations with asyncpg

Migration Status:

Core lookup and submit endpoints have async implementations
Legacy endpoints still use Flask
Both frameworks run simultaneously during transition
Configuration flag controls which implementation is used

Index Architecture

LSM-Tree Design

The index uses a Log-Structured Merge-tree (LSM-tree) for efficient fingerprint storage and retrieval.

Core Concept:

Writes go to in-memory segment (fast)
Memory segment periodically flushed to disk
Background process merges disk segments
Reads check memory segment first, then disk segments

Components:

┌─────────────────────────────────────────┐
│         MultiIndex                      │
│  - Manages multiple named indexes       │
│  - Routes requests to correct index     │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│         Index                           │
│  - Single fingerprint index             │
│  - Coordinates segments and merging     │
└─────────────────────────────────────────┘
              ↓
┌──────────────────┬──────────────────────┐
│  MemorySegment   │   FileSegment(s)     │
│  - In-memory     │   - On-disk          │
│  - Fast writes   │   - Immutable        │
│  - Volatile      │   - Persistent       │
└──────────────────┴──────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│         Oplog (Write-Ahead Log)         │
│  - Durability for memory segment        │
│  - Replay on crash recovery             │
└─────────────────────────────────────────┘

Segment Management

MemorySegment (src/MemorySegment.zig):

Hash map of fingerprint ID to posting list
Posting list: array of term IDs (compressed)
Maximum size threshold triggers flush
Backed by Oplog for durability

FileSegment (src/FileSegment.zig):

Immutable on-disk segment
Binary file format with index and data sections
StreamVByte compression for posting lists
Memory-mapped for fast reads

Segment Lifecycle:

Writes accumulate in MemorySegment
MemorySegment reaches size threshold
Flush to new FileSegment
Clear MemorySegment and Oplog
Background merger selects segments to merge
Merge creates new larger FileSegment
Delete old segments

Merge Policy

Tiered Merge Strategy:

Segments grouped into tiers by size
Tier 0: Smallest segments (recently flushed)
Tier N: Largest segments (heavily merged)
Merge triggered when tier has too many segments
Merges segments within same tier

Benefits:

Write amplification bounded
Read performance improves over time
Disk space reclaimed from deleted entries

File Format

Segment File Structure (src/filefmt.zig):

┌─────────────────────────────────────────┐
│  Header                                 │
│  - Magic number                         │
│  - Version                              │
│  - Metadata                             │
├─────────────────────────────────────────┤
│  Index Section                          │
│  - Fingerprint ID → Offset mapping      │
│  - Binary search tree or hash table     │
├─────────────────────────────────────────┤
│  Data Section                           │
│  - Compressed posting lists             │
│  - StreamVByte encoded                  │
└─────────────────────────────────────────┘

Block Compression (src/block.zig):

Posting lists compressed in blocks
StreamVByte SIMD compression
Delta encoding for term IDs
Typical compression ratio: 4-8x

Index Reader

IndexReader (src/IndexReader.zig):

Read-only view of index
Merges results from all segments
Implements search algorithm
Returns top-K candidates by score

Search Algorithm:

Extract query terms from fingerprint
For each term, fetch posting lists from all segments
Merge posting lists (union)
Score each candidate by term overlap
Return top-K candidates sorted by score

Data Flow

Submission Flow (Detailed)

┌─────────┐
│ Client  │
└────┬────┘
     │ POST /v2/submit
     ↓
┌─────────────────────────────────────────┐
│  SubmitHandler (api/v2/submit.py)      │
│  1. Validate API keys (client + user)  │
│  2. Check rate limits (Redis)          │
│  3. Decode fingerprints                │
│  4. Insert into submission table       │
│  5. Publish to NATS queue              │
└─────────────────────────────────────────┘
     │
     ↓ NATS message
┌─────────────────────────────────────────┐
│  Worker (worker.py)                     │
│  1. Consume message from NATS          │
│  2. Load submission from database      │
└─────────────────────────────────────────┘
     │
     ↓
┌─────────────────────────────────────────┐
│  FingerprintSearcher (data/fingerprint) │
│  1. Extract query from fingerprint     │
│  2. Search index for matches           │
└─────────────────────────────────────────┘
     │
     ↓ HTTP POST /:index/_search
┌─────────────────────────────────────────┐
│  Index (fpindex)                        │
│  1. Decode MessagePack request         │
│  2. Search segments                    │
│  3. Score candidates                   │
│  4. Return top matches                 │
└─────────────────────────────────────────┘
     │
     ↓ Candidate fingerprint IDs
┌─────────────────────────────────────────┐
│  Worker (continued)                     │
│  1. Fetch candidate metadata from DB   │
│  2. Decide: create new track or link   │
│  3. Insert/update track tables         │
│  4. Update index with new fingerprint  │
│  5. Store result in submission_result  │
└─────────────────────────────────────────┘
     │
     ↓ HTTP PUT /:index/:fpid
┌─────────────────────────────────────────┐
│  Index (fpindex)                        │
│  1. Add fingerprint to MemorySegment   │
│  2. Append to Oplog                    │
│  3. Trigger flush if needed            │
└─────────────────────────────────────────┘

Lookup Flow (Detailed)

┌─────────┐
│ Client  │
└────┬────┘
     │ GET/POST /v2/lookup
     ↓
┌─────────────────────────────────────────┐
│  LookupHandler (api/v2/lookup.py)      │
│  1. Validate API key (client)          │
│  2. Check rate limits (Redis)          │
│  3. Parse parameters                   │
└─────────────────────────────────────────┘
     │
     ↓
┌─────────────────────────────────────────┐
│  decode_fingerprint (fingerprint.py)    │
│  1. Decode base64 or compressed format │
│  2. Decompress if needed               │
│  3. Parse Chromaprint data             │
└─────────────────────────────────────────┘
     │
     ↓
┌─────────────────────────────────────────┐
│  extract_query (fingerprint.py)         │
│  1. Extract hash terms from fingerprint│
│  2. Build query structure              │
└─────────────────────────────────────────┘
     │
     ↓
┌─────────────────────────────────────────┐
│  fpstore.search (fpstore.py)            │
│  1. Encode query as MessagePack        │
│  2. HTTP POST to index                 │
└─────────────────────────────────────────┘
     │
     ↓ HTTP POST /:index/_search
┌─────────────────────────────────────────┐
│  Index (fpindex)                        │
│  1. Parse MessagePack query            │
│  2. Search all segments                │
│  3. Merge and score results            │
│  4. Return top-K candidates            │
└─────────────────────────────────────────┘
     │
     ↓ Candidate fingerprint IDs + scores
┌─────────────────────────────────────────┐
│  LookupHandler (continued)              │
│  1. Fetch fingerprint metadata from DB │
│  2. Fetch track metadata from DB       │
│  3. Fetch MusicBrainz data if requested│
│  4. Build result structure             │
│  5. Format as JSON/XML                 │
└─────────────────────────────────────────┘
     │
     ↓ JSON response
┌─────────┐
│ Client  │
└─────────┘

Background Processing

Cron Jobs (acoustid/cron.py):

Update lookup statistics (hourly)
Update user agent statistics (daily)
Clean up old submissions (daily)
Refresh materialized views (hourly)
Backup index snapshots (daily)

Worker Tasks (acoustid/worker.py):

Process fingerprint submissions
Import bulk fingerprints
Update index with new data
Resolve MBID redirects
Clean up orphaned records

Index Communication Protocols

Legacy Protocol (indexclient.py)

Transport: Raw TCP socket
Port: 6080 (default)
Format: Custom binary protocol

Message Structure:

┌────────────────┬────────────────┬────────────────┐
│  Length (4B)   │  Command (1B)  │  Payload       │
└────────────────┴────────────────┴────────────────┘

Commands:

0x01: Search
0x02: Insert
0x03: Delete

Status: Being phased out, replaced by HTTP protocol

Modern Protocol (fpstore.py)

Transport: HTTP/1.1
Port: 6081 (default)
Format: MessagePack

Endpoints:

Method	Path	Purpose
POST	`/:index/_search`	Search for fingerprints
PUT	`/:index/:fpid`	Insert/update fingerprint
DELETE	`/:index/:fpid`	Delete fingerprint
GET	`/:index`	Get index info
GET	`/:index/_segments`	List segments
GET	`/:index/_snapshot`	Create snapshot

Search Request:

{
    "query": [term_id1, term_id2, ...],  # Query terms
    "limit": 10,                          # Max results
    "min_score": 0.5                      # Score threshold
}

Search Response:

{
    "results": [
        {"id": fpid1, "score": 0.95},
        {"id": fpid2, "score": 0.87},
        ...
    ]
}

Concurrency and Parallelism

Server Concurrency

API/Web Processes:

Multiple worker processes (Gunicorn/Uvicorn)
Each process handles requests independently
Shared-nothing architecture
Database connection pooling per process

Worker Processes:

Multiple worker instances
NATS queue provides work distribution
Each worker processes one submission at a time
No shared state between workers

Cron Process:

Single instance (leader election via database)
Scheduled tasks run sequentially
Long-running tasks delegated to workers

Index Concurrency

Thread Model:

Main thread: HTTP server
Worker threads: Search and merge operations
Configurable thread pool size

Locking Strategy:

Read-write lock on Index
Multiple concurrent readers
Exclusive writer (for flush/merge)
Lock-free MemorySegment (atomic operations)

Background Tasks:

Segment merger runs in background thread
Oplog flusher runs periodically
Metrics collector runs independently

Scalability Considerations

Horizontal Scaling

API/Web:

Stateless processes
Scale by adding more instances
Load balancer distributes requests
Session state in Redis (if needed)

Workers:

Scale by adding more instances
NATS queue distributes work
No coordination required

Index:

Multiple index instances (sharding)
Consistent hashing for fingerprint distribution
NATS for cluster coordination
Each instance handles subset of fingerprints

Vertical Scaling

Database:

Connection pooling
Read replicas for queries
Partitioning for large tables
Materialized views for aggregations

Index:

More threads for search
Larger memory segment
Faster disk for segments
More RAM for file caching

Fault Tolerance

Server Resilience

Database Failures:

Connection retry with exponential backoff
Health checks detect failures
Read-only mode if write DB unavailable

Index Failures:

Graceful degradation (return partial results)
Retry with exponential backoff
Circuit breaker pattern

NATS Failures:

Persistent queue (JetStream)
Automatic reconnection
Message replay on recovery

Index Resilience

Crash Recovery:

Oplog replay restores MemorySegment
FileSegments are immutable (no corruption)
Incomplete merges discarded

Data Integrity:

Checksums in file format
Atomic file operations
Write-ahead logging

Replication:

NATS-based replication (optional)
Snapshot-based backup
Point-in-time recovery

Performance Characteristics

Server Performance

Lookup Latency:

P50: ~50ms (including index search)
P95: ~200ms
P99: ~500ms

Bottlenecks:

Index search time (dominant)
Database query time (metadata fetch)
Network latency (MusicBrainz queries)

Index Performance

Search Latency:

P50: ~5ms
P95: ~20ms
P99: ~50ms

Throughput:

~1000 searches/second (single instance)
~500 inserts/second (single instance)