Files
metadata-agregator/docs/research/acoustid/analysis/ARCHITECTURE.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

22 KiB

AcoustID Architecture

System Architecture Overview

AcoustID employs a monolithic multi-process architecture with microservice-like separation of concerns. The system is split into two major repositories with distinct responsibilities:

  1. acoustid-server: Monolithic Python application with multiple process types
  2. acoustid-index: Standalone Zig service for fingerprint indexing

Server Architecture

Process Types

The server runs as multiple independent processes, each with a specific role:

Process Entry Point Purpose Scaling
API acoustid.server:make_application() Handle API requests Horizontal
Web acoustid.server:make_application() Serve web UI Horizontal
Worker acoustid.worker:run() Process background jobs Horizontal
Cron acoustid.cron:run() Execute scheduled tasks Single instance
Import acoustid.scripts.import_submissions Bulk import fingerprints Manual

Directory Structure

acoustid/
├── api/                    # API layer
│   ├── __init__.py        # API application factory
│   ├── errors.py          # Error handling
│   ├── ratelimit.py       # Rate limiting logic
│   └── v2/                # API v2 endpoints
│       ├── __init__.py
│       ├── lookup.py      # Fingerprint lookup
│       ├── submit.py      # Fingerprint submission
│       ├── misc.py        # Utility endpoints
│       └── internal.py    # Internal admin endpoints
├── data/                   # Business logic layer
│   ├── account.py         # User account operations
│   ├── application.py     # API application management
│   ├── fingerprint.py     # Fingerprint operations
│   ├── foreignid.py       # Foreign ID management
│   ├── meta.py            # Metadata operations
│   ├── musicbrainz.py     # MusicBrainz queries
│   ├── stats.py           # Statistics tracking
│   ├── submission.py      # Submission processing
│   └── track.py           # Track operations
├── future/                 # Starlette migration
│   ├── app.py             # ASGI application
│   ├── lookup.py          # Async lookup handler
│   └── submit.py          # Async submit handler
├── web/                    # Web UI layer
│   ├── __init__.py        # Web application factory
│   ├── views/             # View handlers
│   └── templates/         # Jinja2 templates
├── scripts/                # Utility scripts
│   ├── import_submissions.py
│   ├── backfill_fingerprint_index.py
│   └── update_lookup_stats.py
├── cli.py                  # CLI command definitions
├── server.py               # WSGI/ASGI application
├── worker.py               # Background worker
├── cron.py                 # Cron job scheduler
├── fingerprint.py          # Fingerprint utilities
├── indexclient.py          # Legacy TCP index client
├── fpstore.py              # Modern HTTP index client
├── db.py                   # Database connection management
├── config.py               # Configuration loading
└── tables.py               # SQLAlchemy ORM models

Layered Architecture

The server follows a traditional layered architecture:

┌─────────────────────────────────────────┐
│     Presentation Layer                  │
│  (api/, web/, future/)                  │
│  - HTTP request/response handling       │
│  - Input validation                     │
│  - Response formatting                  │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│     Business Logic Layer                │
│  (data/)                                │
│  - Domain operations                    │
│  - Business rules                       │
│  - Orchestration                        │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│     Data Access Layer                   │
│  (db.py, tables.py)                     │
│  - Database queries                     │
│  - ORM models                           │
│  - Transaction management               │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│     External Services Layer             │
│  (indexclient.py, fpstore.py)           │
│  - Index communication                  │
│  - MusicBrainz queries                  │
│  - Redis operations                     │
└─────────────────────────────────────────┘

Framework Transition

The server is actively transitioning from Flask to Starlette:

Current (Flask/Werkzeug):

  • Location: acoustid/api/, acoustid/web/
  • WSGI-based synchronous request handling
  • Gunicorn as application server
  • Blocking database operations with psycopg2

Future (Starlette):

  • Location: acoustid/future/
  • ASGI-based asynchronous request handling
  • Uvicorn as application server
  • Async database operations with asyncpg

Migration Status:

  • Core lookup and submit endpoints have async implementations
  • Legacy endpoints still use Flask
  • Both frameworks run simultaneously during transition
  • Configuration flag controls which implementation is used

Index Architecture

LSM-Tree Design

The index uses a Log-Structured Merge-tree (LSM-tree) for efficient fingerprint storage and retrieval.

Core Concept:

  • Writes go to in-memory segment (fast)
  • Memory segment periodically flushed to disk
  • Background process merges disk segments
  • Reads check memory segment first, then disk segments

Components:

┌─────────────────────────────────────────┐
│         MultiIndex                      │
│  - Manages multiple named indexes       │
│  - Routes requests to correct index     │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│         Index                           │
│  - Single fingerprint index             │
│  - Coordinates segments and merging     │
└─────────────────────────────────────────┘
              ↓
┌──────────────────┬──────────────────────┐
│  MemorySegment   │   FileSegment(s)     │
│  - In-memory     │   - On-disk          │
│  - Fast writes   │   - Immutable        │
│  - Volatile      │   - Persistent       │
└──────────────────┴──────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│         Oplog (Write-Ahead Log)         │
│  - Durability for memory segment        │
│  - Replay on crash recovery             │
└─────────────────────────────────────────┘

Segment Management

MemorySegment (src/MemorySegment.zig):

  • Hash map of fingerprint ID to posting list
  • Posting list: array of term IDs (compressed)
  • Maximum size threshold triggers flush
  • Backed by Oplog for durability

FileSegment (src/FileSegment.zig):

  • Immutable on-disk segment
  • Binary file format with index and data sections
  • StreamVByte compression for posting lists
  • Memory-mapped for fast reads

Segment Lifecycle:

  1. Writes accumulate in MemorySegment
  2. MemorySegment reaches size threshold
  3. Flush to new FileSegment
  4. Clear MemorySegment and Oplog
  5. Background merger selects segments to merge
  6. Merge creates new larger FileSegment
  7. Delete old segments

Merge Policy

Tiered Merge Strategy:

  • Segments grouped into tiers by size
  • Tier 0: Smallest segments (recently flushed)
  • Tier N: Largest segments (heavily merged)
  • Merge triggered when tier has too many segments
  • Merges segments within same tier

Benefits:

  • Write amplification bounded
  • Read performance improves over time
  • Disk space reclaimed from deleted entries

File Format

Segment File Structure (src/filefmt.zig):

┌─────────────────────────────────────────┐
│  Header                                 │
│  - Magic number                         │
│  - Version                              │
│  - Metadata                             │
├─────────────────────────────────────────┤
│  Index Section                          │
│  - Fingerprint ID → Offset mapping      │
│  - Binary search tree or hash table     │
├─────────────────────────────────────────┤
│  Data Section                           │
│  - Compressed posting lists             │
│  - StreamVByte encoded                  │
└─────────────────────────────────────────┘

Block Compression (src/block.zig):

  • Posting lists compressed in blocks
  • StreamVByte SIMD compression
  • Delta encoding for term IDs
  • Typical compression ratio: 4-8x

Index Reader

IndexReader (src/IndexReader.zig):

  • Read-only view of index
  • Merges results from all segments
  • Implements search algorithm
  • Returns top-K candidates by score

Search Algorithm:

  1. Extract query terms from fingerprint
  2. For each term, fetch posting lists from all segments
  3. Merge posting lists (union)
  4. Score each candidate by term overlap
  5. Return top-K candidates sorted by score

Data Flow

Submission Flow (Detailed)

┌─────────┐
│ Client  │
└────┬────┘
     │ POST /v2/submit
     ↓
┌─────────────────────────────────────────┐
│  SubmitHandler (api/v2/submit.py)      │
│  1. Validate API keys (client + user)  │
│  2. Check rate limits (Redis)          │
│  3. Decode fingerprints                │
│  4. Insert into submission table       │
│  5. Publish to NATS queue              │
└─────────────────────────────────────────┘
     │
     ↓ NATS message
┌─────────────────────────────────────────┐
│  Worker (worker.py)                     │
│  1. Consume message from NATS          │
│  2. Load submission from database      │
└─────────────────────────────────────────┘
     │
     ↓
┌─────────────────────────────────────────┐
│  FingerprintSearcher (data/fingerprint) │
│  1. Extract query from fingerprint     │
│  2. Search index for matches           │
└─────────────────────────────────────────┘
     │
     ↓ HTTP POST /:index/_search
┌─────────────────────────────────────────┐
│  Index (fpindex)                        │
│  1. Decode MessagePack request         │
│  2. Search segments                    │
│  3. Score candidates                   │
│  4. Return top matches                 │
└─────────────────────────────────────────┘
     │
     ↓ Candidate fingerprint IDs
┌─────────────────────────────────────────┐
│  Worker (continued)                     │
│  1. Fetch candidate metadata from DB   │
│  2. Decide: create new track or link   │
│  3. Insert/update track tables         │
│  4. Update index with new fingerprint  │
│  5. Store result in submission_result  │
└─────────────────────────────────────────┘
     │
     ↓ HTTP PUT /:index/:fpid
┌─────────────────────────────────────────┐
│  Index (fpindex)                        │
│  1. Add fingerprint to MemorySegment   │
│  2. Append to Oplog                    │
│  3. Trigger flush if needed            │
└─────────────────────────────────────────┘

Lookup Flow (Detailed)

┌─────────┐
│ Client  │
└────┬────┘
     │ GET/POST /v2/lookup
     ↓
┌─────────────────────────────────────────┐
│  LookupHandler (api/v2/lookup.py)      │
│  1. Validate API key (client)          │
│  2. Check rate limits (Redis)          │
│  3. Parse parameters                   │
└─────────────────────────────────────────┘
     │
     ↓
┌─────────────────────────────────────────┐
│  decode_fingerprint (fingerprint.py)    │
│  1. Decode base64 or compressed format │
│  2. Decompress if needed               │
│  3. Parse Chromaprint data             │
└─────────────────────────────────────────┘
     │
     ↓
┌─────────────────────────────────────────┐
│  extract_query (fingerprint.py)         │
│  1. Extract hash terms from fingerprint│
│  2. Build query structure              │
└─────────────────────────────────────────┘
     │
     ↓
┌─────────────────────────────────────────┐
│  fpstore.search (fpstore.py)            │
│  1. Encode query as MessagePack        │
│  2. HTTP POST to index                 │
└─────────────────────────────────────────┘
     │
     ↓ HTTP POST /:index/_search
┌─────────────────────────────────────────┐
│  Index (fpindex)                        │
│  1. Parse MessagePack query            │
│  2. Search all segments                │
│  3. Merge and score results            │
│  4. Return top-K candidates            │
└─────────────────────────────────────────┘
     │
     ↓ Candidate fingerprint IDs + scores
┌─────────────────────────────────────────┐
│  LookupHandler (continued)              │
│  1. Fetch fingerprint metadata from DB │
│  2. Fetch track metadata from DB       │
│  3. Fetch MusicBrainz data if requested│
│  4. Build result structure             │
│  5. Format as JSON/XML                 │
└─────────────────────────────────────────┘
     │
     ↓ JSON response
┌─────────┐
│ Client  │
└─────────┘

Background Processing

Cron Jobs (acoustid/cron.py):

  • Update lookup statistics (hourly)
  • Update user agent statistics (daily)
  • Clean up old submissions (daily)
  • Refresh materialized views (hourly)
  • Backup index snapshots (daily)

Worker Tasks (acoustid/worker.py):

  • Process fingerprint submissions
  • Import bulk fingerprints
  • Update index with new data
  • Resolve MBID redirects
  • Clean up orphaned records

Index Communication Protocols

Legacy Protocol (indexclient.py)

Transport: Raw TCP socket
Port: 6080 (default)
Format: Custom binary protocol

Message Structure:

┌────────────────┬────────────────┬────────────────┐
│  Length (4B)   │  Command (1B)  │  Payload       │
└────────────────┴────────────────┴────────────────┘

Commands:

  • 0x01: Search
  • 0x02: Insert
  • 0x03: Delete

Status: Being phased out, replaced by HTTP protocol

Modern Protocol (fpstore.py)

Transport: HTTP/1.1
Port: 6081 (default)
Format: MessagePack

Endpoints:

Method Path Purpose
POST /:index/_search Search for fingerprints
PUT /:index/:fpid Insert/update fingerprint
DELETE /:index/:fpid Delete fingerprint
GET /:index Get index info
GET /:index/_segments List segments
GET /:index/_snapshot Create snapshot

Search Request:

{
    "query": [term_id1, term_id2, ...],  # Query terms
    "limit": 10,                          # Max results
    "min_score": 0.5                      # Score threshold
}

Search Response:

{
    "results": [
        {"id": fpid1, "score": 0.95},
        {"id": fpid2, "score": 0.87},
        ...
    ]
}

Concurrency and Parallelism

Server Concurrency

API/Web Processes:

  • Multiple worker processes (Gunicorn/Uvicorn)
  • Each process handles requests independently
  • Shared-nothing architecture
  • Database connection pooling per process

Worker Processes:

  • Multiple worker instances
  • NATS queue provides work distribution
  • Each worker processes one submission at a time
  • No shared state between workers

Cron Process:

  • Single instance (leader election via database)
  • Scheduled tasks run sequentially
  • Long-running tasks delegated to workers

Index Concurrency

Thread Model:

  • Main thread: HTTP server
  • Worker threads: Search and merge operations
  • Configurable thread pool size

Locking Strategy:

  • Read-write lock on Index
  • Multiple concurrent readers
  • Exclusive writer (for flush/merge)
  • Lock-free MemorySegment (atomic operations)

Background Tasks:

  • Segment merger runs in background thread
  • Oplog flusher runs periodically
  • Metrics collector runs independently

Scalability Considerations

Horizontal Scaling

API/Web:

  • Stateless processes
  • Scale by adding more instances
  • Load balancer distributes requests
  • Session state in Redis (if needed)

Workers:

  • Scale by adding more instances
  • NATS queue distributes work
  • No coordination required

Index:

  • Multiple index instances (sharding)
  • Consistent hashing for fingerprint distribution
  • NATS for cluster coordination
  • Each instance handles subset of fingerprints

Vertical Scaling

Database:

  • Connection pooling
  • Read replicas for queries
  • Partitioning for large tables
  • Materialized views for aggregations

Index:

  • More threads for search
  • Larger memory segment
  • Faster disk for segments
  • More RAM for file caching

Fault Tolerance

Server Resilience

Database Failures:

  • Connection retry with exponential backoff
  • Health checks detect failures
  • Read-only mode if write DB unavailable

Index Failures:

  • Graceful degradation (return partial results)
  • Retry with exponential backoff
  • Circuit breaker pattern

NATS Failures:

  • Persistent queue (JetStream)
  • Automatic reconnection
  • Message replay on recovery

Index Resilience

Crash Recovery:

  • Oplog replay restores MemorySegment
  • FileSegments are immutable (no corruption)
  • Incomplete merges discarded

Data Integrity:

  • Checksums in file format
  • Atomic file operations
  • Write-ahead logging

Replication:

  • NATS-based replication (optional)
  • Snapshot-based backup
  • Point-in-time recovery

Performance Characteristics

Server Performance

Lookup Latency:

  • P50: ~50ms (including index search)
  • P95: ~200ms
  • P99: ~500ms

Bottlenecks:

  • Index search time (dominant)
  • Database query time (metadata fetch)
  • Network latency (MusicBrainz queries)

Index Performance

Search Latency:

  • P50: ~5ms
  • P95: ~20ms
  • P99: ~50ms

Throughput:

  • ~1000 searches/second (single instance)
  • ~500 inserts/second (single instance)

Bottlenecks:

  • Disk I/O (segment reads)
  • CPU (decompression and scoring)
  • Memory (segment caching)

Future Architecture Plans

Server Modernization

  1. Complete migration to Starlette/ASGI
  2. Remove Flask dependencies
  3. Async database operations everywhere
  4. GraphQL API alongside REST

Index Enhancements

  1. Distributed index with automatic sharding
  2. Replication for high availability
  3. Incremental snapshots
  4. Query result caching

Infrastructure

  1. Kubernetes deployment
  2. Service mesh (Istio/Linkerd)
  3. Distributed tracing (OpenTelemetry)
  4. Advanced monitoring (Prometheus + Grafana)