Files
metadata-agregator/docs/research/acoustid/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

24 KiB

AcoustID Data Model

Database Architecture

AcoustID uses a multi-database PostgreSQL architecture with separate databases for different concerns.

Database Instances

Database Purpose Tables Extensions
acoustid_app Application data (accounts, apps, stats) 8 pgcrypto
acoustid_fingerprint Fingerprint and track data 19 intarray, acoustid, cube
acoustid_ingest Submission processing 3 -
musicbrainz MusicBrainz mirror (read-only) Many -

PostgreSQL Extensions

intarray: Integer array operations

  • Used for fingerprint array queries
  • Provides && (overlap) and @> (contains) operators

pgcrypto: Cryptographic functions

  • UUID generation (gen_random_uuid())
  • API key hashing

acoustid (custom): Fingerprint similarity functions

  • acoustid_compare(int[], int[]): Compare two fingerprints
  • acoustid_extract_query(int[]): Extract query terms
  • Source: acoustid-ext C extension

cube: Multi-dimensional cube data type

  • Used for simhash-based fingerprint indexing
  • Enables fast approximate nearest neighbor search

Core Tables

Account Management (acoustid_app)

account

User accounts for API access.

Column Type Constraints Description
id SERIAL PRIMARY KEY Account ID
name VARCHAR(255) NOT NULL Display name
apikey VARCHAR(40) UNIQUE, NOT NULL API key (user key)
mbuser VARCHAR(64) UNIQUE MusicBrainz username
created TIMESTAMP NOT NULL Creation timestamp
lastlogin TIMESTAMP Last login timestamp
submission_count INTEGER DEFAULT 0 Total submissions
application_id INTEGER FOREIGN KEY Default application
application_version VARCHAR(255) Application version
created_from INET Registration IP
is_admin BOOLEAN DEFAULT FALSE Admin flag

Indexes:

  • account_pkey (PRIMARY KEY on id)
  • account_apikey_key (UNIQUE on apikey)
  • account_mbuser_key (UNIQUE on mbuser)

application

API client applications.

Column Type Constraints Description
id SERIAL PRIMARY KEY Application ID
name VARCHAR(255) NOT NULL Application name
version VARCHAR(255) Version string
apikey VARCHAR(40) UNIQUE, NOT NULL API key (client key)
created TIMESTAMP NOT NULL Creation timestamp
active BOOLEAN DEFAULT TRUE Active status
account_id INTEGER FOREIGN KEY Owner account
email VARCHAR(255) Contact email
website VARCHAR(1000) Website URL
rate_limit INTEGER Custom rate limit (req/s)

Indexes:

  • application_pkey (PRIMARY KEY on id)
  • application_apikey_key (UNIQUE on apikey)

account_openid

OpenID authentication links.

Column Type Constraints Description
openid VARCHAR(255) PRIMARY KEY OpenID identifier
account_id INTEGER FOREIGN KEY Linked account

account_google

Google OAuth authentication links.

Column Type Constraints Description
google_user_id VARCHAR(255) PRIMARY KEY Google user ID
account_id INTEGER FOREIGN KEY Linked account

Fingerprint Data (acoustid_fingerprint)

track

Unique audio tracks identified by fingerprints.

Column Type Constraints Description
id SERIAL PRIMARY KEY Track ID
gid UUID UNIQUE, NOT NULL Public track UUID
created TIMESTAMP NOT NULL Creation timestamp
new_id INTEGER FOREIGN KEY Merge target (if merged)
disabled BOOLEAN DEFAULT FALSE Disabled flag

Indexes:

  • track_pkey (PRIMARY KEY on id)
  • track_gid_key (UNIQUE on gid)
  • track_new_id_idx (on new_id)

Notes:

  • gid is the public-facing AcoustID track ID
  • new_id points to merged track (for deduplication)
  • Disabled tracks excluded from search results

fingerprint

Audio fingerprints linked to tracks.

Column Type Constraints Description
id SERIAL PRIMARY KEY Fingerprint ID
track_id INTEGER FOREIGN KEY Linked track
fingerprint INTEGER[] NOT NULL Chromaprint hash array
length SMALLINT NOT NULL Duration in seconds
bitrate SMALLINT Audio bitrate (kbps)
format_id INTEGER FOREIGN KEY Audio format
created TIMESTAMP NOT NULL Creation timestamp
submission_count INTEGER DEFAULT 1 Number of submissions

Indexes:

  • fingerprint_pkey (PRIMARY KEY on id)
  • fingerprint_track_id_idx (on track_id)
  • fingerprint_length_idx (on length)
  • fingerprint_fingerprint_idx (GIN on fingerprint using intarray)

Notes:

  • fingerprint is an array of 32-bit integers (Chromaprint hashes)
  • GIN index enables fast similarity search
  • submission_count tracks popularity

fingerprint_data

Extended fingerprint data with simhash.

Column Type Constraints Description
fingerprint_id INTEGER PRIMARY KEY, FOREIGN KEY Fingerprint ID
fingerprint BYTEA NOT NULL Raw fingerprint data
simhash CUBE Locality-sensitive hash

Indexes:

  • fingerprint_data_pkey (PRIMARY KEY on fingerprint_id)
  • fingerprint_data_simhash_idx (GIST on simhash)

Notes:

  • fingerprint stores compressed Chromaprint data
  • simhash enables approximate nearest neighbor search
  • GIST index for fast similarity queries

track_mbid

Links tracks to MusicBrainz recordings.

Column Type Constraints Description
id SERIAL PRIMARY KEY Link ID
track_id INTEGER FOREIGN KEY AcoustID track
mbid UUID NOT NULL MusicBrainz recording MBID
created TIMESTAMP NOT NULL Creation timestamp
submission_count INTEGER DEFAULT 1 Number of submissions
disabled BOOLEAN DEFAULT FALSE Disabled flag

Indexes:

  • track_mbid_pkey (PRIMARY KEY on id)
  • track_mbid_track_id_mbid_key (UNIQUE on track_id, mbid)
  • track_mbid_mbid_idx (on mbid)

Notes:

  • Multiple MBIDs per track possible (different recordings)
  • submission_count indicates confidence
  • Disabled links excluded from results

meta

User-submitted metadata.

Column Type Constraints Description
id SERIAL PRIMARY KEY Metadata ID
track VARCHAR(255) Track title
artist VARCHAR(255) Artist name
album VARCHAR(255) Album title
album_artist VARCHAR(255) Album artist
track_no INTEGER Track number
disc_no INTEGER Disc number
year INTEGER Release year

Indexes:

  • meta_pkey (PRIMARY KEY on id)

track_meta

Links tracks to user metadata.

Column Type Constraints Description
id SERIAL PRIMARY KEY Link ID
track_id INTEGER FOREIGN KEY AcoustID track
meta_id INTEGER FOREIGN KEY Metadata record
created TIMESTAMP NOT NULL Creation timestamp
submission_count INTEGER DEFAULT 1 Number of submissions

Indexes:

  • track_meta_pkey (PRIMARY KEY on id)
  • track_meta_track_id_meta_id_key (UNIQUE on track_id, meta_id)

format

Audio file formats.

Column Type Constraints Description
id SERIAL PRIMARY KEY Format ID
name VARCHAR(20) UNIQUE, NOT NULL Format name (mp3, flac, etc.)

Indexes:

  • format_pkey (PRIMARY KEY on id)
  • format_name_key (UNIQUE on name)

Common Values:

  • mp3, flac, ogg, m4a, wma, ape, wav

source

Submission sources (applications).

Column Type Constraints Description
id SERIAL PRIMARY KEY Source ID
application_id INTEGER FOREIGN KEY Application
account_id INTEGER FOREIGN KEY User account
version VARCHAR(255) Application version

Indexes:

  • source_pkey (PRIMARY KEY on id)
  • source_application_id_account_id_version_key (UNIQUE on application_id, account_id, version)

Foreign IDs (acoustid_fingerprint)

foreignid_vendor

External ID providers.

Column Type Constraints Description
id SERIAL PRIMARY KEY Vendor ID
name VARCHAR(255) UNIQUE, NOT NULL Vendor name

Indexes:

  • foreignid_vendor_pkey (PRIMARY KEY on id)
  • foreignid_vendor_name_key (UNIQUE on name)

Common Values:

  • musicbrainz, musicip, discogs, spotify

foreignid

External identifiers.

Column Type Constraints Description
id SERIAL PRIMARY KEY Foreign ID
vendor_id INTEGER FOREIGN KEY Vendor
name VARCHAR(255) NOT NULL External ID value

Indexes:

  • foreignid_pkey (PRIMARY KEY on id)
  • foreignid_vendor_id_name_key (UNIQUE on vendor_id, name)

track_foreignid

Links tracks to external IDs.

Column Type Constraints Description
id SERIAL PRIMARY KEY Link ID
track_id INTEGER FOREIGN KEY AcoustID track
foreignid_id INTEGER FOREIGN KEY External ID
created TIMESTAMP NOT NULL Creation timestamp
submission_count INTEGER DEFAULT 1 Number of submissions

Indexes:

  • track_foreignid_pkey (PRIMARY KEY on id)
  • track_foreignid_track_id_foreignid_id_key (UNIQUE on track_id, foreignid_id)

track_puid

Legacy MusicIP PUID links.

Column Type Constraints Description
id SERIAL PRIMARY KEY Link ID
track_id INTEGER FOREIGN KEY AcoustID track
puid UUID NOT NULL MusicIP PUID
created TIMESTAMP NOT NULL Creation timestamp
submission_count INTEGER DEFAULT 1 Number of submissions

Indexes:

  • track_puid_pkey (PRIMARY KEY on id)
  • track_puid_track_id_puid_key (UNIQUE on track_id, puid)
  • track_puid_puid_idx (on puid)

Statistics (acoustid_app)

stats

General statistics.

Column Type Constraints Description
id SERIAL PRIMARY KEY Stat ID
name VARCHAR(255) UNIQUE, NOT NULL Stat name
value INTEGER NOT NULL Stat value
date DATE NOT NULL Stat date

Indexes:

  • stats_pkey (PRIMARY KEY on id)
  • stats_name_date_key (UNIQUE on name, date)

Common Stats:

  • lookup.count, submission.count, track.count, fingerprint.count

stats_lookups

Lookup statistics by hour.

Column Type Constraints Description
id SERIAL PRIMARY KEY Stat ID
hour TIMESTAMP NOT NULL Hour timestamp
application_id INTEGER FOREIGN KEY Application
count_hits INTEGER DEFAULT 0 Successful lookups
count_misses INTEGER DEFAULT 0 Failed lookups

Indexes:

  • stats_lookups_pkey (PRIMARY KEY on id)
  • stats_lookups_hour_application_id_key (UNIQUE on hour, application_id)

stats_user_agents

User agent statistics.

Column Type Constraints Description
id SERIAL PRIMARY KEY Stat ID
date DATE NOT NULL Date
application_id INTEGER FOREIGN KEY Application
user_agent VARCHAR(1000) NOT NULL User agent string
ip INET NOT NULL IP address
count INTEGER DEFAULT 0 Request count

Indexes:

  • stats_user_agents_pkey (PRIMARY KEY on id)
  • stats_user_agents_date_application_id_user_agent_ip_key (UNIQUE on date, application_id, user_agent, ip)

stats_top_accounts

Top submitter accounts.

Column Type Constraints Description
id SERIAL PRIMARY KEY Stat ID
account_id INTEGER FOREIGN KEY Account
count INTEGER NOT NULL Submission count

Indexes:

  • stats_top_accounts_pkey (PRIMARY KEY on id)
  • stats_top_accounts_account_id_key (UNIQUE on account_id)

Submission Processing (acoustid_ingest)

submission

Pending fingerprint submissions.

Column Type Constraints Description
id SERIAL PRIMARY KEY Submission ID
fingerprint INTEGER[] NOT NULL Chromaprint hash array
length SMALLINT NOT NULL Duration in seconds
bitrate SMALLINT Audio bitrate
format_id INTEGER Audio format
created TIMESTAMP NOT NULL Submission timestamp
source_id INTEGER FOREIGN KEY Submission source
mbid UUID MusicBrainz MBID (if provided)
handled BOOLEAN DEFAULT FALSE Processing status
meta_id INTEGER FOREIGN KEY User metadata

Indexes:

  • submission_pkey (PRIMARY KEY on id)
  • submission_handled_idx (on handled WHERE handled = FALSE)

Notes:

  • Worker processes unhandled submissions
  • handled = TRUE after processing

submission_result

Processing results for submissions.

Column Type Constraints Description
id SERIAL PRIMARY KEY Result ID
submission_id INTEGER FOREIGN KEY Submission
track_id INTEGER FOREIGN KEY Matched/created track
created TIMESTAMP NOT NULL Processing timestamp

Indexes:

  • submission_result_pkey (PRIMARY KEY on id)
  • submission_result_submission_id_key (UNIQUE on submission_id)

pending_submission

Queue for async submission processing.

Column Type Constraints Description
id SERIAL PRIMARY KEY Queue ID
submission_id INTEGER FOREIGN KEY Submission
created TIMESTAMP NOT NULL Queue timestamp

Indexes:

  • pending_submission_pkey (PRIMARY KEY on id)
  • pending_submission_submission_id_key (UNIQUE on submission_id)

Notes:

  • Replaced by NATS queue in newer deployments
  • Legacy table, may be deprecated

Provenance Tables (acoustid_fingerprint)

Track data lineage and changes.

fingerprint_source

Links fingerprints to submission sources.

Column Type Constraints Description
id SERIAL PRIMARY KEY Link ID
fingerprint_id INTEGER FOREIGN KEY Fingerprint
source_id INTEGER FOREIGN KEY Source
created TIMESTAMP NOT NULL Creation timestamp

track_mbid_source

Links track-MBID associations to sources.

Column Type Constraints Description
id SERIAL PRIMARY KEY Link ID
track_mbid_id INTEGER FOREIGN KEY Track-MBID link
source_id INTEGER FOREIGN KEY Source
created TIMESTAMP NOT NULL Creation timestamp

track_mbid_change

Audit log for track-MBID changes.

Column Type Constraints Description
id SERIAL PRIMARY KEY Change ID
track_mbid_id INTEGER FOREIGN KEY Track-MBID link
account_id INTEGER FOREIGN KEY Account that made change
disabled BOOLEAN NOT NULL New disabled status
created TIMESTAMP NOT NULL Change timestamp
note TEXT Change reason

ORM Layer (SQLAlchemy)

Multi-Database Configuration

File: acoustid/db.py

# Database bind keys
BIND_KEYS = {
    'app': 'acoustid_app',
    'fingerprint': 'acoustid_fingerprint',
    'ingest': 'acoustid_ingest',
    'musicbrainz': 'musicbrainz'
}

Model Binding:

class Account(Base):
    __bind_key__ = 'app'
    __tablename__ = 'account'
    # ...

class Track(Base):
    __bind_key__ = 'fingerprint'
    __tablename__ = 'track'
    # ...

Connection Pooling

Configuration (acoustid.conf):

[database]
name = acoustid_app
user = acoustid
password_file = /run/secrets/db_password
host = postgres
port = 5432
pool_size = 20
pool_recycle = 3600

Pool Settings:

  • pool_size: Maximum connections per process
  • pool_recycle: Recycle connections after N seconds
  • pool_pre_ping: Test connections before use

Query Patterns

Fingerprint Search (legacy, pre-index):

# Find similar fingerprints using intarray overlap
query = db.session.query(Fingerprint).filter(
    Fingerprint.fingerprint.op('&&')(query_fingerprint),
    Fingerprint.length.between(duration - 5, duration + 5)
).order_by(
    func.acoustid_compare(Fingerprint.fingerprint, query_fingerprint).desc()
).limit(10)

Track Lookup with MBIDs:

# Fetch track with all linked MBIDs
track = db.session.query(Track).options(
    joinedload(Track.mbids)
).filter(Track.gid == track_gid).first()

Submission Processing:

# Find unhandled submissions
submissions = db.session.query(Submission).filter(
    Submission.handled == False
).order_by(Submission.created).limit(100).all()

Database Migrations

Alembic Configuration

File: alembic.ini

Migration Directories:

  • alembic/versions/app/: acoustid_app migrations
  • alembic/versions/fingerprint/: acoustid_fingerprint migrations
  • alembic/versions/ingest/: acoustid_ingest migrations

Multi-Database Support:

# alembic/env.py
def run_migrations_online():
    for bind_key in ['app', 'fingerprint', 'ingest']:
        engine = get_engine(bind_key)
        with engine.connect() as connection:
            context.configure(
                connection=connection,
                target_metadata=get_metadata(bind_key)
            )
            with context.begin_transaction():
                context.run_migrations()

Migration Commands

# Create new migration
alembic revision --autogenerate -m "Add new column"

# Apply migrations
alembic upgrade head

# Rollback migration
alembic downgrade -1

# Show current version
alembic current

# Show migration history
alembic history

Redis Data Structures

Rate Limiting

Key Pattern: rl:bucket:{scope}:{identifier}:{timestamp}

Example Keys:

rl:bucket:global:1714305600
rl:bucket:app:8XaBELgH:1714305600
rl:bucket:ip:192.168.1.1:1714305600

Value: Integer (request count)
TTL: 25 seconds (window duration + buffer)

Algorithm:

# Increment bucket for current window
bucket_key = f"rl:bucket:{scope}:{identifier}:{current_window}"
count = redis.incr(bucket_key)
redis.expire(bucket_key, 25)

# Sum counts across all windows in sliding window
total = sum(redis.get(f"rl:bucket:{scope}:{identifier}:{w}") 
            for w in windows)

Task Queue (Legacy)

Key Pattern: queue:{queue_name}

Operations:

# Push task
redis.rpush('queue:submissions', json.dumps(task_data))

# Pop task
task_data = redis.lpop('queue:submissions')

Note: Being replaced by NATS in newer deployments

API Key Cache

Implementation: In-memory TTLCache (not Redis)

from cachetools import TTLCache

api_key_cache = TTLCache(maxsize=1000, ttl=60)

Purpose: Reduce database queries for API key validation

Backfill State

Key Pattern: backfill:{index_name}:{state_key}

Example Keys:

backfill:fingerprints:last_id
backfill:fingerprints:batch_size
backfill:fingerprints:completed

Purpose: Track progress of index backfill operations

Unknown MBID Cache

Key Pattern: unknown_mbid:{mbid}

Value: Boolean (1 if MBID not found in MusicBrainz)
TTL: 3600 seconds (1 hour)

Purpose: Avoid repeated MusicBrainz queries for non-existent MBIDs

Data Integrity

Constraints

Foreign Keys:

  • All foreign keys have ON DELETE CASCADE or ON DELETE SET NULL
  • Orphaned records cleaned up automatically

Unique Constraints:

  • Prevent duplicate fingerprints per track
  • Prevent duplicate MBID links per track
  • Ensure API key uniqueness

Check Constraints:

  • Duration must be positive
  • Bitrate must be positive
  • Submission count must be non-negative

Triggers

Update Submission Count:

CREATE TRIGGER update_fingerprint_submission_count
AFTER INSERT ON fingerprint_source
FOR EACH ROW
EXECUTE FUNCTION increment_submission_count();

Track Merge Propagation:

CREATE TRIGGER propagate_track_merge
AFTER UPDATE OF new_id ON track
FOR EACH ROW
EXECUTE FUNCTION update_merged_track_references();

Indexes for Performance

Covering Indexes:

-- Lookup by fingerprint and duration
CREATE INDEX fingerprint_lookup_idx 
ON fingerprint (length, track_id) 
INCLUDE (fingerprint);

Partial Indexes:

-- Only index unhandled submissions
CREATE INDEX submission_unhandled_idx 
ON submission (created) 
WHERE handled = FALSE;

GIN Indexes:

-- Fast fingerprint array queries
CREATE INDEX fingerprint_fingerprint_idx 
ON fingerprint USING GIN (fingerprint gin__int_ops);

Data Lifecycle

Fingerprint Submission

  1. Insert into submission table (acoustid_ingest)
  2. Publish to NATS queue
  3. Worker processes submission
  4. Insert into fingerprint table (acoustid_fingerprint)
  5. Link to track (create or match)
  6. Insert into fingerprint_source (provenance)
  7. Update index via HTTP API
  8. Insert into submission_result
  9. Mark submission.handled = TRUE

Track Merging

  1. Identify duplicate tracks (manual or automated)
  2. Set track.new_id to target track
  3. Trigger updates all references
  4. Merge fingerprints, MBIDs, metadata
  5. Disable old track (track.disabled = TRUE)

Data Cleanup

Cron Jobs:

  • Delete old handled submissions (>30 days)
  • Clean up orphaned metadata records
  • Remove disabled tracks with no references
  • Archive old statistics

Performance Optimization

Query Optimization

Materialized Views:

CREATE MATERIALIZED VIEW track_stats AS
SELECT 
    track_id,
    COUNT(DISTINCT fingerprint_id) AS fingerprint_count,
    COUNT(DISTINCT mbid) AS mbid_count,
    SUM(submission_count) AS total_submissions
FROM fingerprint
LEFT JOIN track_mbid USING (track_id)
GROUP BY track_id;

Partitioning (future):

-- Partition submissions by month
CREATE TABLE submission_2025_04 PARTITION OF submission
FOR VALUES FROM ('2025-04-01') TO ('2025-05-01');

Caching Strategy

Application-Level:

  • API key validation (TTLCache, 60s)
  • Format ID lookup (permanent cache)
  • MusicBrainz MBID existence (Redis, 1h)

Database-Level:

  • Shared buffers (PostgreSQL config)
  • Connection pooling (SQLAlchemy)
  • Query result caching (pg_stat_statements)

Bulk Operations

Batch Inserts:

# Insert multiple fingerprints efficiently
db.session.bulk_insert_mappings(Fingerprint, fingerprint_dicts)
db.session.commit()

Bulk Updates:

# Update submission counts in batch
db.session.execute(
    update(Fingerprint).where(
        Fingerprint.id.in_(fingerprint_ids)
    ).values(
        submission_count=Fingerprint.submission_count + 1
    )
)

Backup and Recovery

Backup Strategy

PostgreSQL:

  • Daily full backups (pg_dump)
  • Continuous WAL archiving
  • Point-in-time recovery enabled

Index:

  • Daily snapshots via /:index/_snapshot
  • Incremental backups of Oplog
  • Segment files backed up separately

Disaster Recovery

Database Restore:

# Restore from dump
pg_restore -d acoustid_app acoustid_app_backup.dump

# Point-in-time recovery
pg_restore --target-time='2025-04-28 12:00:00'

Index Rebuild:

# Rebuild from database
python manage.py run import --rebuild-index