feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
+871
View File
@@ -0,0 +1,871 @@
# AcoustID Data Model
## Database Architecture
AcoustID uses a multi-database PostgreSQL architecture with separate databases for different concerns.
### Database Instances
| Database | Purpose | Tables | Extensions |
|----------|---------|--------|------------|
| `acoustid_app` | Application data (accounts, apps, stats) | 8 | pgcrypto |
| `acoustid_fingerprint` | Fingerprint and track data | 19 | intarray, acoustid, cube |
| `acoustid_ingest` | Submission processing | 3 | - |
| `musicbrainz` | MusicBrainz mirror (read-only) | Many | - |
### PostgreSQL Extensions
**intarray**: Integer array operations
- Used for fingerprint array queries
- Provides `&&` (overlap) and `@>` (contains) operators
**pgcrypto**: Cryptographic functions
- UUID generation (`gen_random_uuid()`)
- API key hashing
**acoustid** (custom): Fingerprint similarity functions
- `acoustid_compare(int[], int[])`: Compare two fingerprints
- `acoustid_extract_query(int[])`: Extract query terms
- Source: `acoustid-ext` C extension
**cube**: Multi-dimensional cube data type
- Used for simhash-based fingerprint indexing
- Enables fast approximate nearest neighbor search
## Core Tables
### Account Management (acoustid_app)
#### `account`
User accounts for API access.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Account ID |
| `name` | VARCHAR(255) | NOT NULL | Display name |
| `apikey` | VARCHAR(40) | UNIQUE, NOT NULL | API key (user key) |
| `mbuser` | VARCHAR(64) | UNIQUE | MusicBrainz username |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `lastlogin` | TIMESTAMP | | Last login timestamp |
| `submission_count` | INTEGER | DEFAULT 0 | Total submissions |
| `application_id` | INTEGER | FOREIGN KEY | Default application |
| `application_version` | VARCHAR(255) | | Application version |
| `created_from` | INET | | Registration IP |
| `is_admin` | BOOLEAN | DEFAULT FALSE | Admin flag |
**Indexes**:
- `account_pkey` (PRIMARY KEY on `id`)
- `account_apikey_key` (UNIQUE on `apikey`)
- `account_mbuser_key` (UNIQUE on `mbuser`)
#### `application`
API client applications.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Application ID |
| `name` | VARCHAR(255) | NOT NULL | Application name |
| `version` | VARCHAR(255) | | Version string |
| `apikey` | VARCHAR(40) | UNIQUE, NOT NULL | API key (client key) |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `active` | BOOLEAN | DEFAULT TRUE | Active status |
| `account_id` | INTEGER | FOREIGN KEY | Owner account |
| `email` | VARCHAR(255) | | Contact email |
| `website` | VARCHAR(1000) | | Website URL |
| `rate_limit` | INTEGER | | Custom rate limit (req/s) |
**Indexes**:
- `application_pkey` (PRIMARY KEY on `id`)
- `application_apikey_key` (UNIQUE on `apikey`)
#### `account_openid`
OpenID authentication links.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `openid` | VARCHAR(255) | PRIMARY KEY | OpenID identifier |
| `account_id` | INTEGER | FOREIGN KEY | Linked account |
#### `account_google`
Google OAuth authentication links.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `google_user_id` | VARCHAR(255) | PRIMARY KEY | Google user ID |
| `account_id` | INTEGER | FOREIGN KEY | Linked account |
### Fingerprint Data (acoustid_fingerprint)
#### `track`
Unique audio tracks identified by fingerprints.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Track ID |
| `gid` | UUID | UNIQUE, NOT NULL | Public track UUID |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `new_id` | INTEGER | FOREIGN KEY | Merge target (if merged) |
| `disabled` | BOOLEAN | DEFAULT FALSE | Disabled flag |
**Indexes**:
- `track_pkey` (PRIMARY KEY on `id`)
- `track_gid_key` (UNIQUE on `gid`)
- `track_new_id_idx` (on `new_id`)
**Notes**:
- `gid` is the public-facing AcoustID track ID
- `new_id` points to merged track (for deduplication)
- Disabled tracks excluded from search results
#### `fingerprint`
Audio fingerprints linked to tracks.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Fingerprint ID |
| `track_id` | INTEGER | FOREIGN KEY | Linked track |
| `fingerprint` | INTEGER[] | NOT NULL | Chromaprint hash array |
| `length` | SMALLINT | NOT NULL | Duration in seconds |
| `bitrate` | SMALLINT | | Audio bitrate (kbps) |
| `format_id` | INTEGER | FOREIGN KEY | Audio format |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
**Indexes**:
- `fingerprint_pkey` (PRIMARY KEY on `id`)
- `fingerprint_track_id_idx` (on `track_id`)
- `fingerprint_length_idx` (on `length`)
- `fingerprint_fingerprint_idx` (GIN on `fingerprint` using `intarray`)
**Notes**:
- `fingerprint` is an array of 32-bit integers (Chromaprint hashes)
- GIN index enables fast similarity search
- `submission_count` tracks popularity
#### `fingerprint_data`
Extended fingerprint data with simhash.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `fingerprint_id` | INTEGER | PRIMARY KEY, FOREIGN KEY | Fingerprint ID |
| `fingerprint` | BYTEA | NOT NULL | Raw fingerprint data |
| `simhash` | CUBE | | Locality-sensitive hash |
**Indexes**:
- `fingerprint_data_pkey` (PRIMARY KEY on `fingerprint_id`)
- `fingerprint_data_simhash_idx` (GIST on `simhash`)
**Notes**:
- `fingerprint` stores compressed Chromaprint data
- `simhash` enables approximate nearest neighbor search
- GIST index for fast similarity queries
#### `track_mbid`
Links tracks to MusicBrainz recordings.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
| `mbid` | UUID | NOT NULL | MusicBrainz recording MBID |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
| `disabled` | BOOLEAN | DEFAULT FALSE | Disabled flag |
**Indexes**:
- `track_mbid_pkey` (PRIMARY KEY on `id`)
- `track_mbid_track_id_mbid_key` (UNIQUE on `track_id, mbid`)
- `track_mbid_mbid_idx` (on `mbid`)
**Notes**:
- Multiple MBIDs per track possible (different recordings)
- `submission_count` indicates confidence
- Disabled links excluded from results
#### `meta`
User-submitted metadata.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Metadata ID |
| `track` | VARCHAR(255) | | Track title |
| `artist` | VARCHAR(255) | | Artist name |
| `album` | VARCHAR(255) | | Album title |
| `album_artist` | VARCHAR(255) | | Album artist |
| `track_no` | INTEGER | | Track number |
| `disc_no` | INTEGER | | Disc number |
| `year` | INTEGER | | Release year |
**Indexes**:
- `meta_pkey` (PRIMARY KEY on `id`)
#### `track_meta`
Links tracks to user metadata.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
| `meta_id` | INTEGER | FOREIGN KEY | Metadata record |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
**Indexes**:
- `track_meta_pkey` (PRIMARY KEY on `id`)
- `track_meta_track_id_meta_id_key` (UNIQUE on `track_id, meta_id`)
#### `format`
Audio file formats.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Format ID |
| `name` | VARCHAR(20) | UNIQUE, NOT NULL | Format name (mp3, flac, etc.) |
**Indexes**:
- `format_pkey` (PRIMARY KEY on `id`)
- `format_name_key` (UNIQUE on `name`)
**Common Values**:
- `mp3`, `flac`, `ogg`, `m4a`, `wma`, `ape`, `wav`
#### `source`
Submission sources (applications).
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Source ID |
| `application_id` | INTEGER | FOREIGN KEY | Application |
| `account_id` | INTEGER | FOREIGN KEY | User account |
| `version` | VARCHAR(255) | | Application version |
**Indexes**:
- `source_pkey` (PRIMARY KEY on `id`)
- `source_application_id_account_id_version_key` (UNIQUE on `application_id, account_id, version`)
### Foreign IDs (acoustid_fingerprint)
#### `foreignid_vendor`
External ID providers.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Vendor ID |
| `name` | VARCHAR(255) | UNIQUE, NOT NULL | Vendor name |
**Indexes**:
- `foreignid_vendor_pkey` (PRIMARY KEY on `id`)
- `foreignid_vendor_name_key` (UNIQUE on `name`)
**Common Values**:
- `musicbrainz`, `musicip`, `discogs`, `spotify`
#### `foreignid`
External identifiers.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Foreign ID |
| `vendor_id` | INTEGER | FOREIGN KEY | Vendor |
| `name` | VARCHAR(255) | NOT NULL | External ID value |
**Indexes**:
- `foreignid_pkey` (PRIMARY KEY on `id`)
- `foreignid_vendor_id_name_key` (UNIQUE on `vendor_id, name`)
#### `track_foreignid`
Links tracks to external IDs.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
| `foreignid_id` | INTEGER | FOREIGN KEY | External ID |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
**Indexes**:
- `track_foreignid_pkey` (PRIMARY KEY on `id`)
- `track_foreignid_track_id_foreignid_id_key` (UNIQUE on `track_id, foreignid_id`)
#### `track_puid`
Legacy MusicIP PUID links.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_id` | INTEGER | FOREIGN KEY | AcoustID track |
| `puid` | UUID | NOT NULL | MusicIP PUID |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
| `submission_count` | INTEGER | DEFAULT 1 | Number of submissions |
**Indexes**:
- `track_puid_pkey` (PRIMARY KEY on `id`)
- `track_puid_track_id_puid_key` (UNIQUE on `track_id, puid`)
- `track_puid_puid_idx` (on `puid`)
### Statistics (acoustid_app)
#### `stats`
General statistics.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Stat ID |
| `name` | VARCHAR(255) | UNIQUE, NOT NULL | Stat name |
| `value` | INTEGER | NOT NULL | Stat value |
| `date` | DATE | NOT NULL | Stat date |
**Indexes**:
- `stats_pkey` (PRIMARY KEY on `id`)
- `stats_name_date_key` (UNIQUE on `name, date`)
**Common Stats**:
- `lookup.count`, `submission.count`, `track.count`, `fingerprint.count`
#### `stats_lookups`
Lookup statistics by hour.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Stat ID |
| `hour` | TIMESTAMP | NOT NULL | Hour timestamp |
| `application_id` | INTEGER | FOREIGN KEY | Application |
| `count_hits` | INTEGER | DEFAULT 0 | Successful lookups |
| `count_misses` | INTEGER | DEFAULT 0 | Failed lookups |
**Indexes**:
- `stats_lookups_pkey` (PRIMARY KEY on `id`)
- `stats_lookups_hour_application_id_key` (UNIQUE on `hour, application_id`)
#### `stats_user_agents`
User agent statistics.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Stat ID |
| `date` | DATE | NOT NULL | Date |
| `application_id` | INTEGER | FOREIGN KEY | Application |
| `user_agent` | VARCHAR(1000) | NOT NULL | User agent string |
| `ip` | INET | NOT NULL | IP address |
| `count` | INTEGER | DEFAULT 0 | Request count |
**Indexes**:
- `stats_user_agents_pkey` (PRIMARY KEY on `id`)
- `stats_user_agents_date_application_id_user_agent_ip_key` (UNIQUE on `date, application_id, user_agent, ip`)
#### `stats_top_accounts`
Top submitter accounts.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Stat ID |
| `account_id` | INTEGER | FOREIGN KEY | Account |
| `count` | INTEGER | NOT NULL | Submission count |
**Indexes**:
- `stats_top_accounts_pkey` (PRIMARY KEY on `id`)
- `stats_top_accounts_account_id_key` (UNIQUE on `account_id`)
### Submission Processing (acoustid_ingest)
#### `submission`
Pending fingerprint submissions.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Submission ID |
| `fingerprint` | INTEGER[] | NOT NULL | Chromaprint hash array |
| `length` | SMALLINT | NOT NULL | Duration in seconds |
| `bitrate` | SMALLINT | | Audio bitrate |
| `format_id` | INTEGER | | Audio format |
| `created` | TIMESTAMP | NOT NULL | Submission timestamp |
| `source_id` | INTEGER | FOREIGN KEY | Submission source |
| `mbid` | UUID | | MusicBrainz MBID (if provided) |
| `handled` | BOOLEAN | DEFAULT FALSE | Processing status |
| `meta_id` | INTEGER | FOREIGN KEY | User metadata |
**Indexes**:
- `submission_pkey` (PRIMARY KEY on `id`)
- `submission_handled_idx` (on `handled` WHERE `handled = FALSE`)
**Notes**:
- Worker processes unhandled submissions
- `handled = TRUE` after processing
#### `submission_result`
Processing results for submissions.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Result ID |
| `submission_id` | INTEGER | FOREIGN KEY | Submission |
| `track_id` | INTEGER | FOREIGN KEY | Matched/created track |
| `created` | TIMESTAMP | NOT NULL | Processing timestamp |
**Indexes**:
- `submission_result_pkey` (PRIMARY KEY on `id`)
- `submission_result_submission_id_key` (UNIQUE on `submission_id`)
#### `pending_submission`
Queue for async submission processing.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Queue ID |
| `submission_id` | INTEGER | FOREIGN KEY | Submission |
| `created` | TIMESTAMP | NOT NULL | Queue timestamp |
**Indexes**:
- `pending_submission_pkey` (PRIMARY KEY on `id`)
- `pending_submission_submission_id_key` (UNIQUE on `submission_id`)
**Notes**:
- Replaced by NATS queue in newer deployments
- Legacy table, may be deprecated
### Provenance Tables (acoustid_fingerprint)
Track data lineage and changes.
#### `fingerprint_source`
Links fingerprints to submission sources.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `fingerprint_id` | INTEGER | FOREIGN KEY | Fingerprint |
| `source_id` | INTEGER | FOREIGN KEY | Source |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
#### `track_mbid_source`
Links track-MBID associations to sources.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Link ID |
| `track_mbid_id` | INTEGER | FOREIGN KEY | Track-MBID link |
| `source_id` | INTEGER | FOREIGN KEY | Source |
| `created` | TIMESTAMP | NOT NULL | Creation timestamp |
#### `track_mbid_change`
Audit log for track-MBID changes.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | SERIAL | PRIMARY KEY | Change ID |
| `track_mbid_id` | INTEGER | FOREIGN KEY | Track-MBID link |
| `account_id` | INTEGER | FOREIGN KEY | Account that made change |
| `disabled` | BOOLEAN | NOT NULL | New disabled status |
| `created` | TIMESTAMP | NOT NULL | Change timestamp |
| `note` | TEXT | | Change reason |
## ORM Layer (SQLAlchemy)
### Multi-Database Configuration
**File**: `acoustid/db.py`
```python
# Database bind keys
BIND_KEYS = {
'app': 'acoustid_app',
'fingerprint': 'acoustid_fingerprint',
'ingest': 'acoustid_ingest',
'musicbrainz': 'musicbrainz'
}
```
**Model Binding**:
```python
class Account(Base):
__bind_key__ = 'app'
__tablename__ = 'account'
# ...
class Track(Base):
__bind_key__ = 'fingerprint'
__tablename__ = 'track'
# ...
```
### Connection Pooling
**Configuration** (`acoustid.conf`):
```ini
[database]
name = acoustid_app
user = acoustid
password_file = /run/secrets/db_password
host = postgres
port = 5432
pool_size = 20
pool_recycle = 3600
```
**Pool Settings**:
- `pool_size`: Maximum connections per process
- `pool_recycle`: Recycle connections after N seconds
- `pool_pre_ping`: Test connections before use
### Query Patterns
**Fingerprint Search** (legacy, pre-index):
```python
# Find similar fingerprints using intarray overlap
query = db.session.query(Fingerprint).filter(
Fingerprint.fingerprint.op('&&')(query_fingerprint),
Fingerprint.length.between(duration - 5, duration + 5)
).order_by(
func.acoustid_compare(Fingerprint.fingerprint, query_fingerprint).desc()
).limit(10)
```
**Track Lookup with MBIDs**:
```python
# Fetch track with all linked MBIDs
track = db.session.query(Track).options(
joinedload(Track.mbids)
).filter(Track.gid == track_gid).first()
```
**Submission Processing**:
```python
# Find unhandled submissions
submissions = db.session.query(Submission).filter(
Submission.handled == False
).order_by(Submission.created).limit(100).all()
```
## Database Migrations
### Alembic Configuration
**File**: `alembic.ini`
**Migration Directories**:
- `alembic/versions/app/`: acoustid_app migrations
- `alembic/versions/fingerprint/`: acoustid_fingerprint migrations
- `alembic/versions/ingest/`: acoustid_ingest migrations
**Multi-Database Support**:
```python
# alembic/env.py
def run_migrations_online():
for bind_key in ['app', 'fingerprint', 'ingest']:
engine = get_engine(bind_key)
with engine.connect() as connection:
context.configure(
connection=connection,
target_metadata=get_metadata(bind_key)
)
with context.begin_transaction():
context.run_migrations()
```
### Migration Commands
```bash
# Create new migration
alembic revision --autogenerate -m "Add new column"
# Apply migrations
alembic upgrade head
# Rollback migration
alembic downgrade -1
# Show current version
alembic current
# Show migration history
alembic history
```
## Redis Data Structures
### Rate Limiting
**Key Pattern**: `rl:bucket:{scope}:{identifier}:{timestamp}`
**Example Keys**:
```
rl:bucket:global:1714305600
rl:bucket:app:8XaBELgH:1714305600
rl:bucket:ip:192.168.1.1:1714305600
```
**Value**: Integer (request count)
**TTL**: 25 seconds (window duration + buffer)
**Algorithm**:
```python
# Increment bucket for current window
bucket_key = f"rl:bucket:{scope}:{identifier}:{current_window}"
count = redis.incr(bucket_key)
redis.expire(bucket_key, 25)
# Sum counts across all windows in sliding window
total = sum(redis.get(f"rl:bucket:{scope}:{identifier}:{w}")
for w in windows)
```
### Task Queue (Legacy)
**Key Pattern**: `queue:{queue_name}`
**Operations**:
```python
# Push task
redis.rpush('queue:submissions', json.dumps(task_data))
# Pop task
task_data = redis.lpop('queue:submissions')
```
**Note**: Being replaced by NATS in newer deployments
### API Key Cache
**Implementation**: In-memory TTLCache (not Redis)
```python
from cachetools import TTLCache
api_key_cache = TTLCache(maxsize=1000, ttl=60)
```
**Purpose**: Reduce database queries for API key validation
### Backfill State
**Key Pattern**: `backfill:{index_name}:{state_key}`
**Example Keys**:
```
backfill:fingerprints:last_id
backfill:fingerprints:batch_size
backfill:fingerprints:completed
```
**Purpose**: Track progress of index backfill operations
### Unknown MBID Cache
**Key Pattern**: `unknown_mbid:{mbid}`
**Value**: Boolean (1 if MBID not found in MusicBrainz)
**TTL**: 3600 seconds (1 hour)
**Purpose**: Avoid repeated MusicBrainz queries for non-existent MBIDs
## Data Integrity
### Constraints
**Foreign Keys**:
- All foreign keys have `ON DELETE CASCADE` or `ON DELETE SET NULL`
- Orphaned records cleaned up automatically
**Unique Constraints**:
- Prevent duplicate fingerprints per track
- Prevent duplicate MBID links per track
- Ensure API key uniqueness
**Check Constraints**:
- Duration must be positive
- Bitrate must be positive
- Submission count must be non-negative
### Triggers
**Update Submission Count**:
```sql
CREATE TRIGGER update_fingerprint_submission_count
AFTER INSERT ON fingerprint_source
FOR EACH ROW
EXECUTE FUNCTION increment_submission_count();
```
**Track Merge Propagation**:
```sql
CREATE TRIGGER propagate_track_merge
AFTER UPDATE OF new_id ON track
FOR EACH ROW
EXECUTE FUNCTION update_merged_track_references();
```
### Indexes for Performance
**Covering Indexes**:
```sql
-- Lookup by fingerprint and duration
CREATE INDEX fingerprint_lookup_idx
ON fingerprint (length, track_id)
INCLUDE (fingerprint);
```
**Partial Indexes**:
```sql
-- Only index unhandled submissions
CREATE INDEX submission_unhandled_idx
ON submission (created)
WHERE handled = FALSE;
```
**GIN Indexes**:
```sql
-- Fast fingerprint array queries
CREATE INDEX fingerprint_fingerprint_idx
ON fingerprint USING GIN (fingerprint gin__int_ops);
```
## Data Lifecycle
### Fingerprint Submission
1. Insert into `submission` table (acoustid_ingest)
2. Publish to NATS queue
3. Worker processes submission
4. Insert into `fingerprint` table (acoustid_fingerprint)
5. Link to `track` (create or match)
6. Insert into `fingerprint_source` (provenance)
7. Update index via HTTP API
8. Insert into `submission_result`
9. Mark `submission.handled = TRUE`
### Track Merging
1. Identify duplicate tracks (manual or automated)
2. Set `track.new_id` to target track
3. Trigger updates all references
4. Merge fingerprints, MBIDs, metadata
5. Disable old track (`track.disabled = TRUE`)
### Data Cleanup
**Cron Jobs**:
- Delete old handled submissions (>30 days)
- Clean up orphaned metadata records
- Remove disabled tracks with no references
- Archive old statistics
## Performance Optimization
### Query Optimization
**Materialized Views**:
```sql
CREATE MATERIALIZED VIEW track_stats AS
SELECT
track_id,
COUNT(DISTINCT fingerprint_id) AS fingerprint_count,
COUNT(DISTINCT mbid) AS mbid_count,
SUM(submission_count) AS total_submissions
FROM fingerprint
LEFT JOIN track_mbid USING (track_id)
GROUP BY track_id;
```
**Partitioning** (future):
```sql
-- Partition submissions by month
CREATE TABLE submission_2025_04 PARTITION OF submission
FOR VALUES FROM ('2025-04-01') TO ('2025-05-01');
```
### Caching Strategy
**Application-Level**:
- API key validation (TTLCache, 60s)
- Format ID lookup (permanent cache)
- MusicBrainz MBID existence (Redis, 1h)
**Database-Level**:
- Shared buffers (PostgreSQL config)
- Connection pooling (SQLAlchemy)
- Query result caching (pg_stat_statements)
### Bulk Operations
**Batch Inserts**:
```python
# Insert multiple fingerprints efficiently
db.session.bulk_insert_mappings(Fingerprint, fingerprint_dicts)
db.session.commit()
```
**Bulk Updates**:
```python
# Update submission counts in batch
db.session.execute(
update(Fingerprint).where(
Fingerprint.id.in_(fingerprint_ids)
).values(
submission_count=Fingerprint.submission_count + 1
)
)
```
## Backup and Recovery
### Backup Strategy
**PostgreSQL**:
- Daily full backups (pg_dump)
- Continuous WAL archiving
- Point-in-time recovery enabled
**Index**:
- Daily snapshots via `/:index/_snapshot`
- Incremental backups of Oplog
- Segment files backed up separately
### Disaster Recovery
**Database Restore**:
```bash
# Restore from dump
pg_restore -d acoustid_app acoustid_app_backup.dump
# Point-in-time recovery
pg_restore --target-time='2025-04-28 12:00:00'
```
**Index Rebuild**:
```bash
# Rebuild from database
python manage.py run import --rebuild-index
```