Files
metadata-agregator/docs/research/minimediametadataapi/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

26 KiB

MiniMediaMetadataAPI - Data Layer Analysis

Database Technology

RDBMS: PostgreSQL
Driver: Npgsql 10.0.2
ORM: Dapper 2.1.72 (micro-ORM)
Extensions: pg_trgm (trigram similarity search)

Schema Ownership

Critical Constraint: This API does NOT own the database schema.

Schema Owner: MiniMediaScanner (separate project)
API Role: Read-only consumer
Migration Strategy: None (schema managed externally)

Implications

Pros:

  • Clear separation of concerns
  • API doesn't need provider API credentials
  • Simpler deployment (no migration coordination)
  • Sync complexity isolated in MiniMediaScanner

Cons:

  • No control over schema evolution
  • Breaking changes in MiniMediaScanner break API
  • Can't optimize schema for query patterns
  • Data freshness depends on external sync schedule

Coupling Points:

  • Table names hardcoded in SQL queries
  • Column names hardcoded in Dapper mappings
  • Foreign key relationships assumed in joins
  • Data types must match C# model properties

Connection Configuration

Connection String Format:

Host=localhost;
Database=minimediametadata;
Username=postgres;
Password=password;
MinPoolSize=5;
MaxPoolSize=100;
Timeout=30;
CommandTimeout=30;

Pooling Settings:

  • MinPoolSize: 5 connections kept alive
  • MaxPoolSize: 100 concurrent connections
  • Timeout: 30 seconds to acquire connection
  • CommandTimeout: 30 seconds for query execution

Connection Lifecycle:

  • Connections created per repository method call
  • Returned to pool after query completion
  • No long-lived connections
  • No transaction management (read-only)

Fuzzy Search Implementation

pg_trgm Extension

Purpose: Trigram-based similarity search for fuzzy text matching

Configuration:

SET LOCAL pg_trgm.similarity_threshold = 0.5;

Threshold: 0.5 (50% similarity required)

Operators:

  • % - Similarity operator (returns true if similarity >= threshold)
  • similarity(text, text) - Returns similarity score (0.0 to 1.0)

Search Query Pattern

Example (Artist Search):

SET LOCAL pg_trgm.similarity_threshold = 0.5;

SELECT 
    id,
    name,
    popularity,
    external_url,
    followers,
    genres,
    last_sync_time
FROM spotify_artist
WHERE lower(name) % lower(@searchTerm)
ORDER BY similarity(lower(name), lower(@searchTerm)) DESC
LIMIT 20 OFFSET @offset;

Key Features:

  • Case-insensitive matching (lower())
  • Similarity-based ordering (best matches first)
  • Pagination support (LIMIT/OFFSET)
  • Threshold filtering (only >= 50% similarity)

Performance:

  • Requires GIN or GiST index on name column
  • Index creation: CREATE INDEX idx_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);
  • Query time: O(log n) with index, O(n) without

Similarity Scoring

Algorithm: Trigram overlap

Example:

"Beatles" vs "Beetles"
Trigrams: ["bea", "eat", "atl", "tle", "les"] vs ["bee", "eet", "etl", "tle", "les"]
Overlap: ["tle", "les"] = 2/5 = 0.4 (below threshold)

"Beatles" vs "The Beatles"
Trigrams: ["bea", "eat", "atl", "tle", "les"] vs ["the", "he ", "e b", " be", "bea", "eat", "atl", "tle", "les"]
Overlap: ["bea", "eat", "atl", "tle", "les"] = 5/9 = 0.56 (above threshold)

Tuning:

  • Lower threshold (0.3) = more results, more false positives
  • Higher threshold (0.7) = fewer results, more precision
  • Current 0.5 = balanced approach

Database Schema

Provider-Specific Tables

Each provider has isolated table structure. No cross-provider foreign keys.

Spotify Schema

Tables:

  1. spotify_artist - Artist metadata
  2. spotify_artist_image - Artist images (1:N)
  3. spotify_album - Album metadata
  4. spotify_album_artist - Album-artist relationships (M:N)
  5. spotify_album_image - Album artwork (1:N)
  6. spotify_album_externalid - External identifiers (UPC, EAN) (1:N)
  7. spotify_track - Track metadata
  8. spotify_track_artist - Track-artist relationships (M:N)
  9. spotify_track_externalid - External identifiers (ISRC) (1:N)

spotify_artist:

CREATE TABLE spotify_artist (
    id VARCHAR(255) PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    popularity INTEGER,
    external_url VARCHAR(500),
    followers INTEGER,
    genres TEXT[], -- PostgreSQL array
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_spotify_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);

spotify_artist_image:

CREATE TABLE spotify_artist_image (
    id SERIAL PRIMARY KEY,
    artist_id VARCHAR(255) REFERENCES spotify_artist(id),
    url VARCHAR(1000) NOT NULL,
    height INTEGER,
    width INTEGER
);

CREATE INDEX idx_spotify_artist_image_artist ON spotify_artist_image(artist_id);

spotify_album:

CREATE TABLE spotify_album (
    id VARCHAR(255) PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    popularity INTEGER,
    external_url VARCHAR(500),
    label VARCHAR(500),
    release_date VARCHAR(50), -- Stored as string (YYYY, YYYY-MM, or YYYY-MM-DD)
    total_tracks INTEGER,
    album_type VARCHAR(50), -- album, single, compilation
    copyright TEXT,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_spotify_album_name_trgm ON spotify_album USING gin(lower(name) gin_trgm_ops);

spotify_album_artist (junction table):

CREATE TABLE spotify_album_artist (
    id SERIAL PRIMARY KEY,
    album_id VARCHAR(255) REFERENCES spotify_album(id),
    artist_id VARCHAR(255) REFERENCES spotify_artist(id)
);

CREATE INDEX idx_spotify_album_artist_album ON spotify_album_artist(album_id);
CREATE INDEX idx_spotify_album_artist_artist ON spotify_album_artist(artist_id);

spotify_track:

CREATE TABLE spotify_track (
    id VARCHAR(255) PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    album_id VARCHAR(255) REFERENCES spotify_album(id),
    popularity INTEGER,
    external_url VARCHAR(500),
    duration_ms INTEGER,
    explicit BOOLEAN,
    disc_number INTEGER,
    track_number INTEGER,
    label VARCHAR(500),
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_spotify_track_name_trgm ON spotify_track USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_track_album ON spotify_track(album_id);

spotify_album_externalid:

CREATE TABLE spotify_album_externalid (
    id SERIAL PRIMARY KEY,
    album_id VARCHAR(255) REFERENCES spotify_album(id),
    type VARCHAR(50), -- upc, ean
    value VARCHAR(255)
);

CREATE INDEX idx_spotify_album_externalid_album ON spotify_album_externalid(album_id);
CREATE INDEX idx_spotify_album_externalid_value ON spotify_album_externalid(value);

spotify_track_externalid:

CREATE TABLE spotify_track_externalid (
    id SERIAL PRIMARY KEY,
    track_id VARCHAR(255) REFERENCES spotify_track(id),
    type VARCHAR(50), -- isrc
    value VARCHAR(255)
);

CREATE INDEX idx_spotify_track_externalid_track ON spotify_track_externalid(track_id);
CREATE INDEX idx_spotify_track_externalid_value ON spotify_track_externalid(value);

Tidal Schema

Tables:

  1. tidal_artist - Artist metadata
  2. tidal_artist_image_link - Artist image URLs (1:N)
  3. tidal_album - Album metadata
  4. tidal_album_external_link - External URLs (1:N)
  5. tidal_album_image - Album artwork (1:N)
  6. tidal_track - Track metadata
  7. tidal_track_artist - Track-artist relationships (M:N)
  8. tidal_track_external_link - External URLs (1:N)

Key Differences from Spotify:

  • ID type: INTEGER instead of VARCHAR
  • No popularity field
  • No genres field
  • External links instead of external IDs
  • Image links stored as separate table

tidal_artist:

CREATE TABLE tidal_artist (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    url VARCHAR(500),
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_tidal_artist_name_trgm ON tidal_artist USING gin(lower(name) gin_trgm_ops);

tidal_album:

CREATE TABLE tidal_album (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    artist_id INTEGER REFERENCES tidal_artist(id),
    url VARCHAR(500),
    release_date VARCHAR(50),
    total_tracks INTEGER,
    duration INTEGER, -- Total duration in seconds
    explicit BOOLEAN,
    upc VARCHAR(255),
    copyright TEXT,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_tidal_album_name_trgm ON tidal_album USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_tidal_album_artist ON tidal_album(artist_id);

MusicBrainz Schema

Tables:

  1. musicbrainz_artist - Artist metadata
  2. musicbrainz_release - Release (album) metadata
  3. musicbrainz_release_label - Release-label relationships (M:N)
  4. musicbrainz_label - Label metadata
  5. musicbrainz_release_track - Track metadata
  6. musicbrainz_release_track_artist - Track-artist relationships (M:N)

Key Differences:

  • ID type: UUID (Guid)
  • "Release" instead of "Album"
  • Sort name field for artists
  • Label as separate entity
  • No popularity or follower counts
  • No images (stored externally via Cover Art Archive)

musicbrainz_artist:

CREATE TABLE musicbrainz_artist (
    id UUID PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    sort_name VARCHAR(500), -- For alphabetical sorting (e.g., "Beatles, The")
    type VARCHAR(100), -- Person, Group, Orchestra, etc.
    country VARCHAR(2), -- ISO country code
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_musicbrainz_artist_name_trgm ON musicbrainz_artist USING gin(lower(name) gin_trgm_ops);

musicbrainz_release:

CREATE TABLE musicbrainz_release (
    id UUID PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    artist_id UUID REFERENCES musicbrainz_artist(id),
    release_date VARCHAR(50),
    country VARCHAR(2),
    barcode VARCHAR(255), -- Similar to UPC
    status VARCHAR(100), -- Official, Promotion, Bootleg, etc.
    packaging VARCHAR(100), -- Jewel Case, Digipak, etc.
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_musicbrainz_release_name_trgm ON musicbrainz_release USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_musicbrainz_release_artist ON musicbrainz_release(artist_id);

musicbrainz_label:

CREATE TABLE musicbrainz_label (
    id UUID PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    type VARCHAR(100), -- Original Production, Bootleg Production, etc.
    country VARCHAR(2),
    last_sync_time TIMESTAMP WITH TIME ZONE
);

musicbrainz_release_label (junction table):

CREATE TABLE musicbrainz_release_label (
    id SERIAL PRIMARY KEY,
    release_id UUID REFERENCES musicbrainz_release(id),
    label_id UUID REFERENCES musicbrainz_label(id),
    catalog_number VARCHAR(255)
);

CREATE INDEX idx_musicbrainz_release_label_release ON musicbrainz_release_label(release_id);
CREATE INDEX idx_musicbrainz_release_label_label ON musicbrainz_release_label(label_id);

Deezer Schema

Tables:

  1. deezer_artist - Artist metadata
  2. deezer_artist_image_link - Artist image URLs (1:N)
  3. deezer_album - Album metadata
  4. deezer_album_image_link - Album artwork URLs (1:N)
  5. deezer_album_artist - Album-artist relationships (M:N)
  6. deezer_track - Track metadata
  7. deezer_track_artist - Track-artist relationships (M:N)

Key Differences:

  • ID type: BIGINT
  • Has popularity (called "fans")
  • Has genres
  • No UPC/ISRC fields
  • No label information

deezer_artist:

CREATE TABLE deezer_artist (
    id BIGINT PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    url VARCHAR(500),
    fans INTEGER, -- Similar to followers
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_deezer_artist_name_trgm ON deezer_artist USING gin(lower(name) gin_trgm_ops);

deezer_album:

CREATE TABLE deezer_album (
    id BIGINT PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    url VARCHAR(500),
    release_date VARCHAR(50),
    total_tracks INTEGER,
    duration INTEGER, -- Total duration in seconds
    explicit BOOLEAN,
    fans INTEGER,
    genres TEXT[], -- PostgreSQL array
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_deezer_album_name_trgm ON deezer_album USING gin(lower(name) gin_trgm_ops);

Discogs Schema

Tables:

  1. discogs_artist - Artist metadata
  2. discogs_artist_alias - Artist aliases (1:N)
  3. discogs_artist_url - Artist URLs (1:N)
  4. discogs_release - Release metadata
  5. discogs_release_artist - Release-artist relationships (M:N)
  6. discogs_release_identifier - Barcodes/identifiers (1:N)
  7. discogs_release_track - Track metadata
  8. discogs_label - Label metadata
  9. discogs_label_sublabel - Label hierarchy (1:N)
  10. discogs_label_url - Label URLs (1:N)

Key Differences:

  • ID type: INTEGER
  • Most comprehensive label data
  • Artist aliases tracked
  • Multiple identifiers per release (Barcode, Matrix, etc.)
  • No popularity metrics
  • No image URLs (stored externally)

discogs_artist:

CREATE TABLE discogs_artist (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    real_name VARCHAR(500), -- For pseudonyms
    profile TEXT, -- Biography
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_discogs_artist_name_trgm ON discogs_artist USING gin(lower(name) gin_trgm_ops);

discogs_artist_alias:

CREATE TABLE discogs_artist_alias (
    id SERIAL PRIMARY KEY,
    artist_id INTEGER REFERENCES discogs_artist(id),
    alias_name VARCHAR(500)
);

CREATE INDEX idx_discogs_artist_alias_artist ON discogs_artist_alias(artist_id);
CREATE INDEX idx_discogs_artist_alias_name_trgm ON discogs_artist_alias USING gin(lower(alias_name) gin_trgm_ops);

discogs_release:

CREATE TABLE discogs_release (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    released VARCHAR(50),
    country VARCHAR(100),
    notes TEXT,
    genres TEXT[],
    styles TEXT[], -- More specific than genres
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_discogs_release_name_trgm ON discogs_release USING gin(lower(name) gin_trgm_ops);

discogs_release_identifier:

CREATE TABLE discogs_release_identifier (
    id SERIAL PRIMARY KEY,
    release_id INTEGER REFERENCES discogs_release(id),
    type VARCHAR(100), -- Barcode, Matrix/Runout, Label Code, etc.
    value VARCHAR(500),
    description TEXT
);

CREATE INDEX idx_discogs_release_identifier_release ON discogs_release_identifier(release_id);
CREATE INDEX idx_discogs_release_identifier_value ON discogs_release_identifier(value);

discogs_label:

CREATE TABLE discogs_label (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    contact_info TEXT,
    profile TEXT,
    parent_label_id INTEGER REFERENCES discogs_label(id),
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_discogs_label_name_trgm ON discogs_label USING gin(lower(name) gin_trgm_ops);

SoundCloud Schema

Tables:

  1. soundcloud_user - User/artist metadata
  2. soundcloud_playlist - Playlist metadata
  3. soundcloud_track - Track metadata
  4. soundcloud_track_artist - Track-artist relationships (M:N)

Key Differences:

  • "User" instead of "Artist" (user-generated content platform)
  • Playlist as first-class entity
  • No album concept
  • Minimal metadata (no UPC, ISRC, labels)
  • ID type: BIGINT

soundcloud_user:

CREATE TABLE soundcloud_user (
    id BIGINT PRIMARY KEY,
    username VARCHAR(500) NOT NULL,
    full_name VARCHAR(500),
    url VARCHAR(500),
    avatar_url VARCHAR(1000),
    followers_count INTEGER,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_soundcloud_user_username_trgm ON soundcloud_user USING gin(lower(username) gin_trgm_ops);

soundcloud_playlist:

CREATE TABLE soundcloud_playlist (
    id BIGINT PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    user_id BIGINT REFERENCES soundcloud_user(id),
    url VARCHAR(500),
    artwork_url VARCHAR(1000),
    duration INTEGER, -- Total duration in milliseconds
    track_count INTEGER,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_soundcloud_playlist_title_trgm ON soundcloud_playlist USING gin(lower(title) gin_trgm_ops);
CREATE INDEX idx_soundcloud_playlist_user ON soundcloud_playlist(user_id);

soundcloud_track:

CREATE TABLE soundcloud_track (
    id BIGINT PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    user_id BIGINT REFERENCES soundcloud_user(id),
    url VARCHAR(500),
    artwork_url VARCHAR(1000),
    duration INTEGER, -- Duration in milliseconds
    genre VARCHAR(255),
    playback_count INTEGER,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_soundcloud_track_title_trgm ON soundcloud_track USING gin(lower(title) gin_trgm_ops);
CREATE INDEX idx_soundcloud_track_user ON soundcloud_track(user_id);

ID Type Comparison

Provider Artist ID Album ID Track ID Notes
Spotify VARCHAR(255) VARCHAR(255) VARCHAR(255) Base62 encoded (22 chars)
Tidal INTEGER INTEGER INTEGER Sequential integers
MusicBrainz UUID UUID UUID RFC 4122 UUIDs
Deezer BIGINT BIGINT BIGINT Large integers
Discogs INTEGER INTEGER INTEGER Sequential integers
SoundCloud BIGINT N/A BIGINT No album concept

Implications:

  • Cross-provider ID lookups impossible
  • ID parameter must match provider type
  • C# models use provider-specific types
  • No universal identifier system

Data Type Patterns

Arrays (PostgreSQL Native)

Usage: Genres, styles, external IDs

Example:

genres TEXT[] -- ["rock", "pop", "alternative"]

Dapper Mapping:

public class SpotifyArtist
{
    public string[] Genres { get; set; } // Dapper auto-maps PostgreSQL arrays
}

Timestamps

Type: TIMESTAMP WITH TIME ZONE
Purpose: Track last sync time from provider

Example:

last_sync_time TIMESTAMP WITH TIME ZONE DEFAULT NOW()

C# Mapping:

public DateTime? LastSyncTime { get; set; }

Variable-Length Dates

Type: VARCHAR(50)
Formats: YYYY, YYYY-MM, YYYY-MM-DD

Rationale: Providers return different precision levels

Examples:

  • "1969" - Year only
  • "1969-09" - Year and month
  • "1969-09-26" - Full date

C# Mapping:

public string ReleaseDate { get; set; } // Stored as string, parsed in application

Query Patterns

SET LOCAL pg_trgm.similarity_threshold = 0.5;

SELECT 
    a.id,
    a.name,
    a.popularity,
    a.external_url,
    a.followers,
    a.genres,
    a.last_sync_time,
    i.url AS image_url,
    i.height AS image_height,
    i.width AS image_width
FROM spotify_artist a
LEFT JOIN spotify_artist_image i ON a.id = i.artist_id
WHERE lower(a.name) % lower(@searchTerm)
ORDER BY similarity(lower(a.name), lower(@searchTerm)) DESC
LIMIT 20 OFFSET @offset;

Dapper Mapping:

var artistDict = new Dictionary<string, SpotifyArtist>();

var results = await connection.QueryAsync<SpotifyArtist, SpotifyArtistImage, SpotifyArtist>(
    sql,
    (artist, image) =>
    {
        if (!artistDict.TryGetValue(artist.Id, out var existingArtist))
        {
            existingArtist = artist;
            existingArtist.Images = new List<SpotifyArtistImage>();
            artistDict.Add(artist.Id, existingArtist);
        }
        
        if (image != null)
        {
            existingArtist.Images.Add(image);
        }
        
        return existingArtist;
    },
    new { searchTerm, offset },
    splitOn: "image_url"
);

return artistDict.Values.ToList();

Album with Artists

SELECT 
    a.id,
    a.name,
    a.popularity,
    a.external_url,
    a.label,
    a.release_date,
    a.total_tracks,
    a.album_type,
    a.copyright,
    a.last_sync_time,
    ar.id AS artist_id,
    ar.name AS artist_name
FROM spotify_album a
LEFT JOIN spotify_album_artist aa ON a.id = aa.album_id
LEFT JOIN spotify_artist ar ON aa.artist_id = ar.id
WHERE a.id = @albumId;

Multi-Mapping: Album with nested artist list.

Track with Album and Artists

SELECT 
    t.id,
    t.name,
    t.popularity,
    t.external_url,
    t.duration_ms,
    t.explicit,
    t.disc_number,
    t.track_number,
    t.label,
    t.last_sync_time,
    a.id AS album_id,
    a.name AS album_name,
    a.release_date AS album_release_date,
    ar.id AS artist_id,
    ar.name AS artist_name
FROM spotify_track t
LEFT JOIN spotify_album a ON t.album_id = a.id
LEFT JOIN spotify_track_artist ta ON t.id = ta.track_id
LEFT JOIN spotify_artist ar ON ta.artist_id = ar.id
WHERE t.id = @trackId;

Multi-Mapping: Track with nested album and artist list.

External ID Lookup

SELECT 
    a.id,
    a.name,
    a.popularity,
    a.external_url,
    a.label,
    a.release_date,
    a.total_tracks,
    a.album_type,
    a.last_sync_time
FROM spotify_album a
INNER JOIN spotify_album_externalid e ON a.id = e.album_id
WHERE e.type = 'upc' AND e.value = @upc;

Use Case: Find album by UPC barcode.

Index Strategy

Required Indexes

Fuzzy Search (GIN trigram):

CREATE INDEX idx_spotify_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_album_name_trgm ON spotify_album USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_track_name_trgm ON spotify_track USING gin(lower(name) gin_trgm_ops);

Foreign Keys:

CREATE INDEX idx_spotify_album_artist_album ON spotify_album_artist(album_id);
CREATE INDEX idx_spotify_album_artist_artist ON spotify_album_artist(artist_id);
CREATE INDEX idx_spotify_track_album ON spotify_track(album_id);
CREATE INDEX idx_spotify_track_artist_track ON spotify_track_artist(track_id);
CREATE INDEX idx_spotify_track_artist_artist ON spotify_track_artist(artist_id);

External IDs:

CREATE INDEX idx_spotify_album_externalid_value ON spotify_album_externalid(value);
CREATE INDEX idx_spotify_track_externalid_value ON spotify_track_externalid(value);

Index Maintenance

Owned by: MiniMediaScanner (schema owner)
API Responsibility: None (read-only consumer)

Performance Impact:

  • GIN indexes: Large (2-3x table size), slow writes, fast reads
  • B-tree indexes: Moderate size, fast writes, fast reads
  • No index = full table scan (unacceptable for fuzzy search)

Data Freshness

Sync Mechanism: MiniMediaScanner polls provider APIs

Sync Frequency: Unknown (configured in MiniMediaScanner)

Staleness Indicator: last_sync_time column

API Behavior:

  • Returns whatever data exists in database
  • No real-time provider API calls
  • No cache invalidation
  • No sync triggering

Client Considerations:

  • Check lastSyncTime in response
  • Stale data possible (hours to days old)
  • No guarantee of completeness
  • Provider outages affect sync, not queries

Provider Feature Matrix

Feature Spotify Tidal MusicBrainz Deezer Discogs SoundCloud
Artist Data
Popularity ✓ (fans)
Followers ✓ (fans)
Genres
Images ✓ (avatar)
Sort Name
Aliases
Album Data
Popularity ✓ (fans) N/A
Images N/A
Label N/A
UPC N/A
Copyright N/A
Album Type N/A
Track Data
Popularity ✓ (playback_count)
Duration
Explicit
ISRC
Disc/Track #

Database Size Estimates

Assumptions:

  • 1 million artists
  • 10 million albums
  • 100 million tracks

Spotify Tables:

  • spotify_artist: ~500 MB
  • spotify_artist_image: ~200 MB
  • spotify_album: ~5 GB
  • spotify_album_artist: ~1 GB
  • spotify_album_image: ~2 GB
  • spotify_track: ~50 GB
  • spotify_track_artist: ~10 GB
  • Total: ~70 GB per provider

All Providers: ~420 GB (6 providers)

Indexes: ~200 GB (GIN indexes are large)

Total Database: ~620 GB for comprehensive catalog

Implications:

  • Requires substantial storage
  • Backup/restore time significant
  • Index rebuilds time-consuming
  • Connection pooling critical

Performance Considerations

Query Performance

Fuzzy Search:

  • With GIN index: 10-50ms for 20 results
  • Without index: 5-30 seconds (full table scan)
  • Threshold tuning affects result count and speed

ID Lookup:

  • With primary key: <1ms
  • With foreign key index: 1-5ms

Join Queries:

  • Album with artists: 5-20ms
  • Track with album and artists: 10-30ms
  • Depends on relationship cardinality

Optimization Strategies

Implemented:

  • GIN indexes for fuzzy search
  • B-tree indexes for foreign keys
  • Connection pooling
  • Parameterized queries (SQL injection prevention)

Missing:

  • Query result caching (Redis/Memcached)
  • Materialized views for complex joins
  • Partitioning for large tables
  • Read replicas for horizontal scaling

Bottlenecks

  1. GIN Index Size: Large memory footprint
  2. Fuzzy Search: CPU-intensive similarity calculations
  3. Multi-Provider Queries: 6 parallel database queries
  4. No Caching: Every request hits database
  5. Connection Pool Limit: 100 max connections per instance

Data Integrity

Constraints:

  • Primary keys on all entity tables
  • Foreign keys for relationships
  • NOT NULL on critical fields (id, name)

No Constraints:

  • No unique constraints on names (duplicates allowed)
  • No check constraints on data ranges
  • No triggers for data validation

Orphan Prevention:

  • Foreign keys with CASCADE delete (assumed)
  • Junction tables maintain referential integrity

Data Quality:

  • Depends entirely on MiniMediaScanner sync quality
  • No validation in this API
  • Garbage in, garbage out

Backup and Recovery

Responsibility: Database administrator (not API)

Recommended Strategy:

  • Daily full backups
  • Continuous WAL archiving
  • Point-in-time recovery capability
  • Backup retention: 30 days

Recovery Time:

  • Full restore: Hours (620 GB database)
  • Index rebuild: Hours (GIN indexes)
  • Sync from providers: Days to weeks

Schema Evolution

Change Process:

  1. MiniMediaScanner updates schema
  2. MiniMediaScanner deploys migration
  3. MiniMediaMetadataAPI updates models
  4. MiniMediaMetadataAPI redeploys

Risk: Breaking changes require coordinated deployment.

Mitigation:

  • Additive changes only (new columns, tables)
  • Deprecation period for removals
  • Version compatibility checks

No Automated Migration: API has no migration framework.