Files
metadata-agregator/docs/research/minimediametadataapi/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

981 lines
26 KiB
Markdown

# MiniMediaMetadataAPI - Data Layer Analysis
## Database Technology
**RDBMS:** PostgreSQL
**Driver:** Npgsql 10.0.2
**ORM:** Dapper 2.1.72 (micro-ORM)
**Extensions:** pg_trgm (trigram similarity search)
## Schema Ownership
**Critical Constraint:** This API does NOT own the database schema.
**Schema Owner:** MiniMediaScanner (separate project)
**API Role:** Read-only consumer
**Migration Strategy:** None (schema managed externally)
### Implications
**Pros:**
- Clear separation of concerns
- API doesn't need provider API credentials
- Simpler deployment (no migration coordination)
- Sync complexity isolated in MiniMediaScanner
**Cons:**
- No control over schema evolution
- Breaking changes in MiniMediaScanner break API
- Can't optimize schema for query patterns
- Data freshness depends on external sync schedule
**Coupling Points:**
- Table names hardcoded in SQL queries
- Column names hardcoded in Dapper mappings
- Foreign key relationships assumed in joins
- Data types must match C# model properties
## Connection Configuration
**Connection String Format:**
```
Host=localhost;
Database=minimediametadata;
Username=postgres;
Password=password;
MinPoolSize=5;
MaxPoolSize=100;
Timeout=30;
CommandTimeout=30;
```
**Pooling Settings:**
- **MinPoolSize:** 5 connections kept alive
- **MaxPoolSize:** 100 concurrent connections
- **Timeout:** 30 seconds to acquire connection
- **CommandTimeout:** 30 seconds for query execution
**Connection Lifecycle:**
- Connections created per repository method call
- Returned to pool after query completion
- No long-lived connections
- No transaction management (read-only)
## Fuzzy Search Implementation
### pg_trgm Extension
**Purpose:** Trigram-based similarity search for fuzzy text matching
**Configuration:**
```sql
SET LOCAL pg_trgm.similarity_threshold = 0.5;
```
**Threshold:** 0.5 (50% similarity required)
**Operators:**
- `%` - Similarity operator (returns true if similarity >= threshold)
- `similarity(text, text)` - Returns similarity score (0.0 to 1.0)
### Search Query Pattern
**Example (Artist Search):**
```sql
SET LOCAL pg_trgm.similarity_threshold = 0.5;
SELECT
id,
name,
popularity,
external_url,
followers,
genres,
last_sync_time
FROM spotify_artist
WHERE lower(name) % lower(@searchTerm)
ORDER BY similarity(lower(name), lower(@searchTerm)) DESC
LIMIT 20 OFFSET @offset;
```
**Key Features:**
- Case-insensitive matching (`lower()`)
- Similarity-based ordering (best matches first)
- Pagination support (LIMIT/OFFSET)
- Threshold filtering (only >= 50% similarity)
**Performance:**
- Requires GIN or GiST index on name column
- Index creation: `CREATE INDEX idx_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);`
- Query time: O(log n) with index, O(n) without
### Similarity Scoring
**Algorithm:** Trigram overlap
**Example:**
```
"Beatles" vs "Beetles"
Trigrams: ["bea", "eat", "atl", "tle", "les"] vs ["bee", "eet", "etl", "tle", "les"]
Overlap: ["tle", "les"] = 2/5 = 0.4 (below threshold)
"Beatles" vs "The Beatles"
Trigrams: ["bea", "eat", "atl", "tle", "les"] vs ["the", "he ", "e b", " be", "bea", "eat", "atl", "tle", "les"]
Overlap: ["bea", "eat", "atl", "tle", "les"] = 5/9 = 0.56 (above threshold)
```
**Tuning:**
- Lower threshold (0.3) = more results, more false positives
- Higher threshold (0.7) = fewer results, more precision
- Current 0.5 = balanced approach
## Database Schema
### Provider-Specific Tables
Each provider has isolated table structure. No cross-provider foreign keys.
### Spotify Schema
**Tables:**
1. `spotify_artist` - Artist metadata
2. `spotify_artist_image` - Artist images (1:N)
3. `spotify_album` - Album metadata
4. `spotify_album_artist` - Album-artist relationships (M:N)
5. `spotify_album_image` - Album artwork (1:N)
6. `spotify_album_externalid` - External identifiers (UPC, EAN) (1:N)
7. `spotify_track` - Track metadata
8. `spotify_track_artist` - Track-artist relationships (M:N)
9. `spotify_track_externalid` - External identifiers (ISRC) (1:N)
**spotify_artist:**
```sql
CREATE TABLE spotify_artist (
id VARCHAR(255) PRIMARY KEY,
name VARCHAR(500) NOT NULL,
popularity INTEGER,
external_url VARCHAR(500),
followers INTEGER,
genres TEXT[], -- PostgreSQL array
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_spotify_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);
```
**spotify_artist_image:**
```sql
CREATE TABLE spotify_artist_image (
id SERIAL PRIMARY KEY,
artist_id VARCHAR(255) REFERENCES spotify_artist(id),
url VARCHAR(1000) NOT NULL,
height INTEGER,
width INTEGER
);
CREATE INDEX idx_spotify_artist_image_artist ON spotify_artist_image(artist_id);
```
**spotify_album:**
```sql
CREATE TABLE spotify_album (
id VARCHAR(255) PRIMARY KEY,
name VARCHAR(500) NOT NULL,
popularity INTEGER,
external_url VARCHAR(500),
label VARCHAR(500),
release_date VARCHAR(50), -- Stored as string (YYYY, YYYY-MM, or YYYY-MM-DD)
total_tracks INTEGER,
album_type VARCHAR(50), -- album, single, compilation
copyright TEXT,
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_spotify_album_name_trgm ON spotify_album USING gin(lower(name) gin_trgm_ops);
```
**spotify_album_artist (junction table):**
```sql
CREATE TABLE spotify_album_artist (
id SERIAL PRIMARY KEY,
album_id VARCHAR(255) REFERENCES spotify_album(id),
artist_id VARCHAR(255) REFERENCES spotify_artist(id)
);
CREATE INDEX idx_spotify_album_artist_album ON spotify_album_artist(album_id);
CREATE INDEX idx_spotify_album_artist_artist ON spotify_album_artist(artist_id);
```
**spotify_track:**
```sql
CREATE TABLE spotify_track (
id VARCHAR(255) PRIMARY KEY,
name VARCHAR(500) NOT NULL,
album_id VARCHAR(255) REFERENCES spotify_album(id),
popularity INTEGER,
external_url VARCHAR(500),
duration_ms INTEGER,
explicit BOOLEAN,
disc_number INTEGER,
track_number INTEGER,
label VARCHAR(500),
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_spotify_track_name_trgm ON spotify_track USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_track_album ON spotify_track(album_id);
```
**spotify_album_externalid:**
```sql
CREATE TABLE spotify_album_externalid (
id SERIAL PRIMARY KEY,
album_id VARCHAR(255) REFERENCES spotify_album(id),
type VARCHAR(50), -- upc, ean
value VARCHAR(255)
);
CREATE INDEX idx_spotify_album_externalid_album ON spotify_album_externalid(album_id);
CREATE INDEX idx_spotify_album_externalid_value ON spotify_album_externalid(value);
```
**spotify_track_externalid:**
```sql
CREATE TABLE spotify_track_externalid (
id SERIAL PRIMARY KEY,
track_id VARCHAR(255) REFERENCES spotify_track(id),
type VARCHAR(50), -- isrc
value VARCHAR(255)
);
CREATE INDEX idx_spotify_track_externalid_track ON spotify_track_externalid(track_id);
CREATE INDEX idx_spotify_track_externalid_value ON spotify_track_externalid(value);
```
### Tidal Schema
**Tables:**
1. `tidal_artist` - Artist metadata
2. `tidal_artist_image_link` - Artist image URLs (1:N)
3. `tidal_album` - Album metadata
4. `tidal_album_external_link` - External URLs (1:N)
5. `tidal_album_image` - Album artwork (1:N)
6. `tidal_track` - Track metadata
7. `tidal_track_artist` - Track-artist relationships (M:N)
8. `tidal_track_external_link` - External URLs (1:N)
**Key Differences from Spotify:**
- ID type: INTEGER instead of VARCHAR
- No popularity field
- No genres field
- External links instead of external IDs
- Image links stored as separate table
**tidal_artist:**
```sql
CREATE TABLE tidal_artist (
id INTEGER PRIMARY KEY,
name VARCHAR(500) NOT NULL,
url VARCHAR(500),
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_tidal_artist_name_trgm ON tidal_artist USING gin(lower(name) gin_trgm_ops);
```
**tidal_album:**
```sql
CREATE TABLE tidal_album (
id INTEGER PRIMARY KEY,
name VARCHAR(500) NOT NULL,
artist_id INTEGER REFERENCES tidal_artist(id),
url VARCHAR(500),
release_date VARCHAR(50),
total_tracks INTEGER,
duration INTEGER, -- Total duration in seconds
explicit BOOLEAN,
upc VARCHAR(255),
copyright TEXT,
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_tidal_album_name_trgm ON tidal_album USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_tidal_album_artist ON tidal_album(artist_id);
```
### MusicBrainz Schema
**Tables:**
1. `musicbrainz_artist` - Artist metadata
2. `musicbrainz_release` - Release (album) metadata
3. `musicbrainz_release_label` - Release-label relationships (M:N)
4. `musicbrainz_label` - Label metadata
5. `musicbrainz_release_track` - Track metadata
6. `musicbrainz_release_track_artist` - Track-artist relationships (M:N)
**Key Differences:**
- ID type: UUID (Guid)
- "Release" instead of "Album"
- Sort name field for artists
- Label as separate entity
- No popularity or follower counts
- No images (stored externally via Cover Art Archive)
**musicbrainz_artist:**
```sql
CREATE TABLE musicbrainz_artist (
id UUID PRIMARY KEY,
name VARCHAR(500) NOT NULL,
sort_name VARCHAR(500), -- For alphabetical sorting (e.g., "Beatles, The")
type VARCHAR(100), -- Person, Group, Orchestra, etc.
country VARCHAR(2), -- ISO country code
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_musicbrainz_artist_name_trgm ON musicbrainz_artist USING gin(lower(name) gin_trgm_ops);
```
**musicbrainz_release:**
```sql
CREATE TABLE musicbrainz_release (
id UUID PRIMARY KEY,
name VARCHAR(500) NOT NULL,
artist_id UUID REFERENCES musicbrainz_artist(id),
release_date VARCHAR(50),
country VARCHAR(2),
barcode VARCHAR(255), -- Similar to UPC
status VARCHAR(100), -- Official, Promotion, Bootleg, etc.
packaging VARCHAR(100), -- Jewel Case, Digipak, etc.
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_musicbrainz_release_name_trgm ON musicbrainz_release USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_musicbrainz_release_artist ON musicbrainz_release(artist_id);
```
**musicbrainz_label:**
```sql
CREATE TABLE musicbrainz_label (
id UUID PRIMARY KEY,
name VARCHAR(500) NOT NULL,
type VARCHAR(100), -- Original Production, Bootleg Production, etc.
country VARCHAR(2),
last_sync_time TIMESTAMP WITH TIME ZONE
);
```
**musicbrainz_release_label (junction table):**
```sql
CREATE TABLE musicbrainz_release_label (
id SERIAL PRIMARY KEY,
release_id UUID REFERENCES musicbrainz_release(id),
label_id UUID REFERENCES musicbrainz_label(id),
catalog_number VARCHAR(255)
);
CREATE INDEX idx_musicbrainz_release_label_release ON musicbrainz_release_label(release_id);
CREATE INDEX idx_musicbrainz_release_label_label ON musicbrainz_release_label(label_id);
```
### Deezer Schema
**Tables:**
1. `deezer_artist` - Artist metadata
2. `deezer_artist_image_link` - Artist image URLs (1:N)
3. `deezer_album` - Album metadata
4. `deezer_album_image_link` - Album artwork URLs (1:N)
5. `deezer_album_artist` - Album-artist relationships (M:N)
6. `deezer_track` - Track metadata
7. `deezer_track_artist` - Track-artist relationships (M:N)
**Key Differences:**
- ID type: BIGINT
- Has popularity (called "fans")
- Has genres
- No UPC/ISRC fields
- No label information
**deezer_artist:**
```sql
CREATE TABLE deezer_artist (
id BIGINT PRIMARY KEY,
name VARCHAR(500) NOT NULL,
url VARCHAR(500),
fans INTEGER, -- Similar to followers
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_deezer_artist_name_trgm ON deezer_artist USING gin(lower(name) gin_trgm_ops);
```
**deezer_album:**
```sql
CREATE TABLE deezer_album (
id BIGINT PRIMARY KEY,
name VARCHAR(500) NOT NULL,
url VARCHAR(500),
release_date VARCHAR(50),
total_tracks INTEGER,
duration INTEGER, -- Total duration in seconds
explicit BOOLEAN,
fans INTEGER,
genres TEXT[], -- PostgreSQL array
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_deezer_album_name_trgm ON deezer_album USING gin(lower(name) gin_trgm_ops);
```
### Discogs Schema
**Tables:**
1. `discogs_artist` - Artist metadata
2. `discogs_artist_alias` - Artist aliases (1:N)
3. `discogs_artist_url` - Artist URLs (1:N)
4. `discogs_release` - Release metadata
5. `discogs_release_artist` - Release-artist relationships (M:N)
6. `discogs_release_identifier` - Barcodes/identifiers (1:N)
7. `discogs_release_track` - Track metadata
8. `discogs_label` - Label metadata
9. `discogs_label_sublabel` - Label hierarchy (1:N)
10. `discogs_label_url` - Label URLs (1:N)
**Key Differences:**
- ID type: INTEGER
- Most comprehensive label data
- Artist aliases tracked
- Multiple identifiers per release (Barcode, Matrix, etc.)
- No popularity metrics
- No image URLs (stored externally)
**discogs_artist:**
```sql
CREATE TABLE discogs_artist (
id INTEGER PRIMARY KEY,
name VARCHAR(500) NOT NULL,
real_name VARCHAR(500), -- For pseudonyms
profile TEXT, -- Biography
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_discogs_artist_name_trgm ON discogs_artist USING gin(lower(name) gin_trgm_ops);
```
**discogs_artist_alias:**
```sql
CREATE TABLE discogs_artist_alias (
id SERIAL PRIMARY KEY,
artist_id INTEGER REFERENCES discogs_artist(id),
alias_name VARCHAR(500)
);
CREATE INDEX idx_discogs_artist_alias_artist ON discogs_artist_alias(artist_id);
CREATE INDEX idx_discogs_artist_alias_name_trgm ON discogs_artist_alias USING gin(lower(alias_name) gin_trgm_ops);
```
**discogs_release:**
```sql
CREATE TABLE discogs_release (
id INTEGER PRIMARY KEY,
name VARCHAR(500) NOT NULL,
released VARCHAR(50),
country VARCHAR(100),
notes TEXT,
genres TEXT[],
styles TEXT[], -- More specific than genres
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_discogs_release_name_trgm ON discogs_release USING gin(lower(name) gin_trgm_ops);
```
**discogs_release_identifier:**
```sql
CREATE TABLE discogs_release_identifier (
id SERIAL PRIMARY KEY,
release_id INTEGER REFERENCES discogs_release(id),
type VARCHAR(100), -- Barcode, Matrix/Runout, Label Code, etc.
value VARCHAR(500),
description TEXT
);
CREATE INDEX idx_discogs_release_identifier_release ON discogs_release_identifier(release_id);
CREATE INDEX idx_discogs_release_identifier_value ON discogs_release_identifier(value);
```
**discogs_label:**
```sql
CREATE TABLE discogs_label (
id INTEGER PRIMARY KEY,
name VARCHAR(500) NOT NULL,
contact_info TEXT,
profile TEXT,
parent_label_id INTEGER REFERENCES discogs_label(id),
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_discogs_label_name_trgm ON discogs_label USING gin(lower(name) gin_trgm_ops);
```
### SoundCloud Schema
**Tables:**
1. `soundcloud_user` - User/artist metadata
2. `soundcloud_playlist` - Playlist metadata
3. `soundcloud_track` - Track metadata
4. `soundcloud_track_artist` - Track-artist relationships (M:N)
**Key Differences:**
- "User" instead of "Artist" (user-generated content platform)
- Playlist as first-class entity
- No album concept
- Minimal metadata (no UPC, ISRC, labels)
- ID type: BIGINT
**soundcloud_user:**
```sql
CREATE TABLE soundcloud_user (
id BIGINT PRIMARY KEY,
username VARCHAR(500) NOT NULL,
full_name VARCHAR(500),
url VARCHAR(500),
avatar_url VARCHAR(1000),
followers_count INTEGER,
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_soundcloud_user_username_trgm ON soundcloud_user USING gin(lower(username) gin_trgm_ops);
```
**soundcloud_playlist:**
```sql
CREATE TABLE soundcloud_playlist (
id BIGINT PRIMARY KEY,
title VARCHAR(500) NOT NULL,
user_id BIGINT REFERENCES soundcloud_user(id),
url VARCHAR(500),
artwork_url VARCHAR(1000),
duration INTEGER, -- Total duration in milliseconds
track_count INTEGER,
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_soundcloud_playlist_title_trgm ON soundcloud_playlist USING gin(lower(title) gin_trgm_ops);
CREATE INDEX idx_soundcloud_playlist_user ON soundcloud_playlist(user_id);
```
**soundcloud_track:**
```sql
CREATE TABLE soundcloud_track (
id BIGINT PRIMARY KEY,
title VARCHAR(500) NOT NULL,
user_id BIGINT REFERENCES soundcloud_user(id),
url VARCHAR(500),
artwork_url VARCHAR(1000),
duration INTEGER, -- Duration in milliseconds
genre VARCHAR(255),
playback_count INTEGER,
last_sync_time TIMESTAMP WITH TIME ZONE
);
CREATE INDEX idx_soundcloud_track_title_trgm ON soundcloud_track USING gin(lower(title) gin_trgm_ops);
CREATE INDEX idx_soundcloud_track_user ON soundcloud_track(user_id);
```
## ID Type Comparison
| Provider | Artist ID | Album ID | Track ID | Notes |
|----------|-----------|----------|----------|-------|
| Spotify | VARCHAR(255) | VARCHAR(255) | VARCHAR(255) | Base62 encoded (22 chars) |
| Tidal | INTEGER | INTEGER | INTEGER | Sequential integers |
| MusicBrainz | UUID | UUID | UUID | RFC 4122 UUIDs |
| Deezer | BIGINT | BIGINT | BIGINT | Large integers |
| Discogs | INTEGER | INTEGER | INTEGER | Sequential integers |
| SoundCloud | BIGINT | N/A | BIGINT | No album concept |
**Implications:**
- Cross-provider ID lookups impossible
- ID parameter must match provider type
- C# models use provider-specific types
- No universal identifier system
## Data Type Patterns
### Arrays (PostgreSQL Native)
**Usage:** Genres, styles, external IDs
**Example:**
```sql
genres TEXT[] -- ["rock", "pop", "alternative"]
```
**Dapper Mapping:**
```csharp
public class SpotifyArtist
{
public string[] Genres { get; set; } // Dapper auto-maps PostgreSQL arrays
}
```
### Timestamps
**Type:** `TIMESTAMP WITH TIME ZONE`
**Purpose:** Track last sync time from provider
**Example:**
```sql
last_sync_time TIMESTAMP WITH TIME ZONE DEFAULT NOW()
```
**C# Mapping:**
```csharp
public DateTime? LastSyncTime { get; set; }
```
### Variable-Length Dates
**Type:** VARCHAR(50)
**Formats:** YYYY, YYYY-MM, YYYY-MM-DD
**Rationale:** Providers return different precision levels
**Examples:**
- `"1969"` - Year only
- `"1969-09"` - Year and month
- `"1969-09-26"` - Full date
**C# Mapping:**
```csharp
public string ReleaseDate { get; set; } // Stored as string, parsed in application
```
## Query Patterns
### Artist Search
```sql
SET LOCAL pg_trgm.similarity_threshold = 0.5;
SELECT
a.id,
a.name,
a.popularity,
a.external_url,
a.followers,
a.genres,
a.last_sync_time,
i.url AS image_url,
i.height AS image_height,
i.width AS image_width
FROM spotify_artist a
LEFT JOIN spotify_artist_image i ON a.id = i.artist_id
WHERE lower(a.name) % lower(@searchTerm)
ORDER BY similarity(lower(a.name), lower(@searchTerm)) DESC
LIMIT 20 OFFSET @offset;
```
**Dapper Mapping:**
```csharp
var artistDict = new Dictionary<string, SpotifyArtist>();
var results = await connection.QueryAsync<SpotifyArtist, SpotifyArtistImage, SpotifyArtist>(
sql,
(artist, image) =>
{
if (!artistDict.TryGetValue(artist.Id, out var existingArtist))
{
existingArtist = artist;
existingArtist.Images = new List<SpotifyArtistImage>();
artistDict.Add(artist.Id, existingArtist);
}
if (image != null)
{
existingArtist.Images.Add(image);
}
return existingArtist;
},
new { searchTerm, offset },
splitOn: "image_url"
);
return artistDict.Values.ToList();
```
### Album with Artists
```sql
SELECT
a.id,
a.name,
a.popularity,
a.external_url,
a.label,
a.release_date,
a.total_tracks,
a.album_type,
a.copyright,
a.last_sync_time,
ar.id AS artist_id,
ar.name AS artist_name
FROM spotify_album a
LEFT JOIN spotify_album_artist aa ON a.id = aa.album_id
LEFT JOIN spotify_artist ar ON aa.artist_id = ar.id
WHERE a.id = @albumId;
```
**Multi-Mapping:** Album with nested artist list.
### Track with Album and Artists
```sql
SELECT
t.id,
t.name,
t.popularity,
t.external_url,
t.duration_ms,
t.explicit,
t.disc_number,
t.track_number,
t.label,
t.last_sync_time,
a.id AS album_id,
a.name AS album_name,
a.release_date AS album_release_date,
ar.id AS artist_id,
ar.name AS artist_name
FROM spotify_track t
LEFT JOIN spotify_album a ON t.album_id = a.id
LEFT JOIN spotify_track_artist ta ON t.id = ta.track_id
LEFT JOIN spotify_artist ar ON ta.artist_id = ar.id
WHERE t.id = @trackId;
```
**Multi-Mapping:** Track with nested album and artist list.
### External ID Lookup
```sql
SELECT
a.id,
a.name,
a.popularity,
a.external_url,
a.label,
a.release_date,
a.total_tracks,
a.album_type,
a.last_sync_time
FROM spotify_album a
INNER JOIN spotify_album_externalid e ON a.id = e.album_id
WHERE e.type = 'upc' AND e.value = @upc;
```
**Use Case:** Find album by UPC barcode.
## Index Strategy
### Required Indexes
**Fuzzy Search (GIN trigram):**
```sql
CREATE INDEX idx_spotify_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_album_name_trgm ON spotify_album USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_track_name_trgm ON spotify_track USING gin(lower(name) gin_trgm_ops);
```
**Foreign Keys:**
```sql
CREATE INDEX idx_spotify_album_artist_album ON spotify_album_artist(album_id);
CREATE INDEX idx_spotify_album_artist_artist ON spotify_album_artist(artist_id);
CREATE INDEX idx_spotify_track_album ON spotify_track(album_id);
CREATE INDEX idx_spotify_track_artist_track ON spotify_track_artist(track_id);
CREATE INDEX idx_spotify_track_artist_artist ON spotify_track_artist(artist_id);
```
**External IDs:**
```sql
CREATE INDEX idx_spotify_album_externalid_value ON spotify_album_externalid(value);
CREATE INDEX idx_spotify_track_externalid_value ON spotify_track_externalid(value);
```
### Index Maintenance
**Owned by:** MiniMediaScanner (schema owner)
**API Responsibility:** None (read-only consumer)
**Performance Impact:**
- GIN indexes: Large (2-3x table size), slow writes, fast reads
- B-tree indexes: Moderate size, fast writes, fast reads
- No index = full table scan (unacceptable for fuzzy search)
## Data Freshness
**Sync Mechanism:** MiniMediaScanner polls provider APIs
**Sync Frequency:** Unknown (configured in MiniMediaScanner)
**Staleness Indicator:** `last_sync_time` column
**API Behavior:**
- Returns whatever data exists in database
- No real-time provider API calls
- No cache invalidation
- No sync triggering
**Client Considerations:**
- Check `lastSyncTime` in response
- Stale data possible (hours to days old)
- No guarantee of completeness
- Provider outages affect sync, not queries
## Provider Feature Matrix
| Feature | Spotify | Tidal | MusicBrainz | Deezer | Discogs | SoundCloud |
|---------|---------|-------|-------------|--------|---------|------------|
| **Artist Data** |
| Popularity | ✓ | ✗ | ✗ | ✓ (fans) | ✗ | ✗ |
| Followers | ✓ | ✗ | ✗ | ✓ (fans) | ✗ | ✓ |
| Genres | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ |
| Images | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ (avatar) |
| Sort Name | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ |
| Aliases | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
| **Album Data** |
| Popularity | ✓ | ✗ | ✗ | ✓ (fans) | ✗ | N/A |
| Images | ✓ | ✓ | ✗ | ✓ | ✗ | N/A |
| Label | ✓ | ✗ | ✓ | ✗ | ✓ | N/A |
| UPC | ✓ | ✓ | ✗ | ✗ | ✓ | N/A |
| Copyright | ✓ | ✓ | ✗ | ✗ | ✗ | N/A |
| Album Type | ✓ | ✗ | ✓ | ✗ | ✗ | N/A |
| **Track Data** |
| Popularity | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ (playback_count) |
| Duration | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Explicit | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
| ISRC | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Disc/Track # | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
## Database Size Estimates
**Assumptions:**
- 1 million artists
- 10 million albums
- 100 million tracks
**Spotify Tables:**
- `spotify_artist`: ~500 MB
- `spotify_artist_image`: ~200 MB
- `spotify_album`: ~5 GB
- `spotify_album_artist`: ~1 GB
- `spotify_album_image`: ~2 GB
- `spotify_track`: ~50 GB
- `spotify_track_artist`: ~10 GB
- **Total:** ~70 GB per provider
**All Providers:** ~420 GB (6 providers)
**Indexes:** ~200 GB (GIN indexes are large)
**Total Database:** ~620 GB for comprehensive catalog
**Implications:**
- Requires substantial storage
- Backup/restore time significant
- Index rebuilds time-consuming
- Connection pooling critical
## Performance Considerations
### Query Performance
**Fuzzy Search:**
- With GIN index: 10-50ms for 20 results
- Without index: 5-30 seconds (full table scan)
- Threshold tuning affects result count and speed
**ID Lookup:**
- With primary key: <1ms
- With foreign key index: 1-5ms
**Join Queries:**
- Album with artists: 5-20ms
- Track with album and artists: 10-30ms
- Depends on relationship cardinality
### Optimization Strategies
**Implemented:**
- GIN indexes for fuzzy search
- B-tree indexes for foreign keys
- Connection pooling
- Parameterized queries (SQL injection prevention)
**Missing:**
- Query result caching (Redis/Memcached)
- Materialized views for complex joins
- Partitioning for large tables
- Read replicas for horizontal scaling
### Bottlenecks
1. **GIN Index Size:** Large memory footprint
2. **Fuzzy Search:** CPU-intensive similarity calculations
3. **Multi-Provider Queries:** 6 parallel database queries
4. **No Caching:** Every request hits database
5. **Connection Pool Limit:** 100 max connections per instance
## Data Integrity
**Constraints:**
- Primary keys on all entity tables
- Foreign keys for relationships
- NOT NULL on critical fields (id, name)
**No Constraints:**
- No unique constraints on names (duplicates allowed)
- No check constraints on data ranges
- No triggers for data validation
**Orphan Prevention:**
- Foreign keys with CASCADE delete (assumed)
- Junction tables maintain referential integrity
**Data Quality:**
- Depends entirely on MiniMediaScanner sync quality
- No validation in this API
- Garbage in, garbage out
## Backup and Recovery
**Responsibility:** Database administrator (not API)
**Recommended Strategy:**
- Daily full backups
- Continuous WAL archiving
- Point-in-time recovery capability
- Backup retention: 30 days
**Recovery Time:**
- Full restore: Hours (620 GB database)
- Index rebuild: Hours (GIN indexes)
- Sync from providers: Days to weeks
## Schema Evolution
**Change Process:**
1. MiniMediaScanner updates schema
2. MiniMediaScanner deploys migration
3. MiniMediaMetadataAPI updates models
4. MiniMediaMetadataAPI redeploys
**Risk:** Breaking changes require coordinated deployment.
**Mitigation:**
- Additive changes only (new columns, tables)
- Deprecation period for removals
- Version compatibility checks
**No Automated Migration:** API has no migration framework.