metadata-agregator/docs/research/minimediametadataapi/analysis/DATA.md

# MiniMediaMetadataAPI - Data Layer Analysis

## Database Technology

**RDBMS:** PostgreSQL
**Driver:** Npgsql 10.0.2
**ORM:** Dapper 2.1.72 (micro-ORM)
**Extensions:** pg_trgm (trigram similarity search)

## Schema Ownership

**Critical Constraint:** This API does NOT own the database schema.

**Schema Owner:** MiniMediaScanner (separate project)
**API Role:** Read-only consumer
**Migration Strategy:** None (schema managed externally)

### Implications

**Pros:**
- Clear separation of concerns
- API doesn't need provider API credentials
- Simpler deployment (no migration coordination)
- Sync complexity isolated in MiniMediaScanner

**Cons:**
- No control over schema evolution
- Breaking changes in MiniMediaScanner break API
- Can't optimize schema for query patterns
- Data freshness depends on external sync schedule

**Coupling Points:**
- Table names hardcoded in SQL queries
- Column names hardcoded in Dapper mappings
- Foreign key relationships assumed in joins
- Data types must match C# model properties

## Connection Configuration

**Connection String Format:**
```
Host=localhost;
Database=minimediametadata;
Username=postgres;
Password=password;
MinPoolSize=5;
MaxPoolSize=100;
Timeout=30;
CommandTimeout=30;
```

**Pooling Settings:**
- **MinPoolSize:** 5 connections kept alive
- **MaxPoolSize:** 100 concurrent connections
- **Timeout:** 30 seconds to acquire connection
- **CommandTimeout:** 30 seconds for query execution

**Connection Lifecycle:**
- Connections created per repository method call
- Returned to pool after query completion
- No long-lived connections
- No transaction management (read-only)

## Fuzzy Search Implementation

### pg_trgm Extension

**Purpose:** Trigram-based similarity search for fuzzy text matching

**Configuration:**
```sql
SET LOCAL pg_trgm.similarity_threshold = 0.5;
```

**Threshold:** 0.5 (50% similarity required)

**Operators:**
- `%` - Similarity operator (returns true if similarity >= threshold)
- `similarity(text, text)` - Returns similarity score (0.0 to 1.0)

### Search Query Pattern

**Example (Artist Search):**
```sql
SET LOCAL pg_trgm.similarity_threshold = 0.5;

SELECT
    id,
    name,
    popularity,
    external_url,
    followers,
    genres,
    last_sync_time
FROM spotify_artist
WHERE lower(name) % lower(@searchTerm)
ORDER BY similarity(lower(name), lower(@searchTerm)) DESC
LIMIT 20 OFFSET @offset;
```

**Key Features:**
- Case-insensitive matching (`lower()`)
- Similarity-based ordering (best matches first)
- Pagination support (LIMIT/OFFSET)
- Threshold filtering (only >= 50% similarity)

**Performance:**
- Requires GIN or GiST index on name column
- Index creation: `CREATE INDEX idx_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);`
- Query time: O(log n) with index, O(n) without

### Similarity Scoring

**Algorithm:** Trigram overlap

**Example:**
```
"Beatles" vs "Beetles"
Trigrams: ["bea", "eat", "atl", "tle", "les"] vs ["bee", "eet", "etl", "tle", "les"]
Overlap: ["tle", "les"] = 2/5 = 0.4 (below threshold)

"Beatles" vs "The Beatles"
Trigrams: ["bea", "eat", "atl", "tle", "les"] vs ["the", "he ", "e b", " be", "bea", "eat", "atl", "tle", "les"]
Overlap: ["bea", "eat", "atl", "tle", "les"] = 5/9 = 0.56 (above threshold)
```

**Tuning:**
- Lower threshold (0.3) = more results, more false positives
- Higher threshold (0.7) = fewer results, more precision
- Current 0.5 = balanced approach

## Database Schema

### Provider-Specific Tables

Each provider has isolated table structure. No cross-provider foreign keys.

### Spotify Schema

**Tables:**
1. `spotify_artist` - Artist metadata
2. `spotify_artist_image` - Artist images (1:N)
3. `spotify_album` - Album metadata
4. `spotify_album_artist` - Album-artist relationships (M:N)
5. `spotify_album_image` - Album artwork (1:N)
6. `spotify_album_externalid` - External identifiers (UPC, EAN) (1:N)
7. `spotify_track` - Track metadata
8. `spotify_track_artist` - Track-artist relationships (M:N)
9. `spotify_track_externalid` - External identifiers (ISRC) (1:N)

**spotify_artist:**
```sql
CREATE TABLE spotify_artist (
    id VARCHAR(255) PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    popularity INTEGER,
    external_url VARCHAR(500),
    followers INTEGER,
    genres TEXT[], -- PostgreSQL array
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_spotify_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);
```

**spotify_artist_image:**
```sql
CREATE TABLE spotify_artist_image (
    id SERIAL PRIMARY KEY,
    artist_id VARCHAR(255) REFERENCES spotify_artist(id),
    url VARCHAR(1000) NOT NULL,
    height INTEGER,
    width INTEGER
);

CREATE INDEX idx_spotify_artist_image_artist ON spotify_artist_image(artist_id);
```

**spotify_album:**
```sql
CREATE TABLE spotify_album (
    id VARCHAR(255) PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    popularity INTEGER,
    external_url VARCHAR(500),
    label VARCHAR(500),
    release_date VARCHAR(50), -- Stored as string (YYYY, YYYY-MM, or YYYY-MM-DD)
    total_tracks INTEGER,
    album_type VARCHAR(50), -- album, single, compilation
    copyright TEXT,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_spotify_album_name_trgm ON spotify_album USING gin(lower(name) gin_trgm_ops);
```

**spotify_album_artist (junction table):**
```sql
CREATE TABLE spotify_album_artist (
    id SERIAL PRIMARY KEY,
    album_id VARCHAR(255) REFERENCES spotify_album(id),
    artist_id VARCHAR(255) REFERENCES spotify_artist(id)
);

CREATE INDEX idx_spotify_album_artist_album ON spotify_album_artist(album_id);
CREATE INDEX idx_spotify_album_artist_artist ON spotify_album_artist(artist_id);
```

**spotify_track:**
```sql
CREATE TABLE spotify_track (
    id VARCHAR(255) PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    album_id VARCHAR(255) REFERENCES spotify_album(id),
    popularity INTEGER,
    external_url VARCHAR(500),
    duration_ms INTEGER,
    explicit BOOLEAN,
    disc_number INTEGER,
    track_number INTEGER,
    label VARCHAR(500),
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_spotify_track_name_trgm ON spotify_track USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_track_album ON spotify_track(album_id);
```

**spotify_album_externalid:**
```sql
CREATE TABLE spotify_album_externalid (
    id SERIAL PRIMARY KEY,
    album_id VARCHAR(255) REFERENCES spotify_album(id),
    type VARCHAR(50), -- upc, ean
    value VARCHAR(255)
);

CREATE INDEX idx_spotify_album_externalid_album ON spotify_album_externalid(album_id);
CREATE INDEX idx_spotify_album_externalid_value ON spotify_album_externalid(value);
```

**spotify_track_externalid:**
```sql
CREATE TABLE spotify_track_externalid (
    id SERIAL PRIMARY KEY,
    track_id VARCHAR(255) REFERENCES spotify_track(id),
    type VARCHAR(50), -- isrc
    value VARCHAR(255)
);

CREATE INDEX idx_spotify_track_externalid_track ON spotify_track_externalid(track_id);
CREATE INDEX idx_spotify_track_externalid_value ON spotify_track_externalid(value);
```

### Tidal Schema

**Tables:**
1. `tidal_artist` - Artist metadata
2. `tidal_artist_image_link` - Artist image URLs (1:N)
3. `tidal_album` - Album metadata
4. `tidal_album_external_link` - External URLs (1:N)
5. `tidal_album_image` - Album artwork (1:N)
6. `tidal_track` - Track metadata
7. `tidal_track_artist` - Track-artist relationships (M:N)
8. `tidal_track_external_link` - External URLs (1:N)

**Key Differences from Spotify:**
- ID type: INTEGER instead of VARCHAR
- No popularity field
- No genres field
- External links instead of external IDs
- Image links stored as separate table

**tidal_artist:**
```sql
CREATE TABLE tidal_artist (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    url VARCHAR(500),
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_tidal_artist_name_trgm ON tidal_artist USING gin(lower(name) gin_trgm_ops);
```

**tidal_album:**
```sql
CREATE TABLE tidal_album (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    artist_id INTEGER REFERENCES tidal_artist(id),
    url VARCHAR(500),
    release_date VARCHAR(50),
    total_tracks INTEGER,
    duration INTEGER, -- Total duration in seconds
    explicit BOOLEAN,
    upc VARCHAR(255),
    copyright TEXT,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_tidal_album_name_trgm ON tidal_album USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_tidal_album_artist ON tidal_album(artist_id);
```

### MusicBrainz Schema

**Tables:**
1. `musicbrainz_artist` - Artist metadata
2. `musicbrainz_release` - Release (album) metadata
3. `musicbrainz_release_label` - Release-label relationships (M:N)
4. `musicbrainz_label` - Label metadata
5. `musicbrainz_release_track` - Track metadata
6. `musicbrainz_release_track_artist` - Track-artist relationships (M:N)

**Key Differences:**
- ID type: UUID (Guid)
- "Release" instead of "Album"
- Sort name field for artists
- Label as separate entity
- No popularity or follower counts
- No images (stored externally via Cover Art Archive)

**musicbrainz_artist:**
```sql
CREATE TABLE musicbrainz_artist (
    id UUID PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    sort_name VARCHAR(500), -- For alphabetical sorting (e.g., "Beatles, The")
    type VARCHAR(100), -- Person, Group, Orchestra, etc.
    country VARCHAR(2), -- ISO country code
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_musicbrainz_artist_name_trgm ON musicbrainz_artist USING gin(lower(name) gin_trgm_ops);
```

**musicbrainz_release:**
```sql
CREATE TABLE musicbrainz_release (
    id UUID PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    artist_id UUID REFERENCES musicbrainz_artist(id),
    release_date VARCHAR(50),
    country VARCHAR(2),
    barcode VARCHAR(255), -- Similar to UPC
    status VARCHAR(100), -- Official, Promotion, Bootleg, etc.
    packaging VARCHAR(100), -- Jewel Case, Digipak, etc.
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_musicbrainz_release_name_trgm ON musicbrainz_release USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_musicbrainz_release_artist ON musicbrainz_release(artist_id);
```

**musicbrainz_label:**
```sql
CREATE TABLE musicbrainz_label (
    id UUID PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    type VARCHAR(100), -- Original Production, Bootleg Production, etc.
    country VARCHAR(2),
    last_sync_time TIMESTAMP WITH TIME ZONE
);
```

**musicbrainz_release_label (junction table):**
```sql
CREATE TABLE musicbrainz_release_label (
    id SERIAL PRIMARY KEY,
    release_id UUID REFERENCES musicbrainz_release(id),
    label_id UUID REFERENCES musicbrainz_label(id),
    catalog_number VARCHAR(255)
);

CREATE INDEX idx_musicbrainz_release_label_release ON musicbrainz_release_label(release_id);
CREATE INDEX idx_musicbrainz_release_label_label ON musicbrainz_release_label(label_id);
```

### Deezer Schema

**Tables:**
1. `deezer_artist` - Artist metadata
2. `deezer_artist_image_link` - Artist image URLs (1:N)
3. `deezer_album` - Album metadata
4. `deezer_album_image_link` - Album artwork URLs (1:N)
5. `deezer_album_artist` - Album-artist relationships (M:N)
6. `deezer_track` - Track metadata
7. `deezer_track_artist` - Track-artist relationships (M:N)

**Key Differences:**
- ID type: BIGINT
- Has popularity (called "fans")
- Has genres
- No UPC/ISRC fields
- No label information

**deezer_artist:**
```sql
CREATE TABLE deezer_artist (
    id BIGINT PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    url VARCHAR(500),
    fans INTEGER, -- Similar to followers
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_deezer_artist_name_trgm ON deezer_artist USING gin(lower(name) gin_trgm_ops);
```

**deezer_album:**
```sql
CREATE TABLE deezer_album (
    id BIGINT PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    url VARCHAR(500),
    release_date VARCHAR(50),
    total_tracks INTEGER,
    duration INTEGER, -- Total duration in seconds
    explicit BOOLEAN,
    fans INTEGER,
    genres TEXT[], -- PostgreSQL array
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_deezer_album_name_trgm ON deezer_album USING gin(lower(name) gin_trgm_ops);
```

### Discogs Schema

**Tables:**
1. `discogs_artist` - Artist metadata
2. `discogs_artist_alias` - Artist aliases (1:N)
3. `discogs_artist_url` - Artist URLs (1:N)
4. `discogs_release` - Release metadata
5. `discogs_release_artist` - Release-artist relationships (M:N)
6. `discogs_release_identifier` - Barcodes/identifiers (1:N)
7. `discogs_release_track` - Track metadata
8. `discogs_label` - Label metadata
9. `discogs_label_sublabel` - Label hierarchy (1:N)
10. `discogs_label_url` - Label URLs (1:N)

**Key Differences:**
- ID type: INTEGER
- Most comprehensive label data
- Artist aliases tracked
- Multiple identifiers per release (Barcode, Matrix, etc.)
- No popularity metrics
- No image URLs (stored externally)

**discogs_artist:**
```sql
CREATE TABLE discogs_artist (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    real_name VARCHAR(500), -- For pseudonyms
    profile TEXT, -- Biography
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_discogs_artist_name_trgm ON discogs_artist USING gin(lower(name) gin_trgm_ops);
```

**discogs_artist_alias:**
```sql
CREATE TABLE discogs_artist_alias (
    id SERIAL PRIMARY KEY,
    artist_id INTEGER REFERENCES discogs_artist(id),
    alias_name VARCHAR(500)
);

CREATE INDEX idx_discogs_artist_alias_artist ON discogs_artist_alias(artist_id);
CREATE INDEX idx_discogs_artist_alias_name_trgm ON discogs_artist_alias USING gin(lower(alias_name) gin_trgm_ops);
```

**discogs_release:**
```sql
CREATE TABLE discogs_release (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    released VARCHAR(50),
    country VARCHAR(100),
    notes TEXT,
    genres TEXT[],
    styles TEXT[], -- More specific than genres
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_discogs_release_name_trgm ON discogs_release USING gin(lower(name) gin_trgm_ops);
```

**discogs_release_identifier:**
```sql
CREATE TABLE discogs_release_identifier (
    id SERIAL PRIMARY KEY,
    release_id INTEGER REFERENCES discogs_release(id),
    type VARCHAR(100), -- Barcode, Matrix/Runout, Label Code, etc.
    value VARCHAR(500),
    description TEXT
);

CREATE INDEX idx_discogs_release_identifier_release ON discogs_release_identifier(release_id);
CREATE INDEX idx_discogs_release_identifier_value ON discogs_release_identifier(value);
```

**discogs_label:**
```sql
CREATE TABLE discogs_label (
    id INTEGER PRIMARY KEY,
    name VARCHAR(500) NOT NULL,
    contact_info TEXT,
    profile TEXT,
    parent_label_id INTEGER REFERENCES discogs_label(id),
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_discogs_label_name_trgm ON discogs_label USING gin(lower(name) gin_trgm_ops);
```

### SoundCloud Schema

**Tables:**
1. `soundcloud_user` - User/artist metadata
2. `soundcloud_playlist` - Playlist metadata
3. `soundcloud_track` - Track metadata
4. `soundcloud_track_artist` - Track-artist relationships (M:N)

**Key Differences:**
- "User" instead of "Artist" (user-generated content platform)
- Playlist as first-class entity
- No album concept
- Minimal metadata (no UPC, ISRC, labels)
- ID type: BIGINT

**soundcloud_user:**
```sql
CREATE TABLE soundcloud_user (
    id BIGINT PRIMARY KEY,
    username VARCHAR(500) NOT NULL,
    full_name VARCHAR(500),
    url VARCHAR(500),
    avatar_url VARCHAR(1000),
    followers_count INTEGER,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_soundcloud_user_username_trgm ON soundcloud_user USING gin(lower(username) gin_trgm_ops);
```

**soundcloud_playlist:**
```sql
CREATE TABLE soundcloud_playlist (
    id BIGINT PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    user_id BIGINT REFERENCES soundcloud_user(id),
    url VARCHAR(500),
    artwork_url VARCHAR(1000),
    duration INTEGER, -- Total duration in milliseconds
    track_count INTEGER,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_soundcloud_playlist_title_trgm ON soundcloud_playlist USING gin(lower(title) gin_trgm_ops);
CREATE INDEX idx_soundcloud_playlist_user ON soundcloud_playlist(user_id);
```

**soundcloud_track:**
```sql
CREATE TABLE soundcloud_track (
    id BIGINT PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    user_id BIGINT REFERENCES soundcloud_user(id),
    url VARCHAR(500),
    artwork_url VARCHAR(1000),
    duration INTEGER, -- Duration in milliseconds
    genre VARCHAR(255),
    playback_count INTEGER,
    last_sync_time TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_soundcloud_track_title_trgm ON soundcloud_track USING gin(lower(title) gin_trgm_ops);
CREATE INDEX idx_soundcloud_track_user ON soundcloud_track(user_id);
```

## ID Type Comparison

| Provider | Artist ID | Album ID | Track ID | Notes |
|----------|-----------|----------|----------|-------|
| Spotify | VARCHAR(255) | VARCHAR(255) | VARCHAR(255) | Base62 encoded (22 chars) |
| Tidal | INTEGER | INTEGER | INTEGER | Sequential integers |
| MusicBrainz | UUID | UUID | UUID | RFC 4122 UUIDs |
| Deezer | BIGINT | BIGINT | BIGINT | Large integers |
| Discogs | INTEGER | INTEGER | INTEGER | Sequential integers |
| SoundCloud | BIGINT | N/A | BIGINT | No album concept |

**Implications:**
- Cross-provider ID lookups impossible
- ID parameter must match provider type
- C# models use provider-specific types
- No universal identifier system

## Data Type Patterns

### Arrays (PostgreSQL Native)

**Usage:** Genres, styles, external IDs

**Example:**
```sql
genres TEXT[] -- ["rock", "pop", "alternative"]
```

**Dapper Mapping:**
```csharp
public class SpotifyArtist
{
    public string[] Genres { get; set; } // Dapper auto-maps PostgreSQL arrays
}
```

### Timestamps

**Type:** `TIMESTAMP WITH TIME ZONE`
**Purpose:** Track last sync time from provider

**Example:**
```sql
last_sync_time TIMESTAMP WITH TIME ZONE DEFAULT NOW()
```

**C# Mapping:**
```csharp
public DateTime? LastSyncTime { get; set; }
```

### Variable-Length Dates

**Type:** VARCHAR(50)
**Formats:** YYYY, YYYY-MM, YYYY-MM-DD

**Rationale:** Providers return different precision levels

**Examples:**
- `"1969"` - Year only
- `"1969-09"` - Year and month
- `"1969-09-26"` - Full date

**C# Mapping:**
```csharp
public string ReleaseDate { get; set; } // Stored as string, parsed in application
```

## Query Patterns

### Artist Search

```sql
SET LOCAL pg_trgm.similarity_threshold = 0.5;

SELECT
    a.id,
    a.name,
    a.popularity,
    a.external_url,
    a.followers,
    a.genres,
    a.last_sync_time,
    i.url AS image_url,
    i.height AS image_height,
    i.width AS image_width
FROM spotify_artist a
LEFT JOIN spotify_artist_image i ON a.id = i.artist_id
WHERE lower(a.name) % lower(@searchTerm)
ORDER BY similarity(lower(a.name), lower(@searchTerm)) DESC
LIMIT 20 OFFSET @offset;
```

**Dapper Mapping:**
```csharp
var artistDict = new Dictionary<string, SpotifyArtist>();

var results = await connection.QueryAsync<SpotifyArtist, SpotifyArtistImage, SpotifyArtist>(
    sql,
    (artist, image) =>
    {
        if (!artistDict.TryGetValue(artist.Id, out var existingArtist))
        {
            existingArtist = artist;
            existingArtist.Images = new List<SpotifyArtistImage>();
            artistDict.Add(artist.Id, existingArtist);
        }

        if (image != null)
        {
            existingArtist.Images.Add(image);
        }

        return existingArtist;
    },
    new { searchTerm, offset },
    splitOn: "image_url"
);

return artistDict.Values.ToList();
```

### Album with Artists

```sql
SELECT
    a.id,
    a.name,
    a.popularity,
    a.external_url,
    a.label,
    a.release_date,
    a.total_tracks,
    a.album_type,
    a.copyright,
    a.last_sync_time,
    ar.id AS artist_id,
    ar.name AS artist_name
FROM spotify_album a
LEFT JOIN spotify_album_artist aa ON a.id = aa.album_id
LEFT JOIN spotify_artist ar ON aa.artist_id = ar.id
WHERE a.id = @albumId;
```

**Multi-Mapping:** Album with nested artist list.

### Track with Album and Artists

```sql
SELECT
    t.id,
    t.name,
    t.popularity,
    t.external_url,
    t.duration_ms,
    t.explicit,
    t.disc_number,
    t.track_number,
    t.label,
    t.last_sync_time,
    a.id AS album_id,
    a.name AS album_name,
    a.release_date AS album_release_date,
    ar.id AS artist_id,
    ar.name AS artist_name
FROM spotify_track t
LEFT JOIN spotify_album a ON t.album_id = a.id
LEFT JOIN spotify_track_artist ta ON t.id = ta.track_id
LEFT JOIN spotify_artist ar ON ta.artist_id = ar.id
WHERE t.id = @trackId;
```

**Multi-Mapping:** Track with nested album and artist list.

### External ID Lookup

```sql
SELECT
    a.id,
    a.name,
    a.popularity,
    a.external_url,
    a.label,
    a.release_date,
    a.total_tracks,
    a.album_type,
    a.last_sync_time
FROM spotify_album a
INNER JOIN spotify_album_externalid e ON a.id = e.album_id
WHERE e.type = 'upc' AND e.value = @upc;
```

**Use Case:** Find album by UPC barcode.

## Index Strategy

### Required Indexes

**Fuzzy Search (GIN trigram):**
```sql
CREATE INDEX idx_spotify_artist_name_trgm ON spotify_artist USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_album_name_trgm ON spotify_album USING gin(lower(name) gin_trgm_ops);
CREATE INDEX idx_spotify_track_name_trgm ON spotify_track USING gin(lower(name) gin_trgm_ops);
```

**Foreign Keys:**
```sql
CREATE INDEX idx_spotify_album_artist_album ON spotify_album_artist(album_id);
CREATE INDEX idx_spotify_album_artist_artist ON spotify_album_artist(artist_id);
CREATE INDEX idx_spotify_track_album ON spotify_track(album_id);
CREATE INDEX idx_spotify_track_artist_track ON spotify_track_artist(track_id);
CREATE INDEX idx_spotify_track_artist_artist ON spotify_track_artist(artist_id);
```

**External IDs:**
```sql
CREATE INDEX idx_spotify_album_externalid_value ON spotify_album_externalid(value);
CREATE INDEX idx_spotify_track_externalid_value ON spotify_track_externalid(value);
```

### Index Maintenance

**Owned by:** MiniMediaScanner (schema owner)
**API Responsibility:** None (read-only consumer)

**Performance Impact:**
- GIN indexes: Large (2-3x table size), slow writes, fast reads
- B-tree indexes: Moderate size, fast writes, fast reads
- No index = full table scan (unacceptable for fuzzy search)

## Data Freshness

**Sync Mechanism:** MiniMediaScanner polls provider APIs

**Sync Frequency:** Unknown (configured in MiniMediaScanner)

**Staleness Indicator:** `last_sync_time` column

**API Behavior:**
- Returns whatever data exists in database
- No real-time provider API calls
- No cache invalidation
- No sync triggering

**Client Considerations:**
- Check `lastSyncTime` in response
- Stale data possible (hours to days old)
- No guarantee of completeness
- Provider outages affect sync, not queries

## Provider Feature Matrix

| Feature | Spotify | Tidal | MusicBrainz | Deezer | Discogs | SoundCloud |
|---------|---------|-------|-------------|--------|---------|------------|
| **Artist Data** |
| Popularity | ✓ | ✗ | ✗ | ✓ (fans) | ✗ | ✗ |
| Followers | ✓ | ✗ | ✗ | ✓ (fans) | ✗ | ✓ |
| Genres | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ |
| Images | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ (avatar) |
| Sort Name | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ |
| Aliases | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |
| **Album Data** |
| Popularity | ✓ | ✗ | ✗ | ✓ (fans) | ✗ | N/A |
| Images | ✓ | ✓ | ✗ | ✓ | ✗ | N/A |
| Label | ✓ | ✗ | ✓ | ✗ | ✓ | N/A |
| UPC | ✓ | ✓ | ✗ | ✗ | ✓ | N/A |
| Copyright | ✓ | ✓ | ✗ | ✗ | ✗ | N/A |
| Album Type | ✓ | ✗ | ✓ | ✗ | ✗ | N/A |
| **Track Data** |
| Popularity | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ (playback_count) |
| Duration | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Explicit | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
| ISRC | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Disc/Track # | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |

## Database Size Estimates

**Assumptions:**
- 1 million artists
- 10 million albums
- 100 million tracks

**Spotify Tables:**
- `spotify_artist`: ~500 MB
- `spotify_artist_image`: ~200 MB
- `spotify_album`: ~5 GB
- `spotify_album_artist`: ~1 GB
- `spotify_album_image`: ~2 GB
- `spotify_track`: ~50 GB
- `spotify_track_artist`: ~10 GB
- **Total:** ~70 GB per provider

**All Providers:** ~420 GB (6 providers)

**Indexes:** ~200 GB (GIN indexes are large)

**Total Database:** ~620 GB for comprehensive catalog

**Implications:**
- Requires substantial storage
- Backup/restore time significant
- Index rebuilds time-consuming
- Connection pooling critical

## Performance Considerations

### Query Performance

**Fuzzy Search:**
- With GIN index: 10-50ms for 20 results
- Without index: 5-30 seconds (full table scan)
- Threshold tuning affects result count and speed

**ID Lookup:**
- With primary key: <1ms
- With foreign key index: 1-5ms

**Join Queries:**
- Album with artists: 5-20ms
- Track with album and artists: 10-30ms
- Depends on relationship cardinality

### Optimization Strategies

**Implemented:**
- GIN indexes for fuzzy search
- B-tree indexes for foreign keys
- Connection pooling
- Parameterized queries (SQL injection prevention)

**Missing:**
- Query result caching (Redis/Memcached)
- Materialized views for complex joins
- Partitioning for large tables
- Read replicas for horizontal scaling

### Bottlenecks

1. **GIN Index Size:** Large memory footprint
2. **Fuzzy Search:** CPU-intensive similarity calculations
3. **Multi-Provider Queries:** 6 parallel database queries
4. **No Caching:** Every request hits database
5. **Connection Pool Limit:** 100 max connections per instance

## Data Integrity

**Constraints:**
- Primary keys on all entity tables
- Foreign keys for relationships
- NOT NULL on critical fields (id, name)

**No Constraints:**
- No unique constraints on names (duplicates allowed)
- No check constraints on data ranges
- No triggers for data validation

**Orphan Prevention:**
- Foreign keys with CASCADE delete (assumed)
- Junction tables maintain referential integrity

**Data Quality:**
- Depends entirely on MiniMediaScanner sync quality
- No validation in this API
- Garbage in, garbage out

## Backup and Recovery

**Responsibility:** Database administrator (not API)

**Recommended Strategy:**
- Daily full backups
- Continuous WAL archiving
- Point-in-time recovery capability
- Backup retention: 30 days

**Recovery Time:**
- Full restore: Hours (620 GB database)
- Index rebuild: Hours (GIN indexes)
- Sync from providers: Days to weeks

## Schema Evolution

**Change Process:**
1. MiniMediaScanner updates schema
2. MiniMediaScanner deploys migration
3. MiniMediaMetadataAPI updates models
4. MiniMediaMetadataAPI redeploys

**Risk:** Breaking changes require coordinated deployment.

**Mitigation:**
- Additive changes only (new columns, tables)
- Deprecation period for removals
- Version compatibility checks

**No Automated Migration:** API has no migration framework.