feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,501 @@
+# MusicMetaLinker Data Architecture
+
+## Data Storage Model
+
+MusicMetaLinker has no persistent data storage. All data is in-memory during execution.
+
+**No database:** No SQL, no NoSQL, no embedded databases.
+
+**No file-based persistence:** No local cache files, no serialized objects (except JAMS output).
+
+**Stateless operation:** Each Align instance is independent. No shared state across instances.
+
+## Input Data Formats
+
+### Python Objects
+
+Primary input method: Constructor parameters to Align class.
+
+**Supported data types:**
+
+```python
+{
+    "mbid_track": str,        # UUID format
+    "mbid_release": str,      # UUID format
+    "artist": str,            # Free text
+    "album": str,             # Free text
+    "track": str,             # Free text
+    "track_number": int,      # Positive integer
+    "duration": int | float,  # Seconds
+    "isrc": str,              # ISRC format (no validation)
+    "strict": bool            # Matching mode
+}
+```
+
+**No validation:** Input accepted as-is. Invalid data causes silent failures (returns None).
+
+**No normalization:** Artist names, track titles used exactly as provided. No case normalization, no whitespace trimming, no Unicode normalization.
+
+### JAMS Files
+
+JAMS (JSON Annotated Music Specification) is the standard input format for batch processing.
+
+**JAMS structure:**
+
+```json
+{
+  "file_metadata": {
+    "title": "Track Name",
+    "artist": "Artist Name",
+    "release": "Album Name",
+    "duration": 123.45,
+    "identifiers": {
+      "musicbrainz": "mbid-uuid-here",
+      "isrc": "GBAYE9200070"
+    }
+  },
+  "sandbox": {
+    "type": "music_type",
+    "genre": "rock",
+    "track_number": 1,
+    "release_year": 2020
+  },
+  "annotations": []
+}
+```
+
+**Key sections:**
+
+**file_metadata:** Core track metadata. Required fields: title, artist. Optional: release, duration, identifiers.
+
+**sandbox:** Additional metadata. Free-form structure. Common fields: type, genre, track_number, release_year.
+
+**annotations:** Music information retrieval annotations (not used by MusicMetaLinker).
+
+**Parsing logic:**
+
+JAMSProcessor extracts:
+- title → track
+- artist → artist
+- release → album
+- duration → duration
+- identifiers.musicbrainz → mbid_track
+- identifiers.isrc → isrc
+- sandbox.track_number → track_number
+
+**Missing fields:** Treated as None. No errors raised.
+
+### CSV Input
+
+No direct CSV input support. Batch processing outputs CSV but doesn't read it.
+
+For CSV input, users must:
+1. Parse CSV manually
+2. Create Align instances per row
+3. Collect results
+
+## Output Data Formats
+
+### Python Objects
+
+Align instance acts as data container. Getters return individual fields.
+
+**No structured output method:** No to_dict(), no to_json(), no serialize().
+
+**Manual aggregation required:**
+
+```python
+linker = Align(...)
+result = {
+    "artist": linker.get_artist(),
+    "track": linker.get_track(),
+    "mbid": linker.get_mbid(),
+    "isrc": linker.get_isrc(),
+    "deezer_id": linker.get_deezer_id(),
+    # ... etc
+}
+```
+
+### JAMS Files
+
+Enriched JAMS files with added identifiers.
+
+**Enrichment process:**
+
+1. Read original JAMS file
+2. Extract metadata
+3. Create Align instance
+4. Query all services
+5. Add identifiers to file_metadata.identifiers section
+6. Write enriched JAMS file
+
+**Added identifiers:**
+
+```json
+{
+  "file_metadata": {
+    "identifiers": {
+      "musicbrainz": "mbid-from-query",
+      "isrc": "isrc-from-query",
+      "deezer": "deezer-id-from-query",
+      "youtube": "youtube-url-from-query",
+      "acousticbrainz": null
+    }
+  }
+}
+```
+
+**Preservation:** Original JAMS structure preserved. Only identifiers section modified.
+
+**Overwrite behavior:** Controlled by --overwrite flag. Without flag, existing identifiers preserved.
+
+### CSV Output
+
+Batch processing generates CSV with all metadata and identifiers.
+
+**CSV schema:**
+
+| Column | Type | Description |
+|--------|------|-------------|
+| jams_file | str | Original JAMS filename |
+| track_name | str | Track title |
+| artist_name | str | Artist name |
+| album_name | str | Album/release name |
+| track_number | int | Track position |
+| duration | float | Duration in seconds |
+| release_year | int | Release year |
+| musicbrainz | str | MBID (UUID) |
+| isrc | str | ISRC code |
+| deezer_id | int | Deezer track ID |
+| deezer_url | str | Full Deezer URL |
+| youtube_url | str | Full YouTube URL |
+| acousticbrainz | str | AcousticBrainz URL (always null) |
+| spotify_id | str | Spotify ID (if available) |
+
+**Missing values:** Empty cells or "None" string (inconsistent).
+
+**Encoding:** UTF-8. No BOM.
+
+**Delimiter:** Comma. No escaping issues documented.
+
+**Headers:** First row contains column names.
+
+**Output location:** Same directory as input JAMS files, named based on directory name.
+
+## Data Transformation Pipeline
+
+### Input Transformation
+
+1. **JAMS parsing:** JSON deserialization via jams library
+2. **Field extraction:** Map JAMS fields to Align parameters
+3. **Type conversion:** String to int for track_number, string to float for duration
+4. **Null handling:** Missing fields become None
+
+### Query Transformation
+
+1. **Metadata normalization:** None (passed as-is to services)
+2. **Duration conversion:** MusicBrainz milliseconds → seconds
+3. **ID extraction:** Parse service-specific response formats
+4. **URL construction:** Build full URLs from IDs
+
+### Output Transformation
+
+1. **Result aggregation:** Collect all getter results
+2. **CSV serialization:** pandas DataFrame to CSV
+3. **JAMS enrichment:** Inject identifiers into JSON structure
+4. **File writing:** JSON serialization with indentation
+
+## Data Quality Issues
+
+### Input Data Quality
+
+**No validation:**
+- Invalid MBIDs accepted (wrong format, non-existent)
+- Invalid ISRCs accepted (wrong format, non-existent)
+- Negative durations accepted
+- Empty strings accepted
+
+**No sanitization:**
+- Special characters in metadata not escaped
+- SQL injection risk if metadata used in queries (not applicable here)
+- Command injection risk if metadata used in shell commands (not applicable here)
+
+**No normalization:**
+- "The Beatles" vs "Beatles" treated as different
+- "feat." vs "featuring" vs "ft." not normalized
+- Unicode variants not normalized (e.g., é vs e + combining accent)
+
+### Output Data Quality
+
+**Inconsistent null representation:**
+- Python: None
+- CSV: Empty string or "None" string
+- JAMS: null or missing key
+
+**No data validation:**
+- Retrieved MBIDs not validated as UUIDs
+- Retrieved ISRCs not validated as ISRC format
+- Retrieved URLs not validated as valid URLs
+
+**No conflict resolution:**
+- If MusicBrainz and Deezer return different artists, no reconciliation
+- First successful query wins, no cross-validation
+
+### Data Accuracy Issues
+
+**YouTube matching:** Weak matching logic. First result assumed correct. High false positive rate.
+
+**Duration filtering:** ±3 seconds threshold may be too loose for short tracks, too strict for live recordings.
+
+**Fuzzy matching:** No documented algorithm. Likely simple string similarity. Doesn't handle:
+- Transliterations (e.g., Japanese to romaji)
+- Abbreviations (e.g., "feat." vs "featuring")
+- Reorderings (e.g., "Artist feat. Guest" vs "Guest & Artist")
+
+**AcousticBrainz:** Always returns null (service shut down). Dead data field.
+
+## Data Flow Diagrams
+
+### Single Track Flow
+
+```
+Input (Python dict or JAMS)
+    ↓
+Align constructor
+    ↓
+[Lazy evaluation - no queries yet]
+    ↓
+User calls getter (e.g., get_mbid())
+    ↓
+Check cache
+    ↓
+If not cached:
+    ↓
+Determine service to query
+    ↓
+Execute service query
+    ↓
+Parse response
+    ↓
+Cache result
+    ↓
+Return to user
+```
+
+### Batch Processing Flow
+
+```
+Directory of JAMS files
+    ↓
+For each JAMS file:
+    ↓
+JAMSProcessor.extract_metadata()
+    ↓
+Create Align instance
+    ↓
+Call all getters
+    ↓
+Collect results in list
+    ↓
+End loop
+    ↓
+Convert list to pandas DataFrame
+    ↓
+Write CSV
+    ↓
+Optionally write enriched JAMS files
+```
+
+### Service Query Flow
+
+```
+Align.get_mbid()
+    ↓
+If mbid_track provided:
+    Return mbid_track
+    ↓
+Else if isrc provided:
+    Query MusicBrainz by ISRC
+    ↓
+Else:
+    Query MusicBrainz by metadata
+    ↓
+Parse MusicBrainz response
+    ↓
+Extract MBID
+    ↓
+Cache and return
+```
+
+## Data Caching Strategy
+
+### In-Memory Cache
+
+**Scope:** Single Align instance only.
+
+**Cache key:** Implicit (field name). No explicit key generation.
+
+**Cache invalidation:** None. Values cached for instance lifetime.
+
+**Cache size:** Small (one value per field, ~15 fields max).
+
+**Cache hit rate:** High for repeated getter calls on same instance. Zero across instances.
+
+### No Persistent Cache
+
+**Implications:**
+- Repeated queries for same track across runs
+- No offline operation
+- Network dependency for every query
+
+**Batch processing impact:**
+- Processing 1000 tracks = 1000+ API calls
+- No deduplication across tracks
+- High network usage
+
+### Cache Recommendations
+
+For production use:
+
+1. **Add persistent cache:** Redis or SQLite for cross-run caching
+2. **Cache key:** Hash of (artist, track, album, duration)
+3. **TTL:** 30 days (metadata rarely changes)
+4. **Invalidation:** Manual or TTL-based
+5. **Deduplication:** Cache identical queries across tracks
+
+## Data Privacy and Security
+
+### Personal Data
+
+**No personal data collected:** Only public music metadata.
+
+**No user tracking:** No analytics, no telemetry.
+
+**No data sharing:** Results not sent to third parties.
+
+### API Credentials
+
+**Spotify credentials:** Stored in external mml_secrets.py file. Not encrypted. Not in version control.
+
+**Other services:** No credentials required.
+
+### Data Retention
+
+**No retention:** All data discarded when Align instance destroyed.
+
+**Batch output:** CSV and JAMS files written to disk. User responsible for retention and deletion.
+
+## Data Consistency
+
+### Cross-Service Consistency
+
+**No consistency checks:** If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.
+
+**First-wins strategy:** First successful query result used. No validation against other services.
+
+**Conflict scenarios:**
+- Different artists across services
+- Different track names across services
+- Different durations across services
+
+**No conflict resolution:** User receives inconsistent data.
+
+### Temporal Consistency
+
+**No versioning:** Metadata retrieved at query time. No timestamp recorded.
+
+**Staleness:** If MusicBrainz updates metadata after query, Align instance has stale data.
+
+**No refresh:** No way to refresh cached data without creating new instance.
+
+## Data Completeness
+
+### Missing Data Handling
+
+**Graceful degradation:** Missing fields return None. No errors.
+
+**Partial results:** If MusicBrainz succeeds but Deezer fails, MusicBrainz data returned.
+
+**No completeness metrics:** No indication of how many fields successfully retrieved.
+
+### Required vs Optional Fields
+
+**No required fields:** All constructor parameters optional.
+
+**Minimum viable input:** At least one of (mbid_track, isrc, artist+track) recommended.
+
+**Degenerate cases:**
+- Empty Align() constructor: All getters return None
+- Only duration provided: All getters return None (no searchable metadata)
+
+## Data Format Standards
+
+### Identifier Formats
+
+**MBID:** UUID format (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e"). No validation.
+
+**ISRC:** 12-character alphanumeric (e.g., "GBAYE9200070"). No validation.
+
+**Deezer ID:** Integer. No range validation.
+
+**YouTube ID:** Alphanumeric string (e.g., "dQw4w9WgXcQ"). No validation.
+
+### Metadata Formats
+
+**Artist, track, album:** Free text. No format constraints.
+
+**Duration:** Seconds (int or float). MusicBrainz milliseconds converted to seconds.
+
+**Track number:** Integer. No validation (negative numbers accepted).
+
+**Release date:** ISO format (YYYY-MM-DD) or year only (YYYY). Inconsistent across services.
+
+**BPM:** Integer or float. No range validation.
+
+## Data Interoperability
+
+### JAMS Compatibility
+
+JAMS is a standard format in music information retrieval research. MusicMetaLinker's JAMS support enables interoperability with:
+- mir_eval (evaluation framework)
+- librosa (audio analysis)
+- madmom (music analysis)
+- Other MIR tools
+
+### Service Compatibility
+
+**MusicBrainz:** Uses official musicbrainzngs library. Compatible with MusicBrainz API changes (library handles versioning).
+
+**Deezer:** Uses official deezer-python library. Compatible with Deezer API.
+
+**YouTube Music:** Uses unofficial ytmusicapi. Fragile to YouTube changes. No API stability guarantees.
+
+**Spotify:** Uses official spotipy library. Compatible with Spotify API.
+
+## Data Limitations
+
+1. **No bulk operations:** Each track processed individually
+2. **No streaming:** All data loaded into memory
+3. **No compression:** JAMS files written uncompressed
+4. **No encryption:** All data stored in plaintext
+5. **No checksums:** No data integrity verification
+6. **No versioning:** No metadata version tracking
+7. **No provenance:** No record of which service provided which field
+8. **No confidence scores:** No indication of match quality
+
+## Data Recommendations
+
+For production use:
+
+1. **Add validation:** Validate all input and output formats
+2. **Add normalization:** Normalize artist names, track titles
+3. **Add conflict resolution:** Cross-validate results across services
+4. **Add provenance tracking:** Record which service provided each field
+5. **Add confidence scores:** Indicate match quality
+6. **Add persistent cache:** Reduce API calls
+7. **Add data versioning:** Track when metadata retrieved
+8. **Add bulk operations:** Process multiple tracks efficiently
+9. **Remove dead fields:** Delete AcousticBrainz from output
+10. **Add structured output:** to_dict(), to_json() methods
+
+The data model is simple and functional for research use. Production use requires significant enhancements.