# MusicMetaLinker Data Architecture ## Data Storage Model MusicMetaLinker has no persistent data storage. All data is in-memory during execution. **No database:** No SQL, no NoSQL, no embedded databases. **No file-based persistence:** No local cache files, no serialized objects (except JAMS output). **Stateless operation:** Each Align instance is independent. No shared state across instances. ## Input Data Formats ### Python Objects Primary input method: Constructor parameters to Align class. **Supported data types:** ```python { "mbid_track": str, # UUID format "mbid_release": str, # UUID format "artist": str, # Free text "album": str, # Free text "track": str, # Free text "track_number": int, # Positive integer "duration": int | float, # Seconds "isrc": str, # ISRC format (no validation) "strict": bool # Matching mode } ``` **No validation:** Input accepted as-is. Invalid data causes silent failures (returns None). **No normalization:** Artist names, track titles used exactly as provided. No case normalization, no whitespace trimming, no Unicode normalization. ### JAMS Files JAMS (JSON Annotated Music Specification) is the standard input format for batch processing. **JAMS structure:** ```json { "file_metadata": { "title": "Track Name", "artist": "Artist Name", "release": "Album Name", "duration": 123.45, "identifiers": { "musicbrainz": "mbid-uuid-here", "isrc": "GBAYE9200070" } }, "sandbox": { "type": "music_type", "genre": "rock", "track_number": 1, "release_year": 2020 }, "annotations": [] } ``` **Key sections:** **file_metadata:** Core track metadata. Required fields: title, artist. Optional: release, duration, identifiers. **sandbox:** Additional metadata. Free-form structure. Common fields: type, genre, track_number, release_year. **annotations:** Music information retrieval annotations (not used by MusicMetaLinker). **Parsing logic:** JAMSProcessor extracts: - title → track - artist → artist - release → album - duration → duration - identifiers.musicbrainz → mbid_track - identifiers.isrc → isrc - sandbox.track_number → track_number **Missing fields:** Treated as None. No errors raised. ### CSV Input No direct CSV input support. Batch processing outputs CSV but doesn't read it. For CSV input, users must: 1. Parse CSV manually 2. Create Align instances per row 3. Collect results ## Output Data Formats ### Python Objects Align instance acts as data container. Getters return individual fields. **No structured output method:** No to_dict(), no to_json(), no serialize(). **Manual aggregation required:** ```python linker = Align(...) result = { "artist": linker.get_artist(), "track": linker.get_track(), "mbid": linker.get_mbid(), "isrc": linker.get_isrc(), "deezer_id": linker.get_deezer_id(), # ... etc } ``` ### JAMS Files Enriched JAMS files with added identifiers. **Enrichment process:** 1. Read original JAMS file 2. Extract metadata 3. Create Align instance 4. Query all services 5. Add identifiers to file_metadata.identifiers section 6. Write enriched JAMS file **Added identifiers:** ```json { "file_metadata": { "identifiers": { "musicbrainz": "mbid-from-query", "isrc": "isrc-from-query", "deezer": "deezer-id-from-query", "youtube": "youtube-url-from-query", "acousticbrainz": null } } } ``` **Preservation:** Original JAMS structure preserved. Only identifiers section modified. **Overwrite behavior:** Controlled by --overwrite flag. Without flag, existing identifiers preserved. ### CSV Output Batch processing generates CSV with all metadata and identifiers. **CSV schema:** | Column | Type | Description | |--------|------|-------------| | jams_file | str | Original JAMS filename | | track_name | str | Track title | | artist_name | str | Artist name | | album_name | str | Album/release name | | track_number | int | Track position | | duration | float | Duration in seconds | | release_year | int | Release year | | musicbrainz | str | MBID (UUID) | | isrc | str | ISRC code | | deezer_id | int | Deezer track ID | | deezer_url | str | Full Deezer URL | | youtube_url | str | Full YouTube URL | | acousticbrainz | str | AcousticBrainz URL (always null) | | spotify_id | str | Spotify ID (if available) | **Missing values:** Empty cells or "None" string (inconsistent). **Encoding:** UTF-8. No BOM. **Delimiter:** Comma. No escaping issues documented. **Headers:** First row contains column names. **Output location:** Same directory as input JAMS files, named based on directory name. ## Data Transformation Pipeline ### Input Transformation 1. **JAMS parsing:** JSON deserialization via jams library 2. **Field extraction:** Map JAMS fields to Align parameters 3. **Type conversion:** String to int for track_number, string to float for duration 4. **Null handling:** Missing fields become None ### Query Transformation 1. **Metadata normalization:** None (passed as-is to services) 2. **Duration conversion:** MusicBrainz milliseconds → seconds 3. **ID extraction:** Parse service-specific response formats 4. **URL construction:** Build full URLs from IDs ### Output Transformation 1. **Result aggregation:** Collect all getter results 2. **CSV serialization:** pandas DataFrame to CSV 3. **JAMS enrichment:** Inject identifiers into JSON structure 4. **File writing:** JSON serialization with indentation ## Data Quality Issues ### Input Data Quality **No validation:** - Invalid MBIDs accepted (wrong format, non-existent) - Invalid ISRCs accepted (wrong format, non-existent) - Negative durations accepted - Empty strings accepted **No sanitization:** - Special characters in metadata not escaped - SQL injection risk if metadata used in queries (not applicable here) - Command injection risk if metadata used in shell commands (not applicable here) **No normalization:** - "The Beatles" vs "Beatles" treated as different - "feat." vs "featuring" vs "ft." not normalized - Unicode variants not normalized (e.g., é vs e + combining accent) ### Output Data Quality **Inconsistent null representation:** - Python: None - CSV: Empty string or "None" string - JAMS: null or missing key **No data validation:** - Retrieved MBIDs not validated as UUIDs - Retrieved ISRCs not validated as ISRC format - Retrieved URLs not validated as valid URLs **No conflict resolution:** - If MusicBrainz and Deezer return different artists, no reconciliation - First successful query wins, no cross-validation ### Data Accuracy Issues **YouTube matching:** Weak matching logic. First result assumed correct. High false positive rate. **Duration filtering:** ±3 seconds threshold may be too loose for short tracks, too strict for live recordings. **Fuzzy matching:** No documented algorithm. Likely simple string similarity. Doesn't handle: - Transliterations (e.g., Japanese to romaji) - Abbreviations (e.g., "feat." vs "featuring") - Reorderings (e.g., "Artist feat. Guest" vs "Guest & Artist") **AcousticBrainz:** Always returns null (service shut down). Dead data field. ## Data Flow Diagrams ### Single Track Flow ``` Input (Python dict or JAMS) ↓ Align constructor ↓ [Lazy evaluation - no queries yet] ↓ User calls getter (e.g., get_mbid()) ↓ Check cache ↓ If not cached: ↓ Determine service to query ↓ Execute service query ↓ Parse response ↓ Cache result ↓ Return to user ``` ### Batch Processing Flow ``` Directory of JAMS files ↓ For each JAMS file: ↓ JAMSProcessor.extract_metadata() ↓ Create Align instance ↓ Call all getters ↓ Collect results in list ↓ End loop ↓ Convert list to pandas DataFrame ↓ Write CSV ↓ Optionally write enriched JAMS files ``` ### Service Query Flow ``` Align.get_mbid() ↓ If mbid_track provided: Return mbid_track ↓ Else if isrc provided: Query MusicBrainz by ISRC ↓ Else: Query MusicBrainz by metadata ↓ Parse MusicBrainz response ↓ Extract MBID ↓ Cache and return ``` ## Data Caching Strategy ### In-Memory Cache **Scope:** Single Align instance only. **Cache key:** Implicit (field name). No explicit key generation. **Cache invalidation:** None. Values cached for instance lifetime. **Cache size:** Small (one value per field, ~15 fields max). **Cache hit rate:** High for repeated getter calls on same instance. Zero across instances. ### No Persistent Cache **Implications:** - Repeated queries for same track across runs - No offline operation - Network dependency for every query **Batch processing impact:** - Processing 1000 tracks = 1000+ API calls - No deduplication across tracks - High network usage ### Cache Recommendations For production use: 1. **Add persistent cache:** Redis or SQLite for cross-run caching 2. **Cache key:** Hash of (artist, track, album, duration) 3. **TTL:** 30 days (metadata rarely changes) 4. **Invalidation:** Manual or TTL-based 5. **Deduplication:** Cache identical queries across tracks ## Data Privacy and Security ### Personal Data **No personal data collected:** Only public music metadata. **No user tracking:** No analytics, no telemetry. **No data sharing:** Results not sent to third parties. ### API Credentials **Spotify credentials:** Stored in external mml_secrets.py file. Not encrypted. Not in version control. **Other services:** No credentials required. ### Data Retention **No retention:** All data discarded when Align instance destroyed. **Batch output:** CSV and JAMS files written to disk. User responsible for retention and deletion. ## Data Consistency ### Cross-Service Consistency **No consistency checks:** If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation. **First-wins strategy:** First successful query result used. No validation against other services. **Conflict scenarios:** - Different artists across services - Different track names across services - Different durations across services **No conflict resolution:** User receives inconsistent data. ### Temporal Consistency **No versioning:** Metadata retrieved at query time. No timestamp recorded. **Staleness:** If MusicBrainz updates metadata after query, Align instance has stale data. **No refresh:** No way to refresh cached data without creating new instance. ## Data Completeness ### Missing Data Handling **Graceful degradation:** Missing fields return None. No errors. **Partial results:** If MusicBrainz succeeds but Deezer fails, MusicBrainz data returned. **No completeness metrics:** No indication of how many fields successfully retrieved. ### Required vs Optional Fields **No required fields:** All constructor parameters optional. **Minimum viable input:** At least one of (mbid_track, isrc, artist+track) recommended. **Degenerate cases:** - Empty Align() constructor: All getters return None - Only duration provided: All getters return None (no searchable metadata) ## Data Format Standards ### Identifier Formats **MBID:** UUID format (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e"). No validation. **ISRC:** 12-character alphanumeric (e.g., "GBAYE9200070"). No validation. **Deezer ID:** Integer. No range validation. **YouTube ID:** Alphanumeric string (e.g., "dQw4w9WgXcQ"). No validation. ### Metadata Formats **Artist, track, album:** Free text. No format constraints. **Duration:** Seconds (int or float). MusicBrainz milliseconds converted to seconds. **Track number:** Integer. No validation (negative numbers accepted). **Release date:** ISO format (YYYY-MM-DD) or year only (YYYY). Inconsistent across services. **BPM:** Integer or float. No range validation. ## Data Interoperability ### JAMS Compatibility JAMS is a standard format in music information retrieval research. MusicMetaLinker's JAMS support enables interoperability with: - mir_eval (evaluation framework) - librosa (audio analysis) - madmom (music analysis) - Other MIR tools ### Service Compatibility **MusicBrainz:** Uses official musicbrainzngs library. Compatible with MusicBrainz API changes (library handles versioning). **Deezer:** Uses official deezer-python library. Compatible with Deezer API. **YouTube Music:** Uses unofficial ytmusicapi. Fragile to YouTube changes. No API stability guarantees. **Spotify:** Uses official spotipy library. Compatible with Spotify API. ## Data Limitations 1. **No bulk operations:** Each track processed individually 2. **No streaming:** All data loaded into memory 3. **No compression:** JAMS files written uncompressed 4. **No encryption:** All data stored in plaintext 5. **No checksums:** No data integrity verification 6. **No versioning:** No metadata version tracking 7. **No provenance:** No record of which service provided which field 8. **No confidence scores:** No indication of match quality ## Data Recommendations For production use: 1. **Add validation:** Validate all input and output formats 2. **Add normalization:** Normalize artist names, track titles 3. **Add conflict resolution:** Cross-validate results across services 4. **Add provenance tracking:** Record which service provided each field 5. **Add confidence scores:** Indicate match quality 6. **Add persistent cache:** Reduce API calls 7. **Add data versioning:** Track when metadata retrieved 8. **Add bulk operations:** Process multiple tracks efficiently 9. **Remove dead fields:** Delete AcousticBrainz from output 10. **Add structured output:** to_dict(), to_json() methods The data model is simple and functional for research use. Production use requires significant enhancements.