- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
14 KiB
MusicMetaLinker Data Architecture
Data Storage Model
MusicMetaLinker has no persistent data storage. All data is in-memory during execution.
No database: No SQL, no NoSQL, no embedded databases.
No file-based persistence: No local cache files, no serialized objects (except JAMS output).
Stateless operation: Each Align instance is independent. No shared state across instances.
Input Data Formats
Python Objects
Primary input method: Constructor parameters to Align class.
Supported data types:
{
"mbid_track": str, # UUID format
"mbid_release": str, # UUID format
"artist": str, # Free text
"album": str, # Free text
"track": str, # Free text
"track_number": int, # Positive integer
"duration": int | float, # Seconds
"isrc": str, # ISRC format (no validation)
"strict": bool # Matching mode
}
No validation: Input accepted as-is. Invalid data causes silent failures (returns None).
No normalization: Artist names, track titles used exactly as provided. No case normalization, no whitespace trimming, no Unicode normalization.
JAMS Files
JAMS (JSON Annotated Music Specification) is the standard input format for batch processing.
JAMS structure:
{
"file_metadata": {
"title": "Track Name",
"artist": "Artist Name",
"release": "Album Name",
"duration": 123.45,
"identifiers": {
"musicbrainz": "mbid-uuid-here",
"isrc": "GBAYE9200070"
}
},
"sandbox": {
"type": "music_type",
"genre": "rock",
"track_number": 1,
"release_year": 2020
},
"annotations": []
}
Key sections:
file_metadata: Core track metadata. Required fields: title, artist. Optional: release, duration, identifiers.
sandbox: Additional metadata. Free-form structure. Common fields: type, genre, track_number, release_year.
annotations: Music information retrieval annotations (not used by MusicMetaLinker).
Parsing logic:
JAMSProcessor extracts:
- title → track
- artist → artist
- release → album
- duration → duration
- identifiers.musicbrainz → mbid_track
- identifiers.isrc → isrc
- sandbox.track_number → track_number
Missing fields: Treated as None. No errors raised.
CSV Input
No direct CSV input support. Batch processing outputs CSV but doesn't read it.
For CSV input, users must:
- Parse CSV manually
- Create Align instances per row
- Collect results
Output Data Formats
Python Objects
Align instance acts as data container. Getters return individual fields.
No structured output method: No to_dict(), no to_json(), no serialize().
Manual aggregation required:
linker = Align(...)
result = {
"artist": linker.get_artist(),
"track": linker.get_track(),
"mbid": linker.get_mbid(),
"isrc": linker.get_isrc(),
"deezer_id": linker.get_deezer_id(),
# ... etc
}
JAMS Files
Enriched JAMS files with added identifiers.
Enrichment process:
- Read original JAMS file
- Extract metadata
- Create Align instance
- Query all services
- Add identifiers to file_metadata.identifiers section
- Write enriched JAMS file
Added identifiers:
{
"file_metadata": {
"identifiers": {
"musicbrainz": "mbid-from-query",
"isrc": "isrc-from-query",
"deezer": "deezer-id-from-query",
"youtube": "youtube-url-from-query",
"acousticbrainz": null
}
}
}
Preservation: Original JAMS structure preserved. Only identifiers section modified.
Overwrite behavior: Controlled by --overwrite flag. Without flag, existing identifiers preserved.
CSV Output
Batch processing generates CSV with all metadata and identifiers.
CSV schema:
| Column | Type | Description |
|---|---|---|
| jams_file | str | Original JAMS filename |
| track_name | str | Track title |
| artist_name | str | Artist name |
| album_name | str | Album/release name |
| track_number | int | Track position |
| duration | float | Duration in seconds |
| release_year | int | Release year |
| musicbrainz | str | MBID (UUID) |
| isrc | str | ISRC code |
| deezer_id | int | Deezer track ID |
| deezer_url | str | Full Deezer URL |
| youtube_url | str | Full YouTube URL |
| acousticbrainz | str | AcousticBrainz URL (always null) |
| spotify_id | str | Spotify ID (if available) |
Missing values: Empty cells or "None" string (inconsistent).
Encoding: UTF-8. No BOM.
Delimiter: Comma. No escaping issues documented.
Headers: First row contains column names.
Output location: Same directory as input JAMS files, named based on directory name.
Data Transformation Pipeline
Input Transformation
- JAMS parsing: JSON deserialization via jams library
- Field extraction: Map JAMS fields to Align parameters
- Type conversion: String to int for track_number, string to float for duration
- Null handling: Missing fields become None
Query Transformation
- Metadata normalization: None (passed as-is to services)
- Duration conversion: MusicBrainz milliseconds → seconds
- ID extraction: Parse service-specific response formats
- URL construction: Build full URLs from IDs
Output Transformation
- Result aggregation: Collect all getter results
- CSV serialization: pandas DataFrame to CSV
- JAMS enrichment: Inject identifiers into JSON structure
- File writing: JSON serialization with indentation
Data Quality Issues
Input Data Quality
No validation:
- Invalid MBIDs accepted (wrong format, non-existent)
- Invalid ISRCs accepted (wrong format, non-existent)
- Negative durations accepted
- Empty strings accepted
No sanitization:
- Special characters in metadata not escaped
- SQL injection risk if metadata used in queries (not applicable here)
- Command injection risk if metadata used in shell commands (not applicable here)
No normalization:
- "The Beatles" vs "Beatles" treated as different
- "feat." vs "featuring" vs "ft." not normalized
- Unicode variants not normalized (e.g., é vs e + combining accent)
Output Data Quality
Inconsistent null representation:
- Python: None
- CSV: Empty string or "None" string
- JAMS: null or missing key
No data validation:
- Retrieved MBIDs not validated as UUIDs
- Retrieved ISRCs not validated as ISRC format
- Retrieved URLs not validated as valid URLs
No conflict resolution:
- If MusicBrainz and Deezer return different artists, no reconciliation
- First successful query wins, no cross-validation
Data Accuracy Issues
YouTube matching: Weak matching logic. First result assumed correct. High false positive rate.
Duration filtering: ±3 seconds threshold may be too loose for short tracks, too strict for live recordings.
Fuzzy matching: No documented algorithm. Likely simple string similarity. Doesn't handle:
- Transliterations (e.g., Japanese to romaji)
- Abbreviations (e.g., "feat." vs "featuring")
- Reorderings (e.g., "Artist feat. Guest" vs "Guest & Artist")
AcousticBrainz: Always returns null (service shut down). Dead data field.
Data Flow Diagrams
Single Track Flow
Input (Python dict or JAMS)
↓
Align constructor
↓
[Lazy evaluation - no queries yet]
↓
User calls getter (e.g., get_mbid())
↓
Check cache
↓
If not cached:
↓
Determine service to query
↓
Execute service query
↓
Parse response
↓
Cache result
↓
Return to user
Batch Processing Flow
Directory of JAMS files
↓
For each JAMS file:
↓
JAMSProcessor.extract_metadata()
↓
Create Align instance
↓
Call all getters
↓
Collect results in list
↓
End loop
↓
Convert list to pandas DataFrame
↓
Write CSV
↓
Optionally write enriched JAMS files
Service Query Flow
Align.get_mbid()
↓
If mbid_track provided:
Return mbid_track
↓
Else if isrc provided:
Query MusicBrainz by ISRC
↓
Else:
Query MusicBrainz by metadata
↓
Parse MusicBrainz response
↓
Extract MBID
↓
Cache and return
Data Caching Strategy
In-Memory Cache
Scope: Single Align instance only.
Cache key: Implicit (field name). No explicit key generation.
Cache invalidation: None. Values cached for instance lifetime.
Cache size: Small (one value per field, ~15 fields max).
Cache hit rate: High for repeated getter calls on same instance. Zero across instances.
No Persistent Cache
Implications:
- Repeated queries for same track across runs
- No offline operation
- Network dependency for every query
Batch processing impact:
- Processing 1000 tracks = 1000+ API calls
- No deduplication across tracks
- High network usage
Cache Recommendations
For production use:
- Add persistent cache: Redis or SQLite for cross-run caching
- Cache key: Hash of (artist, track, album, duration)
- TTL: 30 days (metadata rarely changes)
- Invalidation: Manual or TTL-based
- Deduplication: Cache identical queries across tracks
Data Privacy and Security
Personal Data
No personal data collected: Only public music metadata.
No user tracking: No analytics, no telemetry.
No data sharing: Results not sent to third parties.
API Credentials
Spotify credentials: Stored in external mml_secrets.py file. Not encrypted. Not in version control.
Other services: No credentials required.
Data Retention
No retention: All data discarded when Align instance destroyed.
Batch output: CSV and JAMS files written to disk. User responsible for retention and deletion.
Data Consistency
Cross-Service Consistency
No consistency checks: If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.
First-wins strategy: First successful query result used. No validation against other services.
Conflict scenarios:
- Different artists across services
- Different track names across services
- Different durations across services
No conflict resolution: User receives inconsistent data.
Temporal Consistency
No versioning: Metadata retrieved at query time. No timestamp recorded.
Staleness: If MusicBrainz updates metadata after query, Align instance has stale data.
No refresh: No way to refresh cached data without creating new instance.
Data Completeness
Missing Data Handling
Graceful degradation: Missing fields return None. No errors.
Partial results: If MusicBrainz succeeds but Deezer fails, MusicBrainz data returned.
No completeness metrics: No indication of how many fields successfully retrieved.
Required vs Optional Fields
No required fields: All constructor parameters optional.
Minimum viable input: At least one of (mbid_track, isrc, artist+track) recommended.
Degenerate cases:
- Empty Align() constructor: All getters return None
- Only duration provided: All getters return None (no searchable metadata)
Data Format Standards
Identifier Formats
MBID: UUID format (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e"). No validation.
ISRC: 12-character alphanumeric (e.g., "GBAYE9200070"). No validation.
Deezer ID: Integer. No range validation.
YouTube ID: Alphanumeric string (e.g., "dQw4w9WgXcQ"). No validation.
Metadata Formats
Artist, track, album: Free text. No format constraints.
Duration: Seconds (int or float). MusicBrainz milliseconds converted to seconds.
Track number: Integer. No validation (negative numbers accepted).
Release date: ISO format (YYYY-MM-DD) or year only (YYYY). Inconsistent across services.
BPM: Integer or float. No range validation.
Data Interoperability
JAMS Compatibility
JAMS is a standard format in music information retrieval research. MusicMetaLinker's JAMS support enables interoperability with:
- mir_eval (evaluation framework)
- librosa (audio analysis)
- madmom (music analysis)
- Other MIR tools
Service Compatibility
MusicBrainz: Uses official musicbrainzngs library. Compatible with MusicBrainz API changes (library handles versioning).
Deezer: Uses official deezer-python library. Compatible with Deezer API.
YouTube Music: Uses unofficial ytmusicapi. Fragile to YouTube changes. No API stability guarantees.
Spotify: Uses official spotipy library. Compatible with Spotify API.
Data Limitations
- No bulk operations: Each track processed individually
- No streaming: All data loaded into memory
- No compression: JAMS files written uncompressed
- No encryption: All data stored in plaintext
- No checksums: No data integrity verification
- No versioning: No metadata version tracking
- No provenance: No record of which service provided which field
- No confidence scores: No indication of match quality
Data Recommendations
For production use:
- Add validation: Validate all input and output formats
- Add normalization: Normalize artist names, track titles
- Add conflict resolution: Cross-validate results across services
- Add provenance tracking: Record which service provided each field
- Add confidence scores: Indicate match quality
- Add persistent cache: Reduce API calls
- Add data versioning: Track when metadata retrieved
- Add bulk operations: Process multiple tracks efficiently
- Remove dead fields: Delete AcousticBrainz from output
- Add structured output: to_dict(), to_json() methods
The data model is simple and functional for research use. Production use requires significant enhancements.