a1f6701bac
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
502 lines
14 KiB
Markdown
502 lines
14 KiB
Markdown
# MusicMetaLinker Data Architecture
|
|
|
|
## Data Storage Model
|
|
|
|
MusicMetaLinker has no persistent data storage. All data is in-memory during execution.
|
|
|
|
**No database:** No SQL, no NoSQL, no embedded databases.
|
|
|
|
**No file-based persistence:** No local cache files, no serialized objects (except JAMS output).
|
|
|
|
**Stateless operation:** Each Align instance is independent. No shared state across instances.
|
|
|
|
## Input Data Formats
|
|
|
|
### Python Objects
|
|
|
|
Primary input method: Constructor parameters to Align class.
|
|
|
|
**Supported data types:**
|
|
|
|
```python
|
|
{
|
|
"mbid_track": str, # UUID format
|
|
"mbid_release": str, # UUID format
|
|
"artist": str, # Free text
|
|
"album": str, # Free text
|
|
"track": str, # Free text
|
|
"track_number": int, # Positive integer
|
|
"duration": int | float, # Seconds
|
|
"isrc": str, # ISRC format (no validation)
|
|
"strict": bool # Matching mode
|
|
}
|
|
```
|
|
|
|
**No validation:** Input accepted as-is. Invalid data causes silent failures (returns None).
|
|
|
|
**No normalization:** Artist names, track titles used exactly as provided. No case normalization, no whitespace trimming, no Unicode normalization.
|
|
|
|
### JAMS Files
|
|
|
|
JAMS (JSON Annotated Music Specification) is the standard input format for batch processing.
|
|
|
|
**JAMS structure:**
|
|
|
|
```json
|
|
{
|
|
"file_metadata": {
|
|
"title": "Track Name",
|
|
"artist": "Artist Name",
|
|
"release": "Album Name",
|
|
"duration": 123.45,
|
|
"identifiers": {
|
|
"musicbrainz": "mbid-uuid-here",
|
|
"isrc": "GBAYE9200070"
|
|
}
|
|
},
|
|
"sandbox": {
|
|
"type": "music_type",
|
|
"genre": "rock",
|
|
"track_number": 1,
|
|
"release_year": 2020
|
|
},
|
|
"annotations": []
|
|
}
|
|
```
|
|
|
|
**Key sections:**
|
|
|
|
**file_metadata:** Core track metadata. Required fields: title, artist. Optional: release, duration, identifiers.
|
|
|
|
**sandbox:** Additional metadata. Free-form structure. Common fields: type, genre, track_number, release_year.
|
|
|
|
**annotations:** Music information retrieval annotations (not used by MusicMetaLinker).
|
|
|
|
**Parsing logic:**
|
|
|
|
JAMSProcessor extracts:
|
|
- title → track
|
|
- artist → artist
|
|
- release → album
|
|
- duration → duration
|
|
- identifiers.musicbrainz → mbid_track
|
|
- identifiers.isrc → isrc
|
|
- sandbox.track_number → track_number
|
|
|
|
**Missing fields:** Treated as None. No errors raised.
|
|
|
|
### CSV Input
|
|
|
|
No direct CSV input support. Batch processing outputs CSV but doesn't read it.
|
|
|
|
For CSV input, users must:
|
|
1. Parse CSV manually
|
|
2. Create Align instances per row
|
|
3. Collect results
|
|
|
|
## Output Data Formats
|
|
|
|
### Python Objects
|
|
|
|
Align instance acts as data container. Getters return individual fields.
|
|
|
|
**No structured output method:** No to_dict(), no to_json(), no serialize().
|
|
|
|
**Manual aggregation required:**
|
|
|
|
```python
|
|
linker = Align(...)
|
|
result = {
|
|
"artist": linker.get_artist(),
|
|
"track": linker.get_track(),
|
|
"mbid": linker.get_mbid(),
|
|
"isrc": linker.get_isrc(),
|
|
"deezer_id": linker.get_deezer_id(),
|
|
# ... etc
|
|
}
|
|
```
|
|
|
|
### JAMS Files
|
|
|
|
Enriched JAMS files with added identifiers.
|
|
|
|
**Enrichment process:**
|
|
|
|
1. Read original JAMS file
|
|
2. Extract metadata
|
|
3. Create Align instance
|
|
4. Query all services
|
|
5. Add identifiers to file_metadata.identifiers section
|
|
6. Write enriched JAMS file
|
|
|
|
**Added identifiers:**
|
|
|
|
```json
|
|
{
|
|
"file_metadata": {
|
|
"identifiers": {
|
|
"musicbrainz": "mbid-from-query",
|
|
"isrc": "isrc-from-query",
|
|
"deezer": "deezer-id-from-query",
|
|
"youtube": "youtube-url-from-query",
|
|
"acousticbrainz": null
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Preservation:** Original JAMS structure preserved. Only identifiers section modified.
|
|
|
|
**Overwrite behavior:** Controlled by --overwrite flag. Without flag, existing identifiers preserved.
|
|
|
|
### CSV Output
|
|
|
|
Batch processing generates CSV with all metadata and identifiers.
|
|
|
|
**CSV schema:**
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| jams_file | str | Original JAMS filename |
|
|
| track_name | str | Track title |
|
|
| artist_name | str | Artist name |
|
|
| album_name | str | Album/release name |
|
|
| track_number | int | Track position |
|
|
| duration | float | Duration in seconds |
|
|
| release_year | int | Release year |
|
|
| musicbrainz | str | MBID (UUID) |
|
|
| isrc | str | ISRC code |
|
|
| deezer_id | int | Deezer track ID |
|
|
| deezer_url | str | Full Deezer URL |
|
|
| youtube_url | str | Full YouTube URL |
|
|
| acousticbrainz | str | AcousticBrainz URL (always null) |
|
|
| spotify_id | str | Spotify ID (if available) |
|
|
|
|
**Missing values:** Empty cells or "None" string (inconsistent).
|
|
|
|
**Encoding:** UTF-8. No BOM.
|
|
|
|
**Delimiter:** Comma. No escaping issues documented.
|
|
|
|
**Headers:** First row contains column names.
|
|
|
|
**Output location:** Same directory as input JAMS files, named based on directory name.
|
|
|
|
## Data Transformation Pipeline
|
|
|
|
### Input Transformation
|
|
|
|
1. **JAMS parsing:** JSON deserialization via jams library
|
|
2. **Field extraction:** Map JAMS fields to Align parameters
|
|
3. **Type conversion:** String to int for track_number, string to float for duration
|
|
4. **Null handling:** Missing fields become None
|
|
|
|
### Query Transformation
|
|
|
|
1. **Metadata normalization:** None (passed as-is to services)
|
|
2. **Duration conversion:** MusicBrainz milliseconds → seconds
|
|
3. **ID extraction:** Parse service-specific response formats
|
|
4. **URL construction:** Build full URLs from IDs
|
|
|
|
### Output Transformation
|
|
|
|
1. **Result aggregation:** Collect all getter results
|
|
2. **CSV serialization:** pandas DataFrame to CSV
|
|
3. **JAMS enrichment:** Inject identifiers into JSON structure
|
|
4. **File writing:** JSON serialization with indentation
|
|
|
|
## Data Quality Issues
|
|
|
|
### Input Data Quality
|
|
|
|
**No validation:**
|
|
- Invalid MBIDs accepted (wrong format, non-existent)
|
|
- Invalid ISRCs accepted (wrong format, non-existent)
|
|
- Negative durations accepted
|
|
- Empty strings accepted
|
|
|
|
**No sanitization:**
|
|
- Special characters in metadata not escaped
|
|
- SQL injection risk if metadata used in queries (not applicable here)
|
|
- Command injection risk if metadata used in shell commands (not applicable here)
|
|
|
|
**No normalization:**
|
|
- "The Beatles" vs "Beatles" treated as different
|
|
- "feat." vs "featuring" vs "ft." not normalized
|
|
- Unicode variants not normalized (e.g., é vs e + combining accent)
|
|
|
|
### Output Data Quality
|
|
|
|
**Inconsistent null representation:**
|
|
- Python: None
|
|
- CSV: Empty string or "None" string
|
|
- JAMS: null or missing key
|
|
|
|
**No data validation:**
|
|
- Retrieved MBIDs not validated as UUIDs
|
|
- Retrieved ISRCs not validated as ISRC format
|
|
- Retrieved URLs not validated as valid URLs
|
|
|
|
**No conflict resolution:**
|
|
- If MusicBrainz and Deezer return different artists, no reconciliation
|
|
- First successful query wins, no cross-validation
|
|
|
|
### Data Accuracy Issues
|
|
|
|
**YouTube matching:** Weak matching logic. First result assumed correct. High false positive rate.
|
|
|
|
**Duration filtering:** ±3 seconds threshold may be too loose for short tracks, too strict for live recordings.
|
|
|
|
**Fuzzy matching:** No documented algorithm. Likely simple string similarity. Doesn't handle:
|
|
- Transliterations (e.g., Japanese to romaji)
|
|
- Abbreviations (e.g., "feat." vs "featuring")
|
|
- Reorderings (e.g., "Artist feat. Guest" vs "Guest & Artist")
|
|
|
|
**AcousticBrainz:** Always returns null (service shut down). Dead data field.
|
|
|
|
## Data Flow Diagrams
|
|
|
|
### Single Track Flow
|
|
|
|
```
|
|
Input (Python dict or JAMS)
|
|
↓
|
|
Align constructor
|
|
↓
|
|
[Lazy evaluation - no queries yet]
|
|
↓
|
|
User calls getter (e.g., get_mbid())
|
|
↓
|
|
Check cache
|
|
↓
|
|
If not cached:
|
|
↓
|
|
Determine service to query
|
|
↓
|
|
Execute service query
|
|
↓
|
|
Parse response
|
|
↓
|
|
Cache result
|
|
↓
|
|
Return to user
|
|
```
|
|
|
|
### Batch Processing Flow
|
|
|
|
```
|
|
Directory of JAMS files
|
|
↓
|
|
For each JAMS file:
|
|
↓
|
|
JAMSProcessor.extract_metadata()
|
|
↓
|
|
Create Align instance
|
|
↓
|
|
Call all getters
|
|
↓
|
|
Collect results in list
|
|
↓
|
|
End loop
|
|
↓
|
|
Convert list to pandas DataFrame
|
|
↓
|
|
Write CSV
|
|
↓
|
|
Optionally write enriched JAMS files
|
|
```
|
|
|
|
### Service Query Flow
|
|
|
|
```
|
|
Align.get_mbid()
|
|
↓
|
|
If mbid_track provided:
|
|
Return mbid_track
|
|
↓
|
|
Else if isrc provided:
|
|
Query MusicBrainz by ISRC
|
|
↓
|
|
Else:
|
|
Query MusicBrainz by metadata
|
|
↓
|
|
Parse MusicBrainz response
|
|
↓
|
|
Extract MBID
|
|
↓
|
|
Cache and return
|
|
```
|
|
|
|
## Data Caching Strategy
|
|
|
|
### In-Memory Cache
|
|
|
|
**Scope:** Single Align instance only.
|
|
|
|
**Cache key:** Implicit (field name). No explicit key generation.
|
|
|
|
**Cache invalidation:** None. Values cached for instance lifetime.
|
|
|
|
**Cache size:** Small (one value per field, ~15 fields max).
|
|
|
|
**Cache hit rate:** High for repeated getter calls on same instance. Zero across instances.
|
|
|
|
### No Persistent Cache
|
|
|
|
**Implications:**
|
|
- Repeated queries for same track across runs
|
|
- No offline operation
|
|
- Network dependency for every query
|
|
|
|
**Batch processing impact:**
|
|
- Processing 1000 tracks = 1000+ API calls
|
|
- No deduplication across tracks
|
|
- High network usage
|
|
|
|
### Cache Recommendations
|
|
|
|
For production use:
|
|
|
|
1. **Add persistent cache:** Redis or SQLite for cross-run caching
|
|
2. **Cache key:** Hash of (artist, track, album, duration)
|
|
3. **TTL:** 30 days (metadata rarely changes)
|
|
4. **Invalidation:** Manual or TTL-based
|
|
5. **Deduplication:** Cache identical queries across tracks
|
|
|
|
## Data Privacy and Security
|
|
|
|
### Personal Data
|
|
|
|
**No personal data collected:** Only public music metadata.
|
|
|
|
**No user tracking:** No analytics, no telemetry.
|
|
|
|
**No data sharing:** Results not sent to third parties.
|
|
|
|
### API Credentials
|
|
|
|
**Spotify credentials:** Stored in external mml_secrets.py file. Not encrypted. Not in version control.
|
|
|
|
**Other services:** No credentials required.
|
|
|
|
### Data Retention
|
|
|
|
**No retention:** All data discarded when Align instance destroyed.
|
|
|
|
**Batch output:** CSV and JAMS files written to disk. User responsible for retention and deletion.
|
|
|
|
## Data Consistency
|
|
|
|
### Cross-Service Consistency
|
|
|
|
**No consistency checks:** If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.
|
|
|
|
**First-wins strategy:** First successful query result used. No validation against other services.
|
|
|
|
**Conflict scenarios:**
|
|
- Different artists across services
|
|
- Different track names across services
|
|
- Different durations across services
|
|
|
|
**No conflict resolution:** User receives inconsistent data.
|
|
|
|
### Temporal Consistency
|
|
|
|
**No versioning:** Metadata retrieved at query time. No timestamp recorded.
|
|
|
|
**Staleness:** If MusicBrainz updates metadata after query, Align instance has stale data.
|
|
|
|
**No refresh:** No way to refresh cached data without creating new instance.
|
|
|
|
## Data Completeness
|
|
|
|
### Missing Data Handling
|
|
|
|
**Graceful degradation:** Missing fields return None. No errors.
|
|
|
|
**Partial results:** If MusicBrainz succeeds but Deezer fails, MusicBrainz data returned.
|
|
|
|
**No completeness metrics:** No indication of how many fields successfully retrieved.
|
|
|
|
### Required vs Optional Fields
|
|
|
|
**No required fields:** All constructor parameters optional.
|
|
|
|
**Minimum viable input:** At least one of (mbid_track, isrc, artist+track) recommended.
|
|
|
|
**Degenerate cases:**
|
|
- Empty Align() constructor: All getters return None
|
|
- Only duration provided: All getters return None (no searchable metadata)
|
|
|
|
## Data Format Standards
|
|
|
|
### Identifier Formats
|
|
|
|
**MBID:** UUID format (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e"). No validation.
|
|
|
|
**ISRC:** 12-character alphanumeric (e.g., "GBAYE9200070"). No validation.
|
|
|
|
**Deezer ID:** Integer. No range validation.
|
|
|
|
**YouTube ID:** Alphanumeric string (e.g., "dQw4w9WgXcQ"). No validation.
|
|
|
|
### Metadata Formats
|
|
|
|
**Artist, track, album:** Free text. No format constraints.
|
|
|
|
**Duration:** Seconds (int or float). MusicBrainz milliseconds converted to seconds.
|
|
|
|
**Track number:** Integer. No validation (negative numbers accepted).
|
|
|
|
**Release date:** ISO format (YYYY-MM-DD) or year only (YYYY). Inconsistent across services.
|
|
|
|
**BPM:** Integer or float. No range validation.
|
|
|
|
## Data Interoperability
|
|
|
|
### JAMS Compatibility
|
|
|
|
JAMS is a standard format in music information retrieval research. MusicMetaLinker's JAMS support enables interoperability with:
|
|
- mir_eval (evaluation framework)
|
|
- librosa (audio analysis)
|
|
- madmom (music analysis)
|
|
- Other MIR tools
|
|
|
|
### Service Compatibility
|
|
|
|
**MusicBrainz:** Uses official musicbrainzngs library. Compatible with MusicBrainz API changes (library handles versioning).
|
|
|
|
**Deezer:** Uses official deezer-python library. Compatible with Deezer API.
|
|
|
|
**YouTube Music:** Uses unofficial ytmusicapi. Fragile to YouTube changes. No API stability guarantees.
|
|
|
|
**Spotify:** Uses official spotipy library. Compatible with Spotify API.
|
|
|
|
## Data Limitations
|
|
|
|
1. **No bulk operations:** Each track processed individually
|
|
2. **No streaming:** All data loaded into memory
|
|
3. **No compression:** JAMS files written uncompressed
|
|
4. **No encryption:** All data stored in plaintext
|
|
5. **No checksums:** No data integrity verification
|
|
6. **No versioning:** No metadata version tracking
|
|
7. **No provenance:** No record of which service provided which field
|
|
8. **No confidence scores:** No indication of match quality
|
|
|
|
## Data Recommendations
|
|
|
|
For production use:
|
|
|
|
1. **Add validation:** Validate all input and output formats
|
|
2. **Add normalization:** Normalize artist names, track titles
|
|
3. **Add conflict resolution:** Cross-validate results across services
|
|
4. **Add provenance tracking:** Record which service provided each field
|
|
5. **Add confidence scores:** Indicate match quality
|
|
6. **Add persistent cache:** Reduce API calls
|
|
7. **Add data versioning:** Track when metadata retrieved
|
|
8. **Add bulk operations:** Process multiple tracks efficiently
|
|
9. **Remove dead fields:** Delete AcousticBrainz from output
|
|
10. **Add structured output:** to_dict(), to_json() methods
|
|
|
|
The data model is simple and functional for research use. Production use requires significant enhancements.
|