feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,501 @@
# MusicMetaLinker Data Architecture
## Data Storage Model
MusicMetaLinker has no persistent data storage. All data is in-memory during execution.
**No database:** No SQL, no NoSQL, no embedded databases.
**No file-based persistence:** No local cache files, no serialized objects (except JAMS output).
**Stateless operation:** Each Align instance is independent. No shared state across instances.
## Input Data Formats
### Python Objects
Primary input method: Constructor parameters to Align class.
**Supported data types:**
```python
{
"mbid_track": str, # UUID format
"mbid_release": str, # UUID format
"artist": str, # Free text
"album": str, # Free text
"track": str, # Free text
"track_number": int, # Positive integer
"duration": int | float, # Seconds
"isrc": str, # ISRC format (no validation)
"strict": bool # Matching mode
}
```
**No validation:** Input accepted as-is. Invalid data causes silent failures (returns None).
**No normalization:** Artist names, track titles used exactly as provided. No case normalization, no whitespace trimming, no Unicode normalization.
### JAMS Files
JAMS (JSON Annotated Music Specification) is the standard input format for batch processing.
**JAMS structure:**
```json
{
"file_metadata": {
"title": "Track Name",
"artist": "Artist Name",
"release": "Album Name",
"duration": 123.45,
"identifiers": {
"musicbrainz": "mbid-uuid-here",
"isrc": "GBAYE9200070"
}
},
"sandbox": {
"type": "music_type",
"genre": "rock",
"track_number": 1,
"release_year": 2020
},
"annotations": []
}
```
**Key sections:**
**file_metadata:** Core track metadata. Required fields: title, artist. Optional: release, duration, identifiers.
**sandbox:** Additional metadata. Free-form structure. Common fields: type, genre, track_number, release_year.
**annotations:** Music information retrieval annotations (not used by MusicMetaLinker).
**Parsing logic:**
JAMSProcessor extracts:
- title → track
- artist → artist
- release → album
- duration → duration
- identifiers.musicbrainz → mbid_track
- identifiers.isrc → isrc
- sandbox.track_number → track_number
**Missing fields:** Treated as None. No errors raised.
### CSV Input
No direct CSV input support. Batch processing outputs CSV but doesn't read it.
For CSV input, users must:
1. Parse CSV manually
2. Create Align instances per row
3. Collect results
## Output Data Formats
### Python Objects
Align instance acts as data container. Getters return individual fields.
**No structured output method:** No to_dict(), no to_json(), no serialize().
**Manual aggregation required:**
```python
linker = Align(...)
result = {
"artist": linker.get_artist(),
"track": linker.get_track(),
"mbid": linker.get_mbid(),
"isrc": linker.get_isrc(),
"deezer_id": linker.get_deezer_id(),
# ... etc
}
```
### JAMS Files
Enriched JAMS files with added identifiers.
**Enrichment process:**
1. Read original JAMS file
2. Extract metadata
3. Create Align instance
4. Query all services
5. Add identifiers to file_metadata.identifiers section
6. Write enriched JAMS file
**Added identifiers:**
```json
{
"file_metadata": {
"identifiers": {
"musicbrainz": "mbid-from-query",
"isrc": "isrc-from-query",
"deezer": "deezer-id-from-query",
"youtube": "youtube-url-from-query",
"acousticbrainz": null
}
}
}
```
**Preservation:** Original JAMS structure preserved. Only identifiers section modified.
**Overwrite behavior:** Controlled by --overwrite flag. Without flag, existing identifiers preserved.
### CSV Output
Batch processing generates CSV with all metadata and identifiers.
**CSV schema:**
| Column | Type | Description |
|--------|------|-------------|
| jams_file | str | Original JAMS filename |
| track_name | str | Track title |
| artist_name | str | Artist name |
| album_name | str | Album/release name |
| track_number | int | Track position |
| duration | float | Duration in seconds |
| release_year | int | Release year |
| musicbrainz | str | MBID (UUID) |
| isrc | str | ISRC code |
| deezer_id | int | Deezer track ID |
| deezer_url | str | Full Deezer URL |
| youtube_url | str | Full YouTube URL |
| acousticbrainz | str | AcousticBrainz URL (always null) |
| spotify_id | str | Spotify ID (if available) |
**Missing values:** Empty cells or "None" string (inconsistent).
**Encoding:** UTF-8. No BOM.
**Delimiter:** Comma. No escaping issues documented.
**Headers:** First row contains column names.
**Output location:** Same directory as input JAMS files, named based on directory name.
## Data Transformation Pipeline
### Input Transformation
1. **JAMS parsing:** JSON deserialization via jams library
2. **Field extraction:** Map JAMS fields to Align parameters
3. **Type conversion:** String to int for track_number, string to float for duration
4. **Null handling:** Missing fields become None
### Query Transformation
1. **Metadata normalization:** None (passed as-is to services)
2. **Duration conversion:** MusicBrainz milliseconds → seconds
3. **ID extraction:** Parse service-specific response formats
4. **URL construction:** Build full URLs from IDs
### Output Transformation
1. **Result aggregation:** Collect all getter results
2. **CSV serialization:** pandas DataFrame to CSV
3. **JAMS enrichment:** Inject identifiers into JSON structure
4. **File writing:** JSON serialization with indentation
## Data Quality Issues
### Input Data Quality
**No validation:**
- Invalid MBIDs accepted (wrong format, non-existent)
- Invalid ISRCs accepted (wrong format, non-existent)
- Negative durations accepted
- Empty strings accepted
**No sanitization:**
- Special characters in metadata not escaped
- SQL injection risk if metadata used in queries (not applicable here)
- Command injection risk if metadata used in shell commands (not applicable here)
**No normalization:**
- "The Beatles" vs "Beatles" treated as different
- "feat." vs "featuring" vs "ft." not normalized
- Unicode variants not normalized (e.g., é vs e + combining accent)
### Output Data Quality
**Inconsistent null representation:**
- Python: None
- CSV: Empty string or "None" string
- JAMS: null or missing key
**No data validation:**
- Retrieved MBIDs not validated as UUIDs
- Retrieved ISRCs not validated as ISRC format
- Retrieved URLs not validated as valid URLs
**No conflict resolution:**
- If MusicBrainz and Deezer return different artists, no reconciliation
- First successful query wins, no cross-validation
### Data Accuracy Issues
**YouTube matching:** Weak matching logic. First result assumed correct. High false positive rate.
**Duration filtering:** ±3 seconds threshold may be too loose for short tracks, too strict for live recordings.
**Fuzzy matching:** No documented algorithm. Likely simple string similarity. Doesn't handle:
- Transliterations (e.g., Japanese to romaji)
- Abbreviations (e.g., "feat." vs "featuring")
- Reorderings (e.g., "Artist feat. Guest" vs "Guest & Artist")
**AcousticBrainz:** Always returns null (service shut down). Dead data field.
## Data Flow Diagrams
### Single Track Flow
```
Input (Python dict or JAMS)
Align constructor
[Lazy evaluation - no queries yet]
User calls getter (e.g., get_mbid())
Check cache
If not cached:
Determine service to query
Execute service query
Parse response
Cache result
Return to user
```
### Batch Processing Flow
```
Directory of JAMS files
For each JAMS file:
JAMSProcessor.extract_metadata()
Create Align instance
Call all getters
Collect results in list
End loop
Convert list to pandas DataFrame
Write CSV
Optionally write enriched JAMS files
```
### Service Query Flow
```
Align.get_mbid()
If mbid_track provided:
Return mbid_track
Else if isrc provided:
Query MusicBrainz by ISRC
Else:
Query MusicBrainz by metadata
Parse MusicBrainz response
Extract MBID
Cache and return
```
## Data Caching Strategy
### In-Memory Cache
**Scope:** Single Align instance only.
**Cache key:** Implicit (field name). No explicit key generation.
**Cache invalidation:** None. Values cached for instance lifetime.
**Cache size:** Small (one value per field, ~15 fields max).
**Cache hit rate:** High for repeated getter calls on same instance. Zero across instances.
### No Persistent Cache
**Implications:**
- Repeated queries for same track across runs
- No offline operation
- Network dependency for every query
**Batch processing impact:**
- Processing 1000 tracks = 1000+ API calls
- No deduplication across tracks
- High network usage
### Cache Recommendations
For production use:
1. **Add persistent cache:** Redis or SQLite for cross-run caching
2. **Cache key:** Hash of (artist, track, album, duration)
3. **TTL:** 30 days (metadata rarely changes)
4. **Invalidation:** Manual or TTL-based
5. **Deduplication:** Cache identical queries across tracks
## Data Privacy and Security
### Personal Data
**No personal data collected:** Only public music metadata.
**No user tracking:** No analytics, no telemetry.
**No data sharing:** Results not sent to third parties.
### API Credentials
**Spotify credentials:** Stored in external mml_secrets.py file. Not encrypted. Not in version control.
**Other services:** No credentials required.
### Data Retention
**No retention:** All data discarded when Align instance destroyed.
**Batch output:** CSV and JAMS files written to disk. User responsible for retention and deletion.
## Data Consistency
### Cross-Service Consistency
**No consistency checks:** If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.
**First-wins strategy:** First successful query result used. No validation against other services.
**Conflict scenarios:**
- Different artists across services
- Different track names across services
- Different durations across services
**No conflict resolution:** User receives inconsistent data.
### Temporal Consistency
**No versioning:** Metadata retrieved at query time. No timestamp recorded.
**Staleness:** If MusicBrainz updates metadata after query, Align instance has stale data.
**No refresh:** No way to refresh cached data without creating new instance.
## Data Completeness
### Missing Data Handling
**Graceful degradation:** Missing fields return None. No errors.
**Partial results:** If MusicBrainz succeeds but Deezer fails, MusicBrainz data returned.
**No completeness metrics:** No indication of how many fields successfully retrieved.
### Required vs Optional Fields
**No required fields:** All constructor parameters optional.
**Minimum viable input:** At least one of (mbid_track, isrc, artist+track) recommended.
**Degenerate cases:**
- Empty Align() constructor: All getters return None
- Only duration provided: All getters return None (no searchable metadata)
## Data Format Standards
### Identifier Formats
**MBID:** UUID format (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e"). No validation.
**ISRC:** 12-character alphanumeric (e.g., "GBAYE9200070"). No validation.
**Deezer ID:** Integer. No range validation.
**YouTube ID:** Alphanumeric string (e.g., "dQw4w9WgXcQ"). No validation.
### Metadata Formats
**Artist, track, album:** Free text. No format constraints.
**Duration:** Seconds (int or float). MusicBrainz milliseconds converted to seconds.
**Track number:** Integer. No validation (negative numbers accepted).
**Release date:** ISO format (YYYY-MM-DD) or year only (YYYY). Inconsistent across services.
**BPM:** Integer or float. No range validation.
## Data Interoperability
### JAMS Compatibility
JAMS is a standard format in music information retrieval research. MusicMetaLinker's JAMS support enables interoperability with:
- mir_eval (evaluation framework)
- librosa (audio analysis)
- madmom (music analysis)
- Other MIR tools
### Service Compatibility
**MusicBrainz:** Uses official musicbrainzngs library. Compatible with MusicBrainz API changes (library handles versioning).
**Deezer:** Uses official deezer-python library. Compatible with Deezer API.
**YouTube Music:** Uses unofficial ytmusicapi. Fragile to YouTube changes. No API stability guarantees.
**Spotify:** Uses official spotipy library. Compatible with Spotify API.
## Data Limitations
1. **No bulk operations:** Each track processed individually
2. **No streaming:** All data loaded into memory
3. **No compression:** JAMS files written uncompressed
4. **No encryption:** All data stored in plaintext
5. **No checksums:** No data integrity verification
6. **No versioning:** No metadata version tracking
7. **No provenance:** No record of which service provided which field
8. **No confidence scores:** No indication of match quality
## Data Recommendations
For production use:
1. **Add validation:** Validate all input and output formats
2. **Add normalization:** Normalize artist names, track titles
3. **Add conflict resolution:** Cross-validate results across services
4. **Add provenance tracking:** Record which service provided each field
5. **Add confidence scores:** Indicate match quality
6. **Add persistent cache:** Reduce API calls
7. **Add data versioning:** Track when metadata retrieved
8. **Add bulk operations:** Process multiple tracks efficiently
9. **Remove dead fields:** Delete AcousticBrainz from output
10. **Add structured output:** to_dict(), to_json() methods
The data model is simple and functional for research use. Production use requires significant enhancements.