Files
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

14 KiB

MusicMetaLinker Data Architecture

Data Storage Model

MusicMetaLinker has no persistent data storage. All data is in-memory during execution.

No database: No SQL, no NoSQL, no embedded databases.

No file-based persistence: No local cache files, no serialized objects (except JAMS output).

Stateless operation: Each Align instance is independent. No shared state across instances.

Input Data Formats

Python Objects

Primary input method: Constructor parameters to Align class.

Supported data types:

{
    "mbid_track": str,        # UUID format
    "mbid_release": str,      # UUID format
    "artist": str,            # Free text
    "album": str,             # Free text
    "track": str,             # Free text
    "track_number": int,      # Positive integer
    "duration": int | float,  # Seconds
    "isrc": str,              # ISRC format (no validation)
    "strict": bool            # Matching mode
}

No validation: Input accepted as-is. Invalid data causes silent failures (returns None).

No normalization: Artist names, track titles used exactly as provided. No case normalization, no whitespace trimming, no Unicode normalization.

JAMS Files

JAMS (JSON Annotated Music Specification) is the standard input format for batch processing.

JAMS structure:

{
  "file_metadata": {
    "title": "Track Name",
    "artist": "Artist Name",
    "release": "Album Name",
    "duration": 123.45,
    "identifiers": {
      "musicbrainz": "mbid-uuid-here",
      "isrc": "GBAYE9200070"
    }
  },
  "sandbox": {
    "type": "music_type",
    "genre": "rock",
    "track_number": 1,
    "release_year": 2020
  },
  "annotations": []
}

Key sections:

file_metadata: Core track metadata. Required fields: title, artist. Optional: release, duration, identifiers.

sandbox: Additional metadata. Free-form structure. Common fields: type, genre, track_number, release_year.

annotations: Music information retrieval annotations (not used by MusicMetaLinker).

Parsing logic:

JAMSProcessor extracts:

  • title → track
  • artist → artist
  • release → album
  • duration → duration
  • identifiers.musicbrainz → mbid_track
  • identifiers.isrc → isrc
  • sandbox.track_number → track_number

Missing fields: Treated as None. No errors raised.

CSV Input

No direct CSV input support. Batch processing outputs CSV but doesn't read it.

For CSV input, users must:

  1. Parse CSV manually
  2. Create Align instances per row
  3. Collect results

Output Data Formats

Python Objects

Align instance acts as data container. Getters return individual fields.

No structured output method: No to_dict(), no to_json(), no serialize().

Manual aggregation required:

linker = Align(...)
result = {
    "artist": linker.get_artist(),
    "track": linker.get_track(),
    "mbid": linker.get_mbid(),
    "isrc": linker.get_isrc(),
    "deezer_id": linker.get_deezer_id(),
    # ... etc
}

JAMS Files

Enriched JAMS files with added identifiers.

Enrichment process:

  1. Read original JAMS file
  2. Extract metadata
  3. Create Align instance
  4. Query all services
  5. Add identifiers to file_metadata.identifiers section
  6. Write enriched JAMS file

Added identifiers:

{
  "file_metadata": {
    "identifiers": {
      "musicbrainz": "mbid-from-query",
      "isrc": "isrc-from-query",
      "deezer": "deezer-id-from-query",
      "youtube": "youtube-url-from-query",
      "acousticbrainz": null
    }
  }
}

Preservation: Original JAMS structure preserved. Only identifiers section modified.

Overwrite behavior: Controlled by --overwrite flag. Without flag, existing identifiers preserved.

CSV Output

Batch processing generates CSV with all metadata and identifiers.

CSV schema:

Column Type Description
jams_file str Original JAMS filename
track_name str Track title
artist_name str Artist name
album_name str Album/release name
track_number int Track position
duration float Duration in seconds
release_year int Release year
musicbrainz str MBID (UUID)
isrc str ISRC code
deezer_id int Deezer track ID
deezer_url str Full Deezer URL
youtube_url str Full YouTube URL
acousticbrainz str AcousticBrainz URL (always null)
spotify_id str Spotify ID (if available)

Missing values: Empty cells or "None" string (inconsistent).

Encoding: UTF-8. No BOM.

Delimiter: Comma. No escaping issues documented.

Headers: First row contains column names.

Output location: Same directory as input JAMS files, named based on directory name.

Data Transformation Pipeline

Input Transformation

  1. JAMS parsing: JSON deserialization via jams library
  2. Field extraction: Map JAMS fields to Align parameters
  3. Type conversion: String to int for track_number, string to float for duration
  4. Null handling: Missing fields become None

Query Transformation

  1. Metadata normalization: None (passed as-is to services)
  2. Duration conversion: MusicBrainz milliseconds → seconds
  3. ID extraction: Parse service-specific response formats
  4. URL construction: Build full URLs from IDs

Output Transformation

  1. Result aggregation: Collect all getter results
  2. CSV serialization: pandas DataFrame to CSV
  3. JAMS enrichment: Inject identifiers into JSON structure
  4. File writing: JSON serialization with indentation

Data Quality Issues

Input Data Quality

No validation:

  • Invalid MBIDs accepted (wrong format, non-existent)
  • Invalid ISRCs accepted (wrong format, non-existent)
  • Negative durations accepted
  • Empty strings accepted

No sanitization:

  • Special characters in metadata not escaped
  • SQL injection risk if metadata used in queries (not applicable here)
  • Command injection risk if metadata used in shell commands (not applicable here)

No normalization:

  • "The Beatles" vs "Beatles" treated as different
  • "feat." vs "featuring" vs "ft." not normalized
  • Unicode variants not normalized (e.g., é vs e + combining accent)

Output Data Quality

Inconsistent null representation:

  • Python: None
  • CSV: Empty string or "None" string
  • JAMS: null or missing key

No data validation:

  • Retrieved MBIDs not validated as UUIDs
  • Retrieved ISRCs not validated as ISRC format
  • Retrieved URLs not validated as valid URLs

No conflict resolution:

  • If MusicBrainz and Deezer return different artists, no reconciliation
  • First successful query wins, no cross-validation

Data Accuracy Issues

YouTube matching: Weak matching logic. First result assumed correct. High false positive rate.

Duration filtering: ±3 seconds threshold may be too loose for short tracks, too strict for live recordings.

Fuzzy matching: No documented algorithm. Likely simple string similarity. Doesn't handle:

  • Transliterations (e.g., Japanese to romaji)
  • Abbreviations (e.g., "feat." vs "featuring")
  • Reorderings (e.g., "Artist feat. Guest" vs "Guest & Artist")

AcousticBrainz: Always returns null (service shut down). Dead data field.

Data Flow Diagrams

Single Track Flow

Input (Python dict or JAMS)
    ↓
Align constructor
    ↓
[Lazy evaluation - no queries yet]
    ↓
User calls getter (e.g., get_mbid())
    ↓
Check cache
    ↓
If not cached:
    ↓
Determine service to query
    ↓
Execute service query
    ↓
Parse response
    ↓
Cache result
    ↓
Return to user

Batch Processing Flow

Directory of JAMS files
    ↓
For each JAMS file:
    ↓
JAMSProcessor.extract_metadata()
    ↓
Create Align instance
    ↓
Call all getters
    ↓
Collect results in list
    ↓
End loop
    ↓
Convert list to pandas DataFrame
    ↓
Write CSV
    ↓
Optionally write enriched JAMS files

Service Query Flow

Align.get_mbid()
    ↓
If mbid_track provided:
    Return mbid_track
    ↓
Else if isrc provided:
    Query MusicBrainz by ISRC
    ↓
Else:
    Query MusicBrainz by metadata
    ↓
Parse MusicBrainz response
    ↓
Extract MBID
    ↓
Cache and return

Data Caching Strategy

In-Memory Cache

Scope: Single Align instance only.

Cache key: Implicit (field name). No explicit key generation.

Cache invalidation: None. Values cached for instance lifetime.

Cache size: Small (one value per field, ~15 fields max).

Cache hit rate: High for repeated getter calls on same instance. Zero across instances.

No Persistent Cache

Implications:

  • Repeated queries for same track across runs
  • No offline operation
  • Network dependency for every query

Batch processing impact:

  • Processing 1000 tracks = 1000+ API calls
  • No deduplication across tracks
  • High network usage

Cache Recommendations

For production use:

  1. Add persistent cache: Redis or SQLite for cross-run caching
  2. Cache key: Hash of (artist, track, album, duration)
  3. TTL: 30 days (metadata rarely changes)
  4. Invalidation: Manual or TTL-based
  5. Deduplication: Cache identical queries across tracks

Data Privacy and Security

Personal Data

No personal data collected: Only public music metadata.

No user tracking: No analytics, no telemetry.

No data sharing: Results not sent to third parties.

API Credentials

Spotify credentials: Stored in external mml_secrets.py file. Not encrypted. Not in version control.

Other services: No credentials required.

Data Retention

No retention: All data discarded when Align instance destroyed.

Batch output: CSV and JAMS files written to disk. User responsible for retention and deletion.

Data Consistency

Cross-Service Consistency

No consistency checks: If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.

First-wins strategy: First successful query result used. No validation against other services.

Conflict scenarios:

  • Different artists across services
  • Different track names across services
  • Different durations across services

No conflict resolution: User receives inconsistent data.

Temporal Consistency

No versioning: Metadata retrieved at query time. No timestamp recorded.

Staleness: If MusicBrainz updates metadata after query, Align instance has stale data.

No refresh: No way to refresh cached data without creating new instance.

Data Completeness

Missing Data Handling

Graceful degradation: Missing fields return None. No errors.

Partial results: If MusicBrainz succeeds but Deezer fails, MusicBrainz data returned.

No completeness metrics: No indication of how many fields successfully retrieved.

Required vs Optional Fields

No required fields: All constructor parameters optional.

Minimum viable input: At least one of (mbid_track, isrc, artist+track) recommended.

Degenerate cases:

  • Empty Align() constructor: All getters return None
  • Only duration provided: All getters return None (no searchable metadata)

Data Format Standards

Identifier Formats

MBID: UUID format (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e"). No validation.

ISRC: 12-character alphanumeric (e.g., "GBAYE9200070"). No validation.

Deezer ID: Integer. No range validation.

YouTube ID: Alphanumeric string (e.g., "dQw4w9WgXcQ"). No validation.

Metadata Formats

Artist, track, album: Free text. No format constraints.

Duration: Seconds (int or float). MusicBrainz milliseconds converted to seconds.

Track number: Integer. No validation (negative numbers accepted).

Release date: ISO format (YYYY-MM-DD) or year only (YYYY). Inconsistent across services.

BPM: Integer or float. No range validation.

Data Interoperability

JAMS Compatibility

JAMS is a standard format in music information retrieval research. MusicMetaLinker's JAMS support enables interoperability with:

  • mir_eval (evaluation framework)
  • librosa (audio analysis)
  • madmom (music analysis)
  • Other MIR tools

Service Compatibility

MusicBrainz: Uses official musicbrainzngs library. Compatible with MusicBrainz API changes (library handles versioning).

Deezer: Uses official deezer-python library. Compatible with Deezer API.

YouTube Music: Uses unofficial ytmusicapi. Fragile to YouTube changes. No API stability guarantees.

Spotify: Uses official spotipy library. Compatible with Spotify API.

Data Limitations

  1. No bulk operations: Each track processed individually
  2. No streaming: All data loaded into memory
  3. No compression: JAMS files written uncompressed
  4. No encryption: All data stored in plaintext
  5. No checksums: No data integrity verification
  6. No versioning: No metadata version tracking
  7. No provenance: No record of which service provided which field
  8. No confidence scores: No indication of match quality

Data Recommendations

For production use:

  1. Add validation: Validate all input and output formats
  2. Add normalization: Normalize artist names, track titles
  3. Add conflict resolution: Cross-validate results across services
  4. Add provenance tracking: Record which service provided each field
  5. Add confidence scores: Indicate match quality
  6. Add persistent cache: Reduce API calls
  7. Add data versioning: Track when metadata retrieved
  8. Add bulk operations: Process multiple tracks efficiently
  9. Remove dead fields: Delete AcousticBrainz from output
  10. Add structured output: to_dict(), to_json() methods

The data model is simple and functional for research use. Production use requires significant enhancements.