Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

8.1 KiB

Raw Permalink Blame History

MusicMetaLinker Overview

Project Identity

Name: MusicMetaLinker
Version: 0.0.1 (pre-release)
Language: Python 3.8+
License: MIT
Type: Library
Repository: https://github.com/andreamust/MusicMetaLinker
Author: Andrea Poltronieri
Installation: pip install git+https://github.com/andreamust/MusicMetaLinker.git

MusicMetaLinker is not available on PyPI. Installation requires direct GitHub access.

Purpose and Scope

MusicMetaLinker performs entity linking for music tracks. It connects local music metadata to external databases, enriching incomplete or inconsistent information with authoritative data from multiple sources.

The library addresses a common problem in music information retrieval: fragmented metadata across different platforms. A track might have an MBID in one system, an ISRC in another, and only artist/title strings in a third. MusicMetaLinker bridges these gaps by querying multiple services and consolidating results.

Primary use case: academic music research and dataset preparation. The library supports JAMS (JSON Annotated Music Specification), a format common in music information retrieval research.

Core Functionality

MusicMetaLinker implements a three-step workflow:

Service Selection: Based on available input identifiers (MBID, ISRC, or metadata strings), the library determines which external services to query and in what order.
Information Retrieval: Parallel or sequential queries to MusicBrainz, Deezer, YouTube Music, and AcousticBrainz. Each service has specialized search logic.
Filtering and Matching: Results are filtered by duration, track number, and fuzzy string matching to identify the best match across services.

The library returns enriched metadata including:

Standardized identifiers (MBID, ISRC, Deezer ID)
Corrected metadata (artist, album, track name)
Additional attributes (BPM, release date)
Direct links to external services

Dependencies

Core dependencies:

musicbrainzngs: MusicBrainz API client
deezer-python: Deezer API wrapper
ytmusicapi: YouTube Music unofficial API
spotipy: Spotify API client (limited use)
requests: HTTP client for AcousticBrainz
tqdm: Progress bars for batch processing
jams: JAMS format support
pandas: CSV output for batch processing
cryptography: Required by spotipy

All dependencies are standard Python packages. No exotic or unmaintained libraries.

Architecture Pattern

MusicMetaLinker uses a cascading fallback pattern:

If MBID is provided, query MusicBrainz first (authoritative source)
If ISRC is available, try Deezer (commercial database with ISRCs)
Fall back to metadata string search across all services
Aggregate results, preferring more authoritative sources

This pattern ensures maximum coverage while respecting data quality hierarchies. MusicBrainz is treated as ground truth when available.

Key Components

Align class (linking.py): Main entry point. Orchestrates all service queries and exposes unified getter methods.

Service-specific aligners:

MusicBrainzAlign: Queries MusicBrainz by MBID, ISRC, or metadata
DeezerAlign: Searches Deezer with duration-based filtering
YouTubeAlign: Searches YouTube Music by metadata strings

Batch processing:

link_partitions.py: Process directories of JAMS files
JAMSProcessor: Read/write JAMS format with metadata enrichment

Utilities:

MBDownload: Bulk download from MusicBrainz
prepare_dataset.py: Dataset preparation scripts

Workflow Example

Typical usage:

from musicmetalinker.linking import Align

# Initialize with available metadata
linker = Align(
    artist="The Beatles",
    track="Hey Jude",
    album="Hey Jude",
    duration=431
)

# Retrieve enriched metadata
mbid = linker.get_mbid()
isrc = linker.get_isrc()
deezer_id = linker.get_deezer_id()
youtube_link = linker.get_youtube_link()

The Align class handles all service queries internally. Users don't interact with individual service classes directly.

Batch Processing

For dataset-scale operations:

python link_partitions.py /path/to/jams/files --save --limit audio --overwrite

Processes all JAMS files in a directory, enriches metadata, and outputs CSV with consolidated identifiers. Useful for preparing research datasets.

Target Audience

Primary users:

Music information retrieval researchers
Dataset curators
Academic projects requiring standardized music metadata

Not designed for:

Production music applications (pre-release quality)
Real-time streaming services (no rate limiting)
End-user applications (library-only, no GUI)

Development Status

Version 0.0.1 indicates early development. The codebase contains:

Debug print statements in production code
Hardcoded configuration values
Commented-out code sections
No automated tests
No CI/CD pipeline

This is research-quality code, not production-ready software. Suitable for academic exploration and prototyping, but requires significant hardening for production use.

Integration with External Services

MusicBrainz: Open music encyclopedia. No authentication required. Rate limiting recommended but not implemented.

Deezer: Commercial streaming service with public API. No authentication for basic search. More permissive than Spotify for metadata access.

YouTube Music: Unofficial API via ytmusicapi. No authentication. Fragile to YouTube changes.

AcousticBrainz: Audio feature database. Note: AcousticBrainz shut down in 2022. This integration is non-functional.

Spotify: Limited use for ISRC extraction in Billboard dataset cleaning. Requires OAuth2 credentials via external mml_secrets.py file (not in repository).

Licensing and Reuse

MIT license permits unrestricted use, modification, and distribution. No copyleft restrictions.

The library can be freely integrated into commercial or academic projects. Attribution to Andrea Poltronieri is required.

Installation Requirements

Python 3.8 or higher required. No platform-specific dependencies except optional ffmpeg for audio conversion in batch processing.

Installation from GitHub requires git and pip. No binary distributions available.

Configuration

All configuration is hardcoded in source files:

User-Agent: "elka/0.1" (appears to be from a parent project)
API endpoints: Hardcoded URLs
Matching thresholds: Hardcoded in service classes
Spotify credentials: External mml_secrets.py module

No configuration files, environment variables, or runtime configuration options.

Output Formats

Library mode: Python objects with getter methods

Batch mode: CSV with columns:

jams_file
track_name, artist_name, album_name
track_number, duration, release_year
musicbrainz, isrc
deezer_id, deezer_url
youtube_url
acousticbrainz
spotify_id

JAMS files can be enriched in place with new identifiers added to the identifiers section.

Performance Characteristics

No performance benchmarks provided. Expected bottlenecks:

Network latency for API calls
Sequential service queries (no parallelization)
No caching of results

Batch processing includes progress bars via tqdm but no performance optimization.

Error Handling

Errors are silently suppressed. Failed queries return None. No exceptions propagate to callers.

This makes the library robust to individual service failures but provides no visibility into what went wrong. Debugging requires examining log files or adding print statements.

Maintenance Status

Last commit activity and maintenance frequency unknown from provided information. Repository is public but development status unclear.

AcousticBrainz integration is broken (service discontinued). No indication this has been addressed.

Relevance Assessment

Conceptual value: High. The cascading fallback pattern and multi-service aggregation approach are sound architectural patterns for entity linking.

Implementation value: Low. Pre-release quality, broken integrations, no tests, hardcoded configuration.

Reuse recommendation: Study the pattern, don't adopt the code. Reimplement the concept with proper error handling, configuration management, and test coverage.

8.1 KiB Raw Permalink Blame History