a1f6701bac
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
219 lines
8.1 KiB
Markdown
219 lines
8.1 KiB
Markdown
# MusicMetaLinker Overview
|
|
|
|
## Project Identity
|
|
|
|
**Name:** MusicMetaLinker
|
|
**Version:** 0.0.1 (pre-release)
|
|
**Language:** Python 3.8+
|
|
**License:** MIT
|
|
**Type:** Library
|
|
**Repository:** https://github.com/andreamust/MusicMetaLinker
|
|
**Author:** Andrea Poltronieri
|
|
**Installation:** `pip install git+https://github.com/andreamust/MusicMetaLinker.git`
|
|
|
|
MusicMetaLinker is not available on PyPI. Installation requires direct GitHub access.
|
|
|
|
## Purpose and Scope
|
|
|
|
MusicMetaLinker performs entity linking for music tracks. It connects local music metadata to external databases, enriching incomplete or inconsistent information with authoritative data from multiple sources.
|
|
|
|
The library addresses a common problem in music information retrieval: fragmented metadata across different platforms. A track might have an MBID in one system, an ISRC in another, and only artist/title strings in a third. MusicMetaLinker bridges these gaps by querying multiple services and consolidating results.
|
|
|
|
Primary use case: academic music research and dataset preparation. The library supports JAMS (JSON Annotated Music Specification), a format common in music information retrieval research.
|
|
|
|
## Core Functionality
|
|
|
|
MusicMetaLinker implements a three-step workflow:
|
|
|
|
1. **Service Selection:** Based on available input identifiers (MBID, ISRC, or metadata strings), the library determines which external services to query and in what order.
|
|
|
|
2. **Information Retrieval:** Parallel or sequential queries to MusicBrainz, Deezer, YouTube Music, and AcousticBrainz. Each service has specialized search logic.
|
|
|
|
3. **Filtering and Matching:** Results are filtered by duration, track number, and fuzzy string matching to identify the best match across services.
|
|
|
|
The library returns enriched metadata including:
|
|
- Standardized identifiers (MBID, ISRC, Deezer ID)
|
|
- Corrected metadata (artist, album, track name)
|
|
- Additional attributes (BPM, release date)
|
|
- Direct links to external services
|
|
|
|
## Dependencies
|
|
|
|
Core dependencies:
|
|
|
|
- **musicbrainzngs:** MusicBrainz API client
|
|
- **deezer-python:** Deezer API wrapper
|
|
- **ytmusicapi:** YouTube Music unofficial API
|
|
- **spotipy:** Spotify API client (limited use)
|
|
- **requests:** HTTP client for AcousticBrainz
|
|
- **tqdm:** Progress bars for batch processing
|
|
- **jams:** JAMS format support
|
|
- **pandas:** CSV output for batch processing
|
|
- **cryptography:** Required by spotipy
|
|
|
|
All dependencies are standard Python packages. No exotic or unmaintained libraries.
|
|
|
|
## Architecture Pattern
|
|
|
|
MusicMetaLinker uses a cascading fallback pattern:
|
|
|
|
1. If MBID is provided, query MusicBrainz first (authoritative source)
|
|
2. If ISRC is available, try Deezer (commercial database with ISRCs)
|
|
3. Fall back to metadata string search across all services
|
|
4. Aggregate results, preferring more authoritative sources
|
|
|
|
This pattern ensures maximum coverage while respecting data quality hierarchies. MusicBrainz is treated as ground truth when available.
|
|
|
|
## Key Components
|
|
|
|
**Align class (linking.py):** Main entry point. Orchestrates all service queries and exposes unified getter methods.
|
|
|
|
**Service-specific aligners:**
|
|
- MusicBrainzAlign: Queries MusicBrainz by MBID, ISRC, or metadata
|
|
- DeezerAlign: Searches Deezer with duration-based filtering
|
|
- YouTubeAlign: Searches YouTube Music by metadata strings
|
|
|
|
**Batch processing:**
|
|
- link_partitions.py: Process directories of JAMS files
|
|
- JAMSProcessor: Read/write JAMS format with metadata enrichment
|
|
|
|
**Utilities:**
|
|
- MBDownload: Bulk download from MusicBrainz
|
|
- prepare_dataset.py: Dataset preparation scripts
|
|
|
|
## Workflow Example
|
|
|
|
Typical usage:
|
|
|
|
```python
|
|
from musicmetalinker.linking import Align
|
|
|
|
# Initialize with available metadata
|
|
linker = Align(
|
|
artist="The Beatles",
|
|
track="Hey Jude",
|
|
album="Hey Jude",
|
|
duration=431
|
|
)
|
|
|
|
# Retrieve enriched metadata
|
|
mbid = linker.get_mbid()
|
|
isrc = linker.get_isrc()
|
|
deezer_id = linker.get_deezer_id()
|
|
youtube_link = linker.get_youtube_link()
|
|
```
|
|
|
|
The Align class handles all service queries internally. Users don't interact with individual service classes directly.
|
|
|
|
## Batch Processing
|
|
|
|
For dataset-scale operations:
|
|
|
|
```bash
|
|
python link_partitions.py /path/to/jams/files --save --limit audio --overwrite
|
|
```
|
|
|
|
Processes all JAMS files in a directory, enriches metadata, and outputs CSV with consolidated identifiers. Useful for preparing research datasets.
|
|
|
|
## Target Audience
|
|
|
|
Primary users:
|
|
- Music information retrieval researchers
|
|
- Dataset curators
|
|
- Academic projects requiring standardized music metadata
|
|
|
|
Not designed for:
|
|
- Production music applications (pre-release quality)
|
|
- Real-time streaming services (no rate limiting)
|
|
- End-user applications (library-only, no GUI)
|
|
|
|
## Development Status
|
|
|
|
Version 0.0.1 indicates early development. The codebase contains:
|
|
- Debug print statements in production code
|
|
- Hardcoded configuration values
|
|
- Commented-out code sections
|
|
- No automated tests
|
|
- No CI/CD pipeline
|
|
|
|
This is research-quality code, not production-ready software. Suitable for academic exploration and prototyping, but requires significant hardening for production use.
|
|
|
|
## Integration with External Services
|
|
|
|
**MusicBrainz:** Open music encyclopedia. No authentication required. Rate limiting recommended but not implemented.
|
|
|
|
**Deezer:** Commercial streaming service with public API. No authentication for basic search. More permissive than Spotify for metadata access.
|
|
|
|
**YouTube Music:** Unofficial API via ytmusicapi. No authentication. Fragile to YouTube changes.
|
|
|
|
**AcousticBrainz:** Audio feature database. Note: AcousticBrainz shut down in 2022. This integration is non-functional.
|
|
|
|
**Spotify:** Limited use for ISRC extraction in Billboard dataset cleaning. Requires OAuth2 credentials via external mml_secrets.py file (not in repository).
|
|
|
|
## Licensing and Reuse
|
|
|
|
MIT license permits unrestricted use, modification, and distribution. No copyleft restrictions.
|
|
|
|
The library can be freely integrated into commercial or academic projects. Attribution to Andrea Poltronieri is required.
|
|
|
|
## Installation Requirements
|
|
|
|
Python 3.8 or higher required. No platform-specific dependencies except optional ffmpeg for audio conversion in batch processing.
|
|
|
|
Installation from GitHub requires git and pip. No binary distributions available.
|
|
|
|
## Configuration
|
|
|
|
All configuration is hardcoded in source files:
|
|
- User-Agent: "elka/0.1" (appears to be from a parent project)
|
|
- API endpoints: Hardcoded URLs
|
|
- Matching thresholds: Hardcoded in service classes
|
|
- Spotify credentials: External mml_secrets.py module
|
|
|
|
No configuration files, environment variables, or runtime configuration options.
|
|
|
|
## Output Formats
|
|
|
|
**Library mode:** Python objects with getter methods
|
|
|
|
**Batch mode:** CSV with columns:
|
|
- jams_file
|
|
- track_name, artist_name, album_name
|
|
- track_number, duration, release_year
|
|
- musicbrainz, isrc
|
|
- deezer_id, deezer_url
|
|
- youtube_url
|
|
- acousticbrainz
|
|
- spotify_id
|
|
|
|
JAMS files can be enriched in place with new identifiers added to the identifiers section.
|
|
|
|
## Performance Characteristics
|
|
|
|
No performance benchmarks provided. Expected bottlenecks:
|
|
- Network latency for API calls
|
|
- Sequential service queries (no parallelization)
|
|
- No caching of results
|
|
|
|
Batch processing includes progress bars via tqdm but no performance optimization.
|
|
|
|
## Error Handling
|
|
|
|
Errors are silently suppressed. Failed queries return None. No exceptions propagate to callers.
|
|
|
|
This makes the library robust to individual service failures but provides no visibility into what went wrong. Debugging requires examining log files or adding print statements.
|
|
|
|
## Maintenance Status
|
|
|
|
Last commit activity and maintenance frequency unknown from provided information. Repository is public but development status unclear.
|
|
|
|
AcousticBrainz integration is broken (service discontinued). No indication this has been addressed.
|
|
|
|
## Relevance Assessment
|
|
|
|
**Conceptual value:** High. The cascading fallback pattern and multi-service aggregation approach are sound architectural patterns for entity linking.
|
|
|
|
**Implementation value:** Low. Pre-release quality, broken integrations, no tests, hardcoded configuration.
|
|
|
|
**Reuse recommendation:** Study the pattern, don't adopt the code. Reimplement the concept with proper error handling, configuration management, and test coverage.
|