feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,218 @@
|
||||
# MusicMetaLinker Overview
|
||||
|
||||
## Project Identity
|
||||
|
||||
**Name:** MusicMetaLinker
|
||||
**Version:** 0.0.1 (pre-release)
|
||||
**Language:** Python 3.8+
|
||||
**License:** MIT
|
||||
**Type:** Library
|
||||
**Repository:** https://github.com/andreamust/MusicMetaLinker
|
||||
**Author:** Andrea Poltronieri
|
||||
**Installation:** `pip install git+https://github.com/andreamust/MusicMetaLinker.git`
|
||||
|
||||
MusicMetaLinker is not available on PyPI. Installation requires direct GitHub access.
|
||||
|
||||
## Purpose and Scope
|
||||
|
||||
MusicMetaLinker performs entity linking for music tracks. It connects local music metadata to external databases, enriching incomplete or inconsistent information with authoritative data from multiple sources.
|
||||
|
||||
The library addresses a common problem in music information retrieval: fragmented metadata across different platforms. A track might have an MBID in one system, an ISRC in another, and only artist/title strings in a third. MusicMetaLinker bridges these gaps by querying multiple services and consolidating results.
|
||||
|
||||
Primary use case: academic music research and dataset preparation. The library supports JAMS (JSON Annotated Music Specification), a format common in music information retrieval research.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
MusicMetaLinker implements a three-step workflow:
|
||||
|
||||
1. **Service Selection:** Based on available input identifiers (MBID, ISRC, or metadata strings), the library determines which external services to query and in what order.
|
||||
|
||||
2. **Information Retrieval:** Parallel or sequential queries to MusicBrainz, Deezer, YouTube Music, and AcousticBrainz. Each service has specialized search logic.
|
||||
|
||||
3. **Filtering and Matching:** Results are filtered by duration, track number, and fuzzy string matching to identify the best match across services.
|
||||
|
||||
The library returns enriched metadata including:
|
||||
- Standardized identifiers (MBID, ISRC, Deezer ID)
|
||||
- Corrected metadata (artist, album, track name)
|
||||
- Additional attributes (BPM, release date)
|
||||
- Direct links to external services
|
||||
|
||||
## Dependencies
|
||||
|
||||
Core dependencies:
|
||||
|
||||
- **musicbrainzngs:** MusicBrainz API client
|
||||
- **deezer-python:** Deezer API wrapper
|
||||
- **ytmusicapi:** YouTube Music unofficial API
|
||||
- **spotipy:** Spotify API client (limited use)
|
||||
- **requests:** HTTP client for AcousticBrainz
|
||||
- **tqdm:** Progress bars for batch processing
|
||||
- **jams:** JAMS format support
|
||||
- **pandas:** CSV output for batch processing
|
||||
- **cryptography:** Required by spotipy
|
||||
|
||||
All dependencies are standard Python packages. No exotic or unmaintained libraries.
|
||||
|
||||
## Architecture Pattern
|
||||
|
||||
MusicMetaLinker uses a cascading fallback pattern:
|
||||
|
||||
1. If MBID is provided, query MusicBrainz first (authoritative source)
|
||||
2. If ISRC is available, try Deezer (commercial database with ISRCs)
|
||||
3. Fall back to metadata string search across all services
|
||||
4. Aggregate results, preferring more authoritative sources
|
||||
|
||||
This pattern ensures maximum coverage while respecting data quality hierarchies. MusicBrainz is treated as ground truth when available.
|
||||
|
||||
## Key Components
|
||||
|
||||
**Align class (linking.py):** Main entry point. Orchestrates all service queries and exposes unified getter methods.
|
||||
|
||||
**Service-specific aligners:**
|
||||
- MusicBrainzAlign: Queries MusicBrainz by MBID, ISRC, or metadata
|
||||
- DeezerAlign: Searches Deezer with duration-based filtering
|
||||
- YouTubeAlign: Searches YouTube Music by metadata strings
|
||||
|
||||
**Batch processing:**
|
||||
- link_partitions.py: Process directories of JAMS files
|
||||
- JAMSProcessor: Read/write JAMS format with metadata enrichment
|
||||
|
||||
**Utilities:**
|
||||
- MBDownload: Bulk download from MusicBrainz
|
||||
- prepare_dataset.py: Dataset preparation scripts
|
||||
|
||||
## Workflow Example
|
||||
|
||||
Typical usage:
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import Align
|
||||
|
||||
# Initialize with available metadata
|
||||
linker = Align(
|
||||
artist="The Beatles",
|
||||
track="Hey Jude",
|
||||
album="Hey Jude",
|
||||
duration=431
|
||||
)
|
||||
|
||||
# Retrieve enriched metadata
|
||||
mbid = linker.get_mbid()
|
||||
isrc = linker.get_isrc()
|
||||
deezer_id = linker.get_deezer_id()
|
||||
youtube_link = linker.get_youtube_link()
|
||||
```
|
||||
|
||||
The Align class handles all service queries internally. Users don't interact with individual service classes directly.
|
||||
|
||||
## Batch Processing
|
||||
|
||||
For dataset-scale operations:
|
||||
|
||||
```bash
|
||||
python link_partitions.py /path/to/jams/files --save --limit audio --overwrite
|
||||
```
|
||||
|
||||
Processes all JAMS files in a directory, enriches metadata, and outputs CSV with consolidated identifiers. Useful for preparing research datasets.
|
||||
|
||||
## Target Audience
|
||||
|
||||
Primary users:
|
||||
- Music information retrieval researchers
|
||||
- Dataset curators
|
||||
- Academic projects requiring standardized music metadata
|
||||
|
||||
Not designed for:
|
||||
- Production music applications (pre-release quality)
|
||||
- Real-time streaming services (no rate limiting)
|
||||
- End-user applications (library-only, no GUI)
|
||||
|
||||
## Development Status
|
||||
|
||||
Version 0.0.1 indicates early development. The codebase contains:
|
||||
- Debug print statements in production code
|
||||
- Hardcoded configuration values
|
||||
- Commented-out code sections
|
||||
- No automated tests
|
||||
- No CI/CD pipeline
|
||||
|
||||
This is research-quality code, not production-ready software. Suitable for academic exploration and prototyping, but requires significant hardening for production use.
|
||||
|
||||
## Integration with External Services
|
||||
|
||||
**MusicBrainz:** Open music encyclopedia. No authentication required. Rate limiting recommended but not implemented.
|
||||
|
||||
**Deezer:** Commercial streaming service with public API. No authentication for basic search. More permissive than Spotify for metadata access.
|
||||
|
||||
**YouTube Music:** Unofficial API via ytmusicapi. No authentication. Fragile to YouTube changes.
|
||||
|
||||
**AcousticBrainz:** Audio feature database. Note: AcousticBrainz shut down in 2022. This integration is non-functional.
|
||||
|
||||
**Spotify:** Limited use for ISRC extraction in Billboard dataset cleaning. Requires OAuth2 credentials via external mml_secrets.py file (not in repository).
|
||||
|
||||
## Licensing and Reuse
|
||||
|
||||
MIT license permits unrestricted use, modification, and distribution. No copyleft restrictions.
|
||||
|
||||
The library can be freely integrated into commercial or academic projects. Attribution to Andrea Poltronieri is required.
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
Python 3.8 or higher required. No platform-specific dependencies except optional ffmpeg for audio conversion in batch processing.
|
||||
|
||||
Installation from GitHub requires git and pip. No binary distributions available.
|
||||
|
||||
## Configuration
|
||||
|
||||
All configuration is hardcoded in source files:
|
||||
- User-Agent: "elka/0.1" (appears to be from a parent project)
|
||||
- API endpoints: Hardcoded URLs
|
||||
- Matching thresholds: Hardcoded in service classes
|
||||
- Spotify credentials: External mml_secrets.py module
|
||||
|
||||
No configuration files, environment variables, or runtime configuration options.
|
||||
|
||||
## Output Formats
|
||||
|
||||
**Library mode:** Python objects with getter methods
|
||||
|
||||
**Batch mode:** CSV with columns:
|
||||
- jams_file
|
||||
- track_name, artist_name, album_name
|
||||
- track_number, duration, release_year
|
||||
- musicbrainz, isrc
|
||||
- deezer_id, deezer_url
|
||||
- youtube_url
|
||||
- acousticbrainz
|
||||
- spotify_id
|
||||
|
||||
JAMS files can be enriched in place with new identifiers added to the identifiers section.
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
No performance benchmarks provided. Expected bottlenecks:
|
||||
- Network latency for API calls
|
||||
- Sequential service queries (no parallelization)
|
||||
- No caching of results
|
||||
|
||||
Batch processing includes progress bars via tqdm but no performance optimization.
|
||||
|
||||
## Error Handling
|
||||
|
||||
Errors are silently suppressed. Failed queries return None. No exceptions propagate to callers.
|
||||
|
||||
This makes the library robust to individual service failures but provides no visibility into what went wrong. Debugging requires examining log files or adding print statements.
|
||||
|
||||
## Maintenance Status
|
||||
|
||||
Last commit activity and maintenance frequency unknown from provided information. Repository is public but development status unclear.
|
||||
|
||||
AcousticBrainz integration is broken (service discontinued). No indication this has been addressed.
|
||||
|
||||
## Relevance Assessment
|
||||
|
||||
**Conceptual value:** High. The cascading fallback pattern and multi-service aggregation approach are sound architectural patterns for entity linking.
|
||||
|
||||
**Implementation value:** Low. Pre-release quality, broken integrations, no tests, hardcoded configuration.
|
||||
|
||||
**Reuse recommendation:** Study the pattern, don't adopt the code. Reimplement the concept with proper error handling, configuration management, and test coverage.
|
||||
Reference in New Issue
Block a user