feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,218 @@
+# MusicMetaLinker Overview
+
+## Project Identity
+
+**Name:** MusicMetaLinker  
+**Version:** 0.0.1 (pre-release)  
+**Language:** Python 3.8+  
+**License:** MIT  
+**Type:** Library  
+**Repository:** https://github.com/andreamust/MusicMetaLinker  
+**Author:** Andrea Poltronieri  
+**Installation:** `pip install git+https://github.com/andreamust/MusicMetaLinker.git`
+
+MusicMetaLinker is not available on PyPI. Installation requires direct GitHub access.
+
+## Purpose and Scope
+
+MusicMetaLinker performs entity linking for music tracks. It connects local music metadata to external databases, enriching incomplete or inconsistent information with authoritative data from multiple sources.
+
+The library addresses a common problem in music information retrieval: fragmented metadata across different platforms. A track might have an MBID in one system, an ISRC in another, and only artist/title strings in a third. MusicMetaLinker bridges these gaps by querying multiple services and consolidating results.
+
+Primary use case: academic music research and dataset preparation. The library supports JAMS (JSON Annotated Music Specification), a format common in music information retrieval research.
+
+## Core Functionality
+
+MusicMetaLinker implements a three-step workflow:
+
+1. **Service Selection:** Based on available input identifiers (MBID, ISRC, or metadata strings), the library determines which external services to query and in what order.
+
+2. **Information Retrieval:** Parallel or sequential queries to MusicBrainz, Deezer, YouTube Music, and AcousticBrainz. Each service has specialized search logic.
+
+3. **Filtering and Matching:** Results are filtered by duration, track number, and fuzzy string matching to identify the best match across services.
+
+The library returns enriched metadata including:
+- Standardized identifiers (MBID, ISRC, Deezer ID)
+- Corrected metadata (artist, album, track name)
+- Additional attributes (BPM, release date)
+- Direct links to external services
+
+## Dependencies
+
+Core dependencies:
+
+- **musicbrainzngs:** MusicBrainz API client
+- **deezer-python:** Deezer API wrapper
+- **ytmusicapi:** YouTube Music unofficial API
+- **spotipy:** Spotify API client (limited use)
+- **requests:** HTTP client for AcousticBrainz
+- **tqdm:** Progress bars for batch processing
+- **jams:** JAMS format support
+- **pandas:** CSV output for batch processing
+- **cryptography:** Required by spotipy
+
+All dependencies are standard Python packages. No exotic or unmaintained libraries.
+
+## Architecture Pattern
+
+MusicMetaLinker uses a cascading fallback pattern:
+
+1. If MBID is provided, query MusicBrainz first (authoritative source)
+2. If ISRC is available, try Deezer (commercial database with ISRCs)
+3. Fall back to metadata string search across all services
+4. Aggregate results, preferring more authoritative sources
+
+This pattern ensures maximum coverage while respecting data quality hierarchies. MusicBrainz is treated as ground truth when available.
+
+## Key Components
+
+**Align class (linking.py):** Main entry point. Orchestrates all service queries and exposes unified getter methods.
+
+**Service-specific aligners:**
+- MusicBrainzAlign: Queries MusicBrainz by MBID, ISRC, or metadata
+- DeezerAlign: Searches Deezer with duration-based filtering
+- YouTubeAlign: Searches YouTube Music by metadata strings
+
+**Batch processing:**
+- link_partitions.py: Process directories of JAMS files
+- JAMSProcessor: Read/write JAMS format with metadata enrichment
+
+**Utilities:**
+- MBDownload: Bulk download from MusicBrainz
+- prepare_dataset.py: Dataset preparation scripts
+
+## Workflow Example
+
+Typical usage:
+
+```python
+from musicmetalinker.linking import Align
+
+# Initialize with available metadata
+linker = Align(
+    artist="The Beatles",
+    track="Hey Jude",
+    album="Hey Jude",
+    duration=431
+)
+
+# Retrieve enriched metadata
+mbid = linker.get_mbid()
+isrc = linker.get_isrc()
+deezer_id = linker.get_deezer_id()
+youtube_link = linker.get_youtube_link()
+```
+
+The Align class handles all service queries internally. Users don't interact with individual service classes directly.
+
+## Batch Processing
+
+For dataset-scale operations:
+
+```bash
+python link_partitions.py /path/to/jams/files --save --limit audio --overwrite
+```
+
+Processes all JAMS files in a directory, enriches metadata, and outputs CSV with consolidated identifiers. Useful for preparing research datasets.
+
+## Target Audience
+
+Primary users:
+- Music information retrieval researchers
+- Dataset curators
+- Academic projects requiring standardized music metadata
+
+Not designed for:
+- Production music applications (pre-release quality)
+- Real-time streaming services (no rate limiting)
+- End-user applications (library-only, no GUI)
+
+## Development Status
+
+Version 0.0.1 indicates early development. The codebase contains:
+- Debug print statements in production code
+- Hardcoded configuration values
+- Commented-out code sections
+- No automated tests
+- No CI/CD pipeline
+
+This is research-quality code, not production-ready software. Suitable for academic exploration and prototyping, but requires significant hardening for production use.
+
+## Integration with External Services
+
+**MusicBrainz:** Open music encyclopedia. No authentication required. Rate limiting recommended but not implemented.
+
+**Deezer:** Commercial streaming service with public API. No authentication for basic search. More permissive than Spotify for metadata access.
+
+**YouTube Music:** Unofficial API via ytmusicapi. No authentication. Fragile to YouTube changes.
+
+**AcousticBrainz:** Audio feature database. Note: AcousticBrainz shut down in 2022. This integration is non-functional.
+
+**Spotify:** Limited use for ISRC extraction in Billboard dataset cleaning. Requires OAuth2 credentials via external mml_secrets.py file (not in repository).
+
+## Licensing and Reuse
+
+MIT license permits unrestricted use, modification, and distribution. No copyleft restrictions.
+
+The library can be freely integrated into commercial or academic projects. Attribution to Andrea Poltronieri is required.
+
+## Installation Requirements
+
+Python 3.8 or higher required. No platform-specific dependencies except optional ffmpeg for audio conversion in batch processing.
+
+Installation from GitHub requires git and pip. No binary distributions available.
+
+## Configuration
+
+All configuration is hardcoded in source files:
+- User-Agent: "elka/0.1" (appears to be from a parent project)
+- API endpoints: Hardcoded URLs
+- Matching thresholds: Hardcoded in service classes
+- Spotify credentials: External mml_secrets.py module
+
+No configuration files, environment variables, or runtime configuration options.
+
+## Output Formats
+
+**Library mode:** Python objects with getter methods
+
+**Batch mode:** CSV with columns:
+- jams_file
+- track_name, artist_name, album_name
+- track_number, duration, release_year
+- musicbrainz, isrc
+- deezer_id, deezer_url
+- youtube_url
+- acousticbrainz
+- spotify_id
+
+JAMS files can be enriched in place with new identifiers added to the identifiers section.
+
+## Performance Characteristics
+
+No performance benchmarks provided. Expected bottlenecks:
+- Network latency for API calls
+- Sequential service queries (no parallelization)
+- No caching of results
+
+Batch processing includes progress bars via tqdm but no performance optimization.
+
+## Error Handling
+
+Errors are silently suppressed. Failed queries return None. No exceptions propagate to callers.
+
+This makes the library robust to individual service failures but provides no visibility into what went wrong. Debugging requires examining log files or adding print statements.
+
+## Maintenance Status
+
+Last commit activity and maintenance frequency unknown from provided information. Repository is public but development status unclear.
+
+AcousticBrainz integration is broken (service discontinued). No indication this has been addressed.
+
+## Relevance Assessment
+
+**Conceptual value:** High. The cascading fallback pattern and multi-service aggregation approach are sound architectural patterns for entity linking.
+
+**Implementation value:** Low. Pre-release quality, broken integrations, no tests, hardcoded configuration.
+
+**Reuse recommendation:** Study the pattern, don't adopt the code. Reimplement the concept with proper error handling, configuration management, and test coverage.