metadata-agregator/docs/research/musicmetalinker/analysis/OVERVIEW.md

# MusicMetaLinker Overview

## Project Identity

**Name:** MusicMetaLinker
**Version:** 0.0.1 (pre-release)
**Language:** Python 3.8+
**License:** MIT
**Type:** Library
**Repository:** https://github.com/andreamust/MusicMetaLinker
**Author:** Andrea Poltronieri
**Installation:** `pip install git+https://github.com/andreamust/MusicMetaLinker.git`

MusicMetaLinker is not available on PyPI. Installation requires direct GitHub access.

## Purpose and Scope

MusicMetaLinker performs entity linking for music tracks. It connects local music metadata to external databases, enriching incomplete or inconsistent information with authoritative data from multiple sources.

The library addresses a common problem in music information retrieval: fragmented metadata across different platforms. A track might have an MBID in one system, an ISRC in another, and only artist/title strings in a third. MusicMetaLinker bridges these gaps by querying multiple services and consolidating results.

Primary use case: academic music research and dataset preparation. The library supports JAMS (JSON Annotated Music Specification), a format common in music information retrieval research.

## Core Functionality

MusicMetaLinker implements a three-step workflow:

1. **Service Selection:** Based on available input identifiers (MBID, ISRC, or metadata strings), the library determines which external services to query and in what order.

2. **Information Retrieval:** Parallel or sequential queries to MusicBrainz, Deezer, YouTube Music, and AcousticBrainz. Each service has specialized search logic.

3. **Filtering and Matching:** Results are filtered by duration, track number, and fuzzy string matching to identify the best match across services.

The library returns enriched metadata including:
- Standardized identifiers (MBID, ISRC, Deezer ID)
- Corrected metadata (artist, album, track name)
- Additional attributes (BPM, release date)
- Direct links to external services

## Dependencies

Core dependencies:

- **musicbrainzngs:** MusicBrainz API client
- **deezer-python:** Deezer API wrapper
- **ytmusicapi:** YouTube Music unofficial API
- **spotipy:** Spotify API client (limited use)
- **requests:** HTTP client for AcousticBrainz
- **tqdm:** Progress bars for batch processing
- **jams:** JAMS format support
- **pandas:** CSV output for batch processing
- **cryptography:** Required by spotipy

All dependencies are standard Python packages. No exotic or unmaintained libraries.

## Architecture Pattern

MusicMetaLinker uses a cascading fallback pattern:

1. If MBID is provided, query MusicBrainz first (authoritative source)
2. If ISRC is available, try Deezer (commercial database with ISRCs)
3. Fall back to metadata string search across all services
4. Aggregate results, preferring more authoritative sources

This pattern ensures maximum coverage while respecting data quality hierarchies. MusicBrainz is treated as ground truth when available.

## Key Components

**Align class (linking.py):** Main entry point. Orchestrates all service queries and exposes unified getter methods.

**Service-specific aligners:**
- MusicBrainzAlign: Queries MusicBrainz by MBID, ISRC, or metadata
- DeezerAlign: Searches Deezer with duration-based filtering
- YouTubeAlign: Searches YouTube Music by metadata strings

**Batch processing:**
- link_partitions.py: Process directories of JAMS files
- JAMSProcessor: Read/write JAMS format with metadata enrichment

**Utilities:**
- MBDownload: Bulk download from MusicBrainz
- prepare_dataset.py: Dataset preparation scripts

## Workflow Example

Typical usage:

```python
from musicmetalinker.linking import Align

# Initialize with available metadata
linker = Align(
    artist="The Beatles",
    track="Hey Jude",
    album="Hey Jude",
    duration=431
)

# Retrieve enriched metadata
mbid = linker.get_mbid()
isrc = linker.get_isrc()
deezer_id = linker.get_deezer_id()
youtube_link = linker.get_youtube_link()
```

The Align class handles all service queries internally. Users don't interact with individual service classes directly.

## Batch Processing

For dataset-scale operations:

```bash
python link_partitions.py /path/to/jams/files --save --limit audio --overwrite
```

Processes all JAMS files in a directory, enriches metadata, and outputs CSV with consolidated identifiers. Useful for preparing research datasets.

## Target Audience

Primary users:
- Music information retrieval researchers
- Dataset curators
- Academic projects requiring standardized music metadata

Not designed for:
- Production music applications (pre-release quality)
- Real-time streaming services (no rate limiting)
- End-user applications (library-only, no GUI)

## Development Status

Version 0.0.1 indicates early development. The codebase contains:
- Debug print statements in production code
- Hardcoded configuration values
- Commented-out code sections
- No automated tests
- No CI/CD pipeline

This is research-quality code, not production-ready software. Suitable for academic exploration and prototyping, but requires significant hardening for production use.

## Integration with External Services

**MusicBrainz:** Open music encyclopedia. No authentication required. Rate limiting recommended but not implemented.

**Deezer:** Commercial streaming service with public API. No authentication for basic search. More permissive than Spotify for metadata access.

**YouTube Music:** Unofficial API via ytmusicapi. No authentication. Fragile to YouTube changes.

**AcousticBrainz:** Audio feature database. Note: AcousticBrainz shut down in 2022. This integration is non-functional.

**Spotify:** Limited use for ISRC extraction in Billboard dataset cleaning. Requires OAuth2 credentials via external mml_secrets.py file (not in repository).

## Licensing and Reuse

MIT license permits unrestricted use, modification, and distribution. No copyleft restrictions.

The library can be freely integrated into commercial or academic projects. Attribution to Andrea Poltronieri is required.

## Installation Requirements

Python 3.8 or higher required. No platform-specific dependencies except optional ffmpeg for audio conversion in batch processing.

Installation from GitHub requires git and pip. No binary distributions available.

## Configuration

All configuration is hardcoded in source files:
- User-Agent: "elka/0.1" (appears to be from a parent project)
- API endpoints: Hardcoded URLs
- Matching thresholds: Hardcoded in service classes
- Spotify credentials: External mml_secrets.py module

No configuration files, environment variables, or runtime configuration options.

## Output Formats

**Library mode:** Python objects with getter methods

**Batch mode:** CSV with columns:
- jams_file
- track_name, artist_name, album_name
- track_number, duration, release_year
- musicbrainz, isrc
- deezer_id, deezer_url
- youtube_url
- acousticbrainz
- spotify_id

JAMS files can be enriched in place with new identifiers added to the identifiers section.

## Performance Characteristics

No performance benchmarks provided. Expected bottlenecks:
- Network latency for API calls
- Sequential service queries (no parallelization)
- No caching of results

Batch processing includes progress bars via tqdm but no performance optimization.

## Error Handling

Errors are silently suppressed. Failed queries return None. No exceptions propagate to callers.

This makes the library robust to individual service failures but provides no visibility into what went wrong. Debugging requires examining log files or adding print statements.

## Maintenance Status

Last commit activity and maintenance frequency unknown from provided information. Repository is public but development status unclear.

AcousticBrainz integration is broken (service discontinued). No indication this has been addressed.

## Relevance Assessment

**Conceptual value:** High. The cascading fallback pattern and multi-service aggregation approach are sound architectural patterns for entity linking.

**Implementation value:** Low. Pre-release quality, broken integrations, no tests, hardcoded configuration.

**Reuse recommendation:** Study the pattern, don't adopt the code. Reimplement the concept with proper error handling, configuration management, and test coverage.