Files
metadata-agregator/docs/research/musicmetalinker/analysis/INTEGRATIONS.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

17 KiB

MusicMetaLinker Integrations

Integration Architecture

MusicMetaLinker integrates with five external services:

  1. MusicBrainz (open music encyclopedia)
  2. Deezer (commercial streaming service)
  3. YouTube Music (commercial streaming service)
  4. AcousticBrainz (audio analysis database - defunct)
  5. Spotify (commercial streaming service - limited use)

Each integration uses a different library and authentication approach.

MusicBrainz Integration

Library and Authentication

Library: musicbrainzngs (official Python client)

Authentication: None required for read-only queries.

User-Agent: Required by MusicBrainz API. Hardcoded as "elka/0.1" (appears to be from parent project, not MusicMetaLinker-specific).

Rate limiting: MusicBrainz recommends 1 request/second. Not enforced by MusicMetaLinker.

API Endpoints

All queries go through musicbrainzngs library, which handles endpoint construction.

Base URL: https://musicbrainz.org/ws/2/

Endpoints used:

  • Recording search: /recording?query=...
  • Recording lookup: /recording/{mbid}
  • ISRC search: /isrc/{isrc}

Query Patterns

By MBID (most reliable):

import musicbrainzngs as mb

mb.set_useragent("elka", "0.1")
result = mb.get_recording_by_id(
    mbid,
    includes=["artists", "releases", "isrcs"]
)

includes parameter: Fetches related entities in single request. Reduces API calls.

By ISRC:

result = mb.get_recordings_by_isrc(
    isrc,
    includes=["artists", "releases", "isrcs"]
)

Returns list of recordings with that ISRC. Multiple recordings possible (different releases, remasters).

By metadata:

query = f'artist:"{artist}" AND recording:"{track}"'
if album:
    query += f' AND release:"{album}"'

result = mb.search_recordings(
    query=query,
    limit=10
)

Lucene query syntax. Quoted strings for exact matching. Returns ranked results.

Response Parsing

Recording structure:

{
    "recording": {
        "id": "mbid-uuid",
        "title": "Track Name",
        "length": 123456,  # milliseconds
        "artist-credit": [
            {"artist": {"name": "Artist Name"}}
        ],
        "release-list": [
            {
                "title": "Album Name",
                "date": "2020-01-15",
                "track-list": [
                    {"number": "1"}
                ]
            }
        ],
        "isrc-list": ["GBAYE9200070"]
    }
}

Extraction logic:

  • MBID: recording.id
  • Track: recording.title
  • Artist: recording.artist-credit[0].artist.name (first artist only)
  • Duration: recording.length / 1000 (convert milliseconds to seconds)
  • Album: recording.release-list[0].title (first release only)
  • Release date: recording.release-list[0].date
  • Track number: recording.release-list[0].track-list[0].number
  • ISRC: recording.isrc-list[0] (first ISRC only)

Multiple values: MusicBrainz returns lists for artists, releases, ISRCs. MusicMetaLinker takes first value only. No aggregation or selection logic.

Filtering and Matching

Duration filtering:

if duration:
    matches = [r for r in results if abs(r['length']/1000 - duration) < 5]

±5 second threshold for metadata searches. Hardcoded.

Fuzzy string matching:

Uses difflib.SequenceMatcher for artist/track/album similarity.

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

# Match if similarity > 0.8 (80%)

Threshold hardcoded at 0.8. No configuration option.

Error Handling

Network errors: Caught and suppressed. Returns None.

Invalid MBID: Returns None.

No results: Returns None.

Rate limiting: No handling. If rate limited, returns None.

Integration Strengths

  1. Official library: musicbrainzngs is maintained by MusicBrainz community
  2. Rich metadata: Comprehensive music information
  3. No authentication: Easy to use
  4. Includes parameter: Efficient data fetching
  5. Authoritative source: MusicBrainz is ground truth for music metadata

Integration Weaknesses

  1. Hardcoded User-Agent: "elka/0.1" not specific to MusicMetaLinker
  2. No rate limiting: Risk of being blocked
  3. First-value-only: Ignores multiple artists, releases, ISRCs
  4. Hardcoded thresholds: Duration (5s), similarity (0.8) not configurable
  5. No error visibility: Silent failures

Deezer Integration

Library and Authentication

Library: deezer-python (community library, not official)

Authentication: None required for search API.

Rate limiting: Unknown. Not documented. Not enforced by MusicMetaLinker.

API Endpoints

deezer-python library handles endpoint construction.

Base URL: https://api.deezer.com/

Endpoints used:

  • Track search: /search/track?q=...
  • ISRC search: /track/isrc:{isrc}

Query Patterns

By ISRC (preferred):

import deezer

client = deezer.Client()
result = client.search(f'isrc:{isrc}', relation='track')

Returns list of tracks with that ISRC. Usually single result.

By metadata:

query = f'{artist} {track}'
if album:
    query += f' {album}'

result = client.search(query, relation='track')

Simple concatenation. No advanced query syntax.

Response Parsing

Track structure:

{
    "id": 123456789,
    "title": "Track Name",
    "duration": 123,  # seconds
    "artist": {
        "name": "Artist Name"
    },
    "album": {
        "title": "Album Name"
    },
    "release_date": "2020-01-15",
    "bpm": 120,
    "isrc": "GBAYE9200070",
    "rank": 500000
}

Extraction logic:

  • Deezer ID: track.id
  • Track: track.title
  • Artist: track.artist.name
  • Album: track.album.title
  • Duration: track.duration (already in seconds)
  • Release date: track.release_date
  • BPM: track.bpm
  • ISRC: track.isrc
  • Rank: track.rank (popularity metric)

Filtering and Matching

Duration filtering (critical for Deezer):

duration_threshold = 3  # seconds

matches = [
    t for t in results 
    if abs(t.duration - duration) <= duration_threshold
]

±3 second threshold. Configurable via parameter but defaults to 3.

Why critical: Deezer returns many versions of same track (radio edit, album version, remaster, live). Duration filtering essential for correct match.

Fuzzy matching:

Same difflib.SequenceMatcher approach as MusicBrainz. 0.8 similarity threshold.

Ranking:

If multiple matches after filtering, selects highest rank (most popular version).

best_match = max(matches, key=lambda t: t.rank)

Error Handling

Network errors: Caught and suppressed. Returns None.

Invalid ISRC: Returns empty list, treated as no match.

No results: Returns None.

Integration Strengths

  1. Strong ISRC support: Deezer has comprehensive ISRC coverage
  2. Duration filtering: Effective for disambiguating versions
  3. Popularity ranking: Helps select canonical version
  4. BPM data: Only source of BPM in MusicMetaLinker
  5. Fast API: Generally faster than MusicBrainz

Integration Weaknesses

  1. Unofficial library: deezer-python not maintained by Deezer
  2. No authentication: Limited to public API (no user-specific data)
  3. Simple search: No advanced query syntax
  4. Hardcoded threshold: 3-second duration threshold may not suit all use cases
  5. Commercial bias: Deezer catalog may not include obscure or independent releases

YouTube Music Integration

Library and Authentication

Library: ytmusicapi (unofficial, reverse-engineered API)

Authentication: None required for search.

Rate limiting: Unknown. YouTube may block aggressive usage.

API Endpoints

ytmusicapi reverse-engineers YouTube Music web interface. No official API.

Endpoints: Internal to ytmusicapi. Not exposed to MusicMetaLinker.

Query Patterns

By metadata only:

from ytmusicapi import YTMusic

ytmusic = YTMusic()
query = f'{artist} {track} {album}'
results = ytmusic.search(query, filter='songs')

filter='songs': Excludes videos, albums, playlists. Returns only song results.

No ISRC support: YouTube Music API doesn't support ISRC search.

No MBID support: YouTube Music doesn't use MBIDs.

Response Parsing

Song structure:

{
    "videoId": "dQw4w9WgXcQ",
    "title": "Track Name",
    "artists": [
        {"name": "Artist Name"}
    ],
    "album": {
        "name": "Album Name"
    },
    "duration": "7:11",  # string format
    "duration_seconds": 431
}

Extraction logic:

Filtering and Matching

No duration filtering: Duration filtering code commented out in MusicMetaLinker.

# if duration:
#     matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]

Why commented out: Unknown. Possibly unreliable duration data from YouTube.

No fuzzy matching: First result assumed correct.

best_match = results[0] if results else None

Critical weakness: High false positive rate. No validation that first result is correct match.

Error Handling

Network errors: Caught and suppressed. Returns None.

No results: Returns None.

API changes: ytmusicapi may break if YouTube changes web interface. No error handling for this.

Integration Strengths

  1. Broad coverage: YouTube Music has extensive catalog
  2. No authentication: Easy to use
  3. Filter parameter: Excludes non-song results

Integration Weaknesses

  1. Unofficial API: Reverse-engineered, fragile to changes
  2. No duration filtering: Commented out, high false positive rate
  3. First-result-only: No ranking or validation
  4. No ISRC support: Can't use authoritative identifiers
  5. Legal risk: Reverse-engineered APIs may violate ToS
  6. No error handling: API breakage causes silent failures

AcousticBrainz Integration

Library and Authentication

Library: requests (direct HTTP calls)

Authentication: None.

API Endpoints

Base URL: https://acousticbrainz.org/

Endpoint: /{mbid}

Query Pattern

import requests

def acousticbrainz_link(mbid):
    url = f"https://acousticbrainz.org/{mbid}"
    response = requests.get(url)
    return url if response.status_code == 200 else None

Simple HTTP GET. Returns URL if MBID exists, None otherwise.

Critical Issue: Service Shutdown

AcousticBrainz shut down in 2022. All queries return 404.

Impact: This integration is completely non-functional. Dead code.

Why still in codebase: Unknown. Possibly not updated since shutdown.

Recommendation: Remove this integration entirely.

Integration Strengths

None. Service is defunct.

Integration Weaknesses

  1. Service shut down: Non-functional
  2. Dead code: Wastes execution time
  3. Misleading output: CSV includes acousticbrainz column (always null)
  4. No deprecation notice: Code doesn't warn users

Spotify Integration

Library and Authentication

Library: spotipy (official Spotify Python client)

Authentication: OAuth2 client credentials flow.

Credentials: Stored in external mml_secrets.py file (not in repository).

mml_secrets.py structure:

SPOTIFY_CLIENT_ID = "your-client-id"
SPOTIFY_CLIENT_SECRET = "your-client-secret"

Usage Scope

Limited use: Spotify integration only used in Billboard dataset cleaning script (prepare_dataset.py).

Not used in main Align workflow. Spotify not queried by Align class.

Query Pattern

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET

auth_manager = SpotifyClientCredentials(
    client_id=SPOTIFY_CLIENT_ID,
    client_secret=SPOTIFY_CLIENT_SECRET
)
sp = spotipy.Spotify(auth_manager=auth_manager)

result = sp.search(q=f'track:{track} artist:{artist}', type='track', limit=1)

Use Case

Billboard dataset cleaning: Extract ISRCs from Spotify for Billboard chart tracks.

Workflow:

  1. Billboard dataset has artist/track names but no ISRCs
  2. Query Spotify by artist/track
  3. Extract ISRC from Spotify result
  4. Use ISRC for subsequent MusicBrainz/Deezer queries

Integration Strengths

  1. Official library: spotipy maintained by Spotify
  2. OAuth2: Secure authentication
  3. Rich metadata: Comprehensive track information
  4. ISRC support: Spotify provides ISRCs

Integration Weaknesses

  1. Requires credentials: Users must register Spotify app
  2. External secrets file: mml_secrets.py not in repository, must be created manually
  3. Limited use: Only for dataset preparation, not main workflow
  4. No documentation: No instructions for obtaining credentials

Integration Comparison

Service Library Auth ISRC Support Duration Filtering Matching Quality Status
MusicBrainz musicbrainzngs None Yes ±5s Fuzzy (0.8) Active
Deezer deezer-python None Yes ±3s Fuzzy (0.8) + Rank Active
YouTube Music ytmusicapi None No Commented out First result Active (fragile)
AcousticBrainz requests None N/A N/A N/A Defunct
Spotify spotipy OAuth2 Yes N/A N/A Active (limited use)

Integration Orchestration

Service Selection Logic

Priority order:

  1. MusicBrainz if MBID provided (authoritative)
  2. Deezer if ISRC provided (fast, reliable)
  3. MusicBrainz if metadata provided (fallback)
  4. Deezer if metadata provided (fallback)
  5. YouTube Music if metadata provided (last resort)

Parallel vs Sequential

Sequential execution: Services queried one at a time. No parallelization.

Implications:

  • Total latency = sum of all service latencies
  • Slow for batch processing
  • Simple error handling (no race conditions)

Result Aggregation

No cross-validation: Results from different services not compared.

First-wins strategy: First successful query for each field used.

Example:

  • MBID from MusicBrainz
  • ISRC from Deezer (if not in MusicBrainz)
  • BPM from Deezer (only source)
  • YouTube link from YouTube Music

No conflict resolution: If MusicBrainz and Deezer return different artists, no reconciliation.

Integration Error Handling

Network Errors

All network errors caught and suppressed. Returns None.

No retry logic: Single attempt per service.

No exponential backoff: Immediate failure on error.

No circuit breaker: Repeated failures don't disable service.

Rate Limiting

No rate limiting implementation.

Risks:

  • MusicBrainz: Recommends 1 req/s, may block aggressive usage
  • Deezer: Unknown limits, may block
  • YouTube Music: Unknown limits, may block or break API

Batch processing: High risk of rate limiting (no delays between requests).

Service Unavailability

No health checks: Services assumed available.

No fallback: If MusicBrainz down, no alternative for MBID lookup.

No status monitoring: No logging of service failures.

Integration Security

API Keys

MusicBrainz, Deezer, YouTube Music: No API keys required.

Spotify: Client credentials in external file (not encrypted).

Data Privacy

No personal data sent: Only public music metadata queried.

No user tracking: No analytics sent to services.

HTTPS

All services use HTTPS. No plaintext HTTP.

Input Sanitization

No sanitization: Metadata strings passed directly to APIs.

Potential risks:

  • Query injection (if services use SQL/NoSQL)
  • Command injection (if services execute shell commands)

Actual risk: Low. All services use HTTP APIs with proper escaping.

Integration Recommendations

Immediate Fixes

  1. Remove AcousticBrainz: Delete defunct integration
  2. Fix User-Agent: Change "elka/0.1" to "MusicMetaLinker/0.0.1"
  3. Add rate limiting: Implement delays between requests
  4. Document Spotify setup: Instructions for obtaining credentials

Short-Term Improvements

  1. Add retry logic: Exponential backoff for network errors
  2. Add timeout configuration: Configurable request timeouts
  3. Enable YouTube duration filtering: Uncomment and test
  4. Add error logging: Log service failures
  5. Add health checks: Verify service availability before queries

Long-Term Enhancements

  1. Parallel queries: Use asyncio for concurrent API calls
  2. Cross-validation: Compare results across services
  3. Confidence scores: Indicate match quality
  4. Service abstraction: Common interface for all services
  5. Plugin architecture: Allow adding new services without code changes
  6. Caching layer: Reduce redundant API calls
  7. Circuit breaker: Disable failing services temporarily
  8. Metrics collection: Track success rates, latencies per service

Integration Value Assessment

High value:

  • MusicBrainz: Authoritative, comprehensive, reliable
  • Deezer: Fast, good ISRC coverage, BPM data

Medium value:

  • Spotify: Useful for dataset preparation, requires setup

Low value:

  • YouTube Music: Weak matching, fragile API, high false positives
  • AcousticBrainz: Defunct, zero value

Recommendation: Keep MusicBrainz and Deezer. Remove AcousticBrainz. Improve YouTube Music matching or remove. Keep Spotify for dataset preparation.