- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
17 KiB
MusicMetaLinker Integrations
Integration Architecture
MusicMetaLinker integrates with five external services:
- MusicBrainz (open music encyclopedia)
- Deezer (commercial streaming service)
- YouTube Music (commercial streaming service)
- AcousticBrainz (audio analysis database - defunct)
- Spotify (commercial streaming service - limited use)
Each integration uses a different library and authentication approach.
MusicBrainz Integration
Library and Authentication
Library: musicbrainzngs (official Python client)
Authentication: None required for read-only queries.
User-Agent: Required by MusicBrainz API. Hardcoded as "elka/0.1" (appears to be from parent project, not MusicMetaLinker-specific).
Rate limiting: MusicBrainz recommends 1 request/second. Not enforced by MusicMetaLinker.
API Endpoints
All queries go through musicbrainzngs library, which handles endpoint construction.
Base URL: https://musicbrainz.org/ws/2/
Endpoints used:
- Recording search: /recording?query=...
- Recording lookup: /recording/{mbid}
- ISRC search: /isrc/{isrc}
Query Patterns
By MBID (most reliable):
import musicbrainzngs as mb
mb.set_useragent("elka", "0.1")
result = mb.get_recording_by_id(
mbid,
includes=["artists", "releases", "isrcs"]
)
includes parameter: Fetches related entities in single request. Reduces API calls.
By ISRC:
result = mb.get_recordings_by_isrc(
isrc,
includes=["artists", "releases", "isrcs"]
)
Returns list of recordings with that ISRC. Multiple recordings possible (different releases, remasters).
By metadata:
query = f'artist:"{artist}" AND recording:"{track}"'
if album:
query += f' AND release:"{album}"'
result = mb.search_recordings(
query=query,
limit=10
)
Lucene query syntax. Quoted strings for exact matching. Returns ranked results.
Response Parsing
Recording structure:
{
"recording": {
"id": "mbid-uuid",
"title": "Track Name",
"length": 123456, # milliseconds
"artist-credit": [
{"artist": {"name": "Artist Name"}}
],
"release-list": [
{
"title": "Album Name",
"date": "2020-01-15",
"track-list": [
{"number": "1"}
]
}
],
"isrc-list": ["GBAYE9200070"]
}
}
Extraction logic:
- MBID: recording.id
- Track: recording.title
- Artist: recording.artist-credit[0].artist.name (first artist only)
- Duration: recording.length / 1000 (convert milliseconds to seconds)
- Album: recording.release-list[0].title (first release only)
- Release date: recording.release-list[0].date
- Track number: recording.release-list[0].track-list[0].number
- ISRC: recording.isrc-list[0] (first ISRC only)
Multiple values: MusicBrainz returns lists for artists, releases, ISRCs. MusicMetaLinker takes first value only. No aggregation or selection logic.
Filtering and Matching
Duration filtering:
if duration:
matches = [r for r in results if abs(r['length']/1000 - duration) < 5]
±5 second threshold for metadata searches. Hardcoded.
Fuzzy string matching:
Uses difflib.SequenceMatcher for artist/track/album similarity.
from difflib import SequenceMatcher
def similarity(a, b):
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
# Match if similarity > 0.8 (80%)
Threshold hardcoded at 0.8. No configuration option.
Error Handling
Network errors: Caught and suppressed. Returns None.
Invalid MBID: Returns None.
No results: Returns None.
Rate limiting: No handling. If rate limited, returns None.
Integration Strengths
- Official library: musicbrainzngs is maintained by MusicBrainz community
- Rich metadata: Comprehensive music information
- No authentication: Easy to use
- Includes parameter: Efficient data fetching
- Authoritative source: MusicBrainz is ground truth for music metadata
Integration Weaknesses
- Hardcoded User-Agent: "elka/0.1" not specific to MusicMetaLinker
- No rate limiting: Risk of being blocked
- First-value-only: Ignores multiple artists, releases, ISRCs
- Hardcoded thresholds: Duration (5s), similarity (0.8) not configurable
- No error visibility: Silent failures
Deezer Integration
Library and Authentication
Library: deezer-python (community library, not official)
Authentication: None required for search API.
Rate limiting: Unknown. Not documented. Not enforced by MusicMetaLinker.
API Endpoints
deezer-python library handles endpoint construction.
Base URL: https://api.deezer.com/
Endpoints used:
- Track search: /search/track?q=...
- ISRC search: /track/isrc:{isrc}
Query Patterns
By ISRC (preferred):
import deezer
client = deezer.Client()
result = client.search(f'isrc:{isrc}', relation='track')
Returns list of tracks with that ISRC. Usually single result.
By metadata:
query = f'{artist} {track}'
if album:
query += f' {album}'
result = client.search(query, relation='track')
Simple concatenation. No advanced query syntax.
Response Parsing
Track structure:
{
"id": 123456789,
"title": "Track Name",
"duration": 123, # seconds
"artist": {
"name": "Artist Name"
},
"album": {
"title": "Album Name"
},
"release_date": "2020-01-15",
"bpm": 120,
"isrc": "GBAYE9200070",
"rank": 500000
}
Extraction logic:
- Deezer ID: track.id
- Track: track.title
- Artist: track.artist.name
- Album: track.album.title
- Duration: track.duration (already in seconds)
- Release date: track.release_date
- BPM: track.bpm
- ISRC: track.isrc
- Rank: track.rank (popularity metric)
Filtering and Matching
Duration filtering (critical for Deezer):
duration_threshold = 3 # seconds
matches = [
t for t in results
if abs(t.duration - duration) <= duration_threshold
]
±3 second threshold. Configurable via parameter but defaults to 3.
Why critical: Deezer returns many versions of same track (radio edit, album version, remaster, live). Duration filtering essential for correct match.
Fuzzy matching:
Same difflib.SequenceMatcher approach as MusicBrainz. 0.8 similarity threshold.
Ranking:
If multiple matches after filtering, selects highest rank (most popular version).
best_match = max(matches, key=lambda t: t.rank)
Error Handling
Network errors: Caught and suppressed. Returns None.
Invalid ISRC: Returns empty list, treated as no match.
No results: Returns None.
Integration Strengths
- Strong ISRC support: Deezer has comprehensive ISRC coverage
- Duration filtering: Effective for disambiguating versions
- Popularity ranking: Helps select canonical version
- BPM data: Only source of BPM in MusicMetaLinker
- Fast API: Generally faster than MusicBrainz
Integration Weaknesses
- Unofficial library: deezer-python not maintained by Deezer
- No authentication: Limited to public API (no user-specific data)
- Simple search: No advanced query syntax
- Hardcoded threshold: 3-second duration threshold may not suit all use cases
- Commercial bias: Deezer catalog may not include obscure or independent releases
YouTube Music Integration
Library and Authentication
Library: ytmusicapi (unofficial, reverse-engineered API)
Authentication: None required for search.
Rate limiting: Unknown. YouTube may block aggressive usage.
API Endpoints
ytmusicapi reverse-engineers YouTube Music web interface. No official API.
Endpoints: Internal to ytmusicapi. Not exposed to MusicMetaLinker.
Query Patterns
By metadata only:
from ytmusicapi import YTMusic
ytmusic = YTMusic()
query = f'{artist} {track} {album}'
results = ytmusic.search(query, filter='songs')
filter='songs': Excludes videos, albums, playlists. Returns only song results.
No ISRC support: YouTube Music API doesn't support ISRC search.
No MBID support: YouTube Music doesn't use MBIDs.
Response Parsing
Song structure:
{
"videoId": "dQw4w9WgXcQ",
"title": "Track Name",
"artists": [
{"name": "Artist Name"}
],
"album": {
"name": "Album Name"
},
"duration": "7:11", # string format
"duration_seconds": 431
}
Extraction logic:
- YouTube ID: result.videoId
- YouTube URL: f"https://www.youtube.com/watch?v={videoId}"
- Track: result.title
- Artist: result.artists[0].name (first artist only)
- Album: result.album.name
Filtering and Matching
No duration filtering: Duration filtering code commented out in MusicMetaLinker.
# if duration:
# matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]
Why commented out: Unknown. Possibly unreliable duration data from YouTube.
No fuzzy matching: First result assumed correct.
best_match = results[0] if results else None
Critical weakness: High false positive rate. No validation that first result is correct match.
Error Handling
Network errors: Caught and suppressed. Returns None.
No results: Returns None.
API changes: ytmusicapi may break if YouTube changes web interface. No error handling for this.
Integration Strengths
- Broad coverage: YouTube Music has extensive catalog
- No authentication: Easy to use
- Filter parameter: Excludes non-song results
Integration Weaknesses
- Unofficial API: Reverse-engineered, fragile to changes
- No duration filtering: Commented out, high false positive rate
- First-result-only: No ranking or validation
- No ISRC support: Can't use authoritative identifiers
- Legal risk: Reverse-engineered APIs may violate ToS
- No error handling: API breakage causes silent failures
AcousticBrainz Integration
Library and Authentication
Library: requests (direct HTTP calls)
Authentication: None.
API Endpoints
Base URL: https://acousticbrainz.org/
Endpoint: /{mbid}
Query Pattern
import requests
def acousticbrainz_link(mbid):
url = f"https://acousticbrainz.org/{mbid}"
response = requests.get(url)
return url if response.status_code == 200 else None
Simple HTTP GET. Returns URL if MBID exists, None otherwise.
Critical Issue: Service Shutdown
AcousticBrainz shut down in 2022. All queries return 404.
Impact: This integration is completely non-functional. Dead code.
Why still in codebase: Unknown. Possibly not updated since shutdown.
Recommendation: Remove this integration entirely.
Integration Strengths
None. Service is defunct.
Integration Weaknesses
- Service shut down: Non-functional
- Dead code: Wastes execution time
- Misleading output: CSV includes acousticbrainz column (always null)
- No deprecation notice: Code doesn't warn users
Spotify Integration
Library and Authentication
Library: spotipy (official Spotify Python client)
Authentication: OAuth2 client credentials flow.
Credentials: Stored in external mml_secrets.py file (not in repository).
mml_secrets.py structure:
SPOTIFY_CLIENT_ID = "your-client-id"
SPOTIFY_CLIENT_SECRET = "your-client-secret"
Usage Scope
Limited use: Spotify integration only used in Billboard dataset cleaning script (prepare_dataset.py).
Not used in main Align workflow. Spotify not queried by Align class.
Query Pattern
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
auth_manager = SpotifyClientCredentials(
client_id=SPOTIFY_CLIENT_ID,
client_secret=SPOTIFY_CLIENT_SECRET
)
sp = spotipy.Spotify(auth_manager=auth_manager)
result = sp.search(q=f'track:{track} artist:{artist}', type='track', limit=1)
Use Case
Billboard dataset cleaning: Extract ISRCs from Spotify for Billboard chart tracks.
Workflow:
- Billboard dataset has artist/track names but no ISRCs
- Query Spotify by artist/track
- Extract ISRC from Spotify result
- Use ISRC for subsequent MusicBrainz/Deezer queries
Integration Strengths
- Official library: spotipy maintained by Spotify
- OAuth2: Secure authentication
- Rich metadata: Comprehensive track information
- ISRC support: Spotify provides ISRCs
Integration Weaknesses
- Requires credentials: Users must register Spotify app
- External secrets file: mml_secrets.py not in repository, must be created manually
- Limited use: Only for dataset preparation, not main workflow
- No documentation: No instructions for obtaining credentials
Integration Comparison
| Service | Library | Auth | ISRC Support | Duration Filtering | Matching Quality | Status |
|---|---|---|---|---|---|---|
| MusicBrainz | musicbrainzngs | None | Yes | ±5s | Fuzzy (0.8) | Active |
| Deezer | deezer-python | None | Yes | ±3s | Fuzzy (0.8) + Rank | Active |
| YouTube Music | ytmusicapi | None | No | Commented out | First result | Active (fragile) |
| AcousticBrainz | requests | None | N/A | N/A | N/A | Defunct |
| Spotify | spotipy | OAuth2 | Yes | N/A | N/A | Active (limited use) |
Integration Orchestration
Service Selection Logic
Priority order:
- MusicBrainz if MBID provided (authoritative)
- Deezer if ISRC provided (fast, reliable)
- MusicBrainz if metadata provided (fallback)
- Deezer if metadata provided (fallback)
- YouTube Music if metadata provided (last resort)
Parallel vs Sequential
Sequential execution: Services queried one at a time. No parallelization.
Implications:
- Total latency = sum of all service latencies
- Slow for batch processing
- Simple error handling (no race conditions)
Result Aggregation
No cross-validation: Results from different services not compared.
First-wins strategy: First successful query for each field used.
Example:
- MBID from MusicBrainz
- ISRC from Deezer (if not in MusicBrainz)
- BPM from Deezer (only source)
- YouTube link from YouTube Music
No conflict resolution: If MusicBrainz and Deezer return different artists, no reconciliation.
Integration Error Handling
Network Errors
All network errors caught and suppressed. Returns None.
No retry logic: Single attempt per service.
No exponential backoff: Immediate failure on error.
No circuit breaker: Repeated failures don't disable service.
Rate Limiting
No rate limiting implementation.
Risks:
- MusicBrainz: Recommends 1 req/s, may block aggressive usage
- Deezer: Unknown limits, may block
- YouTube Music: Unknown limits, may block or break API
Batch processing: High risk of rate limiting (no delays between requests).
Service Unavailability
No health checks: Services assumed available.
No fallback: If MusicBrainz down, no alternative for MBID lookup.
No status monitoring: No logging of service failures.
Integration Security
API Keys
MusicBrainz, Deezer, YouTube Music: No API keys required.
Spotify: Client credentials in external file (not encrypted).
Data Privacy
No personal data sent: Only public music metadata queried.
No user tracking: No analytics sent to services.
HTTPS
All services use HTTPS. No plaintext HTTP.
Input Sanitization
No sanitization: Metadata strings passed directly to APIs.
Potential risks:
- Query injection (if services use SQL/NoSQL)
- Command injection (if services execute shell commands)
Actual risk: Low. All services use HTTP APIs with proper escaping.
Integration Recommendations
Immediate Fixes
- Remove AcousticBrainz: Delete defunct integration
- Fix User-Agent: Change "elka/0.1" to "MusicMetaLinker/0.0.1"
- Add rate limiting: Implement delays between requests
- Document Spotify setup: Instructions for obtaining credentials
Short-Term Improvements
- Add retry logic: Exponential backoff for network errors
- Add timeout configuration: Configurable request timeouts
- Enable YouTube duration filtering: Uncomment and test
- Add error logging: Log service failures
- Add health checks: Verify service availability before queries
Long-Term Enhancements
- Parallel queries: Use asyncio for concurrent API calls
- Cross-validation: Compare results across services
- Confidence scores: Indicate match quality
- Service abstraction: Common interface for all services
- Plugin architecture: Allow adding new services without code changes
- Caching layer: Reduce redundant API calls
- Circuit breaker: Disable failing services temporarily
- Metrics collection: Track success rates, latencies per service
Integration Value Assessment
High value:
- MusicBrainz: Authoritative, comprehensive, reliable
- Deezer: Fast, good ISRC coverage, BPM data
Medium value:
- Spotify: Useful for dataset preparation, requires setup
Low value:
- YouTube Music: Weak matching, fragile API, high false positives
- AcousticBrainz: Defunct, zero value
Recommendation: Keep MusicBrainz and Deezer. Remove AcousticBrainz. Improve YouTube Music matching or remove. Keep Spotify for dataset preparation.