# MusicMetaLinker Integrations ## Integration Architecture MusicMetaLinker integrates with five external services: 1. MusicBrainz (open music encyclopedia) 2. Deezer (commercial streaming service) 3. YouTube Music (commercial streaming service) 4. AcousticBrainz (audio analysis database - defunct) 5. Spotify (commercial streaming service - limited use) Each integration uses a different library and authentication approach. ## MusicBrainz Integration ### Library and Authentication **Library:** musicbrainzngs (official Python client) **Authentication:** None required for read-only queries. **User-Agent:** Required by MusicBrainz API. Hardcoded as "elka/0.1" (appears to be from parent project, not MusicMetaLinker-specific). **Rate limiting:** MusicBrainz recommends 1 request/second. Not enforced by MusicMetaLinker. ### API Endpoints All queries go through musicbrainzngs library, which handles endpoint construction. **Base URL:** https://musicbrainz.org/ws/2/ **Endpoints used:** - Recording search: /recording?query=... - Recording lookup: /recording/{mbid} - ISRC search: /isrc/{isrc} ### Query Patterns **By MBID (most reliable):** ```python import musicbrainzngs as mb mb.set_useragent("elka", "0.1") result = mb.get_recording_by_id( mbid, includes=["artists", "releases", "isrcs"] ) ``` **includes parameter:** Fetches related entities in single request. Reduces API calls. **By ISRC:** ```python result = mb.get_recordings_by_isrc( isrc, includes=["artists", "releases", "isrcs"] ) ``` Returns list of recordings with that ISRC. Multiple recordings possible (different releases, remasters). **By metadata:** ```python query = f'artist:"{artist}" AND recording:"{track}"' if album: query += f' AND release:"{album}"' result = mb.search_recordings( query=query, limit=10 ) ``` Lucene query syntax. Quoted strings for exact matching. Returns ranked results. ### Response Parsing **Recording structure:** ```python { "recording": { "id": "mbid-uuid", "title": "Track Name", "length": 123456, # milliseconds "artist-credit": [ {"artist": {"name": "Artist Name"}} ], "release-list": [ { "title": "Album Name", "date": "2020-01-15", "track-list": [ {"number": "1"} ] } ], "isrc-list": ["GBAYE9200070"] } } ``` **Extraction logic:** - **MBID:** recording.id - **Track:** recording.title - **Artist:** recording.artist-credit[0].artist.name (first artist only) - **Duration:** recording.length / 1000 (convert milliseconds to seconds) - **Album:** recording.release-list[0].title (first release only) - **Release date:** recording.release-list[0].date - **Track number:** recording.release-list[0].track-list[0].number - **ISRC:** recording.isrc-list[0] (first ISRC only) **Multiple values:** MusicBrainz returns lists for artists, releases, ISRCs. MusicMetaLinker takes first value only. No aggregation or selection logic. ### Filtering and Matching **Duration filtering:** ```python if duration: matches = [r for r in results if abs(r['length']/1000 - duration) < 5] ``` ±5 second threshold for metadata searches. Hardcoded. **Fuzzy string matching:** Uses difflib.SequenceMatcher for artist/track/album similarity. ```python from difflib import SequenceMatcher def similarity(a, b): return SequenceMatcher(None, a.lower(), b.lower()).ratio() # Match if similarity > 0.8 (80%) ``` Threshold hardcoded at 0.8. No configuration option. ### Error Handling **Network errors:** Caught and suppressed. Returns None. **Invalid MBID:** Returns None. **No results:** Returns None. **Rate limiting:** No handling. If rate limited, returns None. ### Integration Strengths 1. **Official library:** musicbrainzngs is maintained by MusicBrainz community 2. **Rich metadata:** Comprehensive music information 3. **No authentication:** Easy to use 4. **Includes parameter:** Efficient data fetching 5. **Authoritative source:** MusicBrainz is ground truth for music metadata ### Integration Weaknesses 1. **Hardcoded User-Agent:** "elka/0.1" not specific to MusicMetaLinker 2. **No rate limiting:** Risk of being blocked 3. **First-value-only:** Ignores multiple artists, releases, ISRCs 4. **Hardcoded thresholds:** Duration (5s), similarity (0.8) not configurable 5. **No error visibility:** Silent failures ## Deezer Integration ### Library and Authentication **Library:** deezer-python (community library, not official) **Authentication:** None required for search API. **Rate limiting:** Unknown. Not documented. Not enforced by MusicMetaLinker. ### API Endpoints deezer-python library handles endpoint construction. **Base URL:** https://api.deezer.com/ **Endpoints used:** - Track search: /search/track?q=... - ISRC search: /track/isrc:{isrc} ### Query Patterns **By ISRC (preferred):** ```python import deezer client = deezer.Client() result = client.search(f'isrc:{isrc}', relation='track') ``` Returns list of tracks with that ISRC. Usually single result. **By metadata:** ```python query = f'{artist} {track}' if album: query += f' {album}' result = client.search(query, relation='track') ``` Simple concatenation. No advanced query syntax. ### Response Parsing **Track structure:** ```python { "id": 123456789, "title": "Track Name", "duration": 123, # seconds "artist": { "name": "Artist Name" }, "album": { "title": "Album Name" }, "release_date": "2020-01-15", "bpm": 120, "isrc": "GBAYE9200070", "rank": 500000 } ``` **Extraction logic:** - **Deezer ID:** track.id - **Track:** track.title - **Artist:** track.artist.name - **Album:** track.album.title - **Duration:** track.duration (already in seconds) - **Release date:** track.release_date - **BPM:** track.bpm - **ISRC:** track.isrc - **Rank:** track.rank (popularity metric) ### Filtering and Matching **Duration filtering (critical for Deezer):** ```python duration_threshold = 3 # seconds matches = [ t for t in results if abs(t.duration - duration) <= duration_threshold ] ``` ±3 second threshold. Configurable via parameter but defaults to 3. **Why critical:** Deezer returns many versions of same track (radio edit, album version, remaster, live). Duration filtering essential for correct match. **Fuzzy matching:** Same difflib.SequenceMatcher approach as MusicBrainz. 0.8 similarity threshold. **Ranking:** If multiple matches after filtering, selects highest rank (most popular version). ```python best_match = max(matches, key=lambda t: t.rank) ``` ### Error Handling **Network errors:** Caught and suppressed. Returns None. **Invalid ISRC:** Returns empty list, treated as no match. **No results:** Returns None. ### Integration Strengths 1. **Strong ISRC support:** Deezer has comprehensive ISRC coverage 2. **Duration filtering:** Effective for disambiguating versions 3. **Popularity ranking:** Helps select canonical version 4. **BPM data:** Only source of BPM in MusicMetaLinker 5. **Fast API:** Generally faster than MusicBrainz ### Integration Weaknesses 1. **Unofficial library:** deezer-python not maintained by Deezer 2. **No authentication:** Limited to public API (no user-specific data) 3. **Simple search:** No advanced query syntax 4. **Hardcoded threshold:** 3-second duration threshold may not suit all use cases 5. **Commercial bias:** Deezer catalog may not include obscure or independent releases ## YouTube Music Integration ### Library and Authentication **Library:** ytmusicapi (unofficial, reverse-engineered API) **Authentication:** None required for search. **Rate limiting:** Unknown. YouTube may block aggressive usage. ### API Endpoints ytmusicapi reverse-engineers YouTube Music web interface. No official API. **Endpoints:** Internal to ytmusicapi. Not exposed to MusicMetaLinker. ### Query Patterns **By metadata only:** ```python from ytmusicapi import YTMusic ytmusic = YTMusic() query = f'{artist} {track} {album}' results = ytmusic.search(query, filter='songs') ``` **filter='songs':** Excludes videos, albums, playlists. Returns only song results. **No ISRC support:** YouTube Music API doesn't support ISRC search. **No MBID support:** YouTube Music doesn't use MBIDs. ### Response Parsing **Song structure:** ```python { "videoId": "dQw4w9WgXcQ", "title": "Track Name", "artists": [ {"name": "Artist Name"} ], "album": { "name": "Album Name" }, "duration": "7:11", # string format "duration_seconds": 431 } ``` **Extraction logic:** - **YouTube ID:** result.videoId - **YouTube URL:** f"https://www.youtube.com/watch?v={videoId}" - **Track:** result.title - **Artist:** result.artists[0].name (first artist only) - **Album:** result.album.name ### Filtering and Matching **No duration filtering:** Duration filtering code commented out in MusicMetaLinker. ```python # if duration: # matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10] ``` **Why commented out:** Unknown. Possibly unreliable duration data from YouTube. **No fuzzy matching:** First result assumed correct. ```python best_match = results[0] if results else None ``` **Critical weakness:** High false positive rate. No validation that first result is correct match. ### Error Handling **Network errors:** Caught and suppressed. Returns None. **No results:** Returns None. **API changes:** ytmusicapi may break if YouTube changes web interface. No error handling for this. ### Integration Strengths 1. **Broad coverage:** YouTube Music has extensive catalog 2. **No authentication:** Easy to use 3. **Filter parameter:** Excludes non-song results ### Integration Weaknesses 1. **Unofficial API:** Reverse-engineered, fragile to changes 2. **No duration filtering:** Commented out, high false positive rate 3. **First-result-only:** No ranking or validation 4. **No ISRC support:** Can't use authoritative identifiers 5. **Legal risk:** Reverse-engineered APIs may violate ToS 6. **No error handling:** API breakage causes silent failures ## AcousticBrainz Integration ### Library and Authentication **Library:** requests (direct HTTP calls) **Authentication:** None. ### API Endpoints **Base URL:** https://acousticbrainz.org/ **Endpoint:** /{mbid} ### Query Pattern ```python import requests def acousticbrainz_link(mbid): url = f"https://acousticbrainz.org/{mbid}" response = requests.get(url) return url if response.status_code == 200 else None ``` Simple HTTP GET. Returns URL if MBID exists, None otherwise. ### Critical Issue: Service Shutdown **AcousticBrainz shut down in 2022.** All queries return 404. **Impact:** This integration is completely non-functional. Dead code. **Why still in codebase:** Unknown. Possibly not updated since shutdown. **Recommendation:** Remove this integration entirely. ### Integration Strengths None. Service is defunct. ### Integration Weaknesses 1. **Service shut down:** Non-functional 2. **Dead code:** Wastes execution time 3. **Misleading output:** CSV includes acousticbrainz column (always null) 4. **No deprecation notice:** Code doesn't warn users ## Spotify Integration ### Library and Authentication **Library:** spotipy (official Spotify Python client) **Authentication:** OAuth2 client credentials flow. **Credentials:** Stored in external mml_secrets.py file (not in repository). **mml_secrets.py structure:** ```python SPOTIFY_CLIENT_ID = "your-client-id" SPOTIFY_CLIENT_SECRET = "your-client-secret" ``` ### Usage Scope **Limited use:** Spotify integration only used in Billboard dataset cleaning script (prepare_dataset.py). **Not used in main Align workflow.** Spotify not queried by Align class. ### Query Pattern ```python import spotipy from spotipy.oauth2 import SpotifyClientCredentials from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET auth_manager = SpotifyClientCredentials( client_id=SPOTIFY_CLIENT_ID, client_secret=SPOTIFY_CLIENT_SECRET ) sp = spotipy.Spotify(auth_manager=auth_manager) result = sp.search(q=f'track:{track} artist:{artist}', type='track', limit=1) ``` ### Use Case **Billboard dataset cleaning:** Extract ISRCs from Spotify for Billboard chart tracks. **Workflow:** 1. Billboard dataset has artist/track names but no ISRCs 2. Query Spotify by artist/track 3. Extract ISRC from Spotify result 4. Use ISRC for subsequent MusicBrainz/Deezer queries ### Integration Strengths 1. **Official library:** spotipy maintained by Spotify 2. **OAuth2:** Secure authentication 3. **Rich metadata:** Comprehensive track information 4. **ISRC support:** Spotify provides ISRCs ### Integration Weaknesses 1. **Requires credentials:** Users must register Spotify app 2. **External secrets file:** mml_secrets.py not in repository, must be created manually 3. **Limited use:** Only for dataset preparation, not main workflow 4. **No documentation:** No instructions for obtaining credentials ## Integration Comparison | Service | Library | Auth | ISRC Support | Duration Filtering | Matching Quality | Status | |---------|---------|------|--------------|-------------------|------------------|--------| | MusicBrainz | musicbrainzngs | None | Yes | ±5s | Fuzzy (0.8) | Active | | Deezer | deezer-python | None | Yes | ±3s | Fuzzy (0.8) + Rank | Active | | YouTube Music | ytmusicapi | None | No | Commented out | First result | Active (fragile) | | AcousticBrainz | requests | None | N/A | N/A | N/A | Defunct | | Spotify | spotipy | OAuth2 | Yes | N/A | N/A | Active (limited use) | ## Integration Orchestration ### Service Selection Logic **Priority order:** 1. **MusicBrainz** if MBID provided (authoritative) 2. **Deezer** if ISRC provided (fast, reliable) 3. **MusicBrainz** if metadata provided (fallback) 4. **Deezer** if metadata provided (fallback) 5. **YouTube Music** if metadata provided (last resort) ### Parallel vs Sequential **Sequential execution:** Services queried one at a time. No parallelization. **Implications:** - Total latency = sum of all service latencies - Slow for batch processing - Simple error handling (no race conditions) ### Result Aggregation **No cross-validation:** Results from different services not compared. **First-wins strategy:** First successful query for each field used. **Example:** - MBID from MusicBrainz - ISRC from Deezer (if not in MusicBrainz) - BPM from Deezer (only source) - YouTube link from YouTube Music **No conflict resolution:** If MusicBrainz and Deezer return different artists, no reconciliation. ## Integration Error Handling ### Network Errors All network errors caught and suppressed. Returns None. **No retry logic:** Single attempt per service. **No exponential backoff:** Immediate failure on error. **No circuit breaker:** Repeated failures don't disable service. ### Rate Limiting **No rate limiting implementation.** **Risks:** - MusicBrainz: Recommends 1 req/s, may block aggressive usage - Deezer: Unknown limits, may block - YouTube Music: Unknown limits, may block or break API **Batch processing:** High risk of rate limiting (no delays between requests). ### Service Unavailability **No health checks:** Services assumed available. **No fallback:** If MusicBrainz down, no alternative for MBID lookup. **No status monitoring:** No logging of service failures. ## Integration Security ### API Keys **MusicBrainz, Deezer, YouTube Music:** No API keys required. **Spotify:** Client credentials in external file (not encrypted). ### Data Privacy **No personal data sent:** Only public music metadata queried. **No user tracking:** No analytics sent to services. ### HTTPS All services use HTTPS. No plaintext HTTP. ### Input Sanitization **No sanitization:** Metadata strings passed directly to APIs. **Potential risks:** - Query injection (if services use SQL/NoSQL) - Command injection (if services execute shell commands) **Actual risk:** Low. All services use HTTP APIs with proper escaping. ## Integration Recommendations ### Immediate Fixes 1. **Remove AcousticBrainz:** Delete defunct integration 2. **Fix User-Agent:** Change "elka/0.1" to "MusicMetaLinker/0.0.1" 3. **Add rate limiting:** Implement delays between requests 4. **Document Spotify setup:** Instructions for obtaining credentials ### Short-Term Improvements 1. **Add retry logic:** Exponential backoff for network errors 2. **Add timeout configuration:** Configurable request timeouts 3. **Enable YouTube duration filtering:** Uncomment and test 4. **Add error logging:** Log service failures 5. **Add health checks:** Verify service availability before queries ### Long-Term Enhancements 1. **Parallel queries:** Use asyncio for concurrent API calls 2. **Cross-validation:** Compare results across services 3. **Confidence scores:** Indicate match quality 4. **Service abstraction:** Common interface for all services 5. **Plugin architecture:** Allow adding new services without code changes 6. **Caching layer:** Reduce redundant API calls 7. **Circuit breaker:** Disable failing services temporarily 8. **Metrics collection:** Track success rates, latencies per service ## Integration Value Assessment **High value:** - MusicBrainz: Authoritative, comprehensive, reliable - Deezer: Fast, good ISRC coverage, BPM data **Medium value:** - Spotify: Useful for dataset preparation, requires setup **Low value:** - YouTube Music: Weak matching, fragile API, high false positives - AcousticBrainz: Defunct, zero value **Recommendation:** Keep MusicBrainz and Deezer. Remove AcousticBrainz. Improve YouTube Music matching or remove. Keep Spotify for dataset preparation.