a1f6701bac
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
663 lines
17 KiB
Markdown
663 lines
17 KiB
Markdown
# MusicMetaLinker Integrations
|
|
|
|
## Integration Architecture
|
|
|
|
MusicMetaLinker integrates with five external services:
|
|
1. MusicBrainz (open music encyclopedia)
|
|
2. Deezer (commercial streaming service)
|
|
3. YouTube Music (commercial streaming service)
|
|
4. AcousticBrainz (audio analysis database - defunct)
|
|
5. Spotify (commercial streaming service - limited use)
|
|
|
|
Each integration uses a different library and authentication approach.
|
|
|
|
## MusicBrainz Integration
|
|
|
|
### Library and Authentication
|
|
|
|
**Library:** musicbrainzngs (official Python client)
|
|
|
|
**Authentication:** None required for read-only queries.
|
|
|
|
**User-Agent:** Required by MusicBrainz API. Hardcoded as "elka/0.1" (appears to be from parent project, not MusicMetaLinker-specific).
|
|
|
|
**Rate limiting:** MusicBrainz recommends 1 request/second. Not enforced by MusicMetaLinker.
|
|
|
|
### API Endpoints
|
|
|
|
All queries go through musicbrainzngs library, which handles endpoint construction.
|
|
|
|
**Base URL:** https://musicbrainz.org/ws/2/
|
|
|
|
**Endpoints used:**
|
|
- Recording search: /recording?query=...
|
|
- Recording lookup: /recording/{mbid}
|
|
- ISRC search: /isrc/{isrc}
|
|
|
|
### Query Patterns
|
|
|
|
**By MBID (most reliable):**
|
|
|
|
```python
|
|
import musicbrainzngs as mb
|
|
|
|
mb.set_useragent("elka", "0.1")
|
|
result = mb.get_recording_by_id(
|
|
mbid,
|
|
includes=["artists", "releases", "isrcs"]
|
|
)
|
|
```
|
|
|
|
**includes parameter:** Fetches related entities in single request. Reduces API calls.
|
|
|
|
**By ISRC:**
|
|
|
|
```python
|
|
result = mb.get_recordings_by_isrc(
|
|
isrc,
|
|
includes=["artists", "releases", "isrcs"]
|
|
)
|
|
```
|
|
|
|
Returns list of recordings with that ISRC. Multiple recordings possible (different releases, remasters).
|
|
|
|
**By metadata:**
|
|
|
|
```python
|
|
query = f'artist:"{artist}" AND recording:"{track}"'
|
|
if album:
|
|
query += f' AND release:"{album}"'
|
|
|
|
result = mb.search_recordings(
|
|
query=query,
|
|
limit=10
|
|
)
|
|
```
|
|
|
|
Lucene query syntax. Quoted strings for exact matching. Returns ranked results.
|
|
|
|
### Response Parsing
|
|
|
|
**Recording structure:**
|
|
|
|
```python
|
|
{
|
|
"recording": {
|
|
"id": "mbid-uuid",
|
|
"title": "Track Name",
|
|
"length": 123456, # milliseconds
|
|
"artist-credit": [
|
|
{"artist": {"name": "Artist Name"}}
|
|
],
|
|
"release-list": [
|
|
{
|
|
"title": "Album Name",
|
|
"date": "2020-01-15",
|
|
"track-list": [
|
|
{"number": "1"}
|
|
]
|
|
}
|
|
],
|
|
"isrc-list": ["GBAYE9200070"]
|
|
}
|
|
}
|
|
```
|
|
|
|
**Extraction logic:**
|
|
|
|
- **MBID:** recording.id
|
|
- **Track:** recording.title
|
|
- **Artist:** recording.artist-credit[0].artist.name (first artist only)
|
|
- **Duration:** recording.length / 1000 (convert milliseconds to seconds)
|
|
- **Album:** recording.release-list[0].title (first release only)
|
|
- **Release date:** recording.release-list[0].date
|
|
- **Track number:** recording.release-list[0].track-list[0].number
|
|
- **ISRC:** recording.isrc-list[0] (first ISRC only)
|
|
|
|
**Multiple values:** MusicBrainz returns lists for artists, releases, ISRCs. MusicMetaLinker takes first value only. No aggregation or selection logic.
|
|
|
|
### Filtering and Matching
|
|
|
|
**Duration filtering:**
|
|
|
|
```python
|
|
if duration:
|
|
matches = [r for r in results if abs(r['length']/1000 - duration) < 5]
|
|
```
|
|
|
|
±5 second threshold for metadata searches. Hardcoded.
|
|
|
|
**Fuzzy string matching:**
|
|
|
|
Uses difflib.SequenceMatcher for artist/track/album similarity.
|
|
|
|
```python
|
|
from difflib import SequenceMatcher
|
|
|
|
def similarity(a, b):
|
|
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
|
|
|
|
# Match if similarity > 0.8 (80%)
|
|
```
|
|
|
|
Threshold hardcoded at 0.8. No configuration option.
|
|
|
|
### Error Handling
|
|
|
|
**Network errors:** Caught and suppressed. Returns None.
|
|
|
|
**Invalid MBID:** Returns None.
|
|
|
|
**No results:** Returns None.
|
|
|
|
**Rate limiting:** No handling. If rate limited, returns None.
|
|
|
|
### Integration Strengths
|
|
|
|
1. **Official library:** musicbrainzngs is maintained by MusicBrainz community
|
|
2. **Rich metadata:** Comprehensive music information
|
|
3. **No authentication:** Easy to use
|
|
4. **Includes parameter:** Efficient data fetching
|
|
5. **Authoritative source:** MusicBrainz is ground truth for music metadata
|
|
|
|
### Integration Weaknesses
|
|
|
|
1. **Hardcoded User-Agent:** "elka/0.1" not specific to MusicMetaLinker
|
|
2. **No rate limiting:** Risk of being blocked
|
|
3. **First-value-only:** Ignores multiple artists, releases, ISRCs
|
|
4. **Hardcoded thresholds:** Duration (5s), similarity (0.8) not configurable
|
|
5. **No error visibility:** Silent failures
|
|
|
|
## Deezer Integration
|
|
|
|
### Library and Authentication
|
|
|
|
**Library:** deezer-python (community library, not official)
|
|
|
|
**Authentication:** None required for search API.
|
|
|
|
**Rate limiting:** Unknown. Not documented. Not enforced by MusicMetaLinker.
|
|
|
|
### API Endpoints
|
|
|
|
deezer-python library handles endpoint construction.
|
|
|
|
**Base URL:** https://api.deezer.com/
|
|
|
|
**Endpoints used:**
|
|
- Track search: /search/track?q=...
|
|
- ISRC search: /track/isrc:{isrc}
|
|
|
|
### Query Patterns
|
|
|
|
**By ISRC (preferred):**
|
|
|
|
```python
|
|
import deezer
|
|
|
|
client = deezer.Client()
|
|
result = client.search(f'isrc:{isrc}', relation='track')
|
|
```
|
|
|
|
Returns list of tracks with that ISRC. Usually single result.
|
|
|
|
**By metadata:**
|
|
|
|
```python
|
|
query = f'{artist} {track}'
|
|
if album:
|
|
query += f' {album}'
|
|
|
|
result = client.search(query, relation='track')
|
|
```
|
|
|
|
Simple concatenation. No advanced query syntax.
|
|
|
|
### Response Parsing
|
|
|
|
**Track structure:**
|
|
|
|
```python
|
|
{
|
|
"id": 123456789,
|
|
"title": "Track Name",
|
|
"duration": 123, # seconds
|
|
"artist": {
|
|
"name": "Artist Name"
|
|
},
|
|
"album": {
|
|
"title": "Album Name"
|
|
},
|
|
"release_date": "2020-01-15",
|
|
"bpm": 120,
|
|
"isrc": "GBAYE9200070",
|
|
"rank": 500000
|
|
}
|
|
```
|
|
|
|
**Extraction logic:**
|
|
|
|
- **Deezer ID:** track.id
|
|
- **Track:** track.title
|
|
- **Artist:** track.artist.name
|
|
- **Album:** track.album.title
|
|
- **Duration:** track.duration (already in seconds)
|
|
- **Release date:** track.release_date
|
|
- **BPM:** track.bpm
|
|
- **ISRC:** track.isrc
|
|
- **Rank:** track.rank (popularity metric)
|
|
|
|
### Filtering and Matching
|
|
|
|
**Duration filtering (critical for Deezer):**
|
|
|
|
```python
|
|
duration_threshold = 3 # seconds
|
|
|
|
matches = [
|
|
t for t in results
|
|
if abs(t.duration - duration) <= duration_threshold
|
|
]
|
|
```
|
|
|
|
±3 second threshold. Configurable via parameter but defaults to 3.
|
|
|
|
**Why critical:** Deezer returns many versions of same track (radio edit, album version, remaster, live). Duration filtering essential for correct match.
|
|
|
|
**Fuzzy matching:**
|
|
|
|
Same difflib.SequenceMatcher approach as MusicBrainz. 0.8 similarity threshold.
|
|
|
|
**Ranking:**
|
|
|
|
If multiple matches after filtering, selects highest rank (most popular version).
|
|
|
|
```python
|
|
best_match = max(matches, key=lambda t: t.rank)
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
**Network errors:** Caught and suppressed. Returns None.
|
|
|
|
**Invalid ISRC:** Returns empty list, treated as no match.
|
|
|
|
**No results:** Returns None.
|
|
|
|
### Integration Strengths
|
|
|
|
1. **Strong ISRC support:** Deezer has comprehensive ISRC coverage
|
|
2. **Duration filtering:** Effective for disambiguating versions
|
|
3. **Popularity ranking:** Helps select canonical version
|
|
4. **BPM data:** Only source of BPM in MusicMetaLinker
|
|
5. **Fast API:** Generally faster than MusicBrainz
|
|
|
|
### Integration Weaknesses
|
|
|
|
1. **Unofficial library:** deezer-python not maintained by Deezer
|
|
2. **No authentication:** Limited to public API (no user-specific data)
|
|
3. **Simple search:** No advanced query syntax
|
|
4. **Hardcoded threshold:** 3-second duration threshold may not suit all use cases
|
|
5. **Commercial bias:** Deezer catalog may not include obscure or independent releases
|
|
|
|
## YouTube Music Integration
|
|
|
|
### Library and Authentication
|
|
|
|
**Library:** ytmusicapi (unofficial, reverse-engineered API)
|
|
|
|
**Authentication:** None required for search.
|
|
|
|
**Rate limiting:** Unknown. YouTube may block aggressive usage.
|
|
|
|
### API Endpoints
|
|
|
|
ytmusicapi reverse-engineers YouTube Music web interface. No official API.
|
|
|
|
**Endpoints:** Internal to ytmusicapi. Not exposed to MusicMetaLinker.
|
|
|
|
### Query Patterns
|
|
|
|
**By metadata only:**
|
|
|
|
```python
|
|
from ytmusicapi import YTMusic
|
|
|
|
ytmusic = YTMusic()
|
|
query = f'{artist} {track} {album}'
|
|
results = ytmusic.search(query, filter='songs')
|
|
```
|
|
|
|
**filter='songs':** Excludes videos, albums, playlists. Returns only song results.
|
|
|
|
**No ISRC support:** YouTube Music API doesn't support ISRC search.
|
|
|
|
**No MBID support:** YouTube Music doesn't use MBIDs.
|
|
|
|
### Response Parsing
|
|
|
|
**Song structure:**
|
|
|
|
```python
|
|
{
|
|
"videoId": "dQw4w9WgXcQ",
|
|
"title": "Track Name",
|
|
"artists": [
|
|
{"name": "Artist Name"}
|
|
],
|
|
"album": {
|
|
"name": "Album Name"
|
|
},
|
|
"duration": "7:11", # string format
|
|
"duration_seconds": 431
|
|
}
|
|
```
|
|
|
|
**Extraction logic:**
|
|
|
|
- **YouTube ID:** result.videoId
|
|
- **YouTube URL:** f"https://www.youtube.com/watch?v={videoId}"
|
|
- **Track:** result.title
|
|
- **Artist:** result.artists[0].name (first artist only)
|
|
- **Album:** result.album.name
|
|
|
|
### Filtering and Matching
|
|
|
|
**No duration filtering:** Duration filtering code commented out in MusicMetaLinker.
|
|
|
|
```python
|
|
# if duration:
|
|
# matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]
|
|
```
|
|
|
|
**Why commented out:** Unknown. Possibly unreliable duration data from YouTube.
|
|
|
|
**No fuzzy matching:** First result assumed correct.
|
|
|
|
```python
|
|
best_match = results[0] if results else None
|
|
```
|
|
|
|
**Critical weakness:** High false positive rate. No validation that first result is correct match.
|
|
|
|
### Error Handling
|
|
|
|
**Network errors:** Caught and suppressed. Returns None.
|
|
|
|
**No results:** Returns None.
|
|
|
|
**API changes:** ytmusicapi may break if YouTube changes web interface. No error handling for this.
|
|
|
|
### Integration Strengths
|
|
|
|
1. **Broad coverage:** YouTube Music has extensive catalog
|
|
2. **No authentication:** Easy to use
|
|
3. **Filter parameter:** Excludes non-song results
|
|
|
|
### Integration Weaknesses
|
|
|
|
1. **Unofficial API:** Reverse-engineered, fragile to changes
|
|
2. **No duration filtering:** Commented out, high false positive rate
|
|
3. **First-result-only:** No ranking or validation
|
|
4. **No ISRC support:** Can't use authoritative identifiers
|
|
5. **Legal risk:** Reverse-engineered APIs may violate ToS
|
|
6. **No error handling:** API breakage causes silent failures
|
|
|
|
## AcousticBrainz Integration
|
|
|
|
### Library and Authentication
|
|
|
|
**Library:** requests (direct HTTP calls)
|
|
|
|
**Authentication:** None.
|
|
|
|
### API Endpoints
|
|
|
|
**Base URL:** https://acousticbrainz.org/
|
|
|
|
**Endpoint:** /{mbid}
|
|
|
|
### Query Pattern
|
|
|
|
```python
|
|
import requests
|
|
|
|
def acousticbrainz_link(mbid):
|
|
url = f"https://acousticbrainz.org/{mbid}"
|
|
response = requests.get(url)
|
|
return url if response.status_code == 200 else None
|
|
```
|
|
|
|
Simple HTTP GET. Returns URL if MBID exists, None otherwise.
|
|
|
|
### Critical Issue: Service Shutdown
|
|
|
|
**AcousticBrainz shut down in 2022.** All queries return 404.
|
|
|
|
**Impact:** This integration is completely non-functional. Dead code.
|
|
|
|
**Why still in codebase:** Unknown. Possibly not updated since shutdown.
|
|
|
|
**Recommendation:** Remove this integration entirely.
|
|
|
|
### Integration Strengths
|
|
|
|
None. Service is defunct.
|
|
|
|
### Integration Weaknesses
|
|
|
|
1. **Service shut down:** Non-functional
|
|
2. **Dead code:** Wastes execution time
|
|
3. **Misleading output:** CSV includes acousticbrainz column (always null)
|
|
4. **No deprecation notice:** Code doesn't warn users
|
|
|
|
## Spotify Integration
|
|
|
|
### Library and Authentication
|
|
|
|
**Library:** spotipy (official Spotify Python client)
|
|
|
|
**Authentication:** OAuth2 client credentials flow.
|
|
|
|
**Credentials:** Stored in external mml_secrets.py file (not in repository).
|
|
|
|
**mml_secrets.py structure:**
|
|
|
|
```python
|
|
SPOTIFY_CLIENT_ID = "your-client-id"
|
|
SPOTIFY_CLIENT_SECRET = "your-client-secret"
|
|
```
|
|
|
|
### Usage Scope
|
|
|
|
**Limited use:** Spotify integration only used in Billboard dataset cleaning script (prepare_dataset.py).
|
|
|
|
**Not used in main Align workflow.** Spotify not queried by Align class.
|
|
|
|
### Query Pattern
|
|
|
|
```python
|
|
import spotipy
|
|
from spotipy.oauth2 import SpotifyClientCredentials
|
|
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
|
|
|
|
auth_manager = SpotifyClientCredentials(
|
|
client_id=SPOTIFY_CLIENT_ID,
|
|
client_secret=SPOTIFY_CLIENT_SECRET
|
|
)
|
|
sp = spotipy.Spotify(auth_manager=auth_manager)
|
|
|
|
result = sp.search(q=f'track:{track} artist:{artist}', type='track', limit=1)
|
|
```
|
|
|
|
### Use Case
|
|
|
|
**Billboard dataset cleaning:** Extract ISRCs from Spotify for Billboard chart tracks.
|
|
|
|
**Workflow:**
|
|
1. Billboard dataset has artist/track names but no ISRCs
|
|
2. Query Spotify by artist/track
|
|
3. Extract ISRC from Spotify result
|
|
4. Use ISRC for subsequent MusicBrainz/Deezer queries
|
|
|
|
### Integration Strengths
|
|
|
|
1. **Official library:** spotipy maintained by Spotify
|
|
2. **OAuth2:** Secure authentication
|
|
3. **Rich metadata:** Comprehensive track information
|
|
4. **ISRC support:** Spotify provides ISRCs
|
|
|
|
### Integration Weaknesses
|
|
|
|
1. **Requires credentials:** Users must register Spotify app
|
|
2. **External secrets file:** mml_secrets.py not in repository, must be created manually
|
|
3. **Limited use:** Only for dataset preparation, not main workflow
|
|
4. **No documentation:** No instructions for obtaining credentials
|
|
|
|
## Integration Comparison
|
|
|
|
| Service | Library | Auth | ISRC Support | Duration Filtering | Matching Quality | Status |
|
|
|---------|---------|------|--------------|-------------------|------------------|--------|
|
|
| MusicBrainz | musicbrainzngs | None | Yes | ±5s | Fuzzy (0.8) | Active |
|
|
| Deezer | deezer-python | None | Yes | ±3s | Fuzzy (0.8) + Rank | Active |
|
|
| YouTube Music | ytmusicapi | None | No | Commented out | First result | Active (fragile) |
|
|
| AcousticBrainz | requests | None | N/A | N/A | N/A | Defunct |
|
|
| Spotify | spotipy | OAuth2 | Yes | N/A | N/A | Active (limited use) |
|
|
|
|
## Integration Orchestration
|
|
|
|
### Service Selection Logic
|
|
|
|
**Priority order:**
|
|
|
|
1. **MusicBrainz** if MBID provided (authoritative)
|
|
2. **Deezer** if ISRC provided (fast, reliable)
|
|
3. **MusicBrainz** if metadata provided (fallback)
|
|
4. **Deezer** if metadata provided (fallback)
|
|
5. **YouTube Music** if metadata provided (last resort)
|
|
|
|
### Parallel vs Sequential
|
|
|
|
**Sequential execution:** Services queried one at a time. No parallelization.
|
|
|
|
**Implications:**
|
|
- Total latency = sum of all service latencies
|
|
- Slow for batch processing
|
|
- Simple error handling (no race conditions)
|
|
|
|
### Result Aggregation
|
|
|
|
**No cross-validation:** Results from different services not compared.
|
|
|
|
**First-wins strategy:** First successful query for each field used.
|
|
|
|
**Example:**
|
|
- MBID from MusicBrainz
|
|
- ISRC from Deezer (if not in MusicBrainz)
|
|
- BPM from Deezer (only source)
|
|
- YouTube link from YouTube Music
|
|
|
|
**No conflict resolution:** If MusicBrainz and Deezer return different artists, no reconciliation.
|
|
|
|
## Integration Error Handling
|
|
|
|
### Network Errors
|
|
|
|
All network errors caught and suppressed. Returns None.
|
|
|
|
**No retry logic:** Single attempt per service.
|
|
|
|
**No exponential backoff:** Immediate failure on error.
|
|
|
|
**No circuit breaker:** Repeated failures don't disable service.
|
|
|
|
### Rate Limiting
|
|
|
|
**No rate limiting implementation.**
|
|
|
|
**Risks:**
|
|
- MusicBrainz: Recommends 1 req/s, may block aggressive usage
|
|
- Deezer: Unknown limits, may block
|
|
- YouTube Music: Unknown limits, may block or break API
|
|
|
|
**Batch processing:** High risk of rate limiting (no delays between requests).
|
|
|
|
### Service Unavailability
|
|
|
|
**No health checks:** Services assumed available.
|
|
|
|
**No fallback:** If MusicBrainz down, no alternative for MBID lookup.
|
|
|
|
**No status monitoring:** No logging of service failures.
|
|
|
|
## Integration Security
|
|
|
|
### API Keys
|
|
|
|
**MusicBrainz, Deezer, YouTube Music:** No API keys required.
|
|
|
|
**Spotify:** Client credentials in external file (not encrypted).
|
|
|
|
### Data Privacy
|
|
|
|
**No personal data sent:** Only public music metadata queried.
|
|
|
|
**No user tracking:** No analytics sent to services.
|
|
|
|
### HTTPS
|
|
|
|
All services use HTTPS. No plaintext HTTP.
|
|
|
|
### Input Sanitization
|
|
|
|
**No sanitization:** Metadata strings passed directly to APIs.
|
|
|
|
**Potential risks:**
|
|
- Query injection (if services use SQL/NoSQL)
|
|
- Command injection (if services execute shell commands)
|
|
|
|
**Actual risk:** Low. All services use HTTP APIs with proper escaping.
|
|
|
|
## Integration Recommendations
|
|
|
|
### Immediate Fixes
|
|
|
|
1. **Remove AcousticBrainz:** Delete defunct integration
|
|
2. **Fix User-Agent:** Change "elka/0.1" to "MusicMetaLinker/0.0.1"
|
|
3. **Add rate limiting:** Implement delays between requests
|
|
4. **Document Spotify setup:** Instructions for obtaining credentials
|
|
|
|
### Short-Term Improvements
|
|
|
|
1. **Add retry logic:** Exponential backoff for network errors
|
|
2. **Add timeout configuration:** Configurable request timeouts
|
|
3. **Enable YouTube duration filtering:** Uncomment and test
|
|
4. **Add error logging:** Log service failures
|
|
5. **Add health checks:** Verify service availability before queries
|
|
|
|
### Long-Term Enhancements
|
|
|
|
1. **Parallel queries:** Use asyncio for concurrent API calls
|
|
2. **Cross-validation:** Compare results across services
|
|
3. **Confidence scores:** Indicate match quality
|
|
4. **Service abstraction:** Common interface for all services
|
|
5. **Plugin architecture:** Allow adding new services without code changes
|
|
6. **Caching layer:** Reduce redundant API calls
|
|
7. **Circuit breaker:** Disable failing services temporarily
|
|
8. **Metrics collection:** Track success rates, latencies per service
|
|
|
|
## Integration Value Assessment
|
|
|
|
**High value:**
|
|
- MusicBrainz: Authoritative, comprehensive, reliable
|
|
- Deezer: Fast, good ISRC coverage, BPM data
|
|
|
|
**Medium value:**
|
|
- Spotify: Useful for dataset preparation, requires setup
|
|
|
|
**Low value:**
|
|
- YouTube Music: Weak matching, fragile API, high false positives
|
|
- AcousticBrainz: Defunct, zero value
|
|
|
|
**Recommendation:** Keep MusicBrainz and Deezer. Remove AcousticBrainz. Improve YouTube Music matching or remove. Keep Spotify for dataset preparation.
|