feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,662 @@
# MusicMetaLinker Integrations
## Integration Architecture
MusicMetaLinker integrates with five external services:
1. MusicBrainz (open music encyclopedia)
2. Deezer (commercial streaming service)
3. YouTube Music (commercial streaming service)
4. AcousticBrainz (audio analysis database - defunct)
5. Spotify (commercial streaming service - limited use)
Each integration uses a different library and authentication approach.
## MusicBrainz Integration
### Library and Authentication
**Library:** musicbrainzngs (official Python client)
**Authentication:** None required for read-only queries.
**User-Agent:** Required by MusicBrainz API. Hardcoded as "elka/0.1" (appears to be from parent project, not MusicMetaLinker-specific).
**Rate limiting:** MusicBrainz recommends 1 request/second. Not enforced by MusicMetaLinker.
### API Endpoints
All queries go through musicbrainzngs library, which handles endpoint construction.
**Base URL:** https://musicbrainz.org/ws/2/
**Endpoints used:**
- Recording search: /recording?query=...
- Recording lookup: /recording/{mbid}
- ISRC search: /isrc/{isrc}
### Query Patterns
**By MBID (most reliable):**
```python
import musicbrainzngs as mb
mb.set_useragent("elka", "0.1")
result = mb.get_recording_by_id(
mbid,
includes=["artists", "releases", "isrcs"]
)
```
**includes parameter:** Fetches related entities in single request. Reduces API calls.
**By ISRC:**
```python
result = mb.get_recordings_by_isrc(
isrc,
includes=["artists", "releases", "isrcs"]
)
```
Returns list of recordings with that ISRC. Multiple recordings possible (different releases, remasters).
**By metadata:**
```python
query = f'artist:"{artist}" AND recording:"{track}"'
if album:
query += f' AND release:"{album}"'
result = mb.search_recordings(
query=query,
limit=10
)
```
Lucene query syntax. Quoted strings for exact matching. Returns ranked results.
### Response Parsing
**Recording structure:**
```python
{
"recording": {
"id": "mbid-uuid",
"title": "Track Name",
"length": 123456, # milliseconds
"artist-credit": [
{"artist": {"name": "Artist Name"}}
],
"release-list": [
{
"title": "Album Name",
"date": "2020-01-15",
"track-list": [
{"number": "1"}
]
}
],
"isrc-list": ["GBAYE9200070"]
}
}
```
**Extraction logic:**
- **MBID:** recording.id
- **Track:** recording.title
- **Artist:** recording.artist-credit[0].artist.name (first artist only)
- **Duration:** recording.length / 1000 (convert milliseconds to seconds)
- **Album:** recording.release-list[0].title (first release only)
- **Release date:** recording.release-list[0].date
- **Track number:** recording.release-list[0].track-list[0].number
- **ISRC:** recording.isrc-list[0] (first ISRC only)
**Multiple values:** MusicBrainz returns lists for artists, releases, ISRCs. MusicMetaLinker takes first value only. No aggregation or selection logic.
### Filtering and Matching
**Duration filtering:**
```python
if duration:
matches = [r for r in results if abs(r['length']/1000 - duration) < 5]
```
±5 second threshold for metadata searches. Hardcoded.
**Fuzzy string matching:**
Uses difflib.SequenceMatcher for artist/track/album similarity.
```python
from difflib import SequenceMatcher
def similarity(a, b):
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
# Match if similarity > 0.8 (80%)
```
Threshold hardcoded at 0.8. No configuration option.
### Error Handling
**Network errors:** Caught and suppressed. Returns None.
**Invalid MBID:** Returns None.
**No results:** Returns None.
**Rate limiting:** No handling. If rate limited, returns None.
### Integration Strengths
1. **Official library:** musicbrainzngs is maintained by MusicBrainz community
2. **Rich metadata:** Comprehensive music information
3. **No authentication:** Easy to use
4. **Includes parameter:** Efficient data fetching
5. **Authoritative source:** MusicBrainz is ground truth for music metadata
### Integration Weaknesses
1. **Hardcoded User-Agent:** "elka/0.1" not specific to MusicMetaLinker
2. **No rate limiting:** Risk of being blocked
3. **First-value-only:** Ignores multiple artists, releases, ISRCs
4. **Hardcoded thresholds:** Duration (5s), similarity (0.8) not configurable
5. **No error visibility:** Silent failures
## Deezer Integration
### Library and Authentication
**Library:** deezer-python (community library, not official)
**Authentication:** None required for search API.
**Rate limiting:** Unknown. Not documented. Not enforced by MusicMetaLinker.
### API Endpoints
deezer-python library handles endpoint construction.
**Base URL:** https://api.deezer.com/
**Endpoints used:**
- Track search: /search/track?q=...
- ISRC search: /track/isrc:{isrc}
### Query Patterns
**By ISRC (preferred):**
```python
import deezer
client = deezer.Client()
result = client.search(f'isrc:{isrc}', relation='track')
```
Returns list of tracks with that ISRC. Usually single result.
**By metadata:**
```python
query = f'{artist} {track}'
if album:
query += f' {album}'
result = client.search(query, relation='track')
```
Simple concatenation. No advanced query syntax.
### Response Parsing
**Track structure:**
```python
{
"id": 123456789,
"title": "Track Name",
"duration": 123, # seconds
"artist": {
"name": "Artist Name"
},
"album": {
"title": "Album Name"
},
"release_date": "2020-01-15",
"bpm": 120,
"isrc": "GBAYE9200070",
"rank": 500000
}
```
**Extraction logic:**
- **Deezer ID:** track.id
- **Track:** track.title
- **Artist:** track.artist.name
- **Album:** track.album.title
- **Duration:** track.duration (already in seconds)
- **Release date:** track.release_date
- **BPM:** track.bpm
- **ISRC:** track.isrc
- **Rank:** track.rank (popularity metric)
### Filtering and Matching
**Duration filtering (critical for Deezer):**
```python
duration_threshold = 3 # seconds
matches = [
t for t in results
if abs(t.duration - duration) <= duration_threshold
]
```
±3 second threshold. Configurable via parameter but defaults to 3.
**Why critical:** Deezer returns many versions of same track (radio edit, album version, remaster, live). Duration filtering essential for correct match.
**Fuzzy matching:**
Same difflib.SequenceMatcher approach as MusicBrainz. 0.8 similarity threshold.
**Ranking:**
If multiple matches after filtering, selects highest rank (most popular version).
```python
best_match = max(matches, key=lambda t: t.rank)
```
### Error Handling
**Network errors:** Caught and suppressed. Returns None.
**Invalid ISRC:** Returns empty list, treated as no match.
**No results:** Returns None.
### Integration Strengths
1. **Strong ISRC support:** Deezer has comprehensive ISRC coverage
2. **Duration filtering:** Effective for disambiguating versions
3. **Popularity ranking:** Helps select canonical version
4. **BPM data:** Only source of BPM in MusicMetaLinker
5. **Fast API:** Generally faster than MusicBrainz
### Integration Weaknesses
1. **Unofficial library:** deezer-python not maintained by Deezer
2. **No authentication:** Limited to public API (no user-specific data)
3. **Simple search:** No advanced query syntax
4. **Hardcoded threshold:** 3-second duration threshold may not suit all use cases
5. **Commercial bias:** Deezer catalog may not include obscure or independent releases
## YouTube Music Integration
### Library and Authentication
**Library:** ytmusicapi (unofficial, reverse-engineered API)
**Authentication:** None required for search.
**Rate limiting:** Unknown. YouTube may block aggressive usage.
### API Endpoints
ytmusicapi reverse-engineers YouTube Music web interface. No official API.
**Endpoints:** Internal to ytmusicapi. Not exposed to MusicMetaLinker.
### Query Patterns
**By metadata only:**
```python
from ytmusicapi import YTMusic
ytmusic = YTMusic()
query = f'{artist} {track} {album}'
results = ytmusic.search(query, filter='songs')
```
**filter='songs':** Excludes videos, albums, playlists. Returns only song results.
**No ISRC support:** YouTube Music API doesn't support ISRC search.
**No MBID support:** YouTube Music doesn't use MBIDs.
### Response Parsing
**Song structure:**
```python
{
"videoId": "dQw4w9WgXcQ",
"title": "Track Name",
"artists": [
{"name": "Artist Name"}
],
"album": {
"name": "Album Name"
},
"duration": "7:11", # string format
"duration_seconds": 431
}
```
**Extraction logic:**
- **YouTube ID:** result.videoId
- **YouTube URL:** f"https://www.youtube.com/watch?v={videoId}"
- **Track:** result.title
- **Artist:** result.artists[0].name (first artist only)
- **Album:** result.album.name
### Filtering and Matching
**No duration filtering:** Duration filtering code commented out in MusicMetaLinker.
```python
# if duration:
# matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]
```
**Why commented out:** Unknown. Possibly unreliable duration data from YouTube.
**No fuzzy matching:** First result assumed correct.
```python
best_match = results[0] if results else None
```
**Critical weakness:** High false positive rate. No validation that first result is correct match.
### Error Handling
**Network errors:** Caught and suppressed. Returns None.
**No results:** Returns None.
**API changes:** ytmusicapi may break if YouTube changes web interface. No error handling for this.
### Integration Strengths
1. **Broad coverage:** YouTube Music has extensive catalog
2. **No authentication:** Easy to use
3. **Filter parameter:** Excludes non-song results
### Integration Weaknesses
1. **Unofficial API:** Reverse-engineered, fragile to changes
2. **No duration filtering:** Commented out, high false positive rate
3. **First-result-only:** No ranking or validation
4. **No ISRC support:** Can't use authoritative identifiers
5. **Legal risk:** Reverse-engineered APIs may violate ToS
6. **No error handling:** API breakage causes silent failures
## AcousticBrainz Integration
### Library and Authentication
**Library:** requests (direct HTTP calls)
**Authentication:** None.
### API Endpoints
**Base URL:** https://acousticbrainz.org/
**Endpoint:** /{mbid}
### Query Pattern
```python
import requests
def acousticbrainz_link(mbid):
url = f"https://acousticbrainz.org/{mbid}"
response = requests.get(url)
return url if response.status_code == 200 else None
```
Simple HTTP GET. Returns URL if MBID exists, None otherwise.
### Critical Issue: Service Shutdown
**AcousticBrainz shut down in 2022.** All queries return 404.
**Impact:** This integration is completely non-functional. Dead code.
**Why still in codebase:** Unknown. Possibly not updated since shutdown.
**Recommendation:** Remove this integration entirely.
### Integration Strengths
None. Service is defunct.
### Integration Weaknesses
1. **Service shut down:** Non-functional
2. **Dead code:** Wastes execution time
3. **Misleading output:** CSV includes acousticbrainz column (always null)
4. **No deprecation notice:** Code doesn't warn users
## Spotify Integration
### Library and Authentication
**Library:** spotipy (official Spotify Python client)
**Authentication:** OAuth2 client credentials flow.
**Credentials:** Stored in external mml_secrets.py file (not in repository).
**mml_secrets.py structure:**
```python
SPOTIFY_CLIENT_ID = "your-client-id"
SPOTIFY_CLIENT_SECRET = "your-client-secret"
```
### Usage Scope
**Limited use:** Spotify integration only used in Billboard dataset cleaning script (prepare_dataset.py).
**Not used in main Align workflow.** Spotify not queried by Align class.
### Query Pattern
```python
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
auth_manager = SpotifyClientCredentials(
client_id=SPOTIFY_CLIENT_ID,
client_secret=SPOTIFY_CLIENT_SECRET
)
sp = spotipy.Spotify(auth_manager=auth_manager)
result = sp.search(q=f'track:{track} artist:{artist}', type='track', limit=1)
```
### Use Case
**Billboard dataset cleaning:** Extract ISRCs from Spotify for Billboard chart tracks.
**Workflow:**
1. Billboard dataset has artist/track names but no ISRCs
2. Query Spotify by artist/track
3. Extract ISRC from Spotify result
4. Use ISRC for subsequent MusicBrainz/Deezer queries
### Integration Strengths
1. **Official library:** spotipy maintained by Spotify
2. **OAuth2:** Secure authentication
3. **Rich metadata:** Comprehensive track information
4. **ISRC support:** Spotify provides ISRCs
### Integration Weaknesses
1. **Requires credentials:** Users must register Spotify app
2. **External secrets file:** mml_secrets.py not in repository, must be created manually
3. **Limited use:** Only for dataset preparation, not main workflow
4. **No documentation:** No instructions for obtaining credentials
## Integration Comparison
| Service | Library | Auth | ISRC Support | Duration Filtering | Matching Quality | Status |
|---------|---------|------|--------------|-------------------|------------------|--------|
| MusicBrainz | musicbrainzngs | None | Yes | ±5s | Fuzzy (0.8) | Active |
| Deezer | deezer-python | None | Yes | ±3s | Fuzzy (0.8) + Rank | Active |
| YouTube Music | ytmusicapi | None | No | Commented out | First result | Active (fragile) |
| AcousticBrainz | requests | None | N/A | N/A | N/A | Defunct |
| Spotify | spotipy | OAuth2 | Yes | N/A | N/A | Active (limited use) |
## Integration Orchestration
### Service Selection Logic
**Priority order:**
1. **MusicBrainz** if MBID provided (authoritative)
2. **Deezer** if ISRC provided (fast, reliable)
3. **MusicBrainz** if metadata provided (fallback)
4. **Deezer** if metadata provided (fallback)
5. **YouTube Music** if metadata provided (last resort)
### Parallel vs Sequential
**Sequential execution:** Services queried one at a time. No parallelization.
**Implications:**
- Total latency = sum of all service latencies
- Slow for batch processing
- Simple error handling (no race conditions)
### Result Aggregation
**No cross-validation:** Results from different services not compared.
**First-wins strategy:** First successful query for each field used.
**Example:**
- MBID from MusicBrainz
- ISRC from Deezer (if not in MusicBrainz)
- BPM from Deezer (only source)
- YouTube link from YouTube Music
**No conflict resolution:** If MusicBrainz and Deezer return different artists, no reconciliation.
## Integration Error Handling
### Network Errors
All network errors caught and suppressed. Returns None.
**No retry logic:** Single attempt per service.
**No exponential backoff:** Immediate failure on error.
**No circuit breaker:** Repeated failures don't disable service.
### Rate Limiting
**No rate limiting implementation.**
**Risks:**
- MusicBrainz: Recommends 1 req/s, may block aggressive usage
- Deezer: Unknown limits, may block
- YouTube Music: Unknown limits, may block or break API
**Batch processing:** High risk of rate limiting (no delays between requests).
### Service Unavailability
**No health checks:** Services assumed available.
**No fallback:** If MusicBrainz down, no alternative for MBID lookup.
**No status monitoring:** No logging of service failures.
## Integration Security
### API Keys
**MusicBrainz, Deezer, YouTube Music:** No API keys required.
**Spotify:** Client credentials in external file (not encrypted).
### Data Privacy
**No personal data sent:** Only public music metadata queried.
**No user tracking:** No analytics sent to services.
### HTTPS
All services use HTTPS. No plaintext HTTP.
### Input Sanitization
**No sanitization:** Metadata strings passed directly to APIs.
**Potential risks:**
- Query injection (if services use SQL/NoSQL)
- Command injection (if services execute shell commands)
**Actual risk:** Low. All services use HTTP APIs with proper escaping.
## Integration Recommendations
### Immediate Fixes
1. **Remove AcousticBrainz:** Delete defunct integration
2. **Fix User-Agent:** Change "elka/0.1" to "MusicMetaLinker/0.0.1"
3. **Add rate limiting:** Implement delays between requests
4. **Document Spotify setup:** Instructions for obtaining credentials
### Short-Term Improvements
1. **Add retry logic:** Exponential backoff for network errors
2. **Add timeout configuration:** Configurable request timeouts
3. **Enable YouTube duration filtering:** Uncomment and test
4. **Add error logging:** Log service failures
5. **Add health checks:** Verify service availability before queries
### Long-Term Enhancements
1. **Parallel queries:** Use asyncio for concurrent API calls
2. **Cross-validation:** Compare results across services
3. **Confidence scores:** Indicate match quality
4. **Service abstraction:** Common interface for all services
5. **Plugin architecture:** Allow adding new services without code changes
6. **Caching layer:** Reduce redundant API calls
7. **Circuit breaker:** Disable failing services temporarily
8. **Metrics collection:** Track success rates, latencies per service
## Integration Value Assessment
**High value:**
- MusicBrainz: Authoritative, comprehensive, reliable
- Deezer: Fast, good ISRC coverage, BPM data
**Medium value:**
- Spotify: Useful for dataset preparation, requires setup
**Low value:**
- YouTube Music: Weak matching, fragile API, high false positives
- AcousticBrainz: Defunct, zero value
**Recommendation:** Keep MusicBrainz and Deezer. Remove AcousticBrainz. Improve YouTube Music matching or remove. Keep Spotify for dataset preparation.