Files
metadata-agregator/docs/research/meelo/analysis/INTEGRATIONS.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

815 lines
19 KiB
Markdown

# Meelo Integrations
## Integration Overview
Meelo integrates with 8 metadata providers and 2 scrobbling services. The Matcher service handles provider queries, while the Server handles scrobbling. All integrations are configurable via settings.json and .env.
## Metadata Providers
### MusicBrainz
**Type**: Primary music database
**Library**: musicbrainzngs (Python)
**Authentication**: None (public API)
**Rate Limit**: 1 request/second
**Priority**: Highest (primary source)
#### Capabilities
- Artist metadata (name, sort name, areas, relationships)
- Album metadata (title, type, release date, labels)
- Track metadata (title, duration, ISRC)
- Recording relationships (covers, remixes, versions)
- Release groups and releases
- Area data (countries, cities with ISO 3166 codes)
#### Matching Strategy
1. Query by AcoustID fingerprint (most accurate)
2. If no fingerprint, search by artist + album + track title
3. Extract MBID (MusicBrainz ID) for future queries
4. Store MBID in LocalIdentifiers table
#### Data Extraction
**Artist**:
```python
artist_data = mb.get_artist_by_id(mbid, includes=['areas', 'aliases'])
{
'name': artist_data['artist']['name'],
'sortName': artist_data['artist']['sort-name'],
'areas': [area['name'] for area in artist_data['artist'].get('areas', [])]
}
```
**Album**:
```python
release_group = mb.get_release_group_by_id(mbid, includes=['releases', 'labels'])
{
'name': release_group['release-group']['title'],
'type': release_group['release-group']['type'],
'releaseDate': release_group['release-group']['first-release-date'],
'releases': [...]
}
```
**Track**:
```python
recording = mb.get_recording_by_id(mbid, includes=['isrcs', 'releases'])
{
'title': recording['recording']['title'],
'duration': recording['recording']['length'],
'isrc': recording['recording'].get('isrc-list', [None])[0]
}
```
#### Rate Limiting
musicbrainzngs library enforces 1 request/second automatically. No additional limiting needed.
#### Error Handling
- **404 Not Found**: No match, skip provider
- **503 Service Unavailable**: Retry with exponential backoff (max 3 attempts)
- **Rate Limit Exceeded**: Wait and retry
### Genius
**Type**: Lyrics and song descriptions
**Library**: lyricsgenius (Python)
**Authentication**: API token (GENIUS_ACCESS_TOKEN)
**Rate Limit**: 10 requests/second
**Priority**: High (for lyrics)
#### Capabilities
- Song lyrics (plain text)
- Song descriptions and annotations
- Artist biographies
- Album descriptions
#### Matching Strategy
1. Search by artist + song title
2. Extract song ID from search results
3. Fetch full song data including lyrics
4. Store lyrics in Lyrics table
#### Data Extraction
**Lyrics**:
```python
genius = lyricsgenius.Genius(token)
song = genius.search_song(title, artist)
{
'plain': song.lyrics,
'description': song.description
}
```
**Artist Bio**:
```python
artist = genius.search_artist(name)
{
'description': artist.description
}
```
#### Rate Limiting
Implemented using aiolimiter:
```python
limiter = AsyncLimiter(10, 1) # 10 requests per second
async with limiter:
result = await fetch_genius(...)
```
#### Error Handling
- **404 Not Found**: No lyrics available, skip
- **401 Unauthorized**: Invalid token, log error
- **Rate Limit**: Wait and retry
### Wikipedia
**Type**: Artist and album context
**Library**: wikipedia (Python)
**Authentication**: None
**Rate Limit**: 5 requests/second (self-imposed)
**Priority**: Medium (for descriptions)
#### Capabilities
- Artist biographies
- Album background and reception
- Contextual information (formation, breakup, influences)
#### Matching Strategy
1. Search Wikipedia by artist/album name
2. Extract first paragraph as description
3. Store full URL as source
#### Data Extraction
**Artist Bio**:
```python
import wikipedia
page = wikipedia.page(artist_name)
{
'description': page.summary,
'url': page.url
}
```
**Album Context**:
```python
page = wikipedia.page(f"{album_name} ({artist_name} album)")
{
'description': page.summary,
'url': page.url
}
```
#### Disambiguation
Wikipedia often returns disambiguation pages. Handle by:
1. Detect disambiguation page (check for "may refer to")
2. Search for most likely option (e.g., add "band" or "musician")
3. If still ambiguous, skip
#### Rate Limiting
```python
limiter = AsyncLimiter(5, 1) # 5 requests per second
```
#### Error Handling
- **PageError**: No Wikipedia page, skip
- **DisambiguationError**: Try disambiguation, or skip
- **HTTPError**: Retry with backoff
### Wikidata
**Type**: Structured data
**Library**: SPARQLWrapper (Python)
**Authentication**: None
**Rate Limit**: None (fast SPARQL endpoint)
**Priority**: Medium (for structured data)
#### Capabilities
- Artist relationships (members, collaborators)
- Area data (countries, cities, ISO codes)
- Dates (birth, death, formation, dissolution)
- External IDs (MusicBrainz, Discogs, AllMusic)
#### Matching Strategy
1. Query by MusicBrainz ID (if available)
2. Extract Wikidata entity ID
3. Query for additional properties
4. Store structured data
#### Data Extraction
**Artist Data**:
```sparql
SELECT ?property ?value WHERE {
?artist wdt:P434 "MBID" . # MusicBrainz artist ID
?artist ?property ?value .
}
```
**Area Hierarchy**:
```sparql
SELECT ?area ?parent ?iso WHERE {
?area wdt:P31 wd:Q515 . # instance of city
?area wdt:P131 ?parent . # located in
?area wdt:P300 ?iso . # ISO 3166 code
}
```
#### Rate Limiting
No rate limit. SPARQL endpoint is fast and public.
#### Error Handling
- **No Results**: Entity not in Wikidata, skip
- **Timeout**: Retry with simpler query
- **SPARQL Error**: Log and skip
### Discogs
**Type**: Release information
**Library**: discogs_client (Python)
**Authentication**: API token (DISCOGS_ACCESS_TOKEN)
**Rate Limit**: 60 requests/minute
**Priority**: Low (optional)
#### Capabilities
- Release details (catalog number, barcode, format)
- Label information
- Release variations (country, format)
- Marketplace data (not used)
#### Matching Strategy
1. Search by artist + album title
2. Filter by format (CD, Vinyl, etc.)
3. Extract release details
4. Store in Release.extensions JSON
#### Data Extraction
**Release**:
```python
import discogs_client
d = discogs_client.Client('Meelo/1.0', user_token=token)
results = d.search(artist=artist, release_title=album, type='release')
release = results[0]
{
'catalogNumber': release.data['catno'],
'barcode': release.data.get('barcode'),
'format': release.formats[0]['name'],
'country': release.country,
'label': release.labels[0].name
}
```
#### Rate Limiting
```python
limiter = AsyncLimiter(60, 60) # 60 requests per minute
```
#### Error Handling
- **404 Not Found**: No Discogs entry, skip
- **401 Unauthorized**: Invalid token, log error
- **Rate Limit**: Wait 60 seconds and retry
### AllMusic
**Type**: Editorial reviews and ratings
**Library**: BeautifulSoup (web scraping)
**Authentication**: None
**Rate Limit**: 1 request/second (self-imposed, no official API)
**Priority**: Low (optional)
#### Capabilities
- Album reviews
- Album ratings (1-5 stars)
- Artist biographies
- Genre classifications
#### Matching Strategy
1. Search AllMusic by artist + album
2. Scrape search results page
3. Extract review and rating
4. Store rating normalized to 0-100 scale
#### Data Extraction
**Album Review**:
```python
from bs4 import BeautifulSoup
import httpx
url = f"https://www.allmusic.com/search/albums/{artist}+{album}"
response = httpx.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
rating_elem = soup.select_one('.allmusic-rating')
rating = len(rating_elem.select('.star-rating.full')) # Count full stars
review_elem = soup.select_one('.review-text')
review = review_elem.text.strip()
{
'rating': rating * 20, # Convert 1-5 to 0-100
'description': review
}
```
#### Rate Limiting
```python
limiter = AsyncLimiter(1, 1) # 1 request per second
```
#### Error Handling
- **404 Not Found**: No AllMusic page, skip
- **Parsing Error**: HTML structure changed, log and skip
- **Timeout**: Retry with backoff
#### Scraping Risks
AllMusic has no official API. Scraping may break if HTML structure changes. Disabled by default in settings.json.
### Metacritic
**Type**: Aggregated critic scores
**Library**: BeautifulSoup (web scraping)
**Authentication**: None
**Rate Limit**: 1 request/second (self-imposed)
**Priority**: Low (optional)
#### Capabilities
- Album critic scores (0-100)
- User scores (not used)
- Critic reviews (not extracted)
#### Matching Strategy
1. Search Metacritic by artist + album
2. Scrape album page
3. Extract Metascore
4. Store as rating (already 0-100 scale)
#### Data Extraction
**Album Score**:
```python
url = f"https://www.metacritic.com/music/{album_slug}/{artist_slug}"
response = httpx.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
score_elem = soup.select_one('.metascore_w')
score = int(score_elem.text.strip())
{
'rating': score
}
```
#### Rate Limiting
```python
limiter = AsyncLimiter(1, 1) # 1 request per second
```
#### Error Handling
- **404 Not Found**: Album not on Metacritic, skip
- **Parsing Error**: HTML structure changed, log and skip
- **Timeout**: Retry with backoff
#### Scraping Risks
Same as AllMusic. Disabled by default.
### LrcLib
**Type**: Synced lyrics
**Library**: httpx (direct API calls)
**Authentication**: None
**Rate Limit**: 10 requests/second (self-imposed)
**Priority**: High (for synced lyrics)
#### Capabilities
- Synced lyrics in .lrc format
- Plain lyrics (fallback)
- Lyrics by duration matching (improves accuracy)
#### Matching Strategy
1. Search by artist + title + duration
2. Parse .lrc format to JSON
3. Store in Lyrics.synced field
#### Data Extraction
**Synced Lyrics**:
```python
import httpx
url = "https://lrclib.net/api/get"
params = {
'artist_name': artist,
'track_name': title,
'duration': duration
}
response = httpx.get(url, params=params)
data = response.json()
lrc_text = data['syncedLyrics']
# Parse .lrc format
lines = []
for line in lrc_text.split('\n'):
match = re.match(r'\[(\d+):(\d+\.\d+)\](.*)', line)
if match:
minutes, seconds, text = match.groups()
time_ms = (int(minutes) * 60 + float(seconds)) * 1000
lines.append({'time': int(time_ms), 'text': text.strip()})
{
'synced': lines,
'plain': data.get('plainLyrics')
}
```
#### Rate Limiting
```python
limiter = AsyncLimiter(10, 1) # 10 requests per second
```
#### Error Handling
- **404 Not Found**: No synced lyrics, try plain lyrics
- **Parsing Error**: Invalid .lrc format, skip
- **Timeout**: Retry with backoff
## Scrobbling Services
### Last.fm
**Type**: Scrobbling service
**Library**: pylast (Python)
**Authentication**: OAuth (LASTFM_API_KEY, LASTFM_API_SECRET)
**Rate Limit**: None specified
**Integration**: Server (NestJS)
#### Capabilities
- Scrobble track plays
- Update "now playing" status
- Retrieve user listening history (not implemented)
#### OAuth Flow
1. User clicks "Connect Last.fm" in settings
2. Server redirects to Last.fm OAuth page
3. User authorizes Meelo
4. Last.fm redirects to callback with token
5. Server exchanges token for session key
6. Session key stored in UserScrobbler.data JSON
#### Scrobbling
**Now Playing**:
```typescript
await lastfm.updateNowPlaying({
artist: track.song.artist.name,
track: track.song.name,
album: track.release.album.name,
duration: track.duration
});
```
**Scrobble**:
```typescript
await lastfm.scrobble({
artist: track.song.artist.name,
track: track.song.name,
album: track.release.album.name,
timestamp: Math.floor(Date.now() / 1000)
});
```
#### Scrobble Rules
- Track must play for at least 30 seconds or 50% of duration (whichever is shorter)
- Scrobble sent when track ends or user skips past 50%
- "Now playing" sent immediately on play
#### Error Handling
- **Invalid Session**: Re-authenticate user
- **Network Error**: Queue scrobble for retry
- **Rate Limit**: Wait and retry
### ListenBrainz
**Type**: Open-source scrobbling service
**Library**: pylistenbrainz (Python)
**Authentication**: User token
**Rate Limit**: None specified
**Integration**: Server (NestJS)
#### Capabilities
- Submit listens (scrobbles)
- Retrieve listening history (not implemented)
- Statistics and recommendations (not implemented)
#### Authentication
1. User obtains token from ListenBrainz settings
2. User enters token in Meelo settings
3. Token stored in UserScrobbler.data JSON
4. No OAuth flow needed
#### Submitting Listens
**Single Listen**:
```typescript
await listenbrainz.submitListen({
listened_at: Math.floor(Date.now() / 1000),
track_metadata: {
artist_name: track.song.artist.name,
track_name: track.song.name,
release_name: track.release.album.name,
additional_info: {
duration_ms: track.duration * 1000,
tracknumber: track.trackIndex
}
}
});
```
#### Listen Types
- **Single**: Submit one listen (used for scrobbling)
- **Playing Now**: Update current track (not implemented)
- **Import**: Bulk import (not used)
#### Error Handling
- **Invalid Token**: Notify user to re-enter token
- **Network Error**: Queue listen for retry
- **Rate Limit**: Wait and retry
## Provider Configuration
### settings.json
```json
{
"providers": {
"musicbrainz": {
"enabled": true
},
"genius": {
"enabled": true
},
"wikipedia": {
"enabled": true
},
"wikidata": {
"enabled": true
},
"discogs": {
"enabled": false
},
"allmusic": {
"enabled": false
},
"metacritic": {
"enabled": false
},
"lrclib": {
"enabled": true
}
},
"metadata": {
"source": "providers",
"order": ["musicbrainz", "genius", "wikipedia", "lrclib", "wikidata"]
}
}
```
**Fields**:
- `providers.<name>.enabled`: Enable/disable provider
- `metadata.source`: Prefer "embedded" tags or "providers"
- `metadata.order`: Provider priority for conflicting data
### .env
```bash
# Genius
GENIUS_ACCESS_TOKEN=your_genius_token
# Discogs
DISCOGS_ACCESS_TOKEN=your_discogs_token
# Last.fm
LASTFM_API_KEY=your_lastfm_key
LASTFM_API_SECRET=your_lastfm_secret
# Public URL for OAuth callbacks
PUBLIC_URL=https://meelo.example.com
```
## Provider Priority
When multiple providers return conflicting data, Matcher uses priority from `metadata.order`:
1. **MusicBrainz**: Highest priority (most accurate)
2. **Genius**: High priority for lyrics
3. **Wikipedia**: Medium priority for descriptions
4. **LrcLib**: High priority for synced lyrics
5. **Wikidata**: Medium priority for structured data
6. **Discogs**: Low priority (optional)
7. **AllMusic**: Low priority (optional)
8. **Metacritic**: Low priority (optional)
## Data Aggregation
### Descriptions
Concatenate descriptions from multiple providers:
```
MusicBrainz: "The Beatles were an English rock band..."
Wikipedia: "Formed in Liverpool in 1960..."
Genius: "Known for their innovative songwriting..."
Result: "The Beatles were an English rock band... Formed in Liverpool in 1960... Known for their innovative songwriting..."
```
### Ratings
Average ratings from multiple providers:
```
AllMusic: 90/100
Metacritic: 85/100
Result: (90 + 85) / 2 = 87.5 → 88/100
```
### Lyrics
Prefer synced lyrics over plain:
```
LrcLib: Synced lyrics available → Use synced
Genius: Plain lyrics available → Use as fallback
```
If both available, store both in Lyrics table.
## Matching Workflow
1. **Scanner** registers file with Server
2. **Scanner** publishes `file.added` event to RabbitMQ
3. **Matcher** consumes event
4. **Matcher** fetches file metadata from Server
5. **Matcher** queries enabled providers in parallel:
- MusicBrainz by AcoustID fingerprint
- Genius by artist + title
- Wikipedia by artist name
- LrcLib by artist + title + duration
- Wikidata by MusicBrainz ID (if found)
- Discogs by artist + album (if enabled)
- AllMusic by artist + album (if enabled)
- Metacritic by artist + album (if enabled)
6. **Matcher** aggregates results based on priority
7. **Matcher** pushes enriched metadata to Server
8. **Server** updates database and search index
## Error Recovery
### Provider Failures
If provider fails:
1. Log error with provider name and reason
2. Continue with other providers
3. Push partial metadata to Server
4. Mark track as "partially matched"
### Retry Logic
For transient errors (network, rate limit):
1. Retry with exponential backoff
2. Max 3 attempts per provider
3. If all attempts fail, skip provider
### Manual Refresh
Users can trigger metadata refresh via Scanner API:
```bash
POST /scanner/refresh
```
This re-queries all providers for existing tracks.
## Performance Optimization
### Parallel Queries
Matcher queries all providers in parallel using asyncio:
```python
async def enrich_metadata(file_id):
tasks = [
fetch_musicbrainz(file_id),
fetch_genius(file_id),
fetch_wikipedia(file_id),
fetch_lrclib(file_id),
fetch_wikidata(file_id)
]
results = await asyncio.gather(*tasks, return_exceptions=True)
return aggregate_results(results)
```
### Caching
Provider responses cached in memory for 1 hour:
- Reduces duplicate queries during batch scans
- Invalidated on manual refresh
### Rate Limit Coordination
Rate limiters shared across all workers:
- Prevents exceeding provider limits
- Uses token bucket algorithm
## Privacy Considerations
### Data Sent to Providers
- **MusicBrainz**: AcoustID fingerprint, artist/album/track names
- **Genius**: Artist and track names
- **Wikipedia**: Artist and album names
- **Wikidata**: MusicBrainz IDs
- **Discogs**: Artist and album names
- **AllMusic**: Artist and album names
- **Metacritic**: Artist and album names
- **LrcLib**: Artist, track name, duration
No file paths or user data sent.
### Scrobbling Privacy
- **Last.fm**: Track plays sent with timestamp
- **ListenBrainz**: Track plays sent with timestamp
Users control scrobbling via settings. Disabled by default.
## Future Enhancements
### Additional Providers
Potential providers to add:
- **Spotify**: Metadata and popularity scores
- **Apple Music**: Editorial content
- **Bandcamp**: Independent artist data
- **RateYourMusic**: User ratings and reviews
### Provider Plugins
Allow users to add custom providers via plugin system.
### Offline Mode
Cache provider responses for offline access.
### Provider Statistics
Track provider accuracy and response times. Display in admin panel.
## Summary
Meelo's integration architecture separates concerns: Matcher handles provider queries, Server handles scrobbling. The provider pattern enables easy addition of new sources. Parallel queries and rate limiting optimize performance. Priority-based aggregation ensures data quality. OAuth flows and token management handle authentication. The system is flexible (enable/disable providers), resilient (retry logic, partial results), and privacy-conscious (no file paths sent).