Files
metadata-agregator/docs/research/meelo/analysis/INTEGRATIONS.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

19 KiB

Meelo Integrations

Integration Overview

Meelo integrates with 8 metadata providers and 2 scrobbling services. The Matcher service handles provider queries, while the Server handles scrobbling. All integrations are configurable via settings.json and .env.

Metadata Providers

MusicBrainz

Type: Primary music database
Library: musicbrainzngs (Python)
Authentication: None (public API)
Rate Limit: 1 request/second
Priority: Highest (primary source)

Capabilities

  • Artist metadata (name, sort name, areas, relationships)
  • Album metadata (title, type, release date, labels)
  • Track metadata (title, duration, ISRC)
  • Recording relationships (covers, remixes, versions)
  • Release groups and releases
  • Area data (countries, cities with ISO 3166 codes)

Matching Strategy

  1. Query by AcoustID fingerprint (most accurate)
  2. If no fingerprint, search by artist + album + track title
  3. Extract MBID (MusicBrainz ID) for future queries
  4. Store MBID in LocalIdentifiers table

Data Extraction

Artist:

artist_data = mb.get_artist_by_id(mbid, includes=['areas', 'aliases'])
{
  'name': artist_data['artist']['name'],
  'sortName': artist_data['artist']['sort-name'],
  'areas': [area['name'] for area in artist_data['artist'].get('areas', [])]
}

Album:

release_group = mb.get_release_group_by_id(mbid, includes=['releases', 'labels'])
{
  'name': release_group['release-group']['title'],
  'type': release_group['release-group']['type'],
  'releaseDate': release_group['release-group']['first-release-date'],
  'releases': [...]
}

Track:

recording = mb.get_recording_by_id(mbid, includes=['isrcs', 'releases'])
{
  'title': recording['recording']['title'],
  'duration': recording['recording']['length'],
  'isrc': recording['recording'].get('isrc-list', [None])[0]
}

Rate Limiting

musicbrainzngs library enforces 1 request/second automatically. No additional limiting needed.

Error Handling

  • 404 Not Found: No match, skip provider
  • 503 Service Unavailable: Retry with exponential backoff (max 3 attempts)
  • Rate Limit Exceeded: Wait and retry

Genius

Type: Lyrics and song descriptions
Library: lyricsgenius (Python)
Authentication: API token (GENIUS_ACCESS_TOKEN)
Rate Limit: 10 requests/second
Priority: High (for lyrics)

Capabilities

  • Song lyrics (plain text)
  • Song descriptions and annotations
  • Artist biographies
  • Album descriptions

Matching Strategy

  1. Search by artist + song title
  2. Extract song ID from search results
  3. Fetch full song data including lyrics
  4. Store lyrics in Lyrics table

Data Extraction

Lyrics:

genius = lyricsgenius.Genius(token)
song = genius.search_song(title, artist)
{
  'plain': song.lyrics,
  'description': song.description
}

Artist Bio:

artist = genius.search_artist(name)
{
  'description': artist.description
}

Rate Limiting

Implemented using aiolimiter:

limiter = AsyncLimiter(10, 1)  # 10 requests per second
async with limiter:
    result = await fetch_genius(...)

Error Handling

  • 404 Not Found: No lyrics available, skip
  • 401 Unauthorized: Invalid token, log error
  • Rate Limit: Wait and retry

Wikipedia

Type: Artist and album context
Library: wikipedia (Python)
Authentication: None
Rate Limit: 5 requests/second (self-imposed)
Priority: Medium (for descriptions)

Capabilities

  • Artist biographies
  • Album background and reception
  • Contextual information (formation, breakup, influences)

Matching Strategy

  1. Search Wikipedia by artist/album name
  2. Extract first paragraph as description
  3. Store full URL as source

Data Extraction

Artist Bio:

import wikipedia
page = wikipedia.page(artist_name)
{
  'description': page.summary,
  'url': page.url
}

Album Context:

page = wikipedia.page(f"{album_name} ({artist_name} album)")
{
  'description': page.summary,
  'url': page.url
}

Disambiguation

Wikipedia often returns disambiguation pages. Handle by:

  1. Detect disambiguation page (check for "may refer to")
  2. Search for most likely option (e.g., add "band" or "musician")
  3. If still ambiguous, skip

Rate Limiting

limiter = AsyncLimiter(5, 1)  # 5 requests per second

Error Handling

  • PageError: No Wikipedia page, skip
  • DisambiguationError: Try disambiguation, or skip
  • HTTPError: Retry with backoff

Wikidata

Type: Structured data
Library: SPARQLWrapper (Python)
Authentication: None
Rate Limit: None (fast SPARQL endpoint)
Priority: Medium (for structured data)

Capabilities

  • Artist relationships (members, collaborators)
  • Area data (countries, cities, ISO codes)
  • Dates (birth, death, formation, dissolution)
  • External IDs (MusicBrainz, Discogs, AllMusic)

Matching Strategy

  1. Query by MusicBrainz ID (if available)
  2. Extract Wikidata entity ID
  3. Query for additional properties
  4. Store structured data

Data Extraction

Artist Data:

SELECT ?property ?value WHERE {
  ?artist wdt:P434 "MBID" .  # MusicBrainz artist ID
  ?artist ?property ?value .
}

Area Hierarchy:

SELECT ?area ?parent ?iso WHERE {
  ?area wdt:P31 wd:Q515 .  # instance of city
  ?area wdt:P131 ?parent .  # located in
  ?area wdt:P300 ?iso .  # ISO 3166 code
}

Rate Limiting

No rate limit. SPARQL endpoint is fast and public.

Error Handling

  • No Results: Entity not in Wikidata, skip
  • Timeout: Retry with simpler query
  • SPARQL Error: Log and skip

Discogs

Type: Release information
Library: discogs_client (Python)
Authentication: API token (DISCOGS_ACCESS_TOKEN)
Rate Limit: 60 requests/minute
Priority: Low (optional)

Capabilities

  • Release details (catalog number, barcode, format)
  • Label information
  • Release variations (country, format)
  • Marketplace data (not used)

Matching Strategy

  1. Search by artist + album title
  2. Filter by format (CD, Vinyl, etc.)
  3. Extract release details
  4. Store in Release.extensions JSON

Data Extraction

Release:

import discogs_client
d = discogs_client.Client('Meelo/1.0', user_token=token)
results = d.search(artist=artist, release_title=album, type='release')
release = results[0]
{
  'catalogNumber': release.data['catno'],
  'barcode': release.data.get('barcode'),
  'format': release.formats[0]['name'],
  'country': release.country,
  'label': release.labels[0].name
}

Rate Limiting

limiter = AsyncLimiter(60, 60)  # 60 requests per minute

Error Handling

  • 404 Not Found: No Discogs entry, skip
  • 401 Unauthorized: Invalid token, log error
  • Rate Limit: Wait 60 seconds and retry

AllMusic

Type: Editorial reviews and ratings
Library: BeautifulSoup (web scraping)
Authentication: None
Rate Limit: 1 request/second (self-imposed, no official API)
Priority: Low (optional)

Capabilities

  • Album reviews
  • Album ratings (1-5 stars)
  • Artist biographies
  • Genre classifications

Matching Strategy

  1. Search AllMusic by artist + album
  2. Scrape search results page
  3. Extract review and rating
  4. Store rating normalized to 0-100 scale

Data Extraction

Album Review:

from bs4 import BeautifulSoup
import httpx

url = f"https://www.allmusic.com/search/albums/{artist}+{album}"
response = httpx.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

rating_elem = soup.select_one('.allmusic-rating')
rating = len(rating_elem.select('.star-rating.full'))  # Count full stars

review_elem = soup.select_one('.review-text')
review = review_elem.text.strip()

{
  'rating': rating * 20,  # Convert 1-5 to 0-100
  'description': review
}

Rate Limiting

limiter = AsyncLimiter(1, 1)  # 1 request per second

Error Handling

  • 404 Not Found: No AllMusic page, skip
  • Parsing Error: HTML structure changed, log and skip
  • Timeout: Retry with backoff

Scraping Risks

AllMusic has no official API. Scraping may break if HTML structure changes. Disabled by default in settings.json.

Metacritic

Type: Aggregated critic scores
Library: BeautifulSoup (web scraping)
Authentication: None
Rate Limit: 1 request/second (self-imposed)
Priority: Low (optional)

Capabilities

  • Album critic scores (0-100)
  • User scores (not used)
  • Critic reviews (not extracted)

Matching Strategy

  1. Search Metacritic by artist + album
  2. Scrape album page
  3. Extract Metascore
  4. Store as rating (already 0-100 scale)

Data Extraction

Album Score:

url = f"https://www.metacritic.com/music/{album_slug}/{artist_slug}"
response = httpx.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

score_elem = soup.select_one('.metascore_w')
score = int(score_elem.text.strip())

{
  'rating': score
}

Rate Limiting

limiter = AsyncLimiter(1, 1)  # 1 request per second

Error Handling

  • 404 Not Found: Album not on Metacritic, skip
  • Parsing Error: HTML structure changed, log and skip
  • Timeout: Retry with backoff

Scraping Risks

Same as AllMusic. Disabled by default.

LrcLib

Type: Synced lyrics
Library: httpx (direct API calls)
Authentication: None
Rate Limit: 10 requests/second (self-imposed)
Priority: High (for synced lyrics)

Capabilities

  • Synced lyrics in .lrc format
  • Plain lyrics (fallback)
  • Lyrics by duration matching (improves accuracy)

Matching Strategy

  1. Search by artist + title + duration
  2. Parse .lrc format to JSON
  3. Store in Lyrics.synced field

Data Extraction

Synced Lyrics:

import httpx

url = "https://lrclib.net/api/get"
params = {
  'artist_name': artist,
  'track_name': title,
  'duration': duration
}
response = httpx.get(url, params=params)
data = response.json()

lrc_text = data['syncedLyrics']
# Parse .lrc format
lines = []
for line in lrc_text.split('\n'):
    match = re.match(r'\[(\d+):(\d+\.\d+)\](.*)', line)
    if match:
        minutes, seconds, text = match.groups()
        time_ms = (int(minutes) * 60 + float(seconds)) * 1000
        lines.append({'time': int(time_ms), 'text': text.strip()})

{
  'synced': lines,
  'plain': data.get('plainLyrics')
}

Rate Limiting

limiter = AsyncLimiter(10, 1)  # 10 requests per second

Error Handling

  • 404 Not Found: No synced lyrics, try plain lyrics
  • Parsing Error: Invalid .lrc format, skip
  • Timeout: Retry with backoff

Scrobbling Services

Last.fm

Type: Scrobbling service
Library: pylast (Python)
Authentication: OAuth (LASTFM_API_KEY, LASTFM_API_SECRET)
Rate Limit: None specified
Integration: Server (NestJS)

Capabilities

  • Scrobble track plays
  • Update "now playing" status
  • Retrieve user listening history (not implemented)

OAuth Flow

  1. User clicks "Connect Last.fm" in settings
  2. Server redirects to Last.fm OAuth page
  3. User authorizes Meelo
  4. Last.fm redirects to callback with token
  5. Server exchanges token for session key
  6. Session key stored in UserScrobbler.data JSON

Scrobbling

Now Playing:

await lastfm.updateNowPlaying({
  artist: track.song.artist.name,
  track: track.song.name,
  album: track.release.album.name,
  duration: track.duration
});

Scrobble:

await lastfm.scrobble({
  artist: track.song.artist.name,
  track: track.song.name,
  album: track.release.album.name,
  timestamp: Math.floor(Date.now() / 1000)
});

Scrobble Rules

  • Track must play for at least 30 seconds or 50% of duration (whichever is shorter)
  • Scrobble sent when track ends or user skips past 50%
  • "Now playing" sent immediately on play

Error Handling

  • Invalid Session: Re-authenticate user
  • Network Error: Queue scrobble for retry
  • Rate Limit: Wait and retry

ListenBrainz

Type: Open-source scrobbling service
Library: pylistenbrainz (Python)
Authentication: User token
Rate Limit: None specified
Integration: Server (NestJS)

Capabilities

  • Submit listens (scrobbles)
  • Retrieve listening history (not implemented)
  • Statistics and recommendations (not implemented)

Authentication

  1. User obtains token from ListenBrainz settings
  2. User enters token in Meelo settings
  3. Token stored in UserScrobbler.data JSON
  4. No OAuth flow needed

Submitting Listens

Single Listen:

await listenbrainz.submitListen({
  listened_at: Math.floor(Date.now() / 1000),
  track_metadata: {
    artist_name: track.song.artist.name,
    track_name: track.song.name,
    release_name: track.release.album.name,
    additional_info: {
      duration_ms: track.duration * 1000,
      tracknumber: track.trackIndex
    }
  }
});

Listen Types

  • Single: Submit one listen (used for scrobbling)
  • Playing Now: Update current track (not implemented)
  • Import: Bulk import (not used)

Error Handling

  • Invalid Token: Notify user to re-enter token
  • Network Error: Queue listen for retry
  • Rate Limit: Wait and retry

Provider Configuration

settings.json

{
  "providers": {
    "musicbrainz": {
      "enabled": true
    },
    "genius": {
      "enabled": true
    },
    "wikipedia": {
      "enabled": true
    },
    "wikidata": {
      "enabled": true
    },
    "discogs": {
      "enabled": false
    },
    "allmusic": {
      "enabled": false
    },
    "metacritic": {
      "enabled": false
    },
    "lrclib": {
      "enabled": true
    }
  },
  "metadata": {
    "source": "providers",
    "order": ["musicbrainz", "genius", "wikipedia", "lrclib", "wikidata"]
  }
}

Fields:

  • providers.<name>.enabled: Enable/disable provider
  • metadata.source: Prefer "embedded" tags or "providers"
  • metadata.order: Provider priority for conflicting data

.env

# Genius
GENIUS_ACCESS_TOKEN=your_genius_token

# Discogs
DISCOGS_ACCESS_TOKEN=your_discogs_token

# Last.fm
LASTFM_API_KEY=your_lastfm_key
LASTFM_API_SECRET=your_lastfm_secret

# Public URL for OAuth callbacks
PUBLIC_URL=https://meelo.example.com

Provider Priority

When multiple providers return conflicting data, Matcher uses priority from metadata.order:

  1. MusicBrainz: Highest priority (most accurate)
  2. Genius: High priority for lyrics
  3. Wikipedia: Medium priority for descriptions
  4. LrcLib: High priority for synced lyrics
  5. Wikidata: Medium priority for structured data
  6. Discogs: Low priority (optional)
  7. AllMusic: Low priority (optional)
  8. Metacritic: Low priority (optional)

Data Aggregation

Descriptions

Concatenate descriptions from multiple providers:

MusicBrainz: "The Beatles were an English rock band..."
Wikipedia: "Formed in Liverpool in 1960..."
Genius: "Known for their innovative songwriting..."

Result: "The Beatles were an English rock band... Formed in Liverpool in 1960... Known for their innovative songwriting..."

Ratings

Average ratings from multiple providers:

AllMusic: 90/100
Metacritic: 85/100

Result: (90 + 85) / 2 = 87.5 → 88/100

Lyrics

Prefer synced lyrics over plain:

LrcLib: Synced lyrics available → Use synced
Genius: Plain lyrics available → Use as fallback

If both available, store both in Lyrics table.

Matching Workflow

  1. Scanner registers file with Server
  2. Scanner publishes file.added event to RabbitMQ
  3. Matcher consumes event
  4. Matcher fetches file metadata from Server
  5. Matcher queries enabled providers in parallel:
    • MusicBrainz by AcoustID fingerprint
    • Genius by artist + title
    • Wikipedia by artist name
    • LrcLib by artist + title + duration
    • Wikidata by MusicBrainz ID (if found)
    • Discogs by artist + album (if enabled)
    • AllMusic by artist + album (if enabled)
    • Metacritic by artist + album (if enabled)
  6. Matcher aggregates results based on priority
  7. Matcher pushes enriched metadata to Server
  8. Server updates database and search index

Error Recovery

Provider Failures

If provider fails:

  1. Log error with provider name and reason
  2. Continue with other providers
  3. Push partial metadata to Server
  4. Mark track as "partially matched"

Retry Logic

For transient errors (network, rate limit):

  1. Retry with exponential backoff
  2. Max 3 attempts per provider
  3. If all attempts fail, skip provider

Manual Refresh

Users can trigger metadata refresh via Scanner API:

POST /scanner/refresh

This re-queries all providers for existing tracks.

Performance Optimization

Parallel Queries

Matcher queries all providers in parallel using asyncio:

async def enrich_metadata(file_id):
    tasks = [
        fetch_musicbrainz(file_id),
        fetch_genius(file_id),
        fetch_wikipedia(file_id),
        fetch_lrclib(file_id),
        fetch_wikidata(file_id)
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return aggregate_results(results)

Caching

Provider responses cached in memory for 1 hour:

  • Reduces duplicate queries during batch scans
  • Invalidated on manual refresh

Rate Limit Coordination

Rate limiters shared across all workers:

  • Prevents exceeding provider limits
  • Uses token bucket algorithm

Privacy Considerations

Data Sent to Providers

  • MusicBrainz: AcoustID fingerprint, artist/album/track names
  • Genius: Artist and track names
  • Wikipedia: Artist and album names
  • Wikidata: MusicBrainz IDs
  • Discogs: Artist and album names
  • AllMusic: Artist and album names
  • Metacritic: Artist and album names
  • LrcLib: Artist, track name, duration

No file paths or user data sent.

Scrobbling Privacy

  • Last.fm: Track plays sent with timestamp
  • ListenBrainz: Track plays sent with timestamp

Users control scrobbling via settings. Disabled by default.

Future Enhancements

Additional Providers

Potential providers to add:

  • Spotify: Metadata and popularity scores
  • Apple Music: Editorial content
  • Bandcamp: Independent artist data
  • RateYourMusic: User ratings and reviews

Provider Plugins

Allow users to add custom providers via plugin system.

Offline Mode

Cache provider responses for offline access.

Provider Statistics

Track provider accuracy and response times. Display in admin panel.

Summary

Meelo's integration architecture separates concerns: Matcher handles provider queries, Server handles scrobbling. The provider pattern enables easy addition of new sources. Parallel queries and rate limiting optimize performance. Priority-based aggregation ensures data quality. OAuth flows and token management handle authentication. The system is flexible (enable/disable providers), resilient (retry logic, partial results), and privacy-conscious (no file paths sent).