- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
13 KiB
MusicMetaLinker Architecture
System Overview
MusicMetaLinker implements a service-oriented architecture for music metadata entity linking. The system coordinates queries across multiple external APIs, aggregates results, and presents a unified interface through a single orchestrator class.
Architecture pattern: Facade with cascading fallback strategy.
Core Components
Align Class (linking.py)
The Align class is the primary orchestrator and sole public interface. It encapsulates all service interactions and presents a clean getter-based API.
Constructor signature:
Align(
mbid_track=None,
mbid_release=None,
artist=None,
album=None,
track=None,
track_number=None,
duration=None,
isrc=None,
strict=False
)
Responsibilities:
- Initialize service-specific aligners based on available input
- Coordinate query execution across services
- Aggregate and normalize results
- Expose unified getter methods for all metadata fields
Internal state:
- Stores all input parameters
- Maintains references to service aligner instances
- Caches retrieved metadata to avoid redundant queries
The Align class doesn't implement service-specific logic. It delegates to specialized classes and functions.
MusicBrainzAlign Class
Handles all MusicBrainz interactions. MusicBrainz is treated as the authoritative source when MBIDs are available.
Key methods:
get_recording(mbid): Retrieves full recording data by MBID. Returns artist, album, track name, duration, ISRCs, and related identifiers.
get_best_match(artist, track, album, duration): Searches MusicBrainz by metadata strings. Filters results by duration and fuzzy string matching. Returns the highest-scoring match.
get_iswc(): Retrieves International Standard Musical Work Code if available.
Search strategy:
- If MBID provided, direct lookup (most reliable)
- If ISRC provided, search by ISRC
- Fall back to metadata string search with filtering
MusicBrainz queries include related entities (artists, releases, ISRCs) in a single request to minimize API calls.
DeezerAlign Class
Interfaces with Deezer's public API. Deezer provides commercial metadata with strong ISRC coverage.
Key methods:
best_match(artist, track, album, duration, duration_threshold=3): Searches Deezer and filters by duration. The duration_threshold parameter allows ±3 seconds variance by default.
get_rank(): Returns Deezer's internal popularity rank for the track.
Search strategy:
- If ISRC available, search by ISRC (most accurate)
- Fall back to metadata string search
- Filter results by duration (±3 seconds)
- Apply fuzzy string matching to artist/track/album
Duration filtering is critical for Deezer because metadata searches often return multiple versions (radio edit, album version, remaster).
YouTubeAlign Class
Queries YouTube Music via the unofficial ytmusicapi library.
Key methods:
get_best_match(artist, track, album): Searches YouTube Music with filter="songs". Returns the first result (no sophisticated ranking).
get_youtube_id(): Extracts YouTube video ID from search results.
Search strategy:
- Constructs query string: "{artist} {track} {album}"
- Filters to songs only (excludes videos, albums)
- Returns first result
YouTube matching is the weakest link. No duration filtering (commented out in code). No fuzzy matching. First result is assumed correct.
acousticbrainz_link Function
Standalone function (not a class) that checks if an MBID exists in AcousticBrainz.
Implementation:
def acousticbrainz_link(mbid):
url = f"https://acousticbrainz.org/{mbid}"
response = requests.get(url)
return url if response.status_code == 200 else None
Simple HTTP check. Returns URL if MBID exists, None otherwise.
Critical issue: AcousticBrainz shut down in 2022. This function always returns None. Dead code.
Data Flow
Initialization Flow
- User creates Align instance with available metadata
- Align constructor stores all input parameters
- Service aligners are instantiated on-demand (lazy initialization)
- No queries execute during construction
Query Flow
- User calls getter method (e.g., get_mbid())
- Align checks if value already cached
- If not cached, determines which service to query based on available input
- Executes service-specific query
- Caches result
- Returns value to user
Queries are lazy and cached. Calling get_mbid() twice only queries MusicBrainz once.
Cascading Fallback Strategy
Priority order for identifier resolution:
For MBID:
- Use provided mbid_track if available
- Query MusicBrainz by ISRC
- Query MusicBrainz by metadata strings
- Return None if all fail
For ISRC:
- Use provided ISRC if available
- Extract from MusicBrainz recording (if MBID available)
- Query Deezer and extract ISRC from result
- Return None if all fail
For Deezer ID:
- Query Deezer by ISRC
- Query Deezer by metadata strings
- Return None if all fail
For YouTube link:
- Query YouTube Music by metadata strings
- Return None if no results
Each service is queried independently. No cross-service validation or conflict resolution.
Supporting Components
JAMSProcessor (preprocessor.py)
Handles reading and writing JAMS (JSON Annotated Music Specification) files.
Responsibilities:
- Parse JAMS JSON structure
- Extract metadata from file_metadata and sandbox sections
- Enrich JAMS files with new identifiers
- Write updated JAMS files
JAMS structure:
{
"file_metadata": {
"title": "track name",
"artist": "artist name",
"release": "album name",
"duration": 123.45,
"identifiers": {
"musicbrainz": "mbid-here"
}
},
"sandbox": {
"type": "genre",
"genre": "rock",
"track_number": 1,
"release_year": 2020
}
}
JAMSProcessor reads these fields, passes them to Align, and writes enriched identifiers back to the identifiers section.
MBDownload (musicbrainz_dump.py)
Utility for bulk downloading MusicBrainz data.
Purpose: Pre-populate local datasets with MusicBrainz metadata to reduce API calls during batch processing.
Implementation details: Not fully specified in provided information. Likely queries MusicBrainz in batches and caches results locally.
link_partitions.py
Batch processing script for directories of JAMS files.
Workflow:
- Scan directory for JAMS files
- For each file, extract metadata via JAMSProcessor
- Create Align instance and query all services
- Collect results in pandas DataFrame
- Output CSV with all identifiers
Command-line options:
--save: Write enriched JAMS files back to disk--limit audio: Only process audio files (skip non-audio JAMS)--overwrite: Overwrite existing enriched files
Includes progress bars via tqdm and logging to link_partitions.log.
prepare_dataset.py
Dataset preparation utilities. Specific functionality not detailed in provided information. Likely includes:
- Data cleaning
- Format conversion
- Batch metadata enrichment
Configuration Architecture
No configuration system. All settings hardcoded in source files.
Hardcoded values:
- MusicBrainz User-Agent: "elka/0.1"
- Deezer duration threshold: 3 seconds
- API endpoints: Direct URLs in code
- Spotify credentials: Imported from external mml_secrets.py
Implications:
- No runtime configuration
- No environment-specific settings
- Changing thresholds requires code modification
- No A/B testing of matching strategies
Error Handling Architecture
Error handling is minimal and inconsistent.
Pattern:
try:
result = service.query()
return result
except:
return None
All exceptions are caught and suppressed. Failed queries return None. No error logging, no exception propagation, no retry logic.
Consequences:
- Silent failures
- No visibility into what went wrong
- Difficult debugging
- No distinction between "not found" and "service error"
Logging Architecture
Uses Python's standard logging module.
Batch processing: File-based logging to link_partitions.log. Includes timestamps, log levels, and progress information.
Library usage: Console logging. Minimal output.
Debug output: Multiple print() statements scattered throughout code. Not controlled by logging configuration.
Issues:
- Debug prints in production code
- No structured logging
- No log levels for debug prints
- No correlation IDs for tracking requests across services
Concurrency Model
Single-threaded, synchronous execution. No parallelization.
Query execution:
- Services queried sequentially
- No concurrent API calls
- No async/await
- No thread pools
Implications:
- Slow batch processing (network latency multiplied by number of tracks)
- Underutilized network bandwidth
- Simple debugging (no race conditions)
Batch processing could benefit significantly from parallel execution.
Dependency Injection
No dependency injection. Service classes instantiated directly in Align constructor.
Current pattern:
self.mb_align = MusicBrainzAlign(...)
self.deezer_align = DeezerAlign(...)
Implications:
- Difficult to mock services for testing
- Tight coupling between Align and service implementations
- No interface-based programming
- Hard to swap service implementations
State Management
State is managed in Align instance variables.
Cached values:
- All input parameters (artist, track, album, etc.)
- Retrieved metadata (MBID, ISRC, Deezer ID, etc.)
- Service aligner instances
Cache invalidation: None. Values cached for lifetime of Align instance.
Thread safety: Not thread-safe. No locks, no synchronization.
Extension Points
Limited extensibility.
Adding new services:
- Create new service aligner class
- Instantiate in Align constructor
- Add getter methods to Align
- Update cascading fallback logic
No plugin system, no service registry, no abstract base classes.
Modifying matching logic: Requires editing service aligner classes directly. No strategy pattern, no configurable matchers.
Testing Architecture
No test suite. No test directory. No test configuration.
Testing approach:
- Manual testing via Jupyter notebooks (deezer_test.ipynb, queries.ipynb)
- if name == "main" blocks in some modules
- No unit tests, no integration tests, no mocks
Build and Packaging
Uses hatchling (PEP 517 build backend).
pyproject.toml structure:
- Project metadata (name, version, authors)
- Dependencies
- Build system configuration
No setup.py. Modern Python packaging.
Distribution: GitHub only. Not published to PyPI.
Deployment Architecture
Library deployment: pip install from GitHub.
Batch processing deployment: Clone repository, install dependencies, run Python scripts directly.
No Docker containers, no systemd services, no process managers.
Performance Considerations
No performance optimization.
Bottlenecks:
- Network latency (sequential API calls)
- No caching across Align instances
- No request batching
- No connection pooling
Memory usage:
- Minimal (only current track metadata in memory)
- No large data structures
- Pandas DataFrame for batch output (could be large for big datasets)
Security Architecture
Minimal security considerations.
API credentials:
- MusicBrainz: No authentication
- Deezer: No authentication
- YouTube Music: No authentication
- Spotify: OAuth2 client credentials in external file
Secrets management:
- Spotify credentials in mml_secrets.py (not in repository)
- No encryption
- No environment variables
- No secrets vault
Input validation:
- No validation of user input
- No sanitization of metadata strings
- Potential injection vulnerabilities if metadata used in shell commands
Architectural Strengths
- Simple facade: Single Align class hides complexity
- Cascading fallback: Graceful degradation when services fail
- Lazy evaluation: Only query services when needed
- Service isolation: Each service in separate class
Architectural Weaknesses
- No abstraction: Service classes have different interfaces
- Tight coupling: Align directly instantiates service classes
- No error handling: Silent failures everywhere
- No concurrency: Sequential execution only
- Hardcoded configuration: No runtime flexibility
- No testing: Untestable design (tight coupling, no mocks)
- Dead code: AcousticBrainz integration non-functional
- Inconsistent patterns: Function for AcousticBrainz, classes for others
Architectural Recommendations
For production use, consider:
- Define service interface: Abstract base class for all aligners
- Dependency injection: Pass service instances to Align constructor
- Configuration system: External config for thresholds, endpoints, credentials
- Error handling: Explicit error types, logging, retry logic
- Async execution: Use asyncio for concurrent API calls
- Caching layer: Redis or in-memory cache for repeated queries
- Remove dead code: Delete AcousticBrainz integration
- Add tests: Unit tests with mocked services
- Structured logging: JSON logs with correlation IDs
- Rate limiting: Respect API rate limits with backoff
The core pattern (cascading fallback across services) is sound. The implementation needs significant hardening.