feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,441 @@
|
||||
# MusicMetaLinker Architecture
|
||||
|
||||
## System Overview
|
||||
|
||||
MusicMetaLinker implements a service-oriented architecture for music metadata entity linking. The system coordinates queries across multiple external APIs, aggregates results, and presents a unified interface through a single orchestrator class.
|
||||
|
||||
Architecture pattern: Facade with cascading fallback strategy.
|
||||
|
||||
## Core Components
|
||||
|
||||
### Align Class (linking.py)
|
||||
|
||||
The Align class is the primary orchestrator and sole public interface. It encapsulates all service interactions and presents a clean getter-based API.
|
||||
|
||||
**Constructor signature:**
|
||||
```python
|
||||
Align(
|
||||
mbid_track=None,
|
||||
mbid_release=None,
|
||||
artist=None,
|
||||
album=None,
|
||||
track=None,
|
||||
track_number=None,
|
||||
duration=None,
|
||||
isrc=None,
|
||||
strict=False
|
||||
)
|
||||
```
|
||||
|
||||
**Responsibilities:**
|
||||
- Initialize service-specific aligners based on available input
|
||||
- Coordinate query execution across services
|
||||
- Aggregate and normalize results
|
||||
- Expose unified getter methods for all metadata fields
|
||||
|
||||
**Internal state:**
|
||||
- Stores all input parameters
|
||||
- Maintains references to service aligner instances
|
||||
- Caches retrieved metadata to avoid redundant queries
|
||||
|
||||
The Align class doesn't implement service-specific logic. It delegates to specialized classes and functions.
|
||||
|
||||
### MusicBrainzAlign Class
|
||||
|
||||
Handles all MusicBrainz interactions. MusicBrainz is treated as the authoritative source when MBIDs are available.
|
||||
|
||||
**Key methods:**
|
||||
|
||||
**get_recording(mbid):** Retrieves full recording data by MBID. Returns artist, album, track name, duration, ISRCs, and related identifiers.
|
||||
|
||||
**get_best_match(artist, track, album, duration):** Searches MusicBrainz by metadata strings. Filters results by duration and fuzzy string matching. Returns the highest-scoring match.
|
||||
|
||||
**get_iswc():** Retrieves International Standard Musical Work Code if available.
|
||||
|
||||
**Search strategy:**
|
||||
1. If MBID provided, direct lookup (most reliable)
|
||||
2. If ISRC provided, search by ISRC
|
||||
3. Fall back to metadata string search with filtering
|
||||
|
||||
MusicBrainz queries include related entities (artists, releases, ISRCs) in a single request to minimize API calls.
|
||||
|
||||
### DeezerAlign Class
|
||||
|
||||
Interfaces with Deezer's public API. Deezer provides commercial metadata with strong ISRC coverage.
|
||||
|
||||
**Key methods:**
|
||||
|
||||
**best_match(artist, track, album, duration, duration_threshold=3):** Searches Deezer and filters by duration. The duration_threshold parameter allows ±3 seconds variance by default.
|
||||
|
||||
**get_rank():** Returns Deezer's internal popularity rank for the track.
|
||||
|
||||
**Search strategy:**
|
||||
1. If ISRC available, search by ISRC (most accurate)
|
||||
2. Fall back to metadata string search
|
||||
3. Filter results by duration (±3 seconds)
|
||||
4. Apply fuzzy string matching to artist/track/album
|
||||
|
||||
Duration filtering is critical for Deezer because metadata searches often return multiple versions (radio edit, album version, remaster).
|
||||
|
||||
### YouTubeAlign Class
|
||||
|
||||
Queries YouTube Music via the unofficial ytmusicapi library.
|
||||
|
||||
**Key methods:**
|
||||
|
||||
**get_best_match(artist, track, album):** Searches YouTube Music with filter="songs". Returns the first result (no sophisticated ranking).
|
||||
|
||||
**get_youtube_id():** Extracts YouTube video ID from search results.
|
||||
|
||||
**Search strategy:**
|
||||
- Constructs query string: "{artist} {track} {album}"
|
||||
- Filters to songs only (excludes videos, albums)
|
||||
- Returns first result
|
||||
|
||||
YouTube matching is the weakest link. No duration filtering (commented out in code). No fuzzy matching. First result is assumed correct.
|
||||
|
||||
### acousticbrainz_link Function
|
||||
|
||||
Standalone function (not a class) that checks if an MBID exists in AcousticBrainz.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
def acousticbrainz_link(mbid):
|
||||
url = f"https://acousticbrainz.org/{mbid}"
|
||||
response = requests.get(url)
|
||||
return url if response.status_code == 200 else None
|
||||
```
|
||||
|
||||
Simple HTTP check. Returns URL if MBID exists, None otherwise.
|
||||
|
||||
**Critical issue:** AcousticBrainz shut down in 2022. This function always returns None. Dead code.
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Initialization Flow
|
||||
|
||||
1. User creates Align instance with available metadata
|
||||
2. Align constructor stores all input parameters
|
||||
3. Service aligners are instantiated on-demand (lazy initialization)
|
||||
4. No queries execute during construction
|
||||
|
||||
### Query Flow
|
||||
|
||||
1. User calls getter method (e.g., get_mbid())
|
||||
2. Align checks if value already cached
|
||||
3. If not cached, determines which service to query based on available input
|
||||
4. Executes service-specific query
|
||||
5. Caches result
|
||||
6. Returns value to user
|
||||
|
||||
Queries are lazy and cached. Calling get_mbid() twice only queries MusicBrainz once.
|
||||
|
||||
### Cascading Fallback Strategy
|
||||
|
||||
Priority order for identifier resolution:
|
||||
|
||||
**For MBID:**
|
||||
1. Use provided mbid_track if available
|
||||
2. Query MusicBrainz by ISRC
|
||||
3. Query MusicBrainz by metadata strings
|
||||
4. Return None if all fail
|
||||
|
||||
**For ISRC:**
|
||||
1. Use provided ISRC if available
|
||||
2. Extract from MusicBrainz recording (if MBID available)
|
||||
3. Query Deezer and extract ISRC from result
|
||||
4. Return None if all fail
|
||||
|
||||
**For Deezer ID:**
|
||||
1. Query Deezer by ISRC
|
||||
2. Query Deezer by metadata strings
|
||||
3. Return None if all fail
|
||||
|
||||
**For YouTube link:**
|
||||
1. Query YouTube Music by metadata strings
|
||||
2. Return None if no results
|
||||
|
||||
Each service is queried independently. No cross-service validation or conflict resolution.
|
||||
|
||||
## Supporting Components
|
||||
|
||||
### JAMSProcessor (preprocessor.py)
|
||||
|
||||
Handles reading and writing JAMS (JSON Annotated Music Specification) files.
|
||||
|
||||
**Responsibilities:**
|
||||
- Parse JAMS JSON structure
|
||||
- Extract metadata from file_metadata and sandbox sections
|
||||
- Enrich JAMS files with new identifiers
|
||||
- Write updated JAMS files
|
||||
|
||||
JAMS structure:
|
||||
```json
|
||||
{
|
||||
"file_metadata": {
|
||||
"title": "track name",
|
||||
"artist": "artist name",
|
||||
"release": "album name",
|
||||
"duration": 123.45,
|
||||
"identifiers": {
|
||||
"musicbrainz": "mbid-here"
|
||||
}
|
||||
},
|
||||
"sandbox": {
|
||||
"type": "genre",
|
||||
"genre": "rock",
|
||||
"track_number": 1,
|
||||
"release_year": 2020
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
JAMSProcessor reads these fields, passes them to Align, and writes enriched identifiers back to the identifiers section.
|
||||
|
||||
### MBDownload (musicbrainz_dump.py)
|
||||
|
||||
Utility for bulk downloading MusicBrainz data.
|
||||
|
||||
**Purpose:** Pre-populate local datasets with MusicBrainz metadata to reduce API calls during batch processing.
|
||||
|
||||
**Implementation details:** Not fully specified in provided information. Likely queries MusicBrainz in batches and caches results locally.
|
||||
|
||||
### link_partitions.py
|
||||
|
||||
Batch processing script for directories of JAMS files.
|
||||
|
||||
**Workflow:**
|
||||
1. Scan directory for JAMS files
|
||||
2. For each file, extract metadata via JAMSProcessor
|
||||
3. Create Align instance and query all services
|
||||
4. Collect results in pandas DataFrame
|
||||
5. Output CSV with all identifiers
|
||||
|
||||
**Command-line options:**
|
||||
- `--save`: Write enriched JAMS files back to disk
|
||||
- `--limit audio`: Only process audio files (skip non-audio JAMS)
|
||||
- `--overwrite`: Overwrite existing enriched files
|
||||
|
||||
Includes progress bars via tqdm and logging to link_partitions.log.
|
||||
|
||||
### prepare_dataset.py
|
||||
|
||||
Dataset preparation utilities. Specific functionality not detailed in provided information. Likely includes:
|
||||
- Data cleaning
|
||||
- Format conversion
|
||||
- Batch metadata enrichment
|
||||
|
||||
## Configuration Architecture
|
||||
|
||||
No configuration system. All settings hardcoded in source files.
|
||||
|
||||
**Hardcoded values:**
|
||||
- MusicBrainz User-Agent: "elka/0.1"
|
||||
- Deezer duration threshold: 3 seconds
|
||||
- API endpoints: Direct URLs in code
|
||||
- Spotify credentials: Imported from external mml_secrets.py
|
||||
|
||||
**Implications:**
|
||||
- No runtime configuration
|
||||
- No environment-specific settings
|
||||
- Changing thresholds requires code modification
|
||||
- No A/B testing of matching strategies
|
||||
|
||||
## Error Handling Architecture
|
||||
|
||||
Error handling is minimal and inconsistent.
|
||||
|
||||
**Pattern:**
|
||||
```python
|
||||
try:
|
||||
result = service.query()
|
||||
return result
|
||||
except:
|
||||
return None
|
||||
```
|
||||
|
||||
All exceptions are caught and suppressed. Failed queries return None. No error logging, no exception propagation, no retry logic.
|
||||
|
||||
**Consequences:**
|
||||
- Silent failures
|
||||
- No visibility into what went wrong
|
||||
- Difficult debugging
|
||||
- No distinction between "not found" and "service error"
|
||||
|
||||
## Logging Architecture
|
||||
|
||||
Uses Python's standard logging module.
|
||||
|
||||
**Batch processing:** File-based logging to link_partitions.log. Includes timestamps, log levels, and progress information.
|
||||
|
||||
**Library usage:** Console logging. Minimal output.
|
||||
|
||||
**Debug output:** Multiple print() statements scattered throughout code. Not controlled by logging configuration.
|
||||
|
||||
**Issues:**
|
||||
- Debug prints in production code
|
||||
- No structured logging
|
||||
- No log levels for debug prints
|
||||
- No correlation IDs for tracking requests across services
|
||||
|
||||
## Concurrency Model
|
||||
|
||||
Single-threaded, synchronous execution. No parallelization.
|
||||
|
||||
**Query execution:**
|
||||
- Services queried sequentially
|
||||
- No concurrent API calls
|
||||
- No async/await
|
||||
- No thread pools
|
||||
|
||||
**Implications:**
|
||||
- Slow batch processing (network latency multiplied by number of tracks)
|
||||
- Underutilized network bandwidth
|
||||
- Simple debugging (no race conditions)
|
||||
|
||||
Batch processing could benefit significantly from parallel execution.
|
||||
|
||||
## Dependency Injection
|
||||
|
||||
No dependency injection. Service classes instantiated directly in Align constructor.
|
||||
|
||||
**Current pattern:**
|
||||
```python
|
||||
self.mb_align = MusicBrainzAlign(...)
|
||||
self.deezer_align = DeezerAlign(...)
|
||||
```
|
||||
|
||||
**Implications:**
|
||||
- Difficult to mock services for testing
|
||||
- Tight coupling between Align and service implementations
|
||||
- No interface-based programming
|
||||
- Hard to swap service implementations
|
||||
|
||||
## State Management
|
||||
|
||||
State is managed in Align instance variables.
|
||||
|
||||
**Cached values:**
|
||||
- All input parameters (artist, track, album, etc.)
|
||||
- Retrieved metadata (MBID, ISRC, Deezer ID, etc.)
|
||||
- Service aligner instances
|
||||
|
||||
**Cache invalidation:** None. Values cached for lifetime of Align instance.
|
||||
|
||||
**Thread safety:** Not thread-safe. No locks, no synchronization.
|
||||
|
||||
## Extension Points
|
||||
|
||||
Limited extensibility.
|
||||
|
||||
**Adding new services:**
|
||||
1. Create new service aligner class
|
||||
2. Instantiate in Align constructor
|
||||
3. Add getter methods to Align
|
||||
4. Update cascading fallback logic
|
||||
|
||||
No plugin system, no service registry, no abstract base classes.
|
||||
|
||||
**Modifying matching logic:**
|
||||
Requires editing service aligner classes directly. No strategy pattern, no configurable matchers.
|
||||
|
||||
## Testing Architecture
|
||||
|
||||
No test suite. No test directory. No test configuration.
|
||||
|
||||
**Testing approach:**
|
||||
- Manual testing via Jupyter notebooks (deezer_test.ipynb, queries.ipynb)
|
||||
- if __name__ == "__main__" blocks in some modules
|
||||
- No unit tests, no integration tests, no mocks
|
||||
|
||||
## Build and Packaging
|
||||
|
||||
Uses hatchling (PEP 517 build backend).
|
||||
|
||||
**pyproject.toml structure:**
|
||||
- Project metadata (name, version, authors)
|
||||
- Dependencies
|
||||
- Build system configuration
|
||||
|
||||
No setup.py. Modern Python packaging.
|
||||
|
||||
**Distribution:** GitHub only. Not published to PyPI.
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
Library deployment: pip install from GitHub.
|
||||
|
||||
Batch processing deployment: Clone repository, install dependencies, run Python scripts directly.
|
||||
|
||||
No Docker containers, no systemd services, no process managers.
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
No performance optimization.
|
||||
|
||||
**Bottlenecks:**
|
||||
- Network latency (sequential API calls)
|
||||
- No caching across Align instances
|
||||
- No request batching
|
||||
- No connection pooling
|
||||
|
||||
**Memory usage:**
|
||||
- Minimal (only current track metadata in memory)
|
||||
- No large data structures
|
||||
- Pandas DataFrame for batch output (could be large for big datasets)
|
||||
|
||||
## Security Architecture
|
||||
|
||||
Minimal security considerations.
|
||||
|
||||
**API credentials:**
|
||||
- MusicBrainz: No authentication
|
||||
- Deezer: No authentication
|
||||
- YouTube Music: No authentication
|
||||
- Spotify: OAuth2 client credentials in external file
|
||||
|
||||
**Secrets management:**
|
||||
- Spotify credentials in mml_secrets.py (not in repository)
|
||||
- No encryption
|
||||
- No environment variables
|
||||
- No secrets vault
|
||||
|
||||
**Input validation:**
|
||||
- No validation of user input
|
||||
- No sanitization of metadata strings
|
||||
- Potential injection vulnerabilities if metadata used in shell commands
|
||||
|
||||
## Architectural Strengths
|
||||
|
||||
1. **Simple facade:** Single Align class hides complexity
|
||||
2. **Cascading fallback:** Graceful degradation when services fail
|
||||
3. **Lazy evaluation:** Only query services when needed
|
||||
4. **Service isolation:** Each service in separate class
|
||||
|
||||
## Architectural Weaknesses
|
||||
|
||||
1. **No abstraction:** Service classes have different interfaces
|
||||
2. **Tight coupling:** Align directly instantiates service classes
|
||||
3. **No error handling:** Silent failures everywhere
|
||||
4. **No concurrency:** Sequential execution only
|
||||
5. **Hardcoded configuration:** No runtime flexibility
|
||||
6. **No testing:** Untestable design (tight coupling, no mocks)
|
||||
7. **Dead code:** AcousticBrainz integration non-functional
|
||||
8. **Inconsistent patterns:** Function for AcousticBrainz, classes for others
|
||||
|
||||
## Architectural Recommendations
|
||||
|
||||
For production use, consider:
|
||||
|
||||
1. **Define service interface:** Abstract base class for all aligners
|
||||
2. **Dependency injection:** Pass service instances to Align constructor
|
||||
3. **Configuration system:** External config for thresholds, endpoints, credentials
|
||||
4. **Error handling:** Explicit error types, logging, retry logic
|
||||
5. **Async execution:** Use asyncio for concurrent API calls
|
||||
6. **Caching layer:** Redis or in-memory cache for repeated queries
|
||||
7. **Remove dead code:** Delete AcousticBrainz integration
|
||||
8. **Add tests:** Unit tests with mocked services
|
||||
9. **Structured logging:** JSON logs with correlation IDs
|
||||
10. **Rate limiting:** Respect API rate limits with backoff
|
||||
|
||||
The core pattern (cascading fallback across services) is sound. The implementation needs significant hardening.
|
||||
Reference in New Issue
Block a user