feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,521 @@
|
||||
# MusicMetaLinker API Reference
|
||||
|
||||
## API Type
|
||||
|
||||
MusicMetaLinker is a Python library API. No REST API, no GraphQL, no command-line interface for library functionality.
|
||||
|
||||
Batch processing has a CLI (link_partitions.py) but the core library is Python-only.
|
||||
|
||||
## Primary Interface: Align Class
|
||||
|
||||
### Constructor
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import Align
|
||||
|
||||
linker = Align(
|
||||
mbid_track=None,
|
||||
mbid_release=None,
|
||||
artist=None,
|
||||
album=None,
|
||||
track=None,
|
||||
track_number=None,
|
||||
duration=None,
|
||||
isrc=None,
|
||||
strict=False
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
**mbid_track** (str, optional): MusicBrainz recording ID. If provided, MusicBrainz is queried first and treated as authoritative.
|
||||
|
||||
**mbid_release** (str, optional): MusicBrainz release ID. Used for album-level metadata.
|
||||
|
||||
**artist** (str, optional): Artist name. Used for metadata-based search when identifiers unavailable.
|
||||
|
||||
**album** (str, optional): Album name. Used for filtering and matching.
|
||||
|
||||
**track** (str, optional): Track name. Primary search term for metadata-based queries.
|
||||
|
||||
**track_number** (int, optional): Track position on album. Used for filtering multiple matches.
|
||||
|
||||
**duration** (int or float, optional): Track duration in seconds. Critical for filtering. Deezer uses ±3 second threshold.
|
||||
|
||||
**isrc** (str, optional): International Standard Recording Code. If provided, used for direct lookup on Deezer and MusicBrainz.
|
||||
|
||||
**strict** (bool, optional): Strict matching mode. Behavior not fully documented. Likely affects fuzzy matching thresholds.
|
||||
|
||||
**Returns:** Align instance. No exceptions raised during construction. Queries execute lazily when getters called.
|
||||
|
||||
**Usage patterns:**
|
||||
|
||||
Minimal input (metadata only):
|
||||
```python
|
||||
linker = Align(artist="Radiohead", track="Creep")
|
||||
```
|
||||
|
||||
With identifiers (preferred):
|
||||
```python
|
||||
linker = Align(
|
||||
mbid_track="6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e",
|
||||
isrc="GBAYE9200070"
|
||||
)
|
||||
```
|
||||
|
||||
Full metadata for best matching:
|
||||
```python
|
||||
linker = Align(
|
||||
artist="The Beatles",
|
||||
track="Hey Jude",
|
||||
album="Hey Jude",
|
||||
duration=431,
|
||||
track_number=1
|
||||
)
|
||||
```
|
||||
|
||||
### Metadata Getter Methods
|
||||
|
||||
All getters return None if data unavailable. No exceptions raised.
|
||||
|
||||
#### get_artist()
|
||||
|
||||
```python
|
||||
artist = linker.get_artist()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Artist name from best available source (MusicBrainz > Deezer > YouTube > input).
|
||||
|
||||
**Behavior:**
|
||||
- If MBID available, returns MusicBrainz artist
|
||||
- Falls back to Deezer artist if found
|
||||
- Falls back to YouTube artist if found
|
||||
- Returns input artist if no services matched
|
||||
- Returns None if no artist information available
|
||||
|
||||
#### get_album()
|
||||
|
||||
```python
|
||||
album = linker.get_album()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Album/release name.
|
||||
|
||||
**Behavior:** Same cascading fallback as get_artist().
|
||||
|
||||
#### get_track()
|
||||
|
||||
```python
|
||||
track = linker.get_track()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Track/recording name.
|
||||
|
||||
**Behavior:** Same cascading fallback as get_artist().
|
||||
|
||||
#### get_track_number()
|
||||
|
||||
```python
|
||||
track_number = linker.get_track_number()
|
||||
```
|
||||
|
||||
**Returns:** int or None. Track position on album.
|
||||
|
||||
**Behavior:**
|
||||
- Returns MusicBrainz track number if available
|
||||
- Falls back to input track_number
|
||||
- Returns None if unavailable
|
||||
|
||||
#### get_duration()
|
||||
|
||||
```python
|
||||
duration = linker.get_duration()
|
||||
```
|
||||
|
||||
**Returns:** int, float, or None. Track duration in seconds.
|
||||
|
||||
**Behavior:**
|
||||
- Returns MusicBrainz duration if available (milliseconds converted to seconds)
|
||||
- Falls back to Deezer duration
|
||||
- Falls back to input duration
|
||||
- Returns None if unavailable
|
||||
|
||||
**Note:** MusicBrainz stores duration in milliseconds. The library converts to seconds for consistency.
|
||||
|
||||
#### get_release_date()
|
||||
|
||||
```python
|
||||
release_date = linker.get_release_date()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Release date in ISO format (YYYY-MM-DD) or year only (YYYY).
|
||||
|
||||
**Behavior:**
|
||||
- Returns MusicBrainz release date if available
|
||||
- Falls back to Deezer release date
|
||||
- Returns None if unavailable
|
||||
|
||||
**Format inconsistency:** MusicBrainz may return full date, Deezer typically returns year only.
|
||||
|
||||
#### get_isrc()
|
||||
|
||||
```python
|
||||
isrc = linker.get_isrc()
|
||||
```
|
||||
|
||||
**Returns:** str or None. International Standard Recording Code.
|
||||
|
||||
**Behavior:**
|
||||
- Returns input ISRC if provided
|
||||
- Extracts from MusicBrainz recording if available
|
||||
- Extracts from Deezer result if available
|
||||
- Returns None if unavailable
|
||||
|
||||
**Format:** Standard ISRC format (e.g., "GBAYE9200070"). No validation performed.
|
||||
|
||||
#### get_bpm()
|
||||
|
||||
```python
|
||||
bpm = linker.get_bpm()
|
||||
```
|
||||
|
||||
**Returns:** int, float, or None. Tempo in beats per minute.
|
||||
|
||||
**Behavior:**
|
||||
- Returns Deezer BPM if available
|
||||
- Returns None if unavailable
|
||||
|
||||
**Note:** MusicBrainz doesn't provide BPM in standard queries. Only Deezer source.
|
||||
|
||||
### Identifier Getter Methods
|
||||
|
||||
#### get_mbid()
|
||||
|
||||
```python
|
||||
mbid = linker.get_mbid()
|
||||
```
|
||||
|
||||
**Returns:** str or None. MusicBrainz recording ID (UUID format).
|
||||
|
||||
**Behavior:**
|
||||
- Returns input mbid_track if provided
|
||||
- Queries MusicBrainz by ISRC if available
|
||||
- Queries MusicBrainz by metadata if ISRC unavailable
|
||||
- Returns None if no match found
|
||||
|
||||
**Format:** UUID string (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e").
|
||||
|
||||
#### get_deezer_id()
|
||||
|
||||
```python
|
||||
deezer_id = linker.get_deezer_id()
|
||||
```
|
||||
|
||||
**Returns:** int or None. Deezer track ID.
|
||||
|
||||
**Behavior:**
|
||||
- Queries Deezer by ISRC if available
|
||||
- Queries Deezer by metadata if ISRC unavailable
|
||||
- Filters by duration (±3 seconds)
|
||||
- Returns None if no match found
|
||||
|
||||
**Format:** Integer (e.g., 123456789).
|
||||
|
||||
#### get_deezer_link()
|
||||
|
||||
```python
|
||||
deezer_link = linker.get_deezer_link()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Full Deezer track URL.
|
||||
|
||||
**Behavior:**
|
||||
- Calls get_deezer_id() internally
|
||||
- Constructs URL: f"https://www.deezer.com/track/{deezer_id}"
|
||||
- Returns None if no Deezer ID available
|
||||
|
||||
**Format:** Full URL (e.g., "https://www.deezer.com/track/123456789").
|
||||
|
||||
#### get_youtube_link()
|
||||
|
||||
```python
|
||||
youtube_link = linker.get_youtube_link()
|
||||
```
|
||||
|
||||
**Returns:** str or None. YouTube Music track URL.
|
||||
|
||||
**Behavior:**
|
||||
- Queries YouTube Music by metadata (artist, track, album)
|
||||
- Returns first result (no sophisticated ranking)
|
||||
- Returns None if no results
|
||||
|
||||
**Format:** Full YouTube URL (e.g., "https://www.youtube.com/watch?v=dQw4w9WgXcQ").
|
||||
|
||||
**Warning:** YouTube matching is weak. First result assumed correct. No duration filtering.
|
||||
|
||||
#### get_acousticbrainz_link()
|
||||
|
||||
```python
|
||||
acousticbrainz_link = linker.get_acousticbrainz_link()
|
||||
```
|
||||
|
||||
**Returns:** str or None. AcousticBrainz URL.
|
||||
|
||||
**Behavior:**
|
||||
- Requires MBID (calls get_mbid() internally)
|
||||
- Checks if https://acousticbrainz.org/{mbid} returns HTTP 200
|
||||
- Returns URL if exists, None otherwise
|
||||
|
||||
**Critical issue:** AcousticBrainz shut down in 2022. This method always returns None. Dead code.
|
||||
|
||||
### Internal Service Methods
|
||||
|
||||
Not part of public API but exposed in service classes.
|
||||
|
||||
#### MusicBrainzAlign Methods
|
||||
|
||||
**get_recording(mbid):** Direct MusicBrainz recording lookup by MBID.
|
||||
|
||||
**get_best_match(artist, track, album, duration):** Search MusicBrainz by metadata with filtering.
|
||||
|
||||
**get_iswc():** Retrieve International Standard Musical Work Code.
|
||||
|
||||
**Implementation details:**
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import MusicBrainzAlign
|
||||
|
||||
mb = MusicBrainzAlign(mbid="...")
|
||||
recording = mb.get_recording(mbid)
|
||||
# Returns dict with artist, album, track, duration, isrcs, etc.
|
||||
```
|
||||
|
||||
Not intended for direct use. Align class wraps these methods.
|
||||
|
||||
#### DeezerAlign Methods
|
||||
|
||||
**best_match(artist, track, album, duration, duration_threshold=3):** Search Deezer with duration filtering.
|
||||
|
||||
**get_rank():** Retrieve Deezer popularity rank.
|
||||
|
||||
**Implementation details:**
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import DeezerAlign
|
||||
|
||||
deezer = DeezerAlign(artist="...", track="...", album="...", duration=123)
|
||||
match = deezer.best_match(artist, track, album, duration)
|
||||
# Returns Deezer track object or None
|
||||
```
|
||||
|
||||
Duration threshold defaults to 3 seconds. Adjustable for stricter/looser matching.
|
||||
|
||||
#### YouTubeAlign Methods
|
||||
|
||||
**get_best_match(artist, track, album):** Search YouTube Music.
|
||||
|
||||
**get_youtube_id():** Extract video ID from search results.
|
||||
|
||||
**Implementation details:**
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import YouTubeAlign
|
||||
|
||||
yt = YouTubeAlign(artist="...", track="...", album="...")
|
||||
match = yt.get_best_match(artist, track, album)
|
||||
# Returns YouTube Music result dict or None
|
||||
```
|
||||
|
||||
No duration parameter. No filtering. First result returned.
|
||||
|
||||
### Batch Processing API
|
||||
|
||||
#### link_partitions.py CLI
|
||||
|
||||
```bash
|
||||
python link_partitions.py <directory> [options]
|
||||
```
|
||||
|
||||
**Arguments:**
|
||||
|
||||
**directory** (positional): Path to directory containing JAMS files.
|
||||
|
||||
**Options:**
|
||||
|
||||
**--save:** Write enriched JAMS files back to disk. Without this flag, only CSV output generated.
|
||||
|
||||
**--limit audio:** Only process JAMS files with audio content. Skip annotation-only files.
|
||||
|
||||
**--overwrite:** Overwrite existing enriched JAMS files. Without this flag, existing files skipped.
|
||||
|
||||
**Output:**
|
||||
|
||||
CSV file with columns:
|
||||
- jams_file: Original JAMS filename
|
||||
- track_name, artist_name, album_name: Metadata
|
||||
- track_number, duration, release_year: Attributes
|
||||
- musicbrainz: MBID
|
||||
- isrc: ISRC
|
||||
- deezer_id, deezer_url: Deezer identifiers
|
||||
- youtube_url: YouTube Music link
|
||||
- acousticbrainz: AcousticBrainz link (always None)
|
||||
- spotify_id: Spotify ID (if available)
|
||||
|
||||
Log file: link_partitions.log in current directory.
|
||||
|
||||
#### JAMSProcessor API
|
||||
|
||||
```python
|
||||
from musicmetalinker.preprocessor import JAMSProcessor
|
||||
|
||||
processor = JAMSProcessor(jams_file_path)
|
||||
metadata = processor.extract_metadata()
|
||||
# Returns dict with artist, track, album, duration, etc.
|
||||
|
||||
processor.enrich_jams(align_instance)
|
||||
processor.write_jams(output_path)
|
||||
```
|
||||
|
||||
**extract_metadata():** Parses JAMS file and returns metadata dict.
|
||||
|
||||
**enrich_jams(align):** Takes Align instance and adds identifiers to JAMS structure.
|
||||
|
||||
**write_jams(path):** Writes enriched JAMS to file.
|
||||
|
||||
### Error Handling
|
||||
|
||||
No exceptions raised by public API. All errors silently suppressed.
|
||||
|
||||
**Pattern:**
|
||||
- Service query fails: Returns None
|
||||
- Network error: Returns None
|
||||
- Invalid input: Returns None
|
||||
- No match found: Returns None
|
||||
|
||||
**Implications:**
|
||||
- No distinction between error types
|
||||
- No error messages
|
||||
- No logging of failures (except in batch mode)
|
||||
- Caller cannot determine why None returned
|
||||
|
||||
**Debugging:**
|
||||
- Enable logging to see internal errors
|
||||
- Check link_partitions.log for batch processing errors
|
||||
- Add print statements to source code
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
No rate limiting implemented.
|
||||
|
||||
**Risks:**
|
||||
- MusicBrainz rate limits: 1 request/second recommended, not enforced
|
||||
- Deezer rate limits: Unknown, not enforced
|
||||
- YouTube Music rate limits: Unknown, not enforced
|
||||
|
||||
**Batch processing:** No delays between requests. High risk of rate limiting or IP bans.
|
||||
|
||||
**Recommendation:** Add manual delays in batch processing loops.
|
||||
|
||||
### Caching
|
||||
|
||||
Results cached within Align instance lifetime. No cross-instance caching.
|
||||
|
||||
**Behavior:**
|
||||
- First call to get_mbid() queries MusicBrainz
|
||||
- Second call to get_mbid() returns cached value
|
||||
- Creating new Align instance queries again
|
||||
|
||||
**No persistent cache:** No disk cache, no Redis, no memcached.
|
||||
|
||||
**Batch processing:** Each track creates new Align instance. No cache reuse across tracks.
|
||||
|
||||
### Thread Safety
|
||||
|
||||
Not thread-safe. No synchronization primitives.
|
||||
|
||||
**Unsafe operations:**
|
||||
- Concurrent calls to same Align instance
|
||||
- Concurrent batch processing of same directory
|
||||
|
||||
**Safe operations:**
|
||||
- Multiple Align instances in separate threads (each queries independently)
|
||||
|
||||
### Authentication
|
||||
|
||||
**MusicBrainz:** No authentication. User-Agent header required ("elka/0.1" hardcoded).
|
||||
|
||||
**Deezer:** No authentication for search API.
|
||||
|
||||
**YouTube Music:** No authentication. Uses unofficial API.
|
||||
|
||||
**Spotify:** OAuth2 client credentials required. Configured in external mml_secrets.py file.
|
||||
|
||||
**Spotify usage:** Limited to ISRC extraction in Billboard dataset cleaning. Not used in main Align workflow.
|
||||
|
||||
### API Versioning
|
||||
|
||||
No API versioning. Library version 0.0.1 indicates pre-release.
|
||||
|
||||
**Breaking changes:** Possible in any release. No stability guarantees.
|
||||
|
||||
**Compatibility:** No backward compatibility promises.
|
||||
|
||||
### Dependencies for API Usage
|
||||
|
||||
Minimum dependencies for using Align class:
|
||||
- musicbrainzngs
|
||||
- deezer-python
|
||||
- ytmusicapi
|
||||
- requests
|
||||
|
||||
Optional dependencies:
|
||||
- jams (for JAMS file support)
|
||||
- pandas (for batch CSV output)
|
||||
- spotipy (for Spotify integration)
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
**Query latency:**
|
||||
- MusicBrainz: 100-500ms per query
|
||||
- Deezer: 50-200ms per query
|
||||
- YouTube Music: 100-300ms per query
|
||||
|
||||
**Total latency:** Sum of all service queries (sequential execution). Expect 250-1000ms per track.
|
||||
|
||||
**Batch processing:** Linear scaling. 1000 tracks = 1000x single track latency.
|
||||
|
||||
### API Limitations
|
||||
|
||||
1. **No bulk queries:** Each track requires separate Align instance
|
||||
2. **No async support:** Synchronous only
|
||||
3. **No streaming results:** All-or-nothing queries
|
||||
4. **No partial updates:** Can't update single field
|
||||
5. **No validation:** No input validation, no output validation
|
||||
6. **No error details:** Only None on failure
|
||||
7. **Dead integrations:** AcousticBrainz non-functional
|
||||
8. **Weak YouTube matching:** First result assumed correct
|
||||
|
||||
### API Strengths
|
||||
|
||||
1. **Simple interface:** Single class, clear getters
|
||||
2. **Flexible input:** Works with identifiers or metadata
|
||||
3. **Cascading fallback:** Graceful degradation
|
||||
4. **Lazy evaluation:** Only query when needed
|
||||
5. **JAMS support:** Academic standard format
|
||||
|
||||
### API Design Recommendations
|
||||
|
||||
For production use:
|
||||
|
||||
1. **Add exceptions:** Raise specific errors instead of returning None
|
||||
2. **Add validation:** Validate input parameters
|
||||
3. **Add async API:** Async versions of all getters
|
||||
4. **Add bulk API:** Process multiple tracks in single call
|
||||
5. **Add configuration:** Runtime configuration for thresholds
|
||||
6. **Add logging:** Structured logging with correlation IDs
|
||||
7. **Add rate limiting:** Respect API limits
|
||||
8. **Remove dead code:** Delete AcousticBrainz methods
|
||||
9. **Add documentation:** Docstrings for all public methods
|
||||
10. **Add type hints:** Full type annotations
|
||||
|
||||
The API surface is clean and simple. The implementation needs hardening.
|
||||
Reference in New Issue
Block a user