feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,521 @@
|
||||
# MusicMetaLinker API Reference
|
||||
|
||||
## API Type
|
||||
|
||||
MusicMetaLinker is a Python library API. No REST API, no GraphQL, no command-line interface for library functionality.
|
||||
|
||||
Batch processing has a CLI (link_partitions.py) but the core library is Python-only.
|
||||
|
||||
## Primary Interface: Align Class
|
||||
|
||||
### Constructor
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import Align
|
||||
|
||||
linker = Align(
|
||||
mbid_track=None,
|
||||
mbid_release=None,
|
||||
artist=None,
|
||||
album=None,
|
||||
track=None,
|
||||
track_number=None,
|
||||
duration=None,
|
||||
isrc=None,
|
||||
strict=False
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
**mbid_track** (str, optional): MusicBrainz recording ID. If provided, MusicBrainz is queried first and treated as authoritative.
|
||||
|
||||
**mbid_release** (str, optional): MusicBrainz release ID. Used for album-level metadata.
|
||||
|
||||
**artist** (str, optional): Artist name. Used for metadata-based search when identifiers unavailable.
|
||||
|
||||
**album** (str, optional): Album name. Used for filtering and matching.
|
||||
|
||||
**track** (str, optional): Track name. Primary search term for metadata-based queries.
|
||||
|
||||
**track_number** (int, optional): Track position on album. Used for filtering multiple matches.
|
||||
|
||||
**duration** (int or float, optional): Track duration in seconds. Critical for filtering. Deezer uses ±3 second threshold.
|
||||
|
||||
**isrc** (str, optional): International Standard Recording Code. If provided, used for direct lookup on Deezer and MusicBrainz.
|
||||
|
||||
**strict** (bool, optional): Strict matching mode. Behavior not fully documented. Likely affects fuzzy matching thresholds.
|
||||
|
||||
**Returns:** Align instance. No exceptions raised during construction. Queries execute lazily when getters called.
|
||||
|
||||
**Usage patterns:**
|
||||
|
||||
Minimal input (metadata only):
|
||||
```python
|
||||
linker = Align(artist="Radiohead", track="Creep")
|
||||
```
|
||||
|
||||
With identifiers (preferred):
|
||||
```python
|
||||
linker = Align(
|
||||
mbid_track="6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e",
|
||||
isrc="GBAYE9200070"
|
||||
)
|
||||
```
|
||||
|
||||
Full metadata for best matching:
|
||||
```python
|
||||
linker = Align(
|
||||
artist="The Beatles",
|
||||
track="Hey Jude",
|
||||
album="Hey Jude",
|
||||
duration=431,
|
||||
track_number=1
|
||||
)
|
||||
```
|
||||
|
||||
### Metadata Getter Methods
|
||||
|
||||
All getters return None if data unavailable. No exceptions raised.
|
||||
|
||||
#### get_artist()
|
||||
|
||||
```python
|
||||
artist = linker.get_artist()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Artist name from best available source (MusicBrainz > Deezer > YouTube > input).
|
||||
|
||||
**Behavior:**
|
||||
- If MBID available, returns MusicBrainz artist
|
||||
- Falls back to Deezer artist if found
|
||||
- Falls back to YouTube artist if found
|
||||
- Returns input artist if no services matched
|
||||
- Returns None if no artist information available
|
||||
|
||||
#### get_album()
|
||||
|
||||
```python
|
||||
album = linker.get_album()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Album/release name.
|
||||
|
||||
**Behavior:** Same cascading fallback as get_artist().
|
||||
|
||||
#### get_track()
|
||||
|
||||
```python
|
||||
track = linker.get_track()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Track/recording name.
|
||||
|
||||
**Behavior:** Same cascading fallback as get_artist().
|
||||
|
||||
#### get_track_number()
|
||||
|
||||
```python
|
||||
track_number = linker.get_track_number()
|
||||
```
|
||||
|
||||
**Returns:** int or None. Track position on album.
|
||||
|
||||
**Behavior:**
|
||||
- Returns MusicBrainz track number if available
|
||||
- Falls back to input track_number
|
||||
- Returns None if unavailable
|
||||
|
||||
#### get_duration()
|
||||
|
||||
```python
|
||||
duration = linker.get_duration()
|
||||
```
|
||||
|
||||
**Returns:** int, float, or None. Track duration in seconds.
|
||||
|
||||
**Behavior:**
|
||||
- Returns MusicBrainz duration if available (milliseconds converted to seconds)
|
||||
- Falls back to Deezer duration
|
||||
- Falls back to input duration
|
||||
- Returns None if unavailable
|
||||
|
||||
**Note:** MusicBrainz stores duration in milliseconds. The library converts to seconds for consistency.
|
||||
|
||||
#### get_release_date()
|
||||
|
||||
```python
|
||||
release_date = linker.get_release_date()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Release date in ISO format (YYYY-MM-DD) or year only (YYYY).
|
||||
|
||||
**Behavior:**
|
||||
- Returns MusicBrainz release date if available
|
||||
- Falls back to Deezer release date
|
||||
- Returns None if unavailable
|
||||
|
||||
**Format inconsistency:** MusicBrainz may return full date, Deezer typically returns year only.
|
||||
|
||||
#### get_isrc()
|
||||
|
||||
```python
|
||||
isrc = linker.get_isrc()
|
||||
```
|
||||
|
||||
**Returns:** str or None. International Standard Recording Code.
|
||||
|
||||
**Behavior:**
|
||||
- Returns input ISRC if provided
|
||||
- Extracts from MusicBrainz recording if available
|
||||
- Extracts from Deezer result if available
|
||||
- Returns None if unavailable
|
||||
|
||||
**Format:** Standard ISRC format (e.g., "GBAYE9200070"). No validation performed.
|
||||
|
||||
#### get_bpm()
|
||||
|
||||
```python
|
||||
bpm = linker.get_bpm()
|
||||
```
|
||||
|
||||
**Returns:** int, float, or None. Tempo in beats per minute.
|
||||
|
||||
**Behavior:**
|
||||
- Returns Deezer BPM if available
|
||||
- Returns None if unavailable
|
||||
|
||||
**Note:** MusicBrainz doesn't provide BPM in standard queries. Only Deezer source.
|
||||
|
||||
### Identifier Getter Methods
|
||||
|
||||
#### get_mbid()
|
||||
|
||||
```python
|
||||
mbid = linker.get_mbid()
|
||||
```
|
||||
|
||||
**Returns:** str or None. MusicBrainz recording ID (UUID format).
|
||||
|
||||
**Behavior:**
|
||||
- Returns input mbid_track if provided
|
||||
- Queries MusicBrainz by ISRC if available
|
||||
- Queries MusicBrainz by metadata if ISRC unavailable
|
||||
- Returns None if no match found
|
||||
|
||||
**Format:** UUID string (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e").
|
||||
|
||||
#### get_deezer_id()
|
||||
|
||||
```python
|
||||
deezer_id = linker.get_deezer_id()
|
||||
```
|
||||
|
||||
**Returns:** int or None. Deezer track ID.
|
||||
|
||||
**Behavior:**
|
||||
- Queries Deezer by ISRC if available
|
||||
- Queries Deezer by metadata if ISRC unavailable
|
||||
- Filters by duration (±3 seconds)
|
||||
- Returns None if no match found
|
||||
|
||||
**Format:** Integer (e.g., 123456789).
|
||||
|
||||
#### get_deezer_link()
|
||||
|
||||
```python
|
||||
deezer_link = linker.get_deezer_link()
|
||||
```
|
||||
|
||||
**Returns:** str or None. Full Deezer track URL.
|
||||
|
||||
**Behavior:**
|
||||
- Calls get_deezer_id() internally
|
||||
- Constructs URL: f"https://www.deezer.com/track/{deezer_id}"
|
||||
- Returns None if no Deezer ID available
|
||||
|
||||
**Format:** Full URL (e.g., "https://www.deezer.com/track/123456789").
|
||||
|
||||
#### get_youtube_link()
|
||||
|
||||
```python
|
||||
youtube_link = linker.get_youtube_link()
|
||||
```
|
||||
|
||||
**Returns:** str or None. YouTube Music track URL.
|
||||
|
||||
**Behavior:**
|
||||
- Queries YouTube Music by metadata (artist, track, album)
|
||||
- Returns first result (no sophisticated ranking)
|
||||
- Returns None if no results
|
||||
|
||||
**Format:** Full YouTube URL (e.g., "https://www.youtube.com/watch?v=dQw4w9WgXcQ").
|
||||
|
||||
**Warning:** YouTube matching is weak. First result assumed correct. No duration filtering.
|
||||
|
||||
#### get_acousticbrainz_link()
|
||||
|
||||
```python
|
||||
acousticbrainz_link = linker.get_acousticbrainz_link()
|
||||
```
|
||||
|
||||
**Returns:** str or None. AcousticBrainz URL.
|
||||
|
||||
**Behavior:**
|
||||
- Requires MBID (calls get_mbid() internally)
|
||||
- Checks if https://acousticbrainz.org/{mbid} returns HTTP 200
|
||||
- Returns URL if exists, None otherwise
|
||||
|
||||
**Critical issue:** AcousticBrainz shut down in 2022. This method always returns None. Dead code.
|
||||
|
||||
### Internal Service Methods
|
||||
|
||||
Not part of public API but exposed in service classes.
|
||||
|
||||
#### MusicBrainzAlign Methods
|
||||
|
||||
**get_recording(mbid):** Direct MusicBrainz recording lookup by MBID.
|
||||
|
||||
**get_best_match(artist, track, album, duration):** Search MusicBrainz by metadata with filtering.
|
||||
|
||||
**get_iswc():** Retrieve International Standard Musical Work Code.
|
||||
|
||||
**Implementation details:**
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import MusicBrainzAlign
|
||||
|
||||
mb = MusicBrainzAlign(mbid="...")
|
||||
recording = mb.get_recording(mbid)
|
||||
# Returns dict with artist, album, track, duration, isrcs, etc.
|
||||
```
|
||||
|
||||
Not intended for direct use. Align class wraps these methods.
|
||||
|
||||
#### DeezerAlign Methods
|
||||
|
||||
**best_match(artist, track, album, duration, duration_threshold=3):** Search Deezer with duration filtering.
|
||||
|
||||
**get_rank():** Retrieve Deezer popularity rank.
|
||||
|
||||
**Implementation details:**
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import DeezerAlign
|
||||
|
||||
deezer = DeezerAlign(artist="...", track="...", album="...", duration=123)
|
||||
match = deezer.best_match(artist, track, album, duration)
|
||||
# Returns Deezer track object or None
|
||||
```
|
||||
|
||||
Duration threshold defaults to 3 seconds. Adjustable for stricter/looser matching.
|
||||
|
||||
#### YouTubeAlign Methods
|
||||
|
||||
**get_best_match(artist, track, album):** Search YouTube Music.
|
||||
|
||||
**get_youtube_id():** Extract video ID from search results.
|
||||
|
||||
**Implementation details:**
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import YouTubeAlign
|
||||
|
||||
yt = YouTubeAlign(artist="...", track="...", album="...")
|
||||
match = yt.get_best_match(artist, track, album)
|
||||
# Returns YouTube Music result dict or None
|
||||
```
|
||||
|
||||
No duration parameter. No filtering. First result returned.
|
||||
|
||||
### Batch Processing API
|
||||
|
||||
#### link_partitions.py CLI
|
||||
|
||||
```bash
|
||||
python link_partitions.py <directory> [options]
|
||||
```
|
||||
|
||||
**Arguments:**
|
||||
|
||||
**directory** (positional): Path to directory containing JAMS files.
|
||||
|
||||
**Options:**
|
||||
|
||||
**--save:** Write enriched JAMS files back to disk. Without this flag, only CSV output generated.
|
||||
|
||||
**--limit audio:** Only process JAMS files with audio content. Skip annotation-only files.
|
||||
|
||||
**--overwrite:** Overwrite existing enriched JAMS files. Without this flag, existing files skipped.
|
||||
|
||||
**Output:**
|
||||
|
||||
CSV file with columns:
|
||||
- jams_file: Original JAMS filename
|
||||
- track_name, artist_name, album_name: Metadata
|
||||
- track_number, duration, release_year: Attributes
|
||||
- musicbrainz: MBID
|
||||
- isrc: ISRC
|
||||
- deezer_id, deezer_url: Deezer identifiers
|
||||
- youtube_url: YouTube Music link
|
||||
- acousticbrainz: AcousticBrainz link (always None)
|
||||
- spotify_id: Spotify ID (if available)
|
||||
|
||||
Log file: link_partitions.log in current directory.
|
||||
|
||||
#### JAMSProcessor API
|
||||
|
||||
```python
|
||||
from musicmetalinker.preprocessor import JAMSProcessor
|
||||
|
||||
processor = JAMSProcessor(jams_file_path)
|
||||
metadata = processor.extract_metadata()
|
||||
# Returns dict with artist, track, album, duration, etc.
|
||||
|
||||
processor.enrich_jams(align_instance)
|
||||
processor.write_jams(output_path)
|
||||
```
|
||||
|
||||
**extract_metadata():** Parses JAMS file and returns metadata dict.
|
||||
|
||||
**enrich_jams(align):** Takes Align instance and adds identifiers to JAMS structure.
|
||||
|
||||
**write_jams(path):** Writes enriched JAMS to file.
|
||||
|
||||
### Error Handling
|
||||
|
||||
No exceptions raised by public API. All errors silently suppressed.
|
||||
|
||||
**Pattern:**
|
||||
- Service query fails: Returns None
|
||||
- Network error: Returns None
|
||||
- Invalid input: Returns None
|
||||
- No match found: Returns None
|
||||
|
||||
**Implications:**
|
||||
- No distinction between error types
|
||||
- No error messages
|
||||
- No logging of failures (except in batch mode)
|
||||
- Caller cannot determine why None returned
|
||||
|
||||
**Debugging:**
|
||||
- Enable logging to see internal errors
|
||||
- Check link_partitions.log for batch processing errors
|
||||
- Add print statements to source code
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
No rate limiting implemented.
|
||||
|
||||
**Risks:**
|
||||
- MusicBrainz rate limits: 1 request/second recommended, not enforced
|
||||
- Deezer rate limits: Unknown, not enforced
|
||||
- YouTube Music rate limits: Unknown, not enforced
|
||||
|
||||
**Batch processing:** No delays between requests. High risk of rate limiting or IP bans.
|
||||
|
||||
**Recommendation:** Add manual delays in batch processing loops.
|
||||
|
||||
### Caching
|
||||
|
||||
Results cached within Align instance lifetime. No cross-instance caching.
|
||||
|
||||
**Behavior:**
|
||||
- First call to get_mbid() queries MusicBrainz
|
||||
- Second call to get_mbid() returns cached value
|
||||
- Creating new Align instance queries again
|
||||
|
||||
**No persistent cache:** No disk cache, no Redis, no memcached.
|
||||
|
||||
**Batch processing:** Each track creates new Align instance. No cache reuse across tracks.
|
||||
|
||||
### Thread Safety
|
||||
|
||||
Not thread-safe. No synchronization primitives.
|
||||
|
||||
**Unsafe operations:**
|
||||
- Concurrent calls to same Align instance
|
||||
- Concurrent batch processing of same directory
|
||||
|
||||
**Safe operations:**
|
||||
- Multiple Align instances in separate threads (each queries independently)
|
||||
|
||||
### Authentication
|
||||
|
||||
**MusicBrainz:** No authentication. User-Agent header required ("elka/0.1" hardcoded).
|
||||
|
||||
**Deezer:** No authentication for search API.
|
||||
|
||||
**YouTube Music:** No authentication. Uses unofficial API.
|
||||
|
||||
**Spotify:** OAuth2 client credentials required. Configured in external mml_secrets.py file.
|
||||
|
||||
**Spotify usage:** Limited to ISRC extraction in Billboard dataset cleaning. Not used in main Align workflow.
|
||||
|
||||
### API Versioning
|
||||
|
||||
No API versioning. Library version 0.0.1 indicates pre-release.
|
||||
|
||||
**Breaking changes:** Possible in any release. No stability guarantees.
|
||||
|
||||
**Compatibility:** No backward compatibility promises.
|
||||
|
||||
### Dependencies for API Usage
|
||||
|
||||
Minimum dependencies for using Align class:
|
||||
- musicbrainzngs
|
||||
- deezer-python
|
||||
- ytmusicapi
|
||||
- requests
|
||||
|
||||
Optional dependencies:
|
||||
- jams (for JAMS file support)
|
||||
- pandas (for batch CSV output)
|
||||
- spotipy (for Spotify integration)
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
**Query latency:**
|
||||
- MusicBrainz: 100-500ms per query
|
||||
- Deezer: 50-200ms per query
|
||||
- YouTube Music: 100-300ms per query
|
||||
|
||||
**Total latency:** Sum of all service queries (sequential execution). Expect 250-1000ms per track.
|
||||
|
||||
**Batch processing:** Linear scaling. 1000 tracks = 1000x single track latency.
|
||||
|
||||
### API Limitations
|
||||
|
||||
1. **No bulk queries:** Each track requires separate Align instance
|
||||
2. **No async support:** Synchronous only
|
||||
3. **No streaming results:** All-or-nothing queries
|
||||
4. **No partial updates:** Can't update single field
|
||||
5. **No validation:** No input validation, no output validation
|
||||
6. **No error details:** Only None on failure
|
||||
7. **Dead integrations:** AcousticBrainz non-functional
|
||||
8. **Weak YouTube matching:** First result assumed correct
|
||||
|
||||
### API Strengths
|
||||
|
||||
1. **Simple interface:** Single class, clear getters
|
||||
2. **Flexible input:** Works with identifiers or metadata
|
||||
3. **Cascading fallback:** Graceful degradation
|
||||
4. **Lazy evaluation:** Only query when needed
|
||||
5. **JAMS support:** Academic standard format
|
||||
|
||||
### API Design Recommendations
|
||||
|
||||
For production use:
|
||||
|
||||
1. **Add exceptions:** Raise specific errors instead of returning None
|
||||
2. **Add validation:** Validate input parameters
|
||||
3. **Add async API:** Async versions of all getters
|
||||
4. **Add bulk API:** Process multiple tracks in single call
|
||||
5. **Add configuration:** Runtime configuration for thresholds
|
||||
6. **Add logging:** Structured logging with correlation IDs
|
||||
7. **Add rate limiting:** Respect API limits
|
||||
8. **Remove dead code:** Delete AcousticBrainz methods
|
||||
9. **Add documentation:** Docstrings for all public methods
|
||||
10. **Add type hints:** Full type annotations
|
||||
|
||||
The API surface is clean and simple. The implementation needs hardening.
|
||||
@@ -0,0 +1,441 @@
|
||||
# MusicMetaLinker Architecture
|
||||
|
||||
## System Overview
|
||||
|
||||
MusicMetaLinker implements a service-oriented architecture for music metadata entity linking. The system coordinates queries across multiple external APIs, aggregates results, and presents a unified interface through a single orchestrator class.
|
||||
|
||||
Architecture pattern: Facade with cascading fallback strategy.
|
||||
|
||||
## Core Components
|
||||
|
||||
### Align Class (linking.py)
|
||||
|
||||
The Align class is the primary orchestrator and sole public interface. It encapsulates all service interactions and presents a clean getter-based API.
|
||||
|
||||
**Constructor signature:**
|
||||
```python
|
||||
Align(
|
||||
mbid_track=None,
|
||||
mbid_release=None,
|
||||
artist=None,
|
||||
album=None,
|
||||
track=None,
|
||||
track_number=None,
|
||||
duration=None,
|
||||
isrc=None,
|
||||
strict=False
|
||||
)
|
||||
```
|
||||
|
||||
**Responsibilities:**
|
||||
- Initialize service-specific aligners based on available input
|
||||
- Coordinate query execution across services
|
||||
- Aggregate and normalize results
|
||||
- Expose unified getter methods for all metadata fields
|
||||
|
||||
**Internal state:**
|
||||
- Stores all input parameters
|
||||
- Maintains references to service aligner instances
|
||||
- Caches retrieved metadata to avoid redundant queries
|
||||
|
||||
The Align class doesn't implement service-specific logic. It delegates to specialized classes and functions.
|
||||
|
||||
### MusicBrainzAlign Class
|
||||
|
||||
Handles all MusicBrainz interactions. MusicBrainz is treated as the authoritative source when MBIDs are available.
|
||||
|
||||
**Key methods:**
|
||||
|
||||
**get_recording(mbid):** Retrieves full recording data by MBID. Returns artist, album, track name, duration, ISRCs, and related identifiers.
|
||||
|
||||
**get_best_match(artist, track, album, duration):** Searches MusicBrainz by metadata strings. Filters results by duration and fuzzy string matching. Returns the highest-scoring match.
|
||||
|
||||
**get_iswc():** Retrieves International Standard Musical Work Code if available.
|
||||
|
||||
**Search strategy:**
|
||||
1. If MBID provided, direct lookup (most reliable)
|
||||
2. If ISRC provided, search by ISRC
|
||||
3. Fall back to metadata string search with filtering
|
||||
|
||||
MusicBrainz queries include related entities (artists, releases, ISRCs) in a single request to minimize API calls.
|
||||
|
||||
### DeezerAlign Class
|
||||
|
||||
Interfaces with Deezer's public API. Deezer provides commercial metadata with strong ISRC coverage.
|
||||
|
||||
**Key methods:**
|
||||
|
||||
**best_match(artist, track, album, duration, duration_threshold=3):** Searches Deezer and filters by duration. The duration_threshold parameter allows ±3 seconds variance by default.
|
||||
|
||||
**get_rank():** Returns Deezer's internal popularity rank for the track.
|
||||
|
||||
**Search strategy:**
|
||||
1. If ISRC available, search by ISRC (most accurate)
|
||||
2. Fall back to metadata string search
|
||||
3. Filter results by duration (±3 seconds)
|
||||
4. Apply fuzzy string matching to artist/track/album
|
||||
|
||||
Duration filtering is critical for Deezer because metadata searches often return multiple versions (radio edit, album version, remaster).
|
||||
|
||||
### YouTubeAlign Class
|
||||
|
||||
Queries YouTube Music via the unofficial ytmusicapi library.
|
||||
|
||||
**Key methods:**
|
||||
|
||||
**get_best_match(artist, track, album):** Searches YouTube Music with filter="songs". Returns the first result (no sophisticated ranking).
|
||||
|
||||
**get_youtube_id():** Extracts YouTube video ID from search results.
|
||||
|
||||
**Search strategy:**
|
||||
- Constructs query string: "{artist} {track} {album}"
|
||||
- Filters to songs only (excludes videos, albums)
|
||||
- Returns first result
|
||||
|
||||
YouTube matching is the weakest link. No duration filtering (commented out in code). No fuzzy matching. First result is assumed correct.
|
||||
|
||||
### acousticbrainz_link Function
|
||||
|
||||
Standalone function (not a class) that checks if an MBID exists in AcousticBrainz.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
def acousticbrainz_link(mbid):
|
||||
url = f"https://acousticbrainz.org/{mbid}"
|
||||
response = requests.get(url)
|
||||
return url if response.status_code == 200 else None
|
||||
```
|
||||
|
||||
Simple HTTP check. Returns URL if MBID exists, None otherwise.
|
||||
|
||||
**Critical issue:** AcousticBrainz shut down in 2022. This function always returns None. Dead code.
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Initialization Flow
|
||||
|
||||
1. User creates Align instance with available metadata
|
||||
2. Align constructor stores all input parameters
|
||||
3. Service aligners are instantiated on-demand (lazy initialization)
|
||||
4. No queries execute during construction
|
||||
|
||||
### Query Flow
|
||||
|
||||
1. User calls getter method (e.g., get_mbid())
|
||||
2. Align checks if value already cached
|
||||
3. If not cached, determines which service to query based on available input
|
||||
4. Executes service-specific query
|
||||
5. Caches result
|
||||
6. Returns value to user
|
||||
|
||||
Queries are lazy and cached. Calling get_mbid() twice only queries MusicBrainz once.
|
||||
|
||||
### Cascading Fallback Strategy
|
||||
|
||||
Priority order for identifier resolution:
|
||||
|
||||
**For MBID:**
|
||||
1. Use provided mbid_track if available
|
||||
2. Query MusicBrainz by ISRC
|
||||
3. Query MusicBrainz by metadata strings
|
||||
4. Return None if all fail
|
||||
|
||||
**For ISRC:**
|
||||
1. Use provided ISRC if available
|
||||
2. Extract from MusicBrainz recording (if MBID available)
|
||||
3. Query Deezer and extract ISRC from result
|
||||
4. Return None if all fail
|
||||
|
||||
**For Deezer ID:**
|
||||
1. Query Deezer by ISRC
|
||||
2. Query Deezer by metadata strings
|
||||
3. Return None if all fail
|
||||
|
||||
**For YouTube link:**
|
||||
1. Query YouTube Music by metadata strings
|
||||
2. Return None if no results
|
||||
|
||||
Each service is queried independently. No cross-service validation or conflict resolution.
|
||||
|
||||
## Supporting Components
|
||||
|
||||
### JAMSProcessor (preprocessor.py)
|
||||
|
||||
Handles reading and writing JAMS (JSON Annotated Music Specification) files.
|
||||
|
||||
**Responsibilities:**
|
||||
- Parse JAMS JSON structure
|
||||
- Extract metadata from file_metadata and sandbox sections
|
||||
- Enrich JAMS files with new identifiers
|
||||
- Write updated JAMS files
|
||||
|
||||
JAMS structure:
|
||||
```json
|
||||
{
|
||||
"file_metadata": {
|
||||
"title": "track name",
|
||||
"artist": "artist name",
|
||||
"release": "album name",
|
||||
"duration": 123.45,
|
||||
"identifiers": {
|
||||
"musicbrainz": "mbid-here"
|
||||
}
|
||||
},
|
||||
"sandbox": {
|
||||
"type": "genre",
|
||||
"genre": "rock",
|
||||
"track_number": 1,
|
||||
"release_year": 2020
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
JAMSProcessor reads these fields, passes them to Align, and writes enriched identifiers back to the identifiers section.
|
||||
|
||||
### MBDownload (musicbrainz_dump.py)
|
||||
|
||||
Utility for bulk downloading MusicBrainz data.
|
||||
|
||||
**Purpose:** Pre-populate local datasets with MusicBrainz metadata to reduce API calls during batch processing.
|
||||
|
||||
**Implementation details:** Not fully specified in provided information. Likely queries MusicBrainz in batches and caches results locally.
|
||||
|
||||
### link_partitions.py
|
||||
|
||||
Batch processing script for directories of JAMS files.
|
||||
|
||||
**Workflow:**
|
||||
1. Scan directory for JAMS files
|
||||
2. For each file, extract metadata via JAMSProcessor
|
||||
3. Create Align instance and query all services
|
||||
4. Collect results in pandas DataFrame
|
||||
5. Output CSV with all identifiers
|
||||
|
||||
**Command-line options:**
|
||||
- `--save`: Write enriched JAMS files back to disk
|
||||
- `--limit audio`: Only process audio files (skip non-audio JAMS)
|
||||
- `--overwrite`: Overwrite existing enriched files
|
||||
|
||||
Includes progress bars via tqdm and logging to link_partitions.log.
|
||||
|
||||
### prepare_dataset.py
|
||||
|
||||
Dataset preparation utilities. Specific functionality not detailed in provided information. Likely includes:
|
||||
- Data cleaning
|
||||
- Format conversion
|
||||
- Batch metadata enrichment
|
||||
|
||||
## Configuration Architecture
|
||||
|
||||
No configuration system. All settings hardcoded in source files.
|
||||
|
||||
**Hardcoded values:**
|
||||
- MusicBrainz User-Agent: "elka/0.1"
|
||||
- Deezer duration threshold: 3 seconds
|
||||
- API endpoints: Direct URLs in code
|
||||
- Spotify credentials: Imported from external mml_secrets.py
|
||||
|
||||
**Implications:**
|
||||
- No runtime configuration
|
||||
- No environment-specific settings
|
||||
- Changing thresholds requires code modification
|
||||
- No A/B testing of matching strategies
|
||||
|
||||
## Error Handling Architecture
|
||||
|
||||
Error handling is minimal and inconsistent.
|
||||
|
||||
**Pattern:**
|
||||
```python
|
||||
try:
|
||||
result = service.query()
|
||||
return result
|
||||
except:
|
||||
return None
|
||||
```
|
||||
|
||||
All exceptions are caught and suppressed. Failed queries return None. No error logging, no exception propagation, no retry logic.
|
||||
|
||||
**Consequences:**
|
||||
- Silent failures
|
||||
- No visibility into what went wrong
|
||||
- Difficult debugging
|
||||
- No distinction between "not found" and "service error"
|
||||
|
||||
## Logging Architecture
|
||||
|
||||
Uses Python's standard logging module.
|
||||
|
||||
**Batch processing:** File-based logging to link_partitions.log. Includes timestamps, log levels, and progress information.
|
||||
|
||||
**Library usage:** Console logging. Minimal output.
|
||||
|
||||
**Debug output:** Multiple print() statements scattered throughout code. Not controlled by logging configuration.
|
||||
|
||||
**Issues:**
|
||||
- Debug prints in production code
|
||||
- No structured logging
|
||||
- No log levels for debug prints
|
||||
- No correlation IDs for tracking requests across services
|
||||
|
||||
## Concurrency Model
|
||||
|
||||
Single-threaded, synchronous execution. No parallelization.
|
||||
|
||||
**Query execution:**
|
||||
- Services queried sequentially
|
||||
- No concurrent API calls
|
||||
- No async/await
|
||||
- No thread pools
|
||||
|
||||
**Implications:**
|
||||
- Slow batch processing (network latency multiplied by number of tracks)
|
||||
- Underutilized network bandwidth
|
||||
- Simple debugging (no race conditions)
|
||||
|
||||
Batch processing could benefit significantly from parallel execution.
|
||||
|
||||
## Dependency Injection
|
||||
|
||||
No dependency injection. Service classes instantiated directly in Align constructor.
|
||||
|
||||
**Current pattern:**
|
||||
```python
|
||||
self.mb_align = MusicBrainzAlign(...)
|
||||
self.deezer_align = DeezerAlign(...)
|
||||
```
|
||||
|
||||
**Implications:**
|
||||
- Difficult to mock services for testing
|
||||
- Tight coupling between Align and service implementations
|
||||
- No interface-based programming
|
||||
- Hard to swap service implementations
|
||||
|
||||
## State Management
|
||||
|
||||
State is managed in Align instance variables.
|
||||
|
||||
**Cached values:**
|
||||
- All input parameters (artist, track, album, etc.)
|
||||
- Retrieved metadata (MBID, ISRC, Deezer ID, etc.)
|
||||
- Service aligner instances
|
||||
|
||||
**Cache invalidation:** None. Values cached for lifetime of Align instance.
|
||||
|
||||
**Thread safety:** Not thread-safe. No locks, no synchronization.
|
||||
|
||||
## Extension Points
|
||||
|
||||
Limited extensibility.
|
||||
|
||||
**Adding new services:**
|
||||
1. Create new service aligner class
|
||||
2. Instantiate in Align constructor
|
||||
3. Add getter methods to Align
|
||||
4. Update cascading fallback logic
|
||||
|
||||
No plugin system, no service registry, no abstract base classes.
|
||||
|
||||
**Modifying matching logic:**
|
||||
Requires editing service aligner classes directly. No strategy pattern, no configurable matchers.
|
||||
|
||||
## Testing Architecture
|
||||
|
||||
No test suite. No test directory. No test configuration.
|
||||
|
||||
**Testing approach:**
|
||||
- Manual testing via Jupyter notebooks (deezer_test.ipynb, queries.ipynb)
|
||||
- if __name__ == "__main__" blocks in some modules
|
||||
- No unit tests, no integration tests, no mocks
|
||||
|
||||
## Build and Packaging
|
||||
|
||||
Uses hatchling (PEP 517 build backend).
|
||||
|
||||
**pyproject.toml structure:**
|
||||
- Project metadata (name, version, authors)
|
||||
- Dependencies
|
||||
- Build system configuration
|
||||
|
||||
No setup.py. Modern Python packaging.
|
||||
|
||||
**Distribution:** GitHub only. Not published to PyPI.
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
Library deployment: pip install from GitHub.
|
||||
|
||||
Batch processing deployment: Clone repository, install dependencies, run Python scripts directly.
|
||||
|
||||
No Docker containers, no systemd services, no process managers.
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
No performance optimization.
|
||||
|
||||
**Bottlenecks:**
|
||||
- Network latency (sequential API calls)
|
||||
- No caching across Align instances
|
||||
- No request batching
|
||||
- No connection pooling
|
||||
|
||||
**Memory usage:**
|
||||
- Minimal (only current track metadata in memory)
|
||||
- No large data structures
|
||||
- Pandas DataFrame for batch output (could be large for big datasets)
|
||||
|
||||
## Security Architecture
|
||||
|
||||
Minimal security considerations.
|
||||
|
||||
**API credentials:**
|
||||
- MusicBrainz: No authentication
|
||||
- Deezer: No authentication
|
||||
- YouTube Music: No authentication
|
||||
- Spotify: OAuth2 client credentials in external file
|
||||
|
||||
**Secrets management:**
|
||||
- Spotify credentials in mml_secrets.py (not in repository)
|
||||
- No encryption
|
||||
- No environment variables
|
||||
- No secrets vault
|
||||
|
||||
**Input validation:**
|
||||
- No validation of user input
|
||||
- No sanitization of metadata strings
|
||||
- Potential injection vulnerabilities if metadata used in shell commands
|
||||
|
||||
## Architectural Strengths
|
||||
|
||||
1. **Simple facade:** Single Align class hides complexity
|
||||
2. **Cascading fallback:** Graceful degradation when services fail
|
||||
3. **Lazy evaluation:** Only query services when needed
|
||||
4. **Service isolation:** Each service in separate class
|
||||
|
||||
## Architectural Weaknesses
|
||||
|
||||
1. **No abstraction:** Service classes have different interfaces
|
||||
2. **Tight coupling:** Align directly instantiates service classes
|
||||
3. **No error handling:** Silent failures everywhere
|
||||
4. **No concurrency:** Sequential execution only
|
||||
5. **Hardcoded configuration:** No runtime flexibility
|
||||
6. **No testing:** Untestable design (tight coupling, no mocks)
|
||||
7. **Dead code:** AcousticBrainz integration non-functional
|
||||
8. **Inconsistent patterns:** Function for AcousticBrainz, classes for others
|
||||
|
||||
## Architectural Recommendations
|
||||
|
||||
For production use, consider:
|
||||
|
||||
1. **Define service interface:** Abstract base class for all aligners
|
||||
2. **Dependency injection:** Pass service instances to Align constructor
|
||||
3. **Configuration system:** External config for thresholds, endpoints, credentials
|
||||
4. **Error handling:** Explicit error types, logging, retry logic
|
||||
5. **Async execution:** Use asyncio for concurrent API calls
|
||||
6. **Caching layer:** Redis or in-memory cache for repeated queries
|
||||
7. **Remove dead code:** Delete AcousticBrainz integration
|
||||
8. **Add tests:** Unit tests with mocked services
|
||||
9. **Structured logging:** JSON logs with correlation IDs
|
||||
10. **Rate limiting:** Respect API rate limits with backoff
|
||||
|
||||
The core pattern (cascading fallback across services) is sound. The implementation needs significant hardening.
|
||||
@@ -0,0 +1,807 @@
|
||||
# MusicMetaLinker Codebase Analysis
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
MusicMetaLinker/
|
||||
├── musicmetalinker/
|
||||
│ ├── __init__.py
|
||||
│ ├── linking.py # Core Align class and service aligners
|
||||
│ ├── preprocessor.py # JAMSProcessor for JAMS file handling
|
||||
│ ├── musicbrainz_dump.py # MusicBrainz bulk download utilities
|
||||
│ └── utils.py # Utility functions (likely)
|
||||
├── link_partitions.py # Batch processing CLI
|
||||
├── prepare_dataset.py # Dataset preparation scripts
|
||||
├── deezer_test.ipynb # Deezer integration testing notebook
|
||||
├── queries.ipynb # Query testing notebook
|
||||
├── pyproject.toml # Build configuration
|
||||
├── README.md # Project documentation
|
||||
└── LICENSE # MIT license
|
||||
```
|
||||
|
||||
**No tests directory.** No test files.
|
||||
|
||||
**No docs directory.** Documentation in README only.
|
||||
|
||||
**No examples directory.** Examples in notebooks only.
|
||||
|
||||
## Code Organization
|
||||
|
||||
### linking.py
|
||||
|
||||
**Primary module.** Contains all core functionality.
|
||||
|
||||
**Classes:**
|
||||
- **Align:** Main orchestrator class
|
||||
- **MusicBrainzAlign:** MusicBrainz service integration
|
||||
- **DeezerAlign:** Deezer service integration
|
||||
- **YouTubeAlign:** YouTube Music service integration
|
||||
|
||||
**Functions:**
|
||||
- **acousticbrainz_link(mbid):** AcousticBrainz URL checker (defunct)
|
||||
|
||||
**Estimated size:** 500-800 lines (based on typical structure).
|
||||
|
||||
**Responsibilities:**
|
||||
- Service coordination
|
||||
- Query execution
|
||||
- Result aggregation
|
||||
- Metadata normalization
|
||||
|
||||
**Code quality issues:**
|
||||
- Debug print() statements in production code
|
||||
- Commented-out code sections
|
||||
- Hardcoded configuration values
|
||||
- No docstrings (likely)
|
||||
- Inconsistent naming conventions
|
||||
|
||||
### preprocessor.py
|
||||
|
||||
**JAMS file handling.**
|
||||
|
||||
**Classes:**
|
||||
- **JAMSProcessor:** Read/write JAMS files, extract metadata, enrich with identifiers
|
||||
|
||||
**Responsibilities:**
|
||||
- Parse JAMS JSON structure
|
||||
- Extract file_metadata and sandbox fields
|
||||
- Inject new identifiers
|
||||
- Write enriched JAMS files
|
||||
|
||||
**Dependencies:**
|
||||
- jams library for JAMS format support
|
||||
- json for JSON parsing
|
||||
|
||||
### musicbrainz_dump.py
|
||||
|
||||
**Bulk MusicBrainz download utilities.**
|
||||
|
||||
**Classes:**
|
||||
- **MBDownload:** Batch download from MusicBrainz
|
||||
|
||||
**Purpose:** Pre-populate datasets with MusicBrainz metadata to reduce API calls.
|
||||
|
||||
**Implementation details:** Not fully specified. Likely includes:
|
||||
- Batch query logic
|
||||
- Rate limiting (hopefully)
|
||||
- Local caching
|
||||
- CSV or JSON output
|
||||
|
||||
### link_partitions.py
|
||||
|
||||
**Batch processing CLI script.**
|
||||
|
||||
**Functionality:**
|
||||
- Scan directory for JAMS files
|
||||
- Process each file with Align
|
||||
- Collect results in pandas DataFrame
|
||||
- Output CSV with all identifiers
|
||||
- Optionally write enriched JAMS files
|
||||
|
||||
**Command-line arguments:**
|
||||
- Positional: directory path
|
||||
- --save: Write enriched JAMS files
|
||||
- --limit audio: Only process audio files
|
||||
- --overwrite: Overwrite existing files
|
||||
|
||||
**Logging:** File-based to link_partitions.log.
|
||||
|
||||
**Progress tracking:** tqdm progress bars.
|
||||
|
||||
### prepare_dataset.py
|
||||
|
||||
**Dataset preparation utilities.**
|
||||
|
||||
**Functionality:** Not fully specified. Likely includes:
|
||||
- Data cleaning
|
||||
- Format conversion
|
||||
- Metadata normalization
|
||||
- Spotify ISRC extraction for Billboard dataset
|
||||
|
||||
**Spotify integration:** Uses spotipy with credentials from mml_secrets.py.
|
||||
|
||||
### Notebooks
|
||||
|
||||
**deezer_test.ipynb:** Interactive testing of Deezer integration.
|
||||
|
||||
**queries.ipynb:** Interactive testing of various query patterns.
|
||||
|
||||
**Purpose:** Manual testing and exploration. Not automated tests.
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### Hardcoded Configuration
|
||||
|
||||
All configuration values hardcoded in source files.
|
||||
|
||||
**linking.py:**
|
||||
|
||||
```python
|
||||
# MusicBrainz User-Agent
|
||||
musicbrainzngs.set_useragent("elka", "0.1")
|
||||
|
||||
# Duration thresholds
|
||||
MUSICBRAINZ_DURATION_THRESHOLD = 5 # seconds
|
||||
DEEZER_DURATION_THRESHOLD = 3 # seconds
|
||||
|
||||
# Similarity threshold
|
||||
SIMILARITY_THRESHOLD = 0.8
|
||||
```
|
||||
|
||||
**Issues:**
|
||||
- No runtime configuration
|
||||
- Changing thresholds requires code modification
|
||||
- No environment-specific settings
|
||||
- "elka/0.1" User-Agent suggests code copied from another project
|
||||
|
||||
### External Configuration
|
||||
|
||||
**Only external config:** mml_secrets.py for Spotify credentials.
|
||||
|
||||
**Not in repository.** Users must create manually.
|
||||
|
||||
**Structure:**
|
||||
|
||||
```python
|
||||
SPOTIFY_CLIENT_ID = "..."
|
||||
SPOTIFY_CLIENT_SECRET = "..."
|
||||
```
|
||||
|
||||
**Import pattern:**
|
||||
|
||||
```python
|
||||
try:
|
||||
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
|
||||
except ImportError:
|
||||
SPOTIFY_CLIENT_ID = None
|
||||
SPOTIFY_CLIENT_SECRET = None
|
||||
```
|
||||
|
||||
**Graceful degradation:** If mml_secrets.py missing, Spotify features disabled.
|
||||
|
||||
### Configuration Recommendations
|
||||
|
||||
1. **Use environment variables:**
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
SPOTIFY_CLIENT_ID = os.getenv("SPOTIFY_CLIENT_ID")
|
||||
MUSICBRAINZ_USER_AGENT = os.getenv("MUSICBRAINZ_USER_AGENT", "MusicMetaLinker/0.0.1")
|
||||
DEEZER_DURATION_THRESHOLD = int(os.getenv("DEEZER_DURATION_THRESHOLD", "3"))
|
||||
```
|
||||
|
||||
2. **Add config file support:**
|
||||
|
||||
```python
|
||||
import configparser
|
||||
|
||||
config = configparser.ConfigParser()
|
||||
config.read("musicmetalinker.ini")
|
||||
|
||||
DEEZER_DURATION_THRESHOLD = config.getint("matching", "deezer_duration_threshold", fallback=3)
|
||||
```
|
||||
|
||||
3. **Add runtime configuration:**
|
||||
|
||||
```python
|
||||
linker = Align(
|
||||
artist="...",
|
||||
track="...",
|
||||
config={
|
||||
"deezer_duration_threshold": 5,
|
||||
"similarity_threshold": 0.9
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Logging Architecture
|
||||
|
||||
### Logging Implementation
|
||||
|
||||
**Library:** Python standard logging module.
|
||||
|
||||
**Configuration:**
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
```
|
||||
|
||||
**Log levels used:**
|
||||
- INFO: Normal operation (file processing, successful queries)
|
||||
- ERROR: Failed queries, network errors
|
||||
|
||||
**Not used:**
|
||||
- DEBUG: No debug-level logging
|
||||
- WARNING: No warnings
|
||||
- CRITICAL: No critical errors
|
||||
|
||||
### Logging Locations
|
||||
|
||||
**Batch processing:** File-based logging to link_partitions.log.
|
||||
|
||||
```python
|
||||
file_handler = logging.FileHandler('link_partitions.log')
|
||||
logger.addHandler(file_handler)
|
||||
```
|
||||
|
||||
**Library usage:** Console logging.
|
||||
|
||||
```python
|
||||
console_handler = logging.StreamHandler()
|
||||
logger.addHandler(console_handler)
|
||||
```
|
||||
|
||||
### Debug Output Issues
|
||||
|
||||
**Multiple print() statements in production code:**
|
||||
|
||||
```python
|
||||
print(f"Querying MusicBrainz for {artist} - {track}")
|
||||
print(f"Found MBID: {mbid}")
|
||||
print(f"Deezer search returned {len(results)} results")
|
||||
```
|
||||
|
||||
**Problems:**
|
||||
- Not controlled by logging configuration
|
||||
- Can't disable without code changes
|
||||
- No log levels
|
||||
- No timestamps
|
||||
- Mixes with actual output
|
||||
|
||||
**Recommendation:** Replace all print() with logger.debug().
|
||||
|
||||
### Logging Recommendations
|
||||
|
||||
1. **Remove print() statements:**
|
||||
|
||||
```python
|
||||
# Before
|
||||
print(f"Querying MusicBrainz for {artist} - {track}")
|
||||
|
||||
# After
|
||||
logger.debug(f"Querying MusicBrainz for {artist} - {track}")
|
||||
```
|
||||
|
||||
2. **Add structured logging:**
|
||||
|
||||
```python
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
logger.info("musicbrainz_query", artist=artist, track=track, mbid=mbid)
|
||||
```
|
||||
|
||||
3. **Add correlation IDs:**
|
||||
|
||||
```python
|
||||
import uuid
|
||||
|
||||
correlation_id = str(uuid.uuid4())
|
||||
logger.info("query_started", correlation_id=correlation_id, artist=artist)
|
||||
# ... queries ...
|
||||
logger.info("query_completed", correlation_id=correlation_id, mbid=mbid)
|
||||
```
|
||||
|
||||
4. **Add log levels:**
|
||||
|
||||
```python
|
||||
logger.debug("Attempting MusicBrainz query")
|
||||
logger.info("Successfully retrieved MBID")
|
||||
logger.warning("Deezer query returned no results, falling back to YouTube")
|
||||
logger.error("All services failed", exc_info=True)
|
||||
```
|
||||
|
||||
## Code Quality
|
||||
|
||||
### Code Smells
|
||||
|
||||
**Debug prints in production:**
|
||||
|
||||
```python
|
||||
print("DEBUG: entering get_mbid()")
|
||||
print(f"DEBUG: mbid_track = {self.mbid_track}")
|
||||
```
|
||||
|
||||
**Commented-out code:**
|
||||
|
||||
```python
|
||||
# if duration:
|
||||
# matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]
|
||||
```
|
||||
|
||||
**Hardcoded values:**
|
||||
|
||||
```python
|
||||
musicbrainzngs.set_useragent("elka", "0.1") # Should be "MusicMetaLinker/0.0.1"
|
||||
```
|
||||
|
||||
**Inconsistent naming:**
|
||||
|
||||
```python
|
||||
mbid_track # snake_case
|
||||
mbidTrack # camelCase (in some places)
|
||||
MBID # UPPER_CASE
|
||||
```
|
||||
|
||||
**No docstrings:**
|
||||
|
||||
```python
|
||||
def get_mbid(self):
|
||||
# No docstring explaining what this returns or when it returns None
|
||||
...
|
||||
```
|
||||
|
||||
**Broad exception catching:**
|
||||
|
||||
```python
|
||||
try:
|
||||
result = service.query()
|
||||
except: # Catches everything, including KeyboardInterrupt
|
||||
return None
|
||||
```
|
||||
|
||||
### Code Quality Metrics
|
||||
|
||||
**Estimated metrics (without actual analysis):**
|
||||
|
||||
- **Lines of code:** ~1500-2000
|
||||
- **Cyclomatic complexity:** Moderate (nested conditionals in matching logic)
|
||||
- **Code duplication:** Moderate (similar patterns across service aligners)
|
||||
- **Test coverage:** 0% (no tests)
|
||||
- **Documentation coverage:** Low (minimal docstrings)
|
||||
|
||||
### Linting Issues
|
||||
|
||||
**No linting configuration.** Running pylint or flake8 would likely find:
|
||||
|
||||
- Unused imports
|
||||
- Unused variables
|
||||
- Line too long (>79 characters)
|
||||
- Missing docstrings
|
||||
- Bare except clauses
|
||||
- Inconsistent naming
|
||||
- Wildcard imports (if any)
|
||||
|
||||
### Type Hints
|
||||
|
||||
**Minimal type hints.** Likely no type annotations on most functions.
|
||||
|
||||
**Example of missing type hints:**
|
||||
|
||||
```python
|
||||
# Current (no type hints)
|
||||
def get_mbid(self):
|
||||
...
|
||||
|
||||
# With type hints
|
||||
def get_mbid(self) -> Optional[str]:
|
||||
...
|
||||
```
|
||||
|
||||
**Benefits of adding type hints:**
|
||||
- Static type checking with mypy
|
||||
- Better IDE autocomplete
|
||||
- Self-documenting code
|
||||
- Catch type errors before runtime
|
||||
|
||||
## Testing
|
||||
|
||||
### Test Coverage
|
||||
|
||||
**No automated tests.** No test directory, no test files.
|
||||
|
||||
**Testing approach:**
|
||||
- Manual testing via Jupyter notebooks
|
||||
- if __name__ == "__main__" blocks in some modules
|
||||
|
||||
**Example if __name__ == "__main__" block:**
|
||||
|
||||
```python
|
||||
if __name__ == "__main__":
|
||||
linker = Align(artist="The Beatles", track="Hey Jude")
|
||||
print(linker.get_mbid())
|
||||
print(linker.get_isrc())
|
||||
```
|
||||
|
||||
**Not real tests:** No assertions, no test framework, no automation.
|
||||
|
||||
### Testing Recommendations
|
||||
|
||||
**Unit tests with mocked services:**
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from unittest.mock import Mock, patch
|
||||
|
||||
def test_get_mbid_with_provided_mbid():
|
||||
linker = Align(mbid_track="test-mbid")
|
||||
assert linker.get_mbid() == "test-mbid"
|
||||
|
||||
@patch('musicmetalinker.linking.musicbrainzngs')
|
||||
def test_get_mbid_queries_musicbrainz(mock_mb):
|
||||
mock_mb.search_recordings.return_value = {
|
||||
'recording-list': [{'id': 'found-mbid'}]
|
||||
}
|
||||
|
||||
linker = Align(artist="Test Artist", track="Test Track")
|
||||
mbid = linker.get_mbid()
|
||||
|
||||
assert mbid == "found-mbid"
|
||||
mock_mb.search_recordings.assert_called_once()
|
||||
```
|
||||
|
||||
**Integration tests:**
|
||||
|
||||
```python
|
||||
@pytest.mark.integration
|
||||
def test_real_musicbrainz_query():
|
||||
linker = Align(artist="The Beatles", track="Hey Jude")
|
||||
mbid = linker.get_mbid()
|
||||
|
||||
assert mbid is not None
|
||||
assert len(mbid) == 36 # UUID length
|
||||
```
|
||||
|
||||
**Test coverage goals:**
|
||||
- Unit tests: 80%+ coverage
|
||||
- Integration tests: Critical paths
|
||||
- Mock all external API calls in unit tests
|
||||
- Real API calls only in integration tests (marked with @pytest.mark.integration)
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Current Error Handling
|
||||
|
||||
**Pattern throughout codebase:**
|
||||
|
||||
```python
|
||||
try:
|
||||
result = service.query()
|
||||
return result
|
||||
except:
|
||||
return None
|
||||
```
|
||||
|
||||
**Issues:**
|
||||
- Catches all exceptions (including KeyboardInterrupt, SystemExit)
|
||||
- No error logging
|
||||
- No distinction between error types
|
||||
- Silent failures
|
||||
|
||||
### Error Handling Recommendations
|
||||
|
||||
**Specific exception handling:**
|
||||
|
||||
```python
|
||||
try:
|
||||
result = service.query()
|
||||
return result
|
||||
except requests.exceptions.Timeout:
|
||||
logger.warning("Service timeout", service="musicbrainz")
|
||||
return None
|
||||
except requests.exceptions.ConnectionError:
|
||||
logger.error("Service unavailable", service="musicbrainz")
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error("Unexpected error", service="musicbrainz", error=str(e), exc_info=True)
|
||||
return None
|
||||
```
|
||||
|
||||
**Custom exceptions:**
|
||||
|
||||
```python
|
||||
class MusicMetaLinkerError(Exception):
|
||||
pass
|
||||
|
||||
class ServiceUnavailableError(MusicMetaLinkerError):
|
||||
pass
|
||||
|
||||
class InvalidInputError(MusicMetaLinkerError):
|
||||
pass
|
||||
|
||||
class NoMatchFoundError(MusicMetaLinkerError):
|
||||
pass
|
||||
```
|
||||
|
||||
**Explicit error returns:**
|
||||
|
||||
```python
|
||||
from typing import Optional, Union
|
||||
|
||||
def get_mbid(self) -> Union[str, None, MusicMetaLinkerError]:
|
||||
try:
|
||||
...
|
||||
except ServiceUnavailableError as e:
|
||||
return e # Return error instead of None
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Performance Bottlenecks
|
||||
|
||||
**Network latency:** Sequential API calls. Total latency = sum of all service latencies.
|
||||
|
||||
**No caching:** Repeated queries for same track.
|
||||
|
||||
**No connection pooling:** New connection for each request.
|
||||
|
||||
**No request batching:** One request per track.
|
||||
|
||||
### Performance Optimization Opportunities
|
||||
|
||||
**1. Async/await for concurrent queries:**
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import aiohttp
|
||||
|
||||
async def get_all_metadata(self):
|
||||
tasks = [
|
||||
self.get_mbid_async(),
|
||||
self.get_deezer_id_async(),
|
||||
self.get_youtube_link_async()
|
||||
]
|
||||
results = await asyncio.gather(*tasks)
|
||||
return results
|
||||
```
|
||||
|
||||
**2. Persistent cache:**
|
||||
|
||||
```python
|
||||
import redis
|
||||
|
||||
cache = redis.Redis()
|
||||
|
||||
def get_mbid(self):
|
||||
cache_key = f"mbid:{self.artist}:{self.track}"
|
||||
cached = cache.get(cache_key)
|
||||
if cached:
|
||||
return cached.decode()
|
||||
|
||||
mbid = self._query_mbid()
|
||||
cache.setex(cache_key, 86400, mbid) # 24 hour TTL
|
||||
return mbid
|
||||
```
|
||||
|
||||
**3. Connection pooling:**
|
||||
|
||||
```python
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util.retry import Retry
|
||||
|
||||
session = requests.Session()
|
||||
retry = Retry(total=3, backoff_factor=0.3)
|
||||
adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=20)
|
||||
session.mount('http://', adapter)
|
||||
session.mount('https://', adapter)
|
||||
```
|
||||
|
||||
**4. Batch processing parallelization:**
|
||||
|
||||
```python
|
||||
from multiprocessing import Pool
|
||||
|
||||
def process_track(jams_file):
|
||||
processor = JAMSProcessor(jams_file)
|
||||
metadata = processor.extract_metadata()
|
||||
linker = Align(**metadata)
|
||||
return linker.get_all_metadata()
|
||||
|
||||
with Pool(processes=4) as pool:
|
||||
results = pool.map(process_track, jams_files)
|
||||
```
|
||||
|
||||
## Code Maintainability
|
||||
|
||||
### Maintainability Issues
|
||||
|
||||
**Tight coupling:** Align class directly instantiates service classes. Hard to mock for testing.
|
||||
|
||||
**No abstraction:** Service classes have different interfaces. No common base class.
|
||||
|
||||
**Hardcoded configuration:** Changing thresholds requires code modification.
|
||||
|
||||
**No documentation:** Minimal docstrings, no API documentation.
|
||||
|
||||
**Dead code:** AcousticBrainz integration non-functional.
|
||||
|
||||
**Inconsistent patterns:** Function for AcousticBrainz, classes for other services.
|
||||
|
||||
### Maintainability Recommendations
|
||||
|
||||
**1. Define service interface:**
|
||||
|
||||
```python
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
class ServiceAligner(ABC):
|
||||
@abstractmethod
|
||||
def search_by_isrc(self, isrc: str) -> Optional[dict]:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def search_by_metadata(self, artist: str, track: str, album: str) -> Optional[dict]:
|
||||
pass
|
||||
```
|
||||
|
||||
**2. Dependency injection:**
|
||||
|
||||
```python
|
||||
class Align:
|
||||
def __init__(self, services: List[ServiceAligner], **metadata):
|
||||
self.services = services
|
||||
self.metadata = metadata
|
||||
```
|
||||
|
||||
**3. Add docstrings:**
|
||||
|
||||
```python
|
||||
def get_mbid(self) -> Optional[str]:
|
||||
"""
|
||||
Retrieve MusicBrainz recording ID.
|
||||
|
||||
Queries MusicBrainz by MBID (if provided), ISRC, or metadata.
|
||||
Returns None if no match found or service unavailable.
|
||||
|
||||
Returns:
|
||||
MusicBrainz recording ID (UUID format) or None
|
||||
"""
|
||||
...
|
||||
```
|
||||
|
||||
**4. Remove dead code:**
|
||||
|
||||
Delete acousticbrainz_link() function and all references.
|
||||
|
||||
**5. Add configuration class:**
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class MatchingConfig:
|
||||
deezer_duration_threshold: int = 3
|
||||
musicbrainz_duration_threshold: int = 5
|
||||
similarity_threshold: float = 0.8
|
||||
user_agent: str = "MusicMetaLinker/0.0.1"
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Security Issues
|
||||
|
||||
**Plaintext credentials:** Spotify credentials in mml_secrets.py (not encrypted).
|
||||
|
||||
**No input validation:** Metadata strings not sanitized.
|
||||
|
||||
**Broad exception catching:** May hide security-relevant errors.
|
||||
|
||||
**No dependency scanning:** Vulnerable dependencies unknown.
|
||||
|
||||
### Security Recommendations
|
||||
|
||||
**1. Encrypt credentials:**
|
||||
|
||||
```python
|
||||
from cryptography.fernet import Fernet
|
||||
|
||||
key = os.getenv("ENCRYPTION_KEY")
|
||||
cipher = Fernet(key)
|
||||
|
||||
encrypted_secret = cipher.encrypt(SPOTIFY_CLIENT_SECRET.encode())
|
||||
```
|
||||
|
||||
**2. Input validation:**
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def validate_mbid(mbid: str) -> bool:
|
||||
uuid_pattern = r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$'
|
||||
return bool(re.match(uuid_pattern, mbid, re.IGNORECASE))
|
||||
|
||||
def validate_isrc(isrc: str) -> bool:
|
||||
isrc_pattern = r'^[A-Z]{2}[A-Z0-9]{3}[0-9]{7}$'
|
||||
return bool(re.match(isrc_pattern, isrc))
|
||||
```
|
||||
|
||||
**3. Dependency scanning:**
|
||||
|
||||
```bash
|
||||
pip install pip-audit
|
||||
pip-audit
|
||||
```
|
||||
|
||||
**4. Security headers for API calls:**
|
||||
|
||||
```python
|
||||
headers = {
|
||||
'User-Agent': 'MusicMetaLinker/0.0.1',
|
||||
'X-Request-ID': str(uuid.uuid4())
|
||||
}
|
||||
response = requests.get(url, headers=headers)
|
||||
```
|
||||
|
||||
## Code Recommendations Summary
|
||||
|
||||
### Immediate Fixes
|
||||
|
||||
1. Remove all print() statements, replace with logger.debug()
|
||||
2. Remove commented-out code
|
||||
3. Fix User-Agent: "elka/0.1" → "MusicMetaLinker/0.0.1"
|
||||
4. Remove AcousticBrainz integration
|
||||
5. Add docstrings to all public methods
|
||||
|
||||
### Short-Term Improvements
|
||||
|
||||
1. Add type hints throughout codebase
|
||||
2. Add unit tests with mocked services
|
||||
3. Add linting (pylint, flake8)
|
||||
4. Add formatting (black, isort)
|
||||
5. Add specific exception handling
|
||||
6. Add input validation
|
||||
7. Add configuration system
|
||||
|
||||
### Long-Term Enhancements
|
||||
|
||||
1. Refactor to use service interface abstraction
|
||||
2. Add dependency injection
|
||||
3. Add async/await for concurrent queries
|
||||
4. Add persistent caching
|
||||
5. Add connection pooling
|
||||
6. Add structured logging
|
||||
7. Add monitoring and metrics
|
||||
8. Add comprehensive documentation
|
||||
9. Add integration tests
|
||||
10. Add CI/CD pipeline
|
||||
|
||||
## Codebase Maturity Assessment
|
||||
|
||||
**Current state:** Research prototype. Pre-release quality.
|
||||
|
||||
**Maturity level:** 2/5
|
||||
|
||||
**Strengths:**
|
||||
- Clear separation of concerns (service classes)
|
||||
- Simple, understandable structure
|
||||
- Functional for research use
|
||||
|
||||
**Weaknesses:**
|
||||
- No tests
|
||||
- Debug code in production
|
||||
- Hardcoded configuration
|
||||
- Dead code
|
||||
- No documentation
|
||||
- No error handling
|
||||
- No input validation
|
||||
|
||||
**Recommendation:** Suitable for academic exploration. Requires significant refactoring for production use.
|
||||
@@ -0,0 +1,501 @@
|
||||
# MusicMetaLinker Data Architecture
|
||||
|
||||
## Data Storage Model
|
||||
|
||||
MusicMetaLinker has no persistent data storage. All data is in-memory during execution.
|
||||
|
||||
**No database:** No SQL, no NoSQL, no embedded databases.
|
||||
|
||||
**No file-based persistence:** No local cache files, no serialized objects (except JAMS output).
|
||||
|
||||
**Stateless operation:** Each Align instance is independent. No shared state across instances.
|
||||
|
||||
## Input Data Formats
|
||||
|
||||
### Python Objects
|
||||
|
||||
Primary input method: Constructor parameters to Align class.
|
||||
|
||||
**Supported data types:**
|
||||
|
||||
```python
|
||||
{
|
||||
"mbid_track": str, # UUID format
|
||||
"mbid_release": str, # UUID format
|
||||
"artist": str, # Free text
|
||||
"album": str, # Free text
|
||||
"track": str, # Free text
|
||||
"track_number": int, # Positive integer
|
||||
"duration": int | float, # Seconds
|
||||
"isrc": str, # ISRC format (no validation)
|
||||
"strict": bool # Matching mode
|
||||
}
|
||||
```
|
||||
|
||||
**No validation:** Input accepted as-is. Invalid data causes silent failures (returns None).
|
||||
|
||||
**No normalization:** Artist names, track titles used exactly as provided. No case normalization, no whitespace trimming, no Unicode normalization.
|
||||
|
||||
### JAMS Files
|
||||
|
||||
JAMS (JSON Annotated Music Specification) is the standard input format for batch processing.
|
||||
|
||||
**JAMS structure:**
|
||||
|
||||
```json
|
||||
{
|
||||
"file_metadata": {
|
||||
"title": "Track Name",
|
||||
"artist": "Artist Name",
|
||||
"release": "Album Name",
|
||||
"duration": 123.45,
|
||||
"identifiers": {
|
||||
"musicbrainz": "mbid-uuid-here",
|
||||
"isrc": "GBAYE9200070"
|
||||
}
|
||||
},
|
||||
"sandbox": {
|
||||
"type": "music_type",
|
||||
"genre": "rock",
|
||||
"track_number": 1,
|
||||
"release_year": 2020
|
||||
},
|
||||
"annotations": []
|
||||
}
|
||||
```
|
||||
|
||||
**Key sections:**
|
||||
|
||||
**file_metadata:** Core track metadata. Required fields: title, artist. Optional: release, duration, identifiers.
|
||||
|
||||
**sandbox:** Additional metadata. Free-form structure. Common fields: type, genre, track_number, release_year.
|
||||
|
||||
**annotations:** Music information retrieval annotations (not used by MusicMetaLinker).
|
||||
|
||||
**Parsing logic:**
|
||||
|
||||
JAMSProcessor extracts:
|
||||
- title → track
|
||||
- artist → artist
|
||||
- release → album
|
||||
- duration → duration
|
||||
- identifiers.musicbrainz → mbid_track
|
||||
- identifiers.isrc → isrc
|
||||
- sandbox.track_number → track_number
|
||||
|
||||
**Missing fields:** Treated as None. No errors raised.
|
||||
|
||||
### CSV Input
|
||||
|
||||
No direct CSV input support. Batch processing outputs CSV but doesn't read it.
|
||||
|
||||
For CSV input, users must:
|
||||
1. Parse CSV manually
|
||||
2. Create Align instances per row
|
||||
3. Collect results
|
||||
|
||||
## Output Data Formats
|
||||
|
||||
### Python Objects
|
||||
|
||||
Align instance acts as data container. Getters return individual fields.
|
||||
|
||||
**No structured output method:** No to_dict(), no to_json(), no serialize().
|
||||
|
||||
**Manual aggregation required:**
|
||||
|
||||
```python
|
||||
linker = Align(...)
|
||||
result = {
|
||||
"artist": linker.get_artist(),
|
||||
"track": linker.get_track(),
|
||||
"mbid": linker.get_mbid(),
|
||||
"isrc": linker.get_isrc(),
|
||||
"deezer_id": linker.get_deezer_id(),
|
||||
# ... etc
|
||||
}
|
||||
```
|
||||
|
||||
### JAMS Files
|
||||
|
||||
Enriched JAMS files with added identifiers.
|
||||
|
||||
**Enrichment process:**
|
||||
|
||||
1. Read original JAMS file
|
||||
2. Extract metadata
|
||||
3. Create Align instance
|
||||
4. Query all services
|
||||
5. Add identifiers to file_metadata.identifiers section
|
||||
6. Write enriched JAMS file
|
||||
|
||||
**Added identifiers:**
|
||||
|
||||
```json
|
||||
{
|
||||
"file_metadata": {
|
||||
"identifiers": {
|
||||
"musicbrainz": "mbid-from-query",
|
||||
"isrc": "isrc-from-query",
|
||||
"deezer": "deezer-id-from-query",
|
||||
"youtube": "youtube-url-from-query",
|
||||
"acousticbrainz": null
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Preservation:** Original JAMS structure preserved. Only identifiers section modified.
|
||||
|
||||
**Overwrite behavior:** Controlled by --overwrite flag. Without flag, existing identifiers preserved.
|
||||
|
||||
### CSV Output
|
||||
|
||||
Batch processing generates CSV with all metadata and identifiers.
|
||||
|
||||
**CSV schema:**
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| jams_file | str | Original JAMS filename |
|
||||
| track_name | str | Track title |
|
||||
| artist_name | str | Artist name |
|
||||
| album_name | str | Album/release name |
|
||||
| track_number | int | Track position |
|
||||
| duration | float | Duration in seconds |
|
||||
| release_year | int | Release year |
|
||||
| musicbrainz | str | MBID (UUID) |
|
||||
| isrc | str | ISRC code |
|
||||
| deezer_id | int | Deezer track ID |
|
||||
| deezer_url | str | Full Deezer URL |
|
||||
| youtube_url | str | Full YouTube URL |
|
||||
| acousticbrainz | str | AcousticBrainz URL (always null) |
|
||||
| spotify_id | str | Spotify ID (if available) |
|
||||
|
||||
**Missing values:** Empty cells or "None" string (inconsistent).
|
||||
|
||||
**Encoding:** UTF-8. No BOM.
|
||||
|
||||
**Delimiter:** Comma. No escaping issues documented.
|
||||
|
||||
**Headers:** First row contains column names.
|
||||
|
||||
**Output location:** Same directory as input JAMS files, named based on directory name.
|
||||
|
||||
## Data Transformation Pipeline
|
||||
|
||||
### Input Transformation
|
||||
|
||||
1. **JAMS parsing:** JSON deserialization via jams library
|
||||
2. **Field extraction:** Map JAMS fields to Align parameters
|
||||
3. **Type conversion:** String to int for track_number, string to float for duration
|
||||
4. **Null handling:** Missing fields become None
|
||||
|
||||
### Query Transformation
|
||||
|
||||
1. **Metadata normalization:** None (passed as-is to services)
|
||||
2. **Duration conversion:** MusicBrainz milliseconds → seconds
|
||||
3. **ID extraction:** Parse service-specific response formats
|
||||
4. **URL construction:** Build full URLs from IDs
|
||||
|
||||
### Output Transformation
|
||||
|
||||
1. **Result aggregation:** Collect all getter results
|
||||
2. **CSV serialization:** pandas DataFrame to CSV
|
||||
3. **JAMS enrichment:** Inject identifiers into JSON structure
|
||||
4. **File writing:** JSON serialization with indentation
|
||||
|
||||
## Data Quality Issues
|
||||
|
||||
### Input Data Quality
|
||||
|
||||
**No validation:**
|
||||
- Invalid MBIDs accepted (wrong format, non-existent)
|
||||
- Invalid ISRCs accepted (wrong format, non-existent)
|
||||
- Negative durations accepted
|
||||
- Empty strings accepted
|
||||
|
||||
**No sanitization:**
|
||||
- Special characters in metadata not escaped
|
||||
- SQL injection risk if metadata used in queries (not applicable here)
|
||||
- Command injection risk if metadata used in shell commands (not applicable here)
|
||||
|
||||
**No normalization:**
|
||||
- "The Beatles" vs "Beatles" treated as different
|
||||
- "feat." vs "featuring" vs "ft." not normalized
|
||||
- Unicode variants not normalized (e.g., é vs e + combining accent)
|
||||
|
||||
### Output Data Quality
|
||||
|
||||
**Inconsistent null representation:**
|
||||
- Python: None
|
||||
- CSV: Empty string or "None" string
|
||||
- JAMS: null or missing key
|
||||
|
||||
**No data validation:**
|
||||
- Retrieved MBIDs not validated as UUIDs
|
||||
- Retrieved ISRCs not validated as ISRC format
|
||||
- Retrieved URLs not validated as valid URLs
|
||||
|
||||
**No conflict resolution:**
|
||||
- If MusicBrainz and Deezer return different artists, no reconciliation
|
||||
- First successful query wins, no cross-validation
|
||||
|
||||
### Data Accuracy Issues
|
||||
|
||||
**YouTube matching:** Weak matching logic. First result assumed correct. High false positive rate.
|
||||
|
||||
**Duration filtering:** ±3 seconds threshold may be too loose for short tracks, too strict for live recordings.
|
||||
|
||||
**Fuzzy matching:** No documented algorithm. Likely simple string similarity. Doesn't handle:
|
||||
- Transliterations (e.g., Japanese to romaji)
|
||||
- Abbreviations (e.g., "feat." vs "featuring")
|
||||
- Reorderings (e.g., "Artist feat. Guest" vs "Guest & Artist")
|
||||
|
||||
**AcousticBrainz:** Always returns null (service shut down). Dead data field.
|
||||
|
||||
## Data Flow Diagrams
|
||||
|
||||
### Single Track Flow
|
||||
|
||||
```
|
||||
Input (Python dict or JAMS)
|
||||
↓
|
||||
Align constructor
|
||||
↓
|
||||
[Lazy evaluation - no queries yet]
|
||||
↓
|
||||
User calls getter (e.g., get_mbid())
|
||||
↓
|
||||
Check cache
|
||||
↓
|
||||
If not cached:
|
||||
↓
|
||||
Determine service to query
|
||||
↓
|
||||
Execute service query
|
||||
↓
|
||||
Parse response
|
||||
↓
|
||||
Cache result
|
||||
↓
|
||||
Return to user
|
||||
```
|
||||
|
||||
### Batch Processing Flow
|
||||
|
||||
```
|
||||
Directory of JAMS files
|
||||
↓
|
||||
For each JAMS file:
|
||||
↓
|
||||
JAMSProcessor.extract_metadata()
|
||||
↓
|
||||
Create Align instance
|
||||
↓
|
||||
Call all getters
|
||||
↓
|
||||
Collect results in list
|
||||
↓
|
||||
End loop
|
||||
↓
|
||||
Convert list to pandas DataFrame
|
||||
↓
|
||||
Write CSV
|
||||
↓
|
||||
Optionally write enriched JAMS files
|
||||
```
|
||||
|
||||
### Service Query Flow
|
||||
|
||||
```
|
||||
Align.get_mbid()
|
||||
↓
|
||||
If mbid_track provided:
|
||||
Return mbid_track
|
||||
↓
|
||||
Else if isrc provided:
|
||||
Query MusicBrainz by ISRC
|
||||
↓
|
||||
Else:
|
||||
Query MusicBrainz by metadata
|
||||
↓
|
||||
Parse MusicBrainz response
|
||||
↓
|
||||
Extract MBID
|
||||
↓
|
||||
Cache and return
|
||||
```
|
||||
|
||||
## Data Caching Strategy
|
||||
|
||||
### In-Memory Cache
|
||||
|
||||
**Scope:** Single Align instance only.
|
||||
|
||||
**Cache key:** Implicit (field name). No explicit key generation.
|
||||
|
||||
**Cache invalidation:** None. Values cached for instance lifetime.
|
||||
|
||||
**Cache size:** Small (one value per field, ~15 fields max).
|
||||
|
||||
**Cache hit rate:** High for repeated getter calls on same instance. Zero across instances.
|
||||
|
||||
### No Persistent Cache
|
||||
|
||||
**Implications:**
|
||||
- Repeated queries for same track across runs
|
||||
- No offline operation
|
||||
- Network dependency for every query
|
||||
|
||||
**Batch processing impact:**
|
||||
- Processing 1000 tracks = 1000+ API calls
|
||||
- No deduplication across tracks
|
||||
- High network usage
|
||||
|
||||
### Cache Recommendations
|
||||
|
||||
For production use:
|
||||
|
||||
1. **Add persistent cache:** Redis or SQLite for cross-run caching
|
||||
2. **Cache key:** Hash of (artist, track, album, duration)
|
||||
3. **TTL:** 30 days (metadata rarely changes)
|
||||
4. **Invalidation:** Manual or TTL-based
|
||||
5. **Deduplication:** Cache identical queries across tracks
|
||||
|
||||
## Data Privacy and Security
|
||||
|
||||
### Personal Data
|
||||
|
||||
**No personal data collected:** Only public music metadata.
|
||||
|
||||
**No user tracking:** No analytics, no telemetry.
|
||||
|
||||
**No data sharing:** Results not sent to third parties.
|
||||
|
||||
### API Credentials
|
||||
|
||||
**Spotify credentials:** Stored in external mml_secrets.py file. Not encrypted. Not in version control.
|
||||
|
||||
**Other services:** No credentials required.
|
||||
|
||||
### Data Retention
|
||||
|
||||
**No retention:** All data discarded when Align instance destroyed.
|
||||
|
||||
**Batch output:** CSV and JAMS files written to disk. User responsible for retention and deletion.
|
||||
|
||||
## Data Consistency
|
||||
|
||||
### Cross-Service Consistency
|
||||
|
||||
**No consistency checks:** If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.
|
||||
|
||||
**First-wins strategy:** First successful query result used. No validation against other services.
|
||||
|
||||
**Conflict scenarios:**
|
||||
- Different artists across services
|
||||
- Different track names across services
|
||||
- Different durations across services
|
||||
|
||||
**No conflict resolution:** User receives inconsistent data.
|
||||
|
||||
### Temporal Consistency
|
||||
|
||||
**No versioning:** Metadata retrieved at query time. No timestamp recorded.
|
||||
|
||||
**Staleness:** If MusicBrainz updates metadata after query, Align instance has stale data.
|
||||
|
||||
**No refresh:** No way to refresh cached data without creating new instance.
|
||||
|
||||
## Data Completeness
|
||||
|
||||
### Missing Data Handling
|
||||
|
||||
**Graceful degradation:** Missing fields return None. No errors.
|
||||
|
||||
**Partial results:** If MusicBrainz succeeds but Deezer fails, MusicBrainz data returned.
|
||||
|
||||
**No completeness metrics:** No indication of how many fields successfully retrieved.
|
||||
|
||||
### Required vs Optional Fields
|
||||
|
||||
**No required fields:** All constructor parameters optional.
|
||||
|
||||
**Minimum viable input:** At least one of (mbid_track, isrc, artist+track) recommended.
|
||||
|
||||
**Degenerate cases:**
|
||||
- Empty Align() constructor: All getters return None
|
||||
- Only duration provided: All getters return None (no searchable metadata)
|
||||
|
||||
## Data Format Standards
|
||||
|
||||
### Identifier Formats
|
||||
|
||||
**MBID:** UUID format (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e"). No validation.
|
||||
|
||||
**ISRC:** 12-character alphanumeric (e.g., "GBAYE9200070"). No validation.
|
||||
|
||||
**Deezer ID:** Integer. No range validation.
|
||||
|
||||
**YouTube ID:** Alphanumeric string (e.g., "dQw4w9WgXcQ"). No validation.
|
||||
|
||||
### Metadata Formats
|
||||
|
||||
**Artist, track, album:** Free text. No format constraints.
|
||||
|
||||
**Duration:** Seconds (int or float). MusicBrainz milliseconds converted to seconds.
|
||||
|
||||
**Track number:** Integer. No validation (negative numbers accepted).
|
||||
|
||||
**Release date:** ISO format (YYYY-MM-DD) or year only (YYYY). Inconsistent across services.
|
||||
|
||||
**BPM:** Integer or float. No range validation.
|
||||
|
||||
## Data Interoperability
|
||||
|
||||
### JAMS Compatibility
|
||||
|
||||
JAMS is a standard format in music information retrieval research. MusicMetaLinker's JAMS support enables interoperability with:
|
||||
- mir_eval (evaluation framework)
|
||||
- librosa (audio analysis)
|
||||
- madmom (music analysis)
|
||||
- Other MIR tools
|
||||
|
||||
### Service Compatibility
|
||||
|
||||
**MusicBrainz:** Uses official musicbrainzngs library. Compatible with MusicBrainz API changes (library handles versioning).
|
||||
|
||||
**Deezer:** Uses official deezer-python library. Compatible with Deezer API.
|
||||
|
||||
**YouTube Music:** Uses unofficial ytmusicapi. Fragile to YouTube changes. No API stability guarantees.
|
||||
|
||||
**Spotify:** Uses official spotipy library. Compatible with Spotify API.
|
||||
|
||||
## Data Limitations
|
||||
|
||||
1. **No bulk operations:** Each track processed individually
|
||||
2. **No streaming:** All data loaded into memory
|
||||
3. **No compression:** JAMS files written uncompressed
|
||||
4. **No encryption:** All data stored in plaintext
|
||||
5. **No checksums:** No data integrity verification
|
||||
6. **No versioning:** No metadata version tracking
|
||||
7. **No provenance:** No record of which service provided which field
|
||||
8. **No confidence scores:** No indication of match quality
|
||||
|
||||
## Data Recommendations
|
||||
|
||||
For production use:
|
||||
|
||||
1. **Add validation:** Validate all input and output formats
|
||||
2. **Add normalization:** Normalize artist names, track titles
|
||||
3. **Add conflict resolution:** Cross-validate results across services
|
||||
4. **Add provenance tracking:** Record which service provided each field
|
||||
5. **Add confidence scores:** Indicate match quality
|
||||
6. **Add persistent cache:** Reduce API calls
|
||||
7. **Add data versioning:** Track when metadata retrieved
|
||||
8. **Add bulk operations:** Process multiple tracks efficiently
|
||||
9. **Remove dead fields:** Delete AcousticBrainz from output
|
||||
10. **Add structured output:** to_dict(), to_json() methods
|
||||
|
||||
The data model is simple and functional for research use. Production use requires significant enhancements.
|
||||
@@ -0,0 +1,611 @@
|
||||
# MusicMetaLinker Deployment
|
||||
|
||||
## Distribution Model
|
||||
|
||||
MusicMetaLinker is distributed as source code only. No binary distributions, no PyPI package, no conda package.
|
||||
|
||||
**Installation method:** Direct from GitHub via pip.
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/andreamust/MusicMetaLinker.git
|
||||
```
|
||||
|
||||
**Implications:**
|
||||
- Requires git installed
|
||||
- Requires network access to GitHub
|
||||
- No version pinning (always installs latest commit)
|
||||
- No offline installation
|
||||
|
||||
## Build System
|
||||
|
||||
### Build Backend
|
||||
|
||||
**PEP 517 compliant:** Uses pyproject.toml for build configuration.
|
||||
|
||||
**Build backend:** hatchling (modern Python build tool).
|
||||
|
||||
**pyproject.toml structure:**
|
||||
|
||||
```toml
|
||||
[build-system]
|
||||
requires = ["hatchling"]
|
||||
build-backend = "hatchling.build"
|
||||
|
||||
[project]
|
||||
name = "musicmetalinker"
|
||||
version = "0.0.1"
|
||||
dependencies = [
|
||||
"musicbrainzngs",
|
||||
"deezer-python",
|
||||
"ytmusicapi",
|
||||
"spotipy",
|
||||
"requests",
|
||||
"tqdm",
|
||||
"jams",
|
||||
"pandas",
|
||||
"cryptography"
|
||||
]
|
||||
```
|
||||
|
||||
**No setup.py:** Modern packaging only.
|
||||
|
||||
**No setup.cfg:** All configuration in pyproject.toml.
|
||||
|
||||
### Build Process
|
||||
|
||||
**Local build:**
|
||||
|
||||
```bash
|
||||
git clone https://github.com/andreamust/MusicMetaLinker.git
|
||||
cd MusicMetaLinker
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
**-e flag:** Editable install. Changes to source code immediately reflected.
|
||||
|
||||
**Build artifacts:** None. Pure Python package, no compilation.
|
||||
|
||||
### Dependencies
|
||||
|
||||
**Runtime dependencies:**
|
||||
|
||||
- musicbrainzngs: MusicBrainz API client
|
||||
- deezer-python: Deezer API wrapper
|
||||
- ytmusicapi: YouTube Music API client
|
||||
- spotipy: Spotify API client
|
||||
- requests: HTTP library
|
||||
- tqdm: Progress bars
|
||||
- jams: JAMS format support
|
||||
- pandas: CSV output
|
||||
- cryptography: Required by spotipy
|
||||
|
||||
**No optional dependencies:** All dependencies required.
|
||||
|
||||
**No development dependencies:** No test framework, no linting tools, no type checkers.
|
||||
|
||||
**Dependency versions:** No version constraints. Always installs latest compatible versions.
|
||||
|
||||
**Risk:** Breaking changes in dependencies may break MusicMetaLinker.
|
||||
|
||||
## Deployment Environments
|
||||
|
||||
### Library Deployment
|
||||
|
||||
**Target environment:** Python 3.8+ on any platform (Linux, macOS, Windows).
|
||||
|
||||
**Installation:**
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/andreamust/MusicMetaLinker.git
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import Align
|
||||
|
||||
linker = Align(artist="...", track="...")
|
||||
mbid = linker.get_mbid()
|
||||
```
|
||||
|
||||
**No configuration required** (except Spotify credentials for dataset preparation).
|
||||
|
||||
### Batch Processing Deployment
|
||||
|
||||
**Target environment:** Python 3.8+ with file system access.
|
||||
|
||||
**Installation:** Same as library deployment.
|
||||
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
cd /path/to/MusicMetaLinker
|
||||
python link_partitions.py /path/to/jams/files --save --limit audio --overwrite
|
||||
```
|
||||
|
||||
**Requirements:**
|
||||
- JAMS files in target directory
|
||||
- Write permissions for output CSV and enriched JAMS files
|
||||
- Network access for API queries
|
||||
|
||||
**Optional:** ffmpeg for audio conversion (if processing audio files directly).
|
||||
|
||||
### Research Environment Deployment
|
||||
|
||||
**Typical setup:** Jupyter notebook or Python script in research project.
|
||||
|
||||
**Installation:**
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/andreamust/MusicMetaLinker.git
|
||||
```
|
||||
|
||||
**Interactive testing:**
|
||||
|
||||
Notebooks included in repository:
|
||||
- deezer_test.ipynb: Test Deezer integration
|
||||
- queries.ipynb: Test various query patterns
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
# In Jupyter notebook
|
||||
from musicmetalinker.linking import Align
|
||||
|
||||
linker = Align(...)
|
||||
# Interactive exploration of results
|
||||
```
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### No Configuration Files
|
||||
|
||||
All configuration hardcoded in source files.
|
||||
|
||||
**Hardcoded values:**
|
||||
- User-Agent: "elka/0.1" (in linking.py)
|
||||
- Duration thresholds: 3s (Deezer), 5s (MusicBrainz)
|
||||
- Similarity threshold: 0.8
|
||||
- API endpoints: In library code
|
||||
|
||||
**No config.ini, no config.yaml, no .env files.**
|
||||
|
||||
### Spotify Credentials
|
||||
|
||||
**Only external configuration:** mml_secrets.py for Spotify credentials.
|
||||
|
||||
**Location:** Must be in Python path (typically same directory as scripts).
|
||||
|
||||
**Structure:**
|
||||
|
||||
```python
|
||||
# mml_secrets.py
|
||||
SPOTIFY_CLIENT_ID = "your-client-id-here"
|
||||
SPOTIFY_CLIENT_SECRET = "your-client-secret-here"
|
||||
```
|
||||
|
||||
**Not in repository:** Users must create this file manually.
|
||||
|
||||
**No documentation:** No instructions for obtaining Spotify credentials.
|
||||
|
||||
**Obtaining credentials:**
|
||||
1. Register app at https://developer.spotify.com/dashboard
|
||||
2. Copy client ID and secret
|
||||
3. Create mml_secrets.py with credentials
|
||||
|
||||
### Environment Variables
|
||||
|
||||
**Not used:** No environment variable configuration.
|
||||
|
||||
**Recommendation:** Use environment variables for credentials instead of mml_secrets.py.
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
SPOTIFY_CLIENT_ID = os.getenv("SPOTIFY_CLIENT_ID")
|
||||
SPOTIFY_CLIENT_SECRET = os.getenv("SPOTIFY_CLIENT_SECRET")
|
||||
```
|
||||
|
||||
## Runtime Requirements
|
||||
|
||||
### Python Version
|
||||
|
||||
**Minimum:** Python 3.8
|
||||
|
||||
**Tested on:** Unknown (no CI/CD, no test matrix).
|
||||
|
||||
**Likely compatible:** Python 3.8, 3.9, 3.10, 3.11, 3.12
|
||||
|
||||
**Type hints:** Not used extensively. No runtime type checking.
|
||||
|
||||
### System Dependencies
|
||||
|
||||
**Required:**
|
||||
- Python 3.8+
|
||||
- pip
|
||||
- git (for installation)
|
||||
- Network access (for API queries)
|
||||
|
||||
**Optional:**
|
||||
- ffmpeg (for audio conversion in batch processing)
|
||||
|
||||
**No database:** No PostgreSQL, MySQL, MongoDB, etc.
|
||||
|
||||
**No message queue:** No RabbitMQ, Redis, Kafka, etc.
|
||||
|
||||
**No web server:** No nginx, Apache, etc.
|
||||
|
||||
### Platform Support
|
||||
|
||||
**Linux:** Fully supported. Primary development platform (likely).
|
||||
|
||||
**macOS:** Fully supported. All dependencies available.
|
||||
|
||||
**Windows:** Likely supported. All dependencies have Windows wheels. Potential issues:
|
||||
- Path separators (/ vs \)
|
||||
- Line endings (LF vs CRLF)
|
||||
- Case-sensitive file systems
|
||||
|
||||
**No platform-specific code:** Pure Python, no C extensions (except in dependencies).
|
||||
|
||||
## Containerization
|
||||
|
||||
### Docker
|
||||
|
||||
**No Dockerfile provided.**
|
||||
|
||||
**Sample Dockerfile:**
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN pip install git+https://github.com/andreamust/MusicMetaLinker.git
|
||||
|
||||
COPY mml_secrets.py /app/
|
||||
|
||||
CMD ["python"]
|
||||
```
|
||||
|
||||
**For batch processing:**
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
RUN apt-get update && apt-get install -y git ffmpeg && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN pip install git+https://github.com/andreamust/MusicMetaLinker.git
|
||||
|
||||
RUN git clone https://github.com/andreamust/MusicMetaLinker.git /app/MusicMetaLinker
|
||||
|
||||
WORKDIR /app/MusicMetaLinker
|
||||
|
||||
ENTRYPOINT ["python", "link_partitions.py"]
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
docker build -t musicmetalinker .
|
||||
docker run -v /path/to/jams:/data musicmetalinker /data --save
|
||||
```
|
||||
|
||||
### Docker Compose
|
||||
|
||||
**Not provided.**
|
||||
|
||||
**Sample docker-compose.yml:**
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
musicmetalinker:
|
||||
build: .
|
||||
volumes:
|
||||
- ./data:/data
|
||||
- ./output:/output
|
||||
environment:
|
||||
- SPOTIFY_CLIENT_ID=${SPOTIFY_CLIENT_ID}
|
||||
- SPOTIFY_CLIENT_SECRET=${SPOTIFY_CLIENT_SECRET}
|
||||
```
|
||||
|
||||
### Kubernetes
|
||||
|
||||
**Not applicable:** MusicMetaLinker is a library/batch tool, not a long-running service.
|
||||
|
||||
**Possible use case:** Kubernetes Job for batch processing.
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: musicmetalinker-batch
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: musicmetalinker
|
||||
image: musicmetalinker:latest
|
||||
args: ["/data", "--save"]
|
||||
volumeMounts:
|
||||
- name: data
|
||||
mountPath: /data
|
||||
restartPolicy: Never
|
||||
volumes:
|
||||
- name: data
|
||||
persistentVolumeClaim:
|
||||
claimName: jams-data
|
||||
```
|
||||
|
||||
## Continuous Integration/Continuous Deployment
|
||||
|
||||
### CI/CD Status
|
||||
|
||||
**No CI/CD pipeline.**
|
||||
|
||||
**No GitHub Actions, no Travis CI, no CircleCI, no Jenkins.**
|
||||
|
||||
**Implications:**
|
||||
- No automated testing on commits
|
||||
- No automated builds
|
||||
- No automated releases
|
||||
- No quality gates
|
||||
|
||||
### Testing
|
||||
|
||||
**No test suite.**
|
||||
|
||||
**No pytest, no unittest, no nose.**
|
||||
|
||||
**Testing approach:**
|
||||
- Manual testing via Jupyter notebooks
|
||||
- if __name__ == "__main__" blocks in some modules
|
||||
|
||||
**No test coverage metrics.**
|
||||
|
||||
### Linting and Formatting
|
||||
|
||||
**No linting configuration.**
|
||||
|
||||
**No pylint, no flake8, no black, no isort.**
|
||||
|
||||
**Code quality:** Inconsistent. Debug prints, commented-out code, inconsistent naming.
|
||||
|
||||
### Type Checking
|
||||
|
||||
**No type checking.**
|
||||
|
||||
**No mypy, no pyright, no pyre.**
|
||||
|
||||
**Type hints:** Minimal. Not enforced.
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Logging
|
||||
|
||||
**Library usage:** Minimal console logging.
|
||||
|
||||
**Batch processing:** File-based logging to link_partitions.log.
|
||||
|
||||
**Log format:**
|
||||
|
||||
```
|
||||
2024-01-15 10:30:45 - INFO - Processing file: track001.jams
|
||||
2024-01-15 10:30:46 - INFO - Found MBID: 6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e
|
||||
2024-01-15 10:30:47 - ERROR - Failed to query Deezer
|
||||
```
|
||||
|
||||
**Log levels:** INFO, ERROR. No DEBUG, WARNING.
|
||||
|
||||
**Debug output:** Multiple print() statements in code (not controlled by logging).
|
||||
|
||||
### Monitoring
|
||||
|
||||
**No monitoring.**
|
||||
|
||||
**No metrics collection, no Prometheus, no Grafana, no Datadog.**
|
||||
|
||||
**No health checks, no status endpoints.**
|
||||
|
||||
### Error Tracking
|
||||
|
||||
**No error tracking.**
|
||||
|
||||
**No Sentry, no Rollbar, no Bugsnag.**
|
||||
|
||||
**Errors silently suppressed.** Returns None on failure.
|
||||
|
||||
## Scaling Considerations
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
**Not applicable:** Library runs in single process.
|
||||
|
||||
**Batch processing:** Can be parallelized manually.
|
||||
|
||||
**Manual parallelization:**
|
||||
|
||||
```bash
|
||||
# Split JAMS files into partitions
|
||||
# Run multiple instances in parallel
|
||||
python link_partitions.py /data/partition1 --save &
|
||||
python link_partitions.py /data/partition2 --save &
|
||||
python link_partitions.py /data/partition3 --save &
|
||||
wait
|
||||
```
|
||||
|
||||
**No built-in parallelization.**
|
||||
|
||||
### Vertical Scaling
|
||||
|
||||
**CPU:** Single-threaded. More CPU cores don't help.
|
||||
|
||||
**Memory:** Minimal usage. Each Align instance uses ~1KB. Batch processing uses more for pandas DataFrame.
|
||||
|
||||
**Network:** Bottleneck. Sequential API calls. More bandwidth doesn't help (latency-bound).
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
**No performance optimization.**
|
||||
|
||||
**Bottlenecks:**
|
||||
- Network latency (sequential API calls)
|
||||
- No caching across instances
|
||||
- No connection pooling
|
||||
- No request batching
|
||||
|
||||
**Potential optimizations:**
|
||||
- Async/await for concurrent API calls
|
||||
- Persistent cache (Redis)
|
||||
- Connection pooling
|
||||
- Batch API requests (if services support)
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Secrets Management
|
||||
|
||||
**Current approach:** Hardcoded in mml_secrets.py.
|
||||
|
||||
**Issues:**
|
||||
- Plaintext credentials
|
||||
- No encryption
|
||||
- Risk of committing to version control
|
||||
|
||||
**Recommendations:**
|
||||
- Environment variables
|
||||
- Secrets vault (HashiCorp Vault, AWS Secrets Manager)
|
||||
- Encrypted configuration files
|
||||
|
||||
### Network Security
|
||||
|
||||
**HTTPS:** All API calls use HTTPS.
|
||||
|
||||
**Certificate validation:** Handled by requests library (validates by default).
|
||||
|
||||
**No proxy support:** No configuration for HTTP proxies.
|
||||
|
||||
### Input Validation
|
||||
|
||||
**No input validation.**
|
||||
|
||||
**Risks:**
|
||||
- Invalid MBIDs accepted
|
||||
- Negative durations accepted
|
||||
- Malformed ISRCs accepted
|
||||
|
||||
**Actual risk:** Low. Invalid input causes query failures (returns None).
|
||||
|
||||
### Dependency Security
|
||||
|
||||
**No dependency scanning.**
|
||||
|
||||
**No Dependabot, no Snyk, no safety.**
|
||||
|
||||
**Vulnerable dependencies:** Unknown. No automated checks.
|
||||
|
||||
**Recommendation:** Run `pip-audit` or `safety check` regularly.
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Data Backup
|
||||
|
||||
**No persistent data:** Nothing to back up (library is stateless).
|
||||
|
||||
**Batch output:** CSV and JAMS files. User responsible for backup.
|
||||
|
||||
### Disaster Recovery
|
||||
|
||||
**Not applicable:** Library has no state to recover.
|
||||
|
||||
**Batch processing:** Rerun if output lost. No checkpointing, no resume capability.
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
### Library Deployment
|
||||
|
||||
- [ ] Python 3.8+ installed
|
||||
- [ ] pip installed
|
||||
- [ ] git installed
|
||||
- [ ] Network access to GitHub
|
||||
- [ ] Network access to MusicBrainz, Deezer, YouTube Music
|
||||
- [ ] (Optional) Spotify credentials in mml_secrets.py
|
||||
|
||||
### Batch Processing Deployment
|
||||
|
||||
- [ ] All library deployment requirements
|
||||
- [ ] JAMS files prepared
|
||||
- [ ] Write permissions for output directory
|
||||
- [ ] (Optional) ffmpeg installed for audio conversion
|
||||
- [ ] Sufficient disk space for output CSV and enriched JAMS files
|
||||
|
||||
### Production Deployment (Recommendations)
|
||||
|
||||
- [ ] Pin dependency versions in pyproject.toml
|
||||
- [ ] Add automated tests
|
||||
- [ ] Add CI/CD pipeline
|
||||
- [ ] Add error tracking (Sentry)
|
||||
- [ ] Add logging (structured JSON logs)
|
||||
- [ ] Add monitoring (Prometheus metrics)
|
||||
- [ ] Add rate limiting
|
||||
- [ ] Add retry logic with exponential backoff
|
||||
- [ ] Add health checks
|
||||
- [ ] Use environment variables for configuration
|
||||
- [ ] Add input validation
|
||||
- [ ] Add dependency scanning
|
||||
- [ ] Remove AcousticBrainz integration
|
||||
- [ ] Fix User-Agent header
|
||||
- [ ] Add documentation for Spotify setup
|
||||
|
||||
## Deployment Recommendations
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
1. **Publish to PyPI:** Enable `pip install musicmetalinker` without git.
|
||||
2. **Pin dependencies:** Add version constraints to prevent breaking changes.
|
||||
3. **Document Spotify setup:** Instructions for obtaining credentials.
|
||||
4. **Remove AcousticBrainz:** Delete defunct integration.
|
||||
|
||||
### Short-Term Improvements
|
||||
|
||||
1. **Add CI/CD:** GitHub Actions for automated testing and releases.
|
||||
2. **Add tests:** pytest suite with mocked API calls.
|
||||
3. **Add Docker support:** Official Dockerfile and Docker Compose.
|
||||
4. **Add configuration:** Support environment variables and config files.
|
||||
5. **Add logging:** Structured logging with configurable levels.
|
||||
|
||||
### Long-Term Enhancements
|
||||
|
||||
1. **Add monitoring:** Prometheus metrics for API latency, success rates.
|
||||
2. **Add caching:** Redis for cross-instance caching.
|
||||
3. **Add async support:** Concurrent API calls for better performance.
|
||||
4. **Add health checks:** Service availability monitoring.
|
||||
5. **Add error tracking:** Sentry integration for production debugging.
|
||||
6. **Add documentation:** Comprehensive deployment guide.
|
||||
7. **Add versioning:** Semantic versioning with changelog.
|
||||
8. **Add security scanning:** Automated dependency vulnerability checks.
|
||||
|
||||
## Deployment Maturity Assessment
|
||||
|
||||
**Current state:** Research prototype. Suitable for academic exploration, not production.
|
||||
|
||||
**Maturity level:** 1/5
|
||||
|
||||
**Production readiness:** Low
|
||||
|
||||
**Gaps:**
|
||||
- No PyPI distribution
|
||||
- No CI/CD
|
||||
- No tests
|
||||
- No monitoring
|
||||
- No error tracking
|
||||
- Hardcoded configuration
|
||||
- Dead code (AcousticBrainz)
|
||||
- No documentation for deployment
|
||||
|
||||
**Recommendation:** Use for research and prototyping only. Significant work required for production deployment.
|
||||
@@ -0,0 +1,632 @@
|
||||
# MusicMetaLinker Evaluation
|
||||
|
||||
## Executive Summary
|
||||
|
||||
MusicMetaLinker is a research-quality Python library for music metadata entity linking. It connects tracks to external databases (MusicBrainz, Deezer, YouTube Music) to enrich incomplete metadata. The core concept is sound, but implementation is pre-release quality with significant gaps in testing, error handling, and production readiness.
|
||||
|
||||
**Version:** 0.0.1 (pre-release)
|
||||
**Maturity:** Research prototype
|
||||
**Production readiness:** Low
|
||||
**Academic value:** Moderate
|
||||
**Integration potential:** Low (concept valuable, implementation needs work)
|
||||
|
||||
## Strengths
|
||||
|
||||
### 1. Simple, Clean API
|
||||
|
||||
Single Align class provides unified interface to multiple services. Users don't need to understand service-specific APIs.
|
||||
|
||||
```python
|
||||
linker = Align(artist="The Beatles", track="Hey Jude")
|
||||
mbid = linker.get_mbid()
|
||||
isrc = linker.get_isrc()
|
||||
```
|
||||
|
||||
**Value:** Low barrier to entry. Easy to integrate into research workflows.
|
||||
|
||||
### 2. Cascading Fallback Pattern
|
||||
|
||||
Graceful degradation across services. If MusicBrainz fails, tries Deezer. If Deezer fails, tries YouTube Music.
|
||||
|
||||
**Value:** Maximizes coverage. Handles service unavailability gracefully.
|
||||
|
||||
**Applicability:** This pattern is worth adopting in other metadata aggregation systems.
|
||||
|
||||
### 3. JAMS Format Support
|
||||
|
||||
Supports JAMS (JSON Annotated Music Specification), a standard format in music information retrieval research.
|
||||
|
||||
**Value:** Interoperability with academic MIR tools (mir_eval, librosa, madmom).
|
||||
|
||||
**Use case:** Dataset preparation for music research projects.
|
||||
|
||||
### 4. Batch Processing
|
||||
|
||||
link_partitions.py enables processing entire directories of JAMS files with progress tracking and CSV output.
|
||||
|
||||
**Value:** Scales to dataset-level operations. Useful for preparing research datasets.
|
||||
|
||||
### 5. MIT License
|
||||
|
||||
Permissive license allows unrestricted use, modification, and distribution.
|
||||
|
||||
**Value:** Can be freely integrated into commercial or academic projects.
|
||||
|
||||
### 6. Minimal Dependencies
|
||||
|
||||
Only essential dependencies. No exotic or unmaintained libraries.
|
||||
|
||||
**Value:** Easy to install and maintain. Low dependency risk.
|
||||
|
||||
### 7. Multi-Service Coverage
|
||||
|
||||
Integrates with multiple authoritative sources (MusicBrainz, Deezer, YouTube Music).
|
||||
|
||||
**Value:** Comprehensive metadata coverage. Cross-validation potential (not currently implemented).
|
||||
|
||||
## Weaknesses
|
||||
|
||||
### 1. Pre-Release Quality (v0.0.1)
|
||||
|
||||
Version number indicates early development. Codebase confirms this.
|
||||
|
||||
**Evidence:**
|
||||
- Debug print() statements in production code
|
||||
- Commented-out code sections
|
||||
- Hardcoded configuration values
|
||||
- No automated tests
|
||||
- No CI/CD pipeline
|
||||
|
||||
**Impact:** Not suitable for production use without significant hardening.
|
||||
|
||||
### 2. No Automated Tests
|
||||
|
||||
Zero test coverage. No unit tests, no integration tests, no test framework.
|
||||
|
||||
**Testing approach:** Manual testing via Jupyter notebooks.
|
||||
|
||||
**Impact:**
|
||||
- No regression detection
|
||||
- Difficult to refactor safely
|
||||
- No confidence in correctness
|
||||
- Breaking changes undetected
|
||||
|
||||
**Risk:** High. Changes may introduce bugs undetected until runtime.
|
||||
|
||||
### 3. No CI/CD
|
||||
|
||||
No GitHub Actions, no Travis CI, no automated builds or releases.
|
||||
|
||||
**Impact:**
|
||||
- No automated quality gates
|
||||
- No automated testing on commits
|
||||
- Manual release process
|
||||
- No deployment automation
|
||||
|
||||
### 4. Debug Prints in Production Code
|
||||
|
||||
Multiple print() statements throughout codebase.
|
||||
|
||||
```python
|
||||
print(f"DEBUG: Querying MusicBrainz for {artist} - {track}")
|
||||
print(f"Found MBID: {mbid}")
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
- Pollutes output
|
||||
- Can't be disabled without code changes
|
||||
- No log levels or timestamps
|
||||
- Unprofessional appearance
|
||||
|
||||
### 5. Hardcoded Configuration
|
||||
|
||||
All configuration values hardcoded in source files.
|
||||
|
||||
**Examples:**
|
||||
- User-Agent: "elka/0.1" (appears to be from parent project)
|
||||
- Duration thresholds: 3s (Deezer), 5s (MusicBrainz)
|
||||
- Similarity threshold: 0.8
|
||||
- API endpoints
|
||||
|
||||
**Impact:**
|
||||
- No runtime configuration
|
||||
- Changing thresholds requires code modification
|
||||
- No environment-specific settings
|
||||
- Can't A/B test matching strategies
|
||||
|
||||
### 6. Not on PyPI
|
||||
|
||||
Only installable from GitHub. Not published to PyPI.
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/andreamust/MusicMetaLinker.git
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
- Requires git installed
|
||||
- No version pinning
|
||||
- No offline installation
|
||||
- Less discoverable
|
||||
|
||||
### 7. Missing mml_secrets.py
|
||||
|
||||
Spotify credentials required in external file not in repository.
|
||||
|
||||
**Impact:**
|
||||
- Users must create file manually
|
||||
- No documentation for obtaining credentials
|
||||
- Confusing error if file missing
|
||||
- Poor user experience
|
||||
|
||||
### 8. AcousticBrainz Integration Broken
|
||||
|
||||
AcousticBrainz shut down in 2022. Integration always returns None.
|
||||
|
||||
**Impact:**
|
||||
- Dead code in codebase
|
||||
- Wasted execution time
|
||||
- Misleading CSV output (acousticbrainz column always null)
|
||||
- Maintenance burden
|
||||
|
||||
**Recommendation:** Remove entirely.
|
||||
|
||||
### 9. No Rate Limiting
|
||||
|
||||
No rate limiting for API calls. Risk of being blocked by services.
|
||||
|
||||
**MusicBrainz:** Recommends 1 request/second. Not enforced.
|
||||
|
||||
**Deezer, YouTube Music:** Unknown limits. Not enforced.
|
||||
|
||||
**Impact:**
|
||||
- Risk of IP bans
|
||||
- Risk of service degradation
|
||||
- Batch processing may fail partway through
|
||||
|
||||
### 10. Silent Error Handling
|
||||
|
||||
All errors suppressed. Failed queries return None.
|
||||
|
||||
```python
|
||||
try:
|
||||
result = service.query()
|
||||
except:
|
||||
return None
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
- No distinction between "not found" and "service error"
|
||||
- No error messages
|
||||
- Difficult debugging
|
||||
- No visibility into failures
|
||||
|
||||
### 11. YouTube Matching Weakness
|
||||
|
||||
YouTube Music matching is weak. First result assumed correct. No duration filtering (commented out).
|
||||
|
||||
**Impact:**
|
||||
- High false positive rate
|
||||
- Incorrect YouTube links
|
||||
- Low confidence in YouTube results
|
||||
|
||||
**Recommendation:** Improve matching logic or remove YouTube integration.
|
||||
|
||||
### 12. No Input Validation
|
||||
|
||||
No validation of input parameters.
|
||||
|
||||
**Accepted without validation:**
|
||||
- Invalid MBIDs (wrong format, non-existent)
|
||||
- Invalid ISRCs (wrong format, non-existent)
|
||||
- Negative durations
|
||||
- Empty strings
|
||||
|
||||
**Impact:**
|
||||
- Silent failures
|
||||
- Wasted API calls
|
||||
- Confusing behavior
|
||||
|
||||
### 13. No Cross-Service Validation
|
||||
|
||||
Results from different services not compared or validated.
|
||||
|
||||
**Example:** If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.
|
||||
|
||||
**Impact:**
|
||||
- Inconsistent results
|
||||
- No confidence scoring
|
||||
- No conflict resolution
|
||||
|
||||
### 14. No Persistent Caching
|
||||
|
||||
No caching across Align instances. Repeated queries for same track.
|
||||
|
||||
**Impact:**
|
||||
- Wasted API calls
|
||||
- Slow batch processing
|
||||
- High network usage
|
||||
- Risk of rate limiting
|
||||
|
||||
### 15. Single-Threaded Execution
|
||||
|
||||
Sequential API calls. No parallelization.
|
||||
|
||||
**Impact:**
|
||||
- Slow batch processing (latency multiplied by number of tracks)
|
||||
- Underutilized network bandwidth
|
||||
- Poor performance at scale
|
||||
|
||||
## Use Case Evaluation
|
||||
|
||||
### Academic Research
|
||||
|
||||
**Suitability:** Moderate
|
||||
|
||||
**Strengths:**
|
||||
- JAMS format support
|
||||
- Batch processing
|
||||
- Multi-service coverage
|
||||
- MIT license
|
||||
|
||||
**Weaknesses:**
|
||||
- No tests (can't verify correctness)
|
||||
- Broken integrations (AcousticBrainz)
|
||||
- Weak YouTube matching
|
||||
- No documentation
|
||||
|
||||
**Recommendation:** Usable for exploratory research. Not suitable for published results without validation.
|
||||
|
||||
### Dataset Preparation
|
||||
|
||||
**Suitability:** Moderate
|
||||
|
||||
**Strengths:**
|
||||
- Batch processing with progress tracking
|
||||
- CSV output
|
||||
- JAMS enrichment
|
||||
- Cascading fallback
|
||||
|
||||
**Weaknesses:**
|
||||
- No rate limiting (risk of being blocked)
|
||||
- No caching (slow for large datasets)
|
||||
- No parallelization (slow)
|
||||
- Silent failures (incomplete datasets)
|
||||
|
||||
**Recommendation:** Usable for small to medium datasets (hundreds to thousands of tracks). Not suitable for large-scale datasets (millions of tracks) without optimization.
|
||||
|
||||
### Production Music Applications
|
||||
|
||||
**Suitability:** Low
|
||||
|
||||
**Strengths:**
|
||||
- Simple API
|
||||
- Multi-service coverage
|
||||
|
||||
**Weaknesses:**
|
||||
- No tests
|
||||
- No error handling
|
||||
- No monitoring
|
||||
- No rate limiting
|
||||
- Pre-release quality
|
||||
- Hardcoded configuration
|
||||
- Dead code
|
||||
|
||||
**Recommendation:** Not suitable for production without significant refactoring. Consider as reference implementation only.
|
||||
|
||||
### Metadata Enrichment Service
|
||||
|
||||
**Suitability:** Low
|
||||
|
||||
**Strengths:**
|
||||
- Cascading fallback pattern
|
||||
- Multi-service integration
|
||||
|
||||
**Weaknesses:**
|
||||
- No async support
|
||||
- No caching
|
||||
- No rate limiting
|
||||
- No error handling
|
||||
- No monitoring
|
||||
- Single-threaded
|
||||
|
||||
**Recommendation:** Core concept applicable. Implementation needs complete rewrite for production service.
|
||||
|
||||
## Integration Assessment
|
||||
|
||||
### Integration into Metadata Aggregator
|
||||
|
||||
**Conceptual value:** High. Cascading fallback pattern and multi-service aggregation are sound architectural patterns.
|
||||
|
||||
**Implementation value:** Low. Pre-release quality, broken integrations, no tests.
|
||||
|
||||
**Reuse strategy:**
|
||||
|
||||
**Don't adopt the code directly.** Instead:
|
||||
|
||||
1. **Study the pattern:** Understand cascading fallback and service orchestration
|
||||
2. **Identify valuable integrations:** MusicBrainz and Deezer integrations worth studying
|
||||
3. **Reimplement the concept:** Build new implementation with proper error handling, testing, configuration
|
||||
4. **Borrow matching logic:** Duration filtering and fuzzy matching algorithms applicable
|
||||
|
||||
**Specific learnings:**
|
||||
|
||||
**Cascading fallback pattern:**
|
||||
```python
|
||||
def get_identifier(self):
|
||||
# Try authoritative source first
|
||||
if self.has_mbid():
|
||||
return self.query_musicbrainz()
|
||||
|
||||
# Try commercial source with ISRC
|
||||
if self.has_isrc():
|
||||
return self.query_deezer()
|
||||
|
||||
# Fall back to metadata search
|
||||
return self.query_by_metadata()
|
||||
```
|
||||
|
||||
**Duration filtering:**
|
||||
```python
|
||||
def filter_by_duration(results, target_duration, threshold=3):
|
||||
return [r for r in results if abs(r.duration - target_duration) <= threshold]
|
||||
```
|
||||
|
||||
**Fuzzy matching:**
|
||||
```python
|
||||
from difflib import SequenceMatcher
|
||||
|
||||
def similarity(a, b):
|
||||
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
|
||||
|
||||
def fuzzy_match(results, target, threshold=0.8):
|
||||
return [r for r in results if similarity(r.name, target) >= threshold]
|
||||
```
|
||||
|
||||
### Integration Recommendations
|
||||
|
||||
**What to adopt:**
|
||||
- Cascading fallback pattern
|
||||
- Duration filtering approach
|
||||
- Fuzzy string matching
|
||||
- JAMS format support (if working with academic datasets)
|
||||
|
||||
**What to avoid:**
|
||||
- Direct code reuse
|
||||
- YouTube Music integration (weak matching)
|
||||
- AcousticBrainz integration (defunct)
|
||||
- Hardcoded configuration approach
|
||||
- Silent error handling pattern
|
||||
|
||||
**What to improve:**
|
||||
- Add comprehensive error handling
|
||||
- Add input validation
|
||||
- Add persistent caching
|
||||
- Add async/await for concurrency
|
||||
- Add rate limiting
|
||||
- Add cross-service validation
|
||||
- Add confidence scoring
|
||||
- Add monitoring and metrics
|
||||
|
||||
## Competitive Analysis
|
||||
|
||||
### Comparison with Alternatives
|
||||
|
||||
**MusicBrainz Picard:**
|
||||
- Desktop application for music tagging
|
||||
- More mature (v2.x)
|
||||
- GUI-based
|
||||
- Comprehensive MusicBrainz integration
|
||||
- Not a library (can't integrate programmatically)
|
||||
|
||||
**beets:**
|
||||
- Music library management tool
|
||||
- Plugin architecture
|
||||
- CLI and library API
|
||||
- Mature (v1.x)
|
||||
- More comprehensive than MusicMetaLinker
|
||||
- Heavier weight (full music library management)
|
||||
|
||||
**musicbrainzngs:**
|
||||
- Official MusicBrainz Python client
|
||||
- Focused on single service
|
||||
- Well-maintained
|
||||
- No multi-service aggregation
|
||||
- Lower-level API
|
||||
|
||||
**MusicMetaLinker positioning:**
|
||||
- Lighter than beets (focused on entity linking only)
|
||||
- Multi-service (unlike musicbrainzngs)
|
||||
- Library API (unlike Picard)
|
||||
- Less mature than all alternatives
|
||||
- Academic focus (JAMS support)
|
||||
|
||||
**Unique value proposition:** Multi-service entity linking with JAMS support for academic research.
|
||||
|
||||
**Competitive disadvantage:** Pre-release quality, no tests, limited documentation.
|
||||
|
||||
## Technical Debt Assessment
|
||||
|
||||
### High-Priority Debt
|
||||
|
||||
1. **No tests:** Blocks safe refactoring and feature development
|
||||
2. **Dead code:** AcousticBrainz integration non-functional
|
||||
3. **Debug prints:** Unprofessional, pollutes output
|
||||
4. **Hardcoded config:** Inflexible, difficult to customize
|
||||
5. **Silent errors:** Difficult debugging, poor user experience
|
||||
|
||||
**Estimated effort to address:** 2-3 weeks full-time development
|
||||
|
||||
### Medium-Priority Debt
|
||||
|
||||
1. **No rate limiting:** Risk of service blocks
|
||||
2. **No caching:** Performance and efficiency issues
|
||||
3. **No input validation:** Silent failures, wasted API calls
|
||||
4. **Single-threaded:** Performance bottleneck
|
||||
5. **No CI/CD:** Manual testing and releases
|
||||
|
||||
**Estimated effort to address:** 2-3 weeks full-time development
|
||||
|
||||
### Low-Priority Debt
|
||||
|
||||
1. **Not on PyPI:** Distribution inconvenience
|
||||
2. **No documentation:** Learning curve for new users
|
||||
3. **No type hints:** IDE support, static analysis
|
||||
4. **Inconsistent naming:** Code readability
|
||||
5. **No monitoring:** Production visibility
|
||||
|
||||
**Estimated effort to address:** 1-2 weeks full-time development
|
||||
|
||||
**Total technical debt:** 5-8 weeks full-time development to production-ready state.
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### Technical Risks
|
||||
|
||||
**High:**
|
||||
- No tests: Changes may introduce bugs
|
||||
- Broken integrations: AcousticBrainz always fails
|
||||
- No rate limiting: Risk of IP bans
|
||||
- Silent errors: Difficult debugging
|
||||
|
||||
**Medium:**
|
||||
- YouTube Music: Unofficial API may break
|
||||
- No caching: Performance issues at scale
|
||||
- Hardcoded config: Inflexible for different use cases
|
||||
|
||||
**Low:**
|
||||
- Dependency vulnerabilities: No scanning
|
||||
- Security: Plaintext credentials
|
||||
|
||||
### Operational Risks
|
||||
|
||||
**High:**
|
||||
- No monitoring: No visibility into production issues
|
||||
- No error tracking: Can't diagnose failures
|
||||
- No health checks: Can't detect service outages
|
||||
|
||||
**Medium:**
|
||||
- No CI/CD: Manual releases error-prone
|
||||
- No documentation: Difficult onboarding
|
||||
- No versioning strategy: Breaking changes unpredictable
|
||||
|
||||
**Low:**
|
||||
- No backup/recovery: Stateless, nothing to back up
|
||||
- No scaling strategy: Single-threaded, limited throughput
|
||||
|
||||
### Legal Risks
|
||||
|
||||
**Medium:**
|
||||
- YouTube Music: Reverse-engineered API may violate ToS
|
||||
- No license headers: Unclear licensing for individual files
|
||||
|
||||
**Low:**
|
||||
- MIT license: Permissive, low legal risk
|
||||
- No personal data: No GDPR concerns
|
||||
|
||||
## Recommendations
|
||||
|
||||
### For Academic Use
|
||||
|
||||
**Acceptable with caveats:**
|
||||
|
||||
1. **Validate results:** Cross-check critical metadata manually
|
||||
2. **Document limitations:** Note AcousticBrainz non-functional, YouTube matching weak
|
||||
3. **Small to medium datasets:** Hundreds to thousands of tracks, not millions
|
||||
4. **Exploratory research:** Not for published results without validation
|
||||
|
||||
**Improvements for academic use:**
|
||||
|
||||
1. Add logging to track which services provided which data
|
||||
2. Add confidence scores to indicate match quality
|
||||
3. Remove AcousticBrainz integration
|
||||
4. Document known limitations
|
||||
|
||||
### For Production Use
|
||||
|
||||
**Not recommended without significant refactoring.**
|
||||
|
||||
**Minimum requirements for production:**
|
||||
|
||||
1. **Add comprehensive test suite** (unit and integration tests)
|
||||
2. **Add error handling** (specific exceptions, logging, retry logic)
|
||||
3. **Add rate limiting** (respect service limits)
|
||||
4. **Add caching** (persistent cache for repeated queries)
|
||||
5. **Add monitoring** (metrics, health checks, error tracking)
|
||||
6. **Add configuration system** (environment variables, config files)
|
||||
7. **Remove dead code** (AcousticBrainz)
|
||||
8. **Add input validation** (validate MBIDs, ISRCs, etc.)
|
||||
9. **Add CI/CD** (automated testing and releases)
|
||||
10. **Publish to PyPI** (standard distribution)
|
||||
|
||||
**Estimated effort:** 5-8 weeks full-time development.
|
||||
|
||||
### For Integration into Metadata Aggregator
|
||||
|
||||
**Recommendation: Study the pattern, reimplement the concept.**
|
||||
|
||||
**What to learn from MusicMetaLinker:**
|
||||
|
||||
1. **Cascading fallback pattern:** Query authoritative sources first, fall back to less reliable sources
|
||||
2. **Duration filtering:** Use duration to disambiguate multiple matches
|
||||
3. **Fuzzy matching:** Use string similarity for metadata-based search
|
||||
4. **Multi-service aggregation:** Combine results from multiple sources
|
||||
5. **JAMS format:** If working with academic datasets
|
||||
|
||||
**What to implement differently:**
|
||||
|
||||
1. **Service abstraction:** Define common interface for all services
|
||||
2. **Dependency injection:** Pass service instances to orchestrator
|
||||
3. **Async/await:** Concurrent API calls for better performance
|
||||
4. **Persistent caching:** Redis or similar for cross-instance caching
|
||||
5. **Error handling:** Explicit error types, logging, retry logic
|
||||
6. **Configuration:** Runtime configuration for thresholds and endpoints
|
||||
7. **Validation:** Input validation and cross-service validation
|
||||
8. **Monitoring:** Metrics, health checks, error tracking
|
||||
9. **Testing:** Comprehensive test suite with mocked services
|
||||
10. **Documentation:** API documentation, usage examples, deployment guide
|
||||
|
||||
## Overall Assessment
|
||||
|
||||
### Strengths Summary
|
||||
|
||||
- Simple, clean API
|
||||
- Sound architectural pattern (cascading fallback)
|
||||
- JAMS format support for academic use
|
||||
- Batch processing capabilities
|
||||
- MIT license
|
||||
- Minimal dependencies
|
||||
|
||||
### Weaknesses Summary
|
||||
|
||||
- Pre-release quality (v0.0.1)
|
||||
- No automated tests
|
||||
- No CI/CD
|
||||
- Debug code in production
|
||||
- Hardcoded configuration
|
||||
- Broken integrations (AcousticBrainz)
|
||||
- Weak YouTube matching
|
||||
- No rate limiting
|
||||
- Silent error handling
|
||||
- Not on PyPI
|
||||
|
||||
### Final Verdict
|
||||
|
||||
**Academic value:** Moderate. Useful for exploratory research and dataset preparation. Not suitable for published results without validation.
|
||||
|
||||
**Production value:** Low. Requires 5-8 weeks of development to reach production readiness.
|
||||
|
||||
**Integration value:** Moderate. Core concept (cascading fallback, multi-service aggregation) is valuable. Implementation should be studied but not directly adopted.
|
||||
|
||||
**Recommendation:** Use MusicMetaLinker as a reference implementation to understand entity linking patterns. Reimplement the concept with proper error handling, testing, and production hardening for serious use.
|
||||
|
||||
**Best use case:** Academic research projects with small to medium datasets where perfect accuracy is not critical and manual validation is feasible.
|
||||
|
||||
**Avoid for:** Production music applications, large-scale dataset processing, published research results, commercial products.
|
||||
|
||||
### Relevance Score
|
||||
|
||||
**Conceptual relevance:** 8/10. Cascading fallback and multi-service aggregation are highly relevant patterns.
|
||||
|
||||
**Implementation relevance:** 3/10. Pre-release quality, broken integrations, no tests make direct adoption inadvisable.
|
||||
|
||||
**Overall relevance:** 5/10. Study the pattern, don't adopt the code.
|
||||
@@ -0,0 +1,662 @@
|
||||
# MusicMetaLinker Integrations
|
||||
|
||||
## Integration Architecture
|
||||
|
||||
MusicMetaLinker integrates with five external services:
|
||||
1. MusicBrainz (open music encyclopedia)
|
||||
2. Deezer (commercial streaming service)
|
||||
3. YouTube Music (commercial streaming service)
|
||||
4. AcousticBrainz (audio analysis database - defunct)
|
||||
5. Spotify (commercial streaming service - limited use)
|
||||
|
||||
Each integration uses a different library and authentication approach.
|
||||
|
||||
## MusicBrainz Integration
|
||||
|
||||
### Library and Authentication
|
||||
|
||||
**Library:** musicbrainzngs (official Python client)
|
||||
|
||||
**Authentication:** None required for read-only queries.
|
||||
|
||||
**User-Agent:** Required by MusicBrainz API. Hardcoded as "elka/0.1" (appears to be from parent project, not MusicMetaLinker-specific).
|
||||
|
||||
**Rate limiting:** MusicBrainz recommends 1 request/second. Not enforced by MusicMetaLinker.
|
||||
|
||||
### API Endpoints
|
||||
|
||||
All queries go through musicbrainzngs library, which handles endpoint construction.
|
||||
|
||||
**Base URL:** https://musicbrainz.org/ws/2/
|
||||
|
||||
**Endpoints used:**
|
||||
- Recording search: /recording?query=...
|
||||
- Recording lookup: /recording/{mbid}
|
||||
- ISRC search: /isrc/{isrc}
|
||||
|
||||
### Query Patterns
|
||||
|
||||
**By MBID (most reliable):**
|
||||
|
||||
```python
|
||||
import musicbrainzngs as mb
|
||||
|
||||
mb.set_useragent("elka", "0.1")
|
||||
result = mb.get_recording_by_id(
|
||||
mbid,
|
||||
includes=["artists", "releases", "isrcs"]
|
||||
)
|
||||
```
|
||||
|
||||
**includes parameter:** Fetches related entities in single request. Reduces API calls.
|
||||
|
||||
**By ISRC:**
|
||||
|
||||
```python
|
||||
result = mb.get_recordings_by_isrc(
|
||||
isrc,
|
||||
includes=["artists", "releases", "isrcs"]
|
||||
)
|
||||
```
|
||||
|
||||
Returns list of recordings with that ISRC. Multiple recordings possible (different releases, remasters).
|
||||
|
||||
**By metadata:**
|
||||
|
||||
```python
|
||||
query = f'artist:"{artist}" AND recording:"{track}"'
|
||||
if album:
|
||||
query += f' AND release:"{album}"'
|
||||
|
||||
result = mb.search_recordings(
|
||||
query=query,
|
||||
limit=10
|
||||
)
|
||||
```
|
||||
|
||||
Lucene query syntax. Quoted strings for exact matching. Returns ranked results.
|
||||
|
||||
### Response Parsing
|
||||
|
||||
**Recording structure:**
|
||||
|
||||
```python
|
||||
{
|
||||
"recording": {
|
||||
"id": "mbid-uuid",
|
||||
"title": "Track Name",
|
||||
"length": 123456, # milliseconds
|
||||
"artist-credit": [
|
||||
{"artist": {"name": "Artist Name"}}
|
||||
],
|
||||
"release-list": [
|
||||
{
|
||||
"title": "Album Name",
|
||||
"date": "2020-01-15",
|
||||
"track-list": [
|
||||
{"number": "1"}
|
||||
]
|
||||
}
|
||||
],
|
||||
"isrc-list": ["GBAYE9200070"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Extraction logic:**
|
||||
|
||||
- **MBID:** recording.id
|
||||
- **Track:** recording.title
|
||||
- **Artist:** recording.artist-credit[0].artist.name (first artist only)
|
||||
- **Duration:** recording.length / 1000 (convert milliseconds to seconds)
|
||||
- **Album:** recording.release-list[0].title (first release only)
|
||||
- **Release date:** recording.release-list[0].date
|
||||
- **Track number:** recording.release-list[0].track-list[0].number
|
||||
- **ISRC:** recording.isrc-list[0] (first ISRC only)
|
||||
|
||||
**Multiple values:** MusicBrainz returns lists for artists, releases, ISRCs. MusicMetaLinker takes first value only. No aggregation or selection logic.
|
||||
|
||||
### Filtering and Matching
|
||||
|
||||
**Duration filtering:**
|
||||
|
||||
```python
|
||||
if duration:
|
||||
matches = [r for r in results if abs(r['length']/1000 - duration) < 5]
|
||||
```
|
||||
|
||||
±5 second threshold for metadata searches. Hardcoded.
|
||||
|
||||
**Fuzzy string matching:**
|
||||
|
||||
Uses difflib.SequenceMatcher for artist/track/album similarity.
|
||||
|
||||
```python
|
||||
from difflib import SequenceMatcher
|
||||
|
||||
def similarity(a, b):
|
||||
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
|
||||
|
||||
# Match if similarity > 0.8 (80%)
|
||||
```
|
||||
|
||||
Threshold hardcoded at 0.8. No configuration option.
|
||||
|
||||
### Error Handling
|
||||
|
||||
**Network errors:** Caught and suppressed. Returns None.
|
||||
|
||||
**Invalid MBID:** Returns None.
|
||||
|
||||
**No results:** Returns None.
|
||||
|
||||
**Rate limiting:** No handling. If rate limited, returns None.
|
||||
|
||||
### Integration Strengths
|
||||
|
||||
1. **Official library:** musicbrainzngs is maintained by MusicBrainz community
|
||||
2. **Rich metadata:** Comprehensive music information
|
||||
3. **No authentication:** Easy to use
|
||||
4. **Includes parameter:** Efficient data fetching
|
||||
5. **Authoritative source:** MusicBrainz is ground truth for music metadata
|
||||
|
||||
### Integration Weaknesses
|
||||
|
||||
1. **Hardcoded User-Agent:** "elka/0.1" not specific to MusicMetaLinker
|
||||
2. **No rate limiting:** Risk of being blocked
|
||||
3. **First-value-only:** Ignores multiple artists, releases, ISRCs
|
||||
4. **Hardcoded thresholds:** Duration (5s), similarity (0.8) not configurable
|
||||
5. **No error visibility:** Silent failures
|
||||
|
||||
## Deezer Integration
|
||||
|
||||
### Library and Authentication
|
||||
|
||||
**Library:** deezer-python (community library, not official)
|
||||
|
||||
**Authentication:** None required for search API.
|
||||
|
||||
**Rate limiting:** Unknown. Not documented. Not enforced by MusicMetaLinker.
|
||||
|
||||
### API Endpoints
|
||||
|
||||
deezer-python library handles endpoint construction.
|
||||
|
||||
**Base URL:** https://api.deezer.com/
|
||||
|
||||
**Endpoints used:**
|
||||
- Track search: /search/track?q=...
|
||||
- ISRC search: /track/isrc:{isrc}
|
||||
|
||||
### Query Patterns
|
||||
|
||||
**By ISRC (preferred):**
|
||||
|
||||
```python
|
||||
import deezer
|
||||
|
||||
client = deezer.Client()
|
||||
result = client.search(f'isrc:{isrc}', relation='track')
|
||||
```
|
||||
|
||||
Returns list of tracks with that ISRC. Usually single result.
|
||||
|
||||
**By metadata:**
|
||||
|
||||
```python
|
||||
query = f'{artist} {track}'
|
||||
if album:
|
||||
query += f' {album}'
|
||||
|
||||
result = client.search(query, relation='track')
|
||||
```
|
||||
|
||||
Simple concatenation. No advanced query syntax.
|
||||
|
||||
### Response Parsing
|
||||
|
||||
**Track structure:**
|
||||
|
||||
```python
|
||||
{
|
||||
"id": 123456789,
|
||||
"title": "Track Name",
|
||||
"duration": 123, # seconds
|
||||
"artist": {
|
||||
"name": "Artist Name"
|
||||
},
|
||||
"album": {
|
||||
"title": "Album Name"
|
||||
},
|
||||
"release_date": "2020-01-15",
|
||||
"bpm": 120,
|
||||
"isrc": "GBAYE9200070",
|
||||
"rank": 500000
|
||||
}
|
||||
```
|
||||
|
||||
**Extraction logic:**
|
||||
|
||||
- **Deezer ID:** track.id
|
||||
- **Track:** track.title
|
||||
- **Artist:** track.artist.name
|
||||
- **Album:** track.album.title
|
||||
- **Duration:** track.duration (already in seconds)
|
||||
- **Release date:** track.release_date
|
||||
- **BPM:** track.bpm
|
||||
- **ISRC:** track.isrc
|
||||
- **Rank:** track.rank (popularity metric)
|
||||
|
||||
### Filtering and Matching
|
||||
|
||||
**Duration filtering (critical for Deezer):**
|
||||
|
||||
```python
|
||||
duration_threshold = 3 # seconds
|
||||
|
||||
matches = [
|
||||
t for t in results
|
||||
if abs(t.duration - duration) <= duration_threshold
|
||||
]
|
||||
```
|
||||
|
||||
±3 second threshold. Configurable via parameter but defaults to 3.
|
||||
|
||||
**Why critical:** Deezer returns many versions of same track (radio edit, album version, remaster, live). Duration filtering essential for correct match.
|
||||
|
||||
**Fuzzy matching:**
|
||||
|
||||
Same difflib.SequenceMatcher approach as MusicBrainz. 0.8 similarity threshold.
|
||||
|
||||
**Ranking:**
|
||||
|
||||
If multiple matches after filtering, selects highest rank (most popular version).
|
||||
|
||||
```python
|
||||
best_match = max(matches, key=lambda t: t.rank)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
**Network errors:** Caught and suppressed. Returns None.
|
||||
|
||||
**Invalid ISRC:** Returns empty list, treated as no match.
|
||||
|
||||
**No results:** Returns None.
|
||||
|
||||
### Integration Strengths
|
||||
|
||||
1. **Strong ISRC support:** Deezer has comprehensive ISRC coverage
|
||||
2. **Duration filtering:** Effective for disambiguating versions
|
||||
3. **Popularity ranking:** Helps select canonical version
|
||||
4. **BPM data:** Only source of BPM in MusicMetaLinker
|
||||
5. **Fast API:** Generally faster than MusicBrainz
|
||||
|
||||
### Integration Weaknesses
|
||||
|
||||
1. **Unofficial library:** deezer-python not maintained by Deezer
|
||||
2. **No authentication:** Limited to public API (no user-specific data)
|
||||
3. **Simple search:** No advanced query syntax
|
||||
4. **Hardcoded threshold:** 3-second duration threshold may not suit all use cases
|
||||
5. **Commercial bias:** Deezer catalog may not include obscure or independent releases
|
||||
|
||||
## YouTube Music Integration
|
||||
|
||||
### Library and Authentication
|
||||
|
||||
**Library:** ytmusicapi (unofficial, reverse-engineered API)
|
||||
|
||||
**Authentication:** None required for search.
|
||||
|
||||
**Rate limiting:** Unknown. YouTube may block aggressive usage.
|
||||
|
||||
### API Endpoints
|
||||
|
||||
ytmusicapi reverse-engineers YouTube Music web interface. No official API.
|
||||
|
||||
**Endpoints:** Internal to ytmusicapi. Not exposed to MusicMetaLinker.
|
||||
|
||||
### Query Patterns
|
||||
|
||||
**By metadata only:**
|
||||
|
||||
```python
|
||||
from ytmusicapi import YTMusic
|
||||
|
||||
ytmusic = YTMusic()
|
||||
query = f'{artist} {track} {album}'
|
||||
results = ytmusic.search(query, filter='songs')
|
||||
```
|
||||
|
||||
**filter='songs':** Excludes videos, albums, playlists. Returns only song results.
|
||||
|
||||
**No ISRC support:** YouTube Music API doesn't support ISRC search.
|
||||
|
||||
**No MBID support:** YouTube Music doesn't use MBIDs.
|
||||
|
||||
### Response Parsing
|
||||
|
||||
**Song structure:**
|
||||
|
||||
```python
|
||||
{
|
||||
"videoId": "dQw4w9WgXcQ",
|
||||
"title": "Track Name",
|
||||
"artists": [
|
||||
{"name": "Artist Name"}
|
||||
],
|
||||
"album": {
|
||||
"name": "Album Name"
|
||||
},
|
||||
"duration": "7:11", # string format
|
||||
"duration_seconds": 431
|
||||
}
|
||||
```
|
||||
|
||||
**Extraction logic:**
|
||||
|
||||
- **YouTube ID:** result.videoId
|
||||
- **YouTube URL:** f"https://www.youtube.com/watch?v={videoId}"
|
||||
- **Track:** result.title
|
||||
- **Artist:** result.artists[0].name (first artist only)
|
||||
- **Album:** result.album.name
|
||||
|
||||
### Filtering and Matching
|
||||
|
||||
**No duration filtering:** Duration filtering code commented out in MusicMetaLinker.
|
||||
|
||||
```python
|
||||
# if duration:
|
||||
# matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]
|
||||
```
|
||||
|
||||
**Why commented out:** Unknown. Possibly unreliable duration data from YouTube.
|
||||
|
||||
**No fuzzy matching:** First result assumed correct.
|
||||
|
||||
```python
|
||||
best_match = results[0] if results else None
|
||||
```
|
||||
|
||||
**Critical weakness:** High false positive rate. No validation that first result is correct match.
|
||||
|
||||
### Error Handling
|
||||
|
||||
**Network errors:** Caught and suppressed. Returns None.
|
||||
|
||||
**No results:** Returns None.
|
||||
|
||||
**API changes:** ytmusicapi may break if YouTube changes web interface. No error handling for this.
|
||||
|
||||
### Integration Strengths
|
||||
|
||||
1. **Broad coverage:** YouTube Music has extensive catalog
|
||||
2. **No authentication:** Easy to use
|
||||
3. **Filter parameter:** Excludes non-song results
|
||||
|
||||
### Integration Weaknesses
|
||||
|
||||
1. **Unofficial API:** Reverse-engineered, fragile to changes
|
||||
2. **No duration filtering:** Commented out, high false positive rate
|
||||
3. **First-result-only:** No ranking or validation
|
||||
4. **No ISRC support:** Can't use authoritative identifiers
|
||||
5. **Legal risk:** Reverse-engineered APIs may violate ToS
|
||||
6. **No error handling:** API breakage causes silent failures
|
||||
|
||||
## AcousticBrainz Integration
|
||||
|
||||
### Library and Authentication
|
||||
|
||||
**Library:** requests (direct HTTP calls)
|
||||
|
||||
**Authentication:** None.
|
||||
|
||||
### API Endpoints
|
||||
|
||||
**Base URL:** https://acousticbrainz.org/
|
||||
|
||||
**Endpoint:** /{mbid}
|
||||
|
||||
### Query Pattern
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
def acousticbrainz_link(mbid):
|
||||
url = f"https://acousticbrainz.org/{mbid}"
|
||||
response = requests.get(url)
|
||||
return url if response.status_code == 200 else None
|
||||
```
|
||||
|
||||
Simple HTTP GET. Returns URL if MBID exists, None otherwise.
|
||||
|
||||
### Critical Issue: Service Shutdown
|
||||
|
||||
**AcousticBrainz shut down in 2022.** All queries return 404.
|
||||
|
||||
**Impact:** This integration is completely non-functional. Dead code.
|
||||
|
||||
**Why still in codebase:** Unknown. Possibly not updated since shutdown.
|
||||
|
||||
**Recommendation:** Remove this integration entirely.
|
||||
|
||||
### Integration Strengths
|
||||
|
||||
None. Service is defunct.
|
||||
|
||||
### Integration Weaknesses
|
||||
|
||||
1. **Service shut down:** Non-functional
|
||||
2. **Dead code:** Wastes execution time
|
||||
3. **Misleading output:** CSV includes acousticbrainz column (always null)
|
||||
4. **No deprecation notice:** Code doesn't warn users
|
||||
|
||||
## Spotify Integration
|
||||
|
||||
### Library and Authentication
|
||||
|
||||
**Library:** spotipy (official Spotify Python client)
|
||||
|
||||
**Authentication:** OAuth2 client credentials flow.
|
||||
|
||||
**Credentials:** Stored in external mml_secrets.py file (not in repository).
|
||||
|
||||
**mml_secrets.py structure:**
|
||||
|
||||
```python
|
||||
SPOTIFY_CLIENT_ID = "your-client-id"
|
||||
SPOTIFY_CLIENT_SECRET = "your-client-secret"
|
||||
```
|
||||
|
||||
### Usage Scope
|
||||
|
||||
**Limited use:** Spotify integration only used in Billboard dataset cleaning script (prepare_dataset.py).
|
||||
|
||||
**Not used in main Align workflow.** Spotify not queried by Align class.
|
||||
|
||||
### Query Pattern
|
||||
|
||||
```python
|
||||
import spotipy
|
||||
from spotipy.oauth2 import SpotifyClientCredentials
|
||||
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
|
||||
|
||||
auth_manager = SpotifyClientCredentials(
|
||||
client_id=SPOTIFY_CLIENT_ID,
|
||||
client_secret=SPOTIFY_CLIENT_SECRET
|
||||
)
|
||||
sp = spotipy.Spotify(auth_manager=auth_manager)
|
||||
|
||||
result = sp.search(q=f'track:{track} artist:{artist}', type='track', limit=1)
|
||||
```
|
||||
|
||||
### Use Case
|
||||
|
||||
**Billboard dataset cleaning:** Extract ISRCs from Spotify for Billboard chart tracks.
|
||||
|
||||
**Workflow:**
|
||||
1. Billboard dataset has artist/track names but no ISRCs
|
||||
2. Query Spotify by artist/track
|
||||
3. Extract ISRC from Spotify result
|
||||
4. Use ISRC for subsequent MusicBrainz/Deezer queries
|
||||
|
||||
### Integration Strengths
|
||||
|
||||
1. **Official library:** spotipy maintained by Spotify
|
||||
2. **OAuth2:** Secure authentication
|
||||
3. **Rich metadata:** Comprehensive track information
|
||||
4. **ISRC support:** Spotify provides ISRCs
|
||||
|
||||
### Integration Weaknesses
|
||||
|
||||
1. **Requires credentials:** Users must register Spotify app
|
||||
2. **External secrets file:** mml_secrets.py not in repository, must be created manually
|
||||
3. **Limited use:** Only for dataset preparation, not main workflow
|
||||
4. **No documentation:** No instructions for obtaining credentials
|
||||
|
||||
## Integration Comparison
|
||||
|
||||
| Service | Library | Auth | ISRC Support | Duration Filtering | Matching Quality | Status |
|
||||
|---------|---------|------|--------------|-------------------|------------------|--------|
|
||||
| MusicBrainz | musicbrainzngs | None | Yes | ±5s | Fuzzy (0.8) | Active |
|
||||
| Deezer | deezer-python | None | Yes | ±3s | Fuzzy (0.8) + Rank | Active |
|
||||
| YouTube Music | ytmusicapi | None | No | Commented out | First result | Active (fragile) |
|
||||
| AcousticBrainz | requests | None | N/A | N/A | N/A | Defunct |
|
||||
| Spotify | spotipy | OAuth2 | Yes | N/A | N/A | Active (limited use) |
|
||||
|
||||
## Integration Orchestration
|
||||
|
||||
### Service Selection Logic
|
||||
|
||||
**Priority order:**
|
||||
|
||||
1. **MusicBrainz** if MBID provided (authoritative)
|
||||
2. **Deezer** if ISRC provided (fast, reliable)
|
||||
3. **MusicBrainz** if metadata provided (fallback)
|
||||
4. **Deezer** if metadata provided (fallback)
|
||||
5. **YouTube Music** if metadata provided (last resort)
|
||||
|
||||
### Parallel vs Sequential
|
||||
|
||||
**Sequential execution:** Services queried one at a time. No parallelization.
|
||||
|
||||
**Implications:**
|
||||
- Total latency = sum of all service latencies
|
||||
- Slow for batch processing
|
||||
- Simple error handling (no race conditions)
|
||||
|
||||
### Result Aggregation
|
||||
|
||||
**No cross-validation:** Results from different services not compared.
|
||||
|
||||
**First-wins strategy:** First successful query for each field used.
|
||||
|
||||
**Example:**
|
||||
- MBID from MusicBrainz
|
||||
- ISRC from Deezer (if not in MusicBrainz)
|
||||
- BPM from Deezer (only source)
|
||||
- YouTube link from YouTube Music
|
||||
|
||||
**No conflict resolution:** If MusicBrainz and Deezer return different artists, no reconciliation.
|
||||
|
||||
## Integration Error Handling
|
||||
|
||||
### Network Errors
|
||||
|
||||
All network errors caught and suppressed. Returns None.
|
||||
|
||||
**No retry logic:** Single attempt per service.
|
||||
|
||||
**No exponential backoff:** Immediate failure on error.
|
||||
|
||||
**No circuit breaker:** Repeated failures don't disable service.
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
**No rate limiting implementation.**
|
||||
|
||||
**Risks:**
|
||||
- MusicBrainz: Recommends 1 req/s, may block aggressive usage
|
||||
- Deezer: Unknown limits, may block
|
||||
- YouTube Music: Unknown limits, may block or break API
|
||||
|
||||
**Batch processing:** High risk of rate limiting (no delays between requests).
|
||||
|
||||
### Service Unavailability
|
||||
|
||||
**No health checks:** Services assumed available.
|
||||
|
||||
**No fallback:** If MusicBrainz down, no alternative for MBID lookup.
|
||||
|
||||
**No status monitoring:** No logging of service failures.
|
||||
|
||||
## Integration Security
|
||||
|
||||
### API Keys
|
||||
|
||||
**MusicBrainz, Deezer, YouTube Music:** No API keys required.
|
||||
|
||||
**Spotify:** Client credentials in external file (not encrypted).
|
||||
|
||||
### Data Privacy
|
||||
|
||||
**No personal data sent:** Only public music metadata queried.
|
||||
|
||||
**No user tracking:** No analytics sent to services.
|
||||
|
||||
### HTTPS
|
||||
|
||||
All services use HTTPS. No plaintext HTTP.
|
||||
|
||||
### Input Sanitization
|
||||
|
||||
**No sanitization:** Metadata strings passed directly to APIs.
|
||||
|
||||
**Potential risks:**
|
||||
- Query injection (if services use SQL/NoSQL)
|
||||
- Command injection (if services execute shell commands)
|
||||
|
||||
**Actual risk:** Low. All services use HTTP APIs with proper escaping.
|
||||
|
||||
## Integration Recommendations
|
||||
|
||||
### Immediate Fixes
|
||||
|
||||
1. **Remove AcousticBrainz:** Delete defunct integration
|
||||
2. **Fix User-Agent:** Change "elka/0.1" to "MusicMetaLinker/0.0.1"
|
||||
3. **Add rate limiting:** Implement delays between requests
|
||||
4. **Document Spotify setup:** Instructions for obtaining credentials
|
||||
|
||||
### Short-Term Improvements
|
||||
|
||||
1. **Add retry logic:** Exponential backoff for network errors
|
||||
2. **Add timeout configuration:** Configurable request timeouts
|
||||
3. **Enable YouTube duration filtering:** Uncomment and test
|
||||
4. **Add error logging:** Log service failures
|
||||
5. **Add health checks:** Verify service availability before queries
|
||||
|
||||
### Long-Term Enhancements
|
||||
|
||||
1. **Parallel queries:** Use asyncio for concurrent API calls
|
||||
2. **Cross-validation:** Compare results across services
|
||||
3. **Confidence scores:** Indicate match quality
|
||||
4. **Service abstraction:** Common interface for all services
|
||||
5. **Plugin architecture:** Allow adding new services without code changes
|
||||
6. **Caching layer:** Reduce redundant API calls
|
||||
7. **Circuit breaker:** Disable failing services temporarily
|
||||
8. **Metrics collection:** Track success rates, latencies per service
|
||||
|
||||
## Integration Value Assessment
|
||||
|
||||
**High value:**
|
||||
- MusicBrainz: Authoritative, comprehensive, reliable
|
||||
- Deezer: Fast, good ISRC coverage, BPM data
|
||||
|
||||
**Medium value:**
|
||||
- Spotify: Useful for dataset preparation, requires setup
|
||||
|
||||
**Low value:**
|
||||
- YouTube Music: Weak matching, fragile API, high false positives
|
||||
- AcousticBrainz: Defunct, zero value
|
||||
|
||||
**Recommendation:** Keep MusicBrainz and Deezer. Remove AcousticBrainz. Improve YouTube Music matching or remove. Keep Spotify for dataset preparation.
|
||||
@@ -0,0 +1,218 @@
|
||||
# MusicMetaLinker Overview
|
||||
|
||||
## Project Identity
|
||||
|
||||
**Name:** MusicMetaLinker
|
||||
**Version:** 0.0.1 (pre-release)
|
||||
**Language:** Python 3.8+
|
||||
**License:** MIT
|
||||
**Type:** Library
|
||||
**Repository:** https://github.com/andreamust/MusicMetaLinker
|
||||
**Author:** Andrea Poltronieri
|
||||
**Installation:** `pip install git+https://github.com/andreamust/MusicMetaLinker.git`
|
||||
|
||||
MusicMetaLinker is not available on PyPI. Installation requires direct GitHub access.
|
||||
|
||||
## Purpose and Scope
|
||||
|
||||
MusicMetaLinker performs entity linking for music tracks. It connects local music metadata to external databases, enriching incomplete or inconsistent information with authoritative data from multiple sources.
|
||||
|
||||
The library addresses a common problem in music information retrieval: fragmented metadata across different platforms. A track might have an MBID in one system, an ISRC in another, and only artist/title strings in a third. MusicMetaLinker bridges these gaps by querying multiple services and consolidating results.
|
||||
|
||||
Primary use case: academic music research and dataset preparation. The library supports JAMS (JSON Annotated Music Specification), a format common in music information retrieval research.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
MusicMetaLinker implements a three-step workflow:
|
||||
|
||||
1. **Service Selection:** Based on available input identifiers (MBID, ISRC, or metadata strings), the library determines which external services to query and in what order.
|
||||
|
||||
2. **Information Retrieval:** Parallel or sequential queries to MusicBrainz, Deezer, YouTube Music, and AcousticBrainz. Each service has specialized search logic.
|
||||
|
||||
3. **Filtering and Matching:** Results are filtered by duration, track number, and fuzzy string matching to identify the best match across services.
|
||||
|
||||
The library returns enriched metadata including:
|
||||
- Standardized identifiers (MBID, ISRC, Deezer ID)
|
||||
- Corrected metadata (artist, album, track name)
|
||||
- Additional attributes (BPM, release date)
|
||||
- Direct links to external services
|
||||
|
||||
## Dependencies
|
||||
|
||||
Core dependencies:
|
||||
|
||||
- **musicbrainzngs:** MusicBrainz API client
|
||||
- **deezer-python:** Deezer API wrapper
|
||||
- **ytmusicapi:** YouTube Music unofficial API
|
||||
- **spotipy:** Spotify API client (limited use)
|
||||
- **requests:** HTTP client for AcousticBrainz
|
||||
- **tqdm:** Progress bars for batch processing
|
||||
- **jams:** JAMS format support
|
||||
- **pandas:** CSV output for batch processing
|
||||
- **cryptography:** Required by spotipy
|
||||
|
||||
All dependencies are standard Python packages. No exotic or unmaintained libraries.
|
||||
|
||||
## Architecture Pattern
|
||||
|
||||
MusicMetaLinker uses a cascading fallback pattern:
|
||||
|
||||
1. If MBID is provided, query MusicBrainz first (authoritative source)
|
||||
2. If ISRC is available, try Deezer (commercial database with ISRCs)
|
||||
3. Fall back to metadata string search across all services
|
||||
4. Aggregate results, preferring more authoritative sources
|
||||
|
||||
This pattern ensures maximum coverage while respecting data quality hierarchies. MusicBrainz is treated as ground truth when available.
|
||||
|
||||
## Key Components
|
||||
|
||||
**Align class (linking.py):** Main entry point. Orchestrates all service queries and exposes unified getter methods.
|
||||
|
||||
**Service-specific aligners:**
|
||||
- MusicBrainzAlign: Queries MusicBrainz by MBID, ISRC, or metadata
|
||||
- DeezerAlign: Searches Deezer with duration-based filtering
|
||||
- YouTubeAlign: Searches YouTube Music by metadata strings
|
||||
|
||||
**Batch processing:**
|
||||
- link_partitions.py: Process directories of JAMS files
|
||||
- JAMSProcessor: Read/write JAMS format with metadata enrichment
|
||||
|
||||
**Utilities:**
|
||||
- MBDownload: Bulk download from MusicBrainz
|
||||
- prepare_dataset.py: Dataset preparation scripts
|
||||
|
||||
## Workflow Example
|
||||
|
||||
Typical usage:
|
||||
|
||||
```python
|
||||
from musicmetalinker.linking import Align
|
||||
|
||||
# Initialize with available metadata
|
||||
linker = Align(
|
||||
artist="The Beatles",
|
||||
track="Hey Jude",
|
||||
album="Hey Jude",
|
||||
duration=431
|
||||
)
|
||||
|
||||
# Retrieve enriched metadata
|
||||
mbid = linker.get_mbid()
|
||||
isrc = linker.get_isrc()
|
||||
deezer_id = linker.get_deezer_id()
|
||||
youtube_link = linker.get_youtube_link()
|
||||
```
|
||||
|
||||
The Align class handles all service queries internally. Users don't interact with individual service classes directly.
|
||||
|
||||
## Batch Processing
|
||||
|
||||
For dataset-scale operations:
|
||||
|
||||
```bash
|
||||
python link_partitions.py /path/to/jams/files --save --limit audio --overwrite
|
||||
```
|
||||
|
||||
Processes all JAMS files in a directory, enriches metadata, and outputs CSV with consolidated identifiers. Useful for preparing research datasets.
|
||||
|
||||
## Target Audience
|
||||
|
||||
Primary users:
|
||||
- Music information retrieval researchers
|
||||
- Dataset curators
|
||||
- Academic projects requiring standardized music metadata
|
||||
|
||||
Not designed for:
|
||||
- Production music applications (pre-release quality)
|
||||
- Real-time streaming services (no rate limiting)
|
||||
- End-user applications (library-only, no GUI)
|
||||
|
||||
## Development Status
|
||||
|
||||
Version 0.0.1 indicates early development. The codebase contains:
|
||||
- Debug print statements in production code
|
||||
- Hardcoded configuration values
|
||||
- Commented-out code sections
|
||||
- No automated tests
|
||||
- No CI/CD pipeline
|
||||
|
||||
This is research-quality code, not production-ready software. Suitable for academic exploration and prototyping, but requires significant hardening for production use.
|
||||
|
||||
## Integration with External Services
|
||||
|
||||
**MusicBrainz:** Open music encyclopedia. No authentication required. Rate limiting recommended but not implemented.
|
||||
|
||||
**Deezer:** Commercial streaming service with public API. No authentication for basic search. More permissive than Spotify for metadata access.
|
||||
|
||||
**YouTube Music:** Unofficial API via ytmusicapi. No authentication. Fragile to YouTube changes.
|
||||
|
||||
**AcousticBrainz:** Audio feature database. Note: AcousticBrainz shut down in 2022. This integration is non-functional.
|
||||
|
||||
**Spotify:** Limited use for ISRC extraction in Billboard dataset cleaning. Requires OAuth2 credentials via external mml_secrets.py file (not in repository).
|
||||
|
||||
## Licensing and Reuse
|
||||
|
||||
MIT license permits unrestricted use, modification, and distribution. No copyleft restrictions.
|
||||
|
||||
The library can be freely integrated into commercial or academic projects. Attribution to Andrea Poltronieri is required.
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
Python 3.8 or higher required. No platform-specific dependencies except optional ffmpeg for audio conversion in batch processing.
|
||||
|
||||
Installation from GitHub requires git and pip. No binary distributions available.
|
||||
|
||||
## Configuration
|
||||
|
||||
All configuration is hardcoded in source files:
|
||||
- User-Agent: "elka/0.1" (appears to be from a parent project)
|
||||
- API endpoints: Hardcoded URLs
|
||||
- Matching thresholds: Hardcoded in service classes
|
||||
- Spotify credentials: External mml_secrets.py module
|
||||
|
||||
No configuration files, environment variables, or runtime configuration options.
|
||||
|
||||
## Output Formats
|
||||
|
||||
**Library mode:** Python objects with getter methods
|
||||
|
||||
**Batch mode:** CSV with columns:
|
||||
- jams_file
|
||||
- track_name, artist_name, album_name
|
||||
- track_number, duration, release_year
|
||||
- musicbrainz, isrc
|
||||
- deezer_id, deezer_url
|
||||
- youtube_url
|
||||
- acousticbrainz
|
||||
- spotify_id
|
||||
|
||||
JAMS files can be enriched in place with new identifiers added to the identifiers section.
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
No performance benchmarks provided. Expected bottlenecks:
|
||||
- Network latency for API calls
|
||||
- Sequential service queries (no parallelization)
|
||||
- No caching of results
|
||||
|
||||
Batch processing includes progress bars via tqdm but no performance optimization.
|
||||
|
||||
## Error Handling
|
||||
|
||||
Errors are silently suppressed. Failed queries return None. No exceptions propagate to callers.
|
||||
|
||||
This makes the library robust to individual service failures but provides no visibility into what went wrong. Debugging requires examining log files or adding print statements.
|
||||
|
||||
## Maintenance Status
|
||||
|
||||
Last commit activity and maintenance frequency unknown from provided information. Repository is public but development status unclear.
|
||||
|
||||
AcousticBrainz integration is broken (service discontinued). No indication this has been addressed.
|
||||
|
||||
## Relevance Assessment
|
||||
|
||||
**Conceptual value:** High. The cascading fallback pattern and multi-service aggregation approach are sound architectural patterns for entity linking.
|
||||
|
||||
**Implementation value:** Low. Pre-release quality, broken integrations, no tests, hardcoded configuration.
|
||||
|
||||
**Reuse recommendation:** Study the pattern, don't adopt the code. Reimplement the concept with proper error handling, configuration management, and test coverage.
|
||||
Reference in New Issue
Block a user