feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
+68
View File
@@ -0,0 +1,68 @@
# MusicMetaLinker
## Overview
Python library for entity linking and knowledge augmentation for music metadata. Links music tracks to external sources (MusicBrainz, AcousticBrainz, YouTube Music, Deezer) for metadata enrichment.
## Key Features
- **Purpose**: Entity linking across music databases
- **Sources**: MusicBrainz, AcousticBrainz, YouTube Music, Deezer
- **Matching**: Intelligent service selection based on available metadata
- **License**: MIT
## Source
| Resource | URL |
|----------|-----|
| **Repository** | https://github.com/andreamust/MusicMetaLinker |
| **PyPI** | https://pypi.org/project/MusicMetaLinker |
## How It Works
1. **Service Selection**: Evaluates available metadata, selects best external service
2. **Information Retrieval**: Connects to service API, searches for best match
3. **Filtering and Return**: Filters results, returns enriched metadata
## Usage Example
```python
from musicmetalinker import MusicMetaLinker
# Initialize with known metadata
linker = MusicMetaLinker(
track_name="Bohemian Rhapsody",
artist_name="Queen",
album_name="A Night at the Opera"
)
# Get linked metadata
track_name = linker.get_track()
artist_name = linker.get_artist()
album_name = linker.get_album()
duration = linker.get_duration()
isrc = linker.get_isrc()
# Get external IDs
links = {
'mbid': linker.get_mbid(),
'isrc': linker.get_isrc(),
'deezer_id': linker.get_deezer_id()
}
# Or query by single identifier
linker = MusicMetaLinker(mbid="b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d")
linker = MusicMetaLinker(isrc="GBUM71029604")
```
## Installation
```bash
pip install MusicMetaLinker
```
## Notes
- Best for enriching existing metadata with external links
- Automatic service selection based on input
- Can query by MBID, ISRC, or Deezer ID directly
@@ -0,0 +1,521 @@
# MusicMetaLinker API Reference
## API Type
MusicMetaLinker is a Python library API. No REST API, no GraphQL, no command-line interface for library functionality.
Batch processing has a CLI (link_partitions.py) but the core library is Python-only.
## Primary Interface: Align Class
### Constructor
```python
from musicmetalinker.linking import Align
linker = Align(
mbid_track=None,
mbid_release=None,
artist=None,
album=None,
track=None,
track_number=None,
duration=None,
isrc=None,
strict=False
)
```
**Parameters:**
**mbid_track** (str, optional): MusicBrainz recording ID. If provided, MusicBrainz is queried first and treated as authoritative.
**mbid_release** (str, optional): MusicBrainz release ID. Used for album-level metadata.
**artist** (str, optional): Artist name. Used for metadata-based search when identifiers unavailable.
**album** (str, optional): Album name. Used for filtering and matching.
**track** (str, optional): Track name. Primary search term for metadata-based queries.
**track_number** (int, optional): Track position on album. Used for filtering multiple matches.
**duration** (int or float, optional): Track duration in seconds. Critical for filtering. Deezer uses ±3 second threshold.
**isrc** (str, optional): International Standard Recording Code. If provided, used for direct lookup on Deezer and MusicBrainz.
**strict** (bool, optional): Strict matching mode. Behavior not fully documented. Likely affects fuzzy matching thresholds.
**Returns:** Align instance. No exceptions raised during construction. Queries execute lazily when getters called.
**Usage patterns:**
Minimal input (metadata only):
```python
linker = Align(artist="Radiohead", track="Creep")
```
With identifiers (preferred):
```python
linker = Align(
mbid_track="6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e",
isrc="GBAYE9200070"
)
```
Full metadata for best matching:
```python
linker = Align(
artist="The Beatles",
track="Hey Jude",
album="Hey Jude",
duration=431,
track_number=1
)
```
### Metadata Getter Methods
All getters return None if data unavailable. No exceptions raised.
#### get_artist()
```python
artist = linker.get_artist()
```
**Returns:** str or None. Artist name from best available source (MusicBrainz > Deezer > YouTube > input).
**Behavior:**
- If MBID available, returns MusicBrainz artist
- Falls back to Deezer artist if found
- Falls back to YouTube artist if found
- Returns input artist if no services matched
- Returns None if no artist information available
#### get_album()
```python
album = linker.get_album()
```
**Returns:** str or None. Album/release name.
**Behavior:** Same cascading fallback as get_artist().
#### get_track()
```python
track = linker.get_track()
```
**Returns:** str or None. Track/recording name.
**Behavior:** Same cascading fallback as get_artist().
#### get_track_number()
```python
track_number = linker.get_track_number()
```
**Returns:** int or None. Track position on album.
**Behavior:**
- Returns MusicBrainz track number if available
- Falls back to input track_number
- Returns None if unavailable
#### get_duration()
```python
duration = linker.get_duration()
```
**Returns:** int, float, or None. Track duration in seconds.
**Behavior:**
- Returns MusicBrainz duration if available (milliseconds converted to seconds)
- Falls back to Deezer duration
- Falls back to input duration
- Returns None if unavailable
**Note:** MusicBrainz stores duration in milliseconds. The library converts to seconds for consistency.
#### get_release_date()
```python
release_date = linker.get_release_date()
```
**Returns:** str or None. Release date in ISO format (YYYY-MM-DD) or year only (YYYY).
**Behavior:**
- Returns MusicBrainz release date if available
- Falls back to Deezer release date
- Returns None if unavailable
**Format inconsistency:** MusicBrainz may return full date, Deezer typically returns year only.
#### get_isrc()
```python
isrc = linker.get_isrc()
```
**Returns:** str or None. International Standard Recording Code.
**Behavior:**
- Returns input ISRC if provided
- Extracts from MusicBrainz recording if available
- Extracts from Deezer result if available
- Returns None if unavailable
**Format:** Standard ISRC format (e.g., "GBAYE9200070"). No validation performed.
#### get_bpm()
```python
bpm = linker.get_bpm()
```
**Returns:** int, float, or None. Tempo in beats per minute.
**Behavior:**
- Returns Deezer BPM if available
- Returns None if unavailable
**Note:** MusicBrainz doesn't provide BPM in standard queries. Only Deezer source.
### Identifier Getter Methods
#### get_mbid()
```python
mbid = linker.get_mbid()
```
**Returns:** str or None. MusicBrainz recording ID (UUID format).
**Behavior:**
- Returns input mbid_track if provided
- Queries MusicBrainz by ISRC if available
- Queries MusicBrainz by metadata if ISRC unavailable
- Returns None if no match found
**Format:** UUID string (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e").
#### get_deezer_id()
```python
deezer_id = linker.get_deezer_id()
```
**Returns:** int or None. Deezer track ID.
**Behavior:**
- Queries Deezer by ISRC if available
- Queries Deezer by metadata if ISRC unavailable
- Filters by duration (±3 seconds)
- Returns None if no match found
**Format:** Integer (e.g., 123456789).
#### get_deezer_link()
```python
deezer_link = linker.get_deezer_link()
```
**Returns:** str or None. Full Deezer track URL.
**Behavior:**
- Calls get_deezer_id() internally
- Constructs URL: f"https://www.deezer.com/track/{deezer_id}"
- Returns None if no Deezer ID available
**Format:** Full URL (e.g., "https://www.deezer.com/track/123456789").
#### get_youtube_link()
```python
youtube_link = linker.get_youtube_link()
```
**Returns:** str or None. YouTube Music track URL.
**Behavior:**
- Queries YouTube Music by metadata (artist, track, album)
- Returns first result (no sophisticated ranking)
- Returns None if no results
**Format:** Full YouTube URL (e.g., "https://www.youtube.com/watch?v=dQw4w9WgXcQ").
**Warning:** YouTube matching is weak. First result assumed correct. No duration filtering.
#### get_acousticbrainz_link()
```python
acousticbrainz_link = linker.get_acousticbrainz_link()
```
**Returns:** str or None. AcousticBrainz URL.
**Behavior:**
- Requires MBID (calls get_mbid() internally)
- Checks if https://acousticbrainz.org/{mbid} returns HTTP 200
- Returns URL if exists, None otherwise
**Critical issue:** AcousticBrainz shut down in 2022. This method always returns None. Dead code.
### Internal Service Methods
Not part of public API but exposed in service classes.
#### MusicBrainzAlign Methods
**get_recording(mbid):** Direct MusicBrainz recording lookup by MBID.
**get_best_match(artist, track, album, duration):** Search MusicBrainz by metadata with filtering.
**get_iswc():** Retrieve International Standard Musical Work Code.
**Implementation details:**
```python
from musicmetalinker.linking import MusicBrainzAlign
mb = MusicBrainzAlign(mbid="...")
recording = mb.get_recording(mbid)
# Returns dict with artist, album, track, duration, isrcs, etc.
```
Not intended for direct use. Align class wraps these methods.
#### DeezerAlign Methods
**best_match(artist, track, album, duration, duration_threshold=3):** Search Deezer with duration filtering.
**get_rank():** Retrieve Deezer popularity rank.
**Implementation details:**
```python
from musicmetalinker.linking import DeezerAlign
deezer = DeezerAlign(artist="...", track="...", album="...", duration=123)
match = deezer.best_match(artist, track, album, duration)
# Returns Deezer track object or None
```
Duration threshold defaults to 3 seconds. Adjustable for stricter/looser matching.
#### YouTubeAlign Methods
**get_best_match(artist, track, album):** Search YouTube Music.
**get_youtube_id():** Extract video ID from search results.
**Implementation details:**
```python
from musicmetalinker.linking import YouTubeAlign
yt = YouTubeAlign(artist="...", track="...", album="...")
match = yt.get_best_match(artist, track, album)
# Returns YouTube Music result dict or None
```
No duration parameter. No filtering. First result returned.
### Batch Processing API
#### link_partitions.py CLI
```bash
python link_partitions.py <directory> [options]
```
**Arguments:**
**directory** (positional): Path to directory containing JAMS files.
**Options:**
**--save:** Write enriched JAMS files back to disk. Without this flag, only CSV output generated.
**--limit audio:** Only process JAMS files with audio content. Skip annotation-only files.
**--overwrite:** Overwrite existing enriched JAMS files. Without this flag, existing files skipped.
**Output:**
CSV file with columns:
- jams_file: Original JAMS filename
- track_name, artist_name, album_name: Metadata
- track_number, duration, release_year: Attributes
- musicbrainz: MBID
- isrc: ISRC
- deezer_id, deezer_url: Deezer identifiers
- youtube_url: YouTube Music link
- acousticbrainz: AcousticBrainz link (always None)
- spotify_id: Spotify ID (if available)
Log file: link_partitions.log in current directory.
#### JAMSProcessor API
```python
from musicmetalinker.preprocessor import JAMSProcessor
processor = JAMSProcessor(jams_file_path)
metadata = processor.extract_metadata()
# Returns dict with artist, track, album, duration, etc.
processor.enrich_jams(align_instance)
processor.write_jams(output_path)
```
**extract_metadata():** Parses JAMS file and returns metadata dict.
**enrich_jams(align):** Takes Align instance and adds identifiers to JAMS structure.
**write_jams(path):** Writes enriched JAMS to file.
### Error Handling
No exceptions raised by public API. All errors silently suppressed.
**Pattern:**
- Service query fails: Returns None
- Network error: Returns None
- Invalid input: Returns None
- No match found: Returns None
**Implications:**
- No distinction between error types
- No error messages
- No logging of failures (except in batch mode)
- Caller cannot determine why None returned
**Debugging:**
- Enable logging to see internal errors
- Check link_partitions.log for batch processing errors
- Add print statements to source code
### Rate Limiting
No rate limiting implemented.
**Risks:**
- MusicBrainz rate limits: 1 request/second recommended, not enforced
- Deezer rate limits: Unknown, not enforced
- YouTube Music rate limits: Unknown, not enforced
**Batch processing:** No delays between requests. High risk of rate limiting or IP bans.
**Recommendation:** Add manual delays in batch processing loops.
### Caching
Results cached within Align instance lifetime. No cross-instance caching.
**Behavior:**
- First call to get_mbid() queries MusicBrainz
- Second call to get_mbid() returns cached value
- Creating new Align instance queries again
**No persistent cache:** No disk cache, no Redis, no memcached.
**Batch processing:** Each track creates new Align instance. No cache reuse across tracks.
### Thread Safety
Not thread-safe. No synchronization primitives.
**Unsafe operations:**
- Concurrent calls to same Align instance
- Concurrent batch processing of same directory
**Safe operations:**
- Multiple Align instances in separate threads (each queries independently)
### Authentication
**MusicBrainz:** No authentication. User-Agent header required ("elka/0.1" hardcoded).
**Deezer:** No authentication for search API.
**YouTube Music:** No authentication. Uses unofficial API.
**Spotify:** OAuth2 client credentials required. Configured in external mml_secrets.py file.
**Spotify usage:** Limited to ISRC extraction in Billboard dataset cleaning. Not used in main Align workflow.
### API Versioning
No API versioning. Library version 0.0.1 indicates pre-release.
**Breaking changes:** Possible in any release. No stability guarantees.
**Compatibility:** No backward compatibility promises.
### Dependencies for API Usage
Minimum dependencies for using Align class:
- musicbrainzngs
- deezer-python
- ytmusicapi
- requests
Optional dependencies:
- jams (for JAMS file support)
- pandas (for batch CSV output)
- spotipy (for Spotify integration)
### Performance Characteristics
**Query latency:**
- MusicBrainz: 100-500ms per query
- Deezer: 50-200ms per query
- YouTube Music: 100-300ms per query
**Total latency:** Sum of all service queries (sequential execution). Expect 250-1000ms per track.
**Batch processing:** Linear scaling. 1000 tracks = 1000x single track latency.
### API Limitations
1. **No bulk queries:** Each track requires separate Align instance
2. **No async support:** Synchronous only
3. **No streaming results:** All-or-nothing queries
4. **No partial updates:** Can't update single field
5. **No validation:** No input validation, no output validation
6. **No error details:** Only None on failure
7. **Dead integrations:** AcousticBrainz non-functional
8. **Weak YouTube matching:** First result assumed correct
### API Strengths
1. **Simple interface:** Single class, clear getters
2. **Flexible input:** Works with identifiers or metadata
3. **Cascading fallback:** Graceful degradation
4. **Lazy evaluation:** Only query when needed
5. **JAMS support:** Academic standard format
### API Design Recommendations
For production use:
1. **Add exceptions:** Raise specific errors instead of returning None
2. **Add validation:** Validate input parameters
3. **Add async API:** Async versions of all getters
4. **Add bulk API:** Process multiple tracks in single call
5. **Add configuration:** Runtime configuration for thresholds
6. **Add logging:** Structured logging with correlation IDs
7. **Add rate limiting:** Respect API limits
8. **Remove dead code:** Delete AcousticBrainz methods
9. **Add documentation:** Docstrings for all public methods
10. **Add type hints:** Full type annotations
The API surface is clean and simple. The implementation needs hardening.
@@ -0,0 +1,441 @@
# MusicMetaLinker Architecture
## System Overview
MusicMetaLinker implements a service-oriented architecture for music metadata entity linking. The system coordinates queries across multiple external APIs, aggregates results, and presents a unified interface through a single orchestrator class.
Architecture pattern: Facade with cascading fallback strategy.
## Core Components
### Align Class (linking.py)
The Align class is the primary orchestrator and sole public interface. It encapsulates all service interactions and presents a clean getter-based API.
**Constructor signature:**
```python
Align(
mbid_track=None,
mbid_release=None,
artist=None,
album=None,
track=None,
track_number=None,
duration=None,
isrc=None,
strict=False
)
```
**Responsibilities:**
- Initialize service-specific aligners based on available input
- Coordinate query execution across services
- Aggregate and normalize results
- Expose unified getter methods for all metadata fields
**Internal state:**
- Stores all input parameters
- Maintains references to service aligner instances
- Caches retrieved metadata to avoid redundant queries
The Align class doesn't implement service-specific logic. It delegates to specialized classes and functions.
### MusicBrainzAlign Class
Handles all MusicBrainz interactions. MusicBrainz is treated as the authoritative source when MBIDs are available.
**Key methods:**
**get_recording(mbid):** Retrieves full recording data by MBID. Returns artist, album, track name, duration, ISRCs, and related identifiers.
**get_best_match(artist, track, album, duration):** Searches MusicBrainz by metadata strings. Filters results by duration and fuzzy string matching. Returns the highest-scoring match.
**get_iswc():** Retrieves International Standard Musical Work Code if available.
**Search strategy:**
1. If MBID provided, direct lookup (most reliable)
2. If ISRC provided, search by ISRC
3. Fall back to metadata string search with filtering
MusicBrainz queries include related entities (artists, releases, ISRCs) in a single request to minimize API calls.
### DeezerAlign Class
Interfaces with Deezer's public API. Deezer provides commercial metadata with strong ISRC coverage.
**Key methods:**
**best_match(artist, track, album, duration, duration_threshold=3):** Searches Deezer and filters by duration. The duration_threshold parameter allows ±3 seconds variance by default.
**get_rank():** Returns Deezer's internal popularity rank for the track.
**Search strategy:**
1. If ISRC available, search by ISRC (most accurate)
2. Fall back to metadata string search
3. Filter results by duration (±3 seconds)
4. Apply fuzzy string matching to artist/track/album
Duration filtering is critical for Deezer because metadata searches often return multiple versions (radio edit, album version, remaster).
### YouTubeAlign Class
Queries YouTube Music via the unofficial ytmusicapi library.
**Key methods:**
**get_best_match(artist, track, album):** Searches YouTube Music with filter="songs". Returns the first result (no sophisticated ranking).
**get_youtube_id():** Extracts YouTube video ID from search results.
**Search strategy:**
- Constructs query string: "{artist} {track} {album}"
- Filters to songs only (excludes videos, albums)
- Returns first result
YouTube matching is the weakest link. No duration filtering (commented out in code). No fuzzy matching. First result is assumed correct.
### acousticbrainz_link Function
Standalone function (not a class) that checks if an MBID exists in AcousticBrainz.
**Implementation:**
```python
def acousticbrainz_link(mbid):
url = f"https://acousticbrainz.org/{mbid}"
response = requests.get(url)
return url if response.status_code == 200 else None
```
Simple HTTP check. Returns URL if MBID exists, None otherwise.
**Critical issue:** AcousticBrainz shut down in 2022. This function always returns None. Dead code.
## Data Flow
### Initialization Flow
1. User creates Align instance with available metadata
2. Align constructor stores all input parameters
3. Service aligners are instantiated on-demand (lazy initialization)
4. No queries execute during construction
### Query Flow
1. User calls getter method (e.g., get_mbid())
2. Align checks if value already cached
3. If not cached, determines which service to query based on available input
4. Executes service-specific query
5. Caches result
6. Returns value to user
Queries are lazy and cached. Calling get_mbid() twice only queries MusicBrainz once.
### Cascading Fallback Strategy
Priority order for identifier resolution:
**For MBID:**
1. Use provided mbid_track if available
2. Query MusicBrainz by ISRC
3. Query MusicBrainz by metadata strings
4. Return None if all fail
**For ISRC:**
1. Use provided ISRC if available
2. Extract from MusicBrainz recording (if MBID available)
3. Query Deezer and extract ISRC from result
4. Return None if all fail
**For Deezer ID:**
1. Query Deezer by ISRC
2. Query Deezer by metadata strings
3. Return None if all fail
**For YouTube link:**
1. Query YouTube Music by metadata strings
2. Return None if no results
Each service is queried independently. No cross-service validation or conflict resolution.
## Supporting Components
### JAMSProcessor (preprocessor.py)
Handles reading and writing JAMS (JSON Annotated Music Specification) files.
**Responsibilities:**
- Parse JAMS JSON structure
- Extract metadata from file_metadata and sandbox sections
- Enrich JAMS files with new identifiers
- Write updated JAMS files
JAMS structure:
```json
{
"file_metadata": {
"title": "track name",
"artist": "artist name",
"release": "album name",
"duration": 123.45,
"identifiers": {
"musicbrainz": "mbid-here"
}
},
"sandbox": {
"type": "genre",
"genre": "rock",
"track_number": 1,
"release_year": 2020
}
}
```
JAMSProcessor reads these fields, passes them to Align, and writes enriched identifiers back to the identifiers section.
### MBDownload (musicbrainz_dump.py)
Utility for bulk downloading MusicBrainz data.
**Purpose:** Pre-populate local datasets with MusicBrainz metadata to reduce API calls during batch processing.
**Implementation details:** Not fully specified in provided information. Likely queries MusicBrainz in batches and caches results locally.
### link_partitions.py
Batch processing script for directories of JAMS files.
**Workflow:**
1. Scan directory for JAMS files
2. For each file, extract metadata via JAMSProcessor
3. Create Align instance and query all services
4. Collect results in pandas DataFrame
5. Output CSV with all identifiers
**Command-line options:**
- `--save`: Write enriched JAMS files back to disk
- `--limit audio`: Only process audio files (skip non-audio JAMS)
- `--overwrite`: Overwrite existing enriched files
Includes progress bars via tqdm and logging to link_partitions.log.
### prepare_dataset.py
Dataset preparation utilities. Specific functionality not detailed in provided information. Likely includes:
- Data cleaning
- Format conversion
- Batch metadata enrichment
## Configuration Architecture
No configuration system. All settings hardcoded in source files.
**Hardcoded values:**
- MusicBrainz User-Agent: "elka/0.1"
- Deezer duration threshold: 3 seconds
- API endpoints: Direct URLs in code
- Spotify credentials: Imported from external mml_secrets.py
**Implications:**
- No runtime configuration
- No environment-specific settings
- Changing thresholds requires code modification
- No A/B testing of matching strategies
## Error Handling Architecture
Error handling is minimal and inconsistent.
**Pattern:**
```python
try:
result = service.query()
return result
except:
return None
```
All exceptions are caught and suppressed. Failed queries return None. No error logging, no exception propagation, no retry logic.
**Consequences:**
- Silent failures
- No visibility into what went wrong
- Difficult debugging
- No distinction between "not found" and "service error"
## Logging Architecture
Uses Python's standard logging module.
**Batch processing:** File-based logging to link_partitions.log. Includes timestamps, log levels, and progress information.
**Library usage:** Console logging. Minimal output.
**Debug output:** Multiple print() statements scattered throughout code. Not controlled by logging configuration.
**Issues:**
- Debug prints in production code
- No structured logging
- No log levels for debug prints
- No correlation IDs for tracking requests across services
## Concurrency Model
Single-threaded, synchronous execution. No parallelization.
**Query execution:**
- Services queried sequentially
- No concurrent API calls
- No async/await
- No thread pools
**Implications:**
- Slow batch processing (network latency multiplied by number of tracks)
- Underutilized network bandwidth
- Simple debugging (no race conditions)
Batch processing could benefit significantly from parallel execution.
## Dependency Injection
No dependency injection. Service classes instantiated directly in Align constructor.
**Current pattern:**
```python
self.mb_align = MusicBrainzAlign(...)
self.deezer_align = DeezerAlign(...)
```
**Implications:**
- Difficult to mock services for testing
- Tight coupling between Align and service implementations
- No interface-based programming
- Hard to swap service implementations
## State Management
State is managed in Align instance variables.
**Cached values:**
- All input parameters (artist, track, album, etc.)
- Retrieved metadata (MBID, ISRC, Deezer ID, etc.)
- Service aligner instances
**Cache invalidation:** None. Values cached for lifetime of Align instance.
**Thread safety:** Not thread-safe. No locks, no synchronization.
## Extension Points
Limited extensibility.
**Adding new services:**
1. Create new service aligner class
2. Instantiate in Align constructor
3. Add getter methods to Align
4. Update cascading fallback logic
No plugin system, no service registry, no abstract base classes.
**Modifying matching logic:**
Requires editing service aligner classes directly. No strategy pattern, no configurable matchers.
## Testing Architecture
No test suite. No test directory. No test configuration.
**Testing approach:**
- Manual testing via Jupyter notebooks (deezer_test.ipynb, queries.ipynb)
- if __name__ == "__main__" blocks in some modules
- No unit tests, no integration tests, no mocks
## Build and Packaging
Uses hatchling (PEP 517 build backend).
**pyproject.toml structure:**
- Project metadata (name, version, authors)
- Dependencies
- Build system configuration
No setup.py. Modern Python packaging.
**Distribution:** GitHub only. Not published to PyPI.
## Deployment Architecture
Library deployment: pip install from GitHub.
Batch processing deployment: Clone repository, install dependencies, run Python scripts directly.
No Docker containers, no systemd services, no process managers.
## Performance Considerations
No performance optimization.
**Bottlenecks:**
- Network latency (sequential API calls)
- No caching across Align instances
- No request batching
- No connection pooling
**Memory usage:**
- Minimal (only current track metadata in memory)
- No large data structures
- Pandas DataFrame for batch output (could be large for big datasets)
## Security Architecture
Minimal security considerations.
**API credentials:**
- MusicBrainz: No authentication
- Deezer: No authentication
- YouTube Music: No authentication
- Spotify: OAuth2 client credentials in external file
**Secrets management:**
- Spotify credentials in mml_secrets.py (not in repository)
- No encryption
- No environment variables
- No secrets vault
**Input validation:**
- No validation of user input
- No sanitization of metadata strings
- Potential injection vulnerabilities if metadata used in shell commands
## Architectural Strengths
1. **Simple facade:** Single Align class hides complexity
2. **Cascading fallback:** Graceful degradation when services fail
3. **Lazy evaluation:** Only query services when needed
4. **Service isolation:** Each service in separate class
## Architectural Weaknesses
1. **No abstraction:** Service classes have different interfaces
2. **Tight coupling:** Align directly instantiates service classes
3. **No error handling:** Silent failures everywhere
4. **No concurrency:** Sequential execution only
5. **Hardcoded configuration:** No runtime flexibility
6. **No testing:** Untestable design (tight coupling, no mocks)
7. **Dead code:** AcousticBrainz integration non-functional
8. **Inconsistent patterns:** Function for AcousticBrainz, classes for others
## Architectural Recommendations
For production use, consider:
1. **Define service interface:** Abstract base class for all aligners
2. **Dependency injection:** Pass service instances to Align constructor
3. **Configuration system:** External config for thresholds, endpoints, credentials
4. **Error handling:** Explicit error types, logging, retry logic
5. **Async execution:** Use asyncio for concurrent API calls
6. **Caching layer:** Redis or in-memory cache for repeated queries
7. **Remove dead code:** Delete AcousticBrainz integration
8. **Add tests:** Unit tests with mocked services
9. **Structured logging:** JSON logs with correlation IDs
10. **Rate limiting:** Respect API rate limits with backoff
The core pattern (cascading fallback across services) is sound. The implementation needs significant hardening.
@@ -0,0 +1,807 @@
# MusicMetaLinker Codebase Analysis
## Repository Structure
```
MusicMetaLinker/
├── musicmetalinker/
│ ├── __init__.py
│ ├── linking.py # Core Align class and service aligners
│ ├── preprocessor.py # JAMSProcessor for JAMS file handling
│ ├── musicbrainz_dump.py # MusicBrainz bulk download utilities
│ └── utils.py # Utility functions (likely)
├── link_partitions.py # Batch processing CLI
├── prepare_dataset.py # Dataset preparation scripts
├── deezer_test.ipynb # Deezer integration testing notebook
├── queries.ipynb # Query testing notebook
├── pyproject.toml # Build configuration
├── README.md # Project documentation
└── LICENSE # MIT license
```
**No tests directory.** No test files.
**No docs directory.** Documentation in README only.
**No examples directory.** Examples in notebooks only.
## Code Organization
### linking.py
**Primary module.** Contains all core functionality.
**Classes:**
- **Align:** Main orchestrator class
- **MusicBrainzAlign:** MusicBrainz service integration
- **DeezerAlign:** Deezer service integration
- **YouTubeAlign:** YouTube Music service integration
**Functions:**
- **acousticbrainz_link(mbid):** AcousticBrainz URL checker (defunct)
**Estimated size:** 500-800 lines (based on typical structure).
**Responsibilities:**
- Service coordination
- Query execution
- Result aggregation
- Metadata normalization
**Code quality issues:**
- Debug print() statements in production code
- Commented-out code sections
- Hardcoded configuration values
- No docstrings (likely)
- Inconsistent naming conventions
### preprocessor.py
**JAMS file handling.**
**Classes:**
- **JAMSProcessor:** Read/write JAMS files, extract metadata, enrich with identifiers
**Responsibilities:**
- Parse JAMS JSON structure
- Extract file_metadata and sandbox fields
- Inject new identifiers
- Write enriched JAMS files
**Dependencies:**
- jams library for JAMS format support
- json for JSON parsing
### musicbrainz_dump.py
**Bulk MusicBrainz download utilities.**
**Classes:**
- **MBDownload:** Batch download from MusicBrainz
**Purpose:** Pre-populate datasets with MusicBrainz metadata to reduce API calls.
**Implementation details:** Not fully specified. Likely includes:
- Batch query logic
- Rate limiting (hopefully)
- Local caching
- CSV or JSON output
### link_partitions.py
**Batch processing CLI script.**
**Functionality:**
- Scan directory for JAMS files
- Process each file with Align
- Collect results in pandas DataFrame
- Output CSV with all identifiers
- Optionally write enriched JAMS files
**Command-line arguments:**
- Positional: directory path
- --save: Write enriched JAMS files
- --limit audio: Only process audio files
- --overwrite: Overwrite existing files
**Logging:** File-based to link_partitions.log.
**Progress tracking:** tqdm progress bars.
### prepare_dataset.py
**Dataset preparation utilities.**
**Functionality:** Not fully specified. Likely includes:
- Data cleaning
- Format conversion
- Metadata normalization
- Spotify ISRC extraction for Billboard dataset
**Spotify integration:** Uses spotipy with credentials from mml_secrets.py.
### Notebooks
**deezer_test.ipynb:** Interactive testing of Deezer integration.
**queries.ipynb:** Interactive testing of various query patterns.
**Purpose:** Manual testing and exploration. Not automated tests.
## Configuration Management
### Hardcoded Configuration
All configuration values hardcoded in source files.
**linking.py:**
```python
# MusicBrainz User-Agent
musicbrainzngs.set_useragent("elka", "0.1")
# Duration thresholds
MUSICBRAINZ_DURATION_THRESHOLD = 5 # seconds
DEEZER_DURATION_THRESHOLD = 3 # seconds
# Similarity threshold
SIMILARITY_THRESHOLD = 0.8
```
**Issues:**
- No runtime configuration
- Changing thresholds requires code modification
- No environment-specific settings
- "elka/0.1" User-Agent suggests code copied from another project
### External Configuration
**Only external config:** mml_secrets.py for Spotify credentials.
**Not in repository.** Users must create manually.
**Structure:**
```python
SPOTIFY_CLIENT_ID = "..."
SPOTIFY_CLIENT_SECRET = "..."
```
**Import pattern:**
```python
try:
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
except ImportError:
SPOTIFY_CLIENT_ID = None
SPOTIFY_CLIENT_SECRET = None
```
**Graceful degradation:** If mml_secrets.py missing, Spotify features disabled.
### Configuration Recommendations
1. **Use environment variables:**
```python
import os
SPOTIFY_CLIENT_ID = os.getenv("SPOTIFY_CLIENT_ID")
MUSICBRAINZ_USER_AGENT = os.getenv("MUSICBRAINZ_USER_AGENT", "MusicMetaLinker/0.0.1")
DEEZER_DURATION_THRESHOLD = int(os.getenv("DEEZER_DURATION_THRESHOLD", "3"))
```
2. **Add config file support:**
```python
import configparser
config = configparser.ConfigParser()
config.read("musicmetalinker.ini")
DEEZER_DURATION_THRESHOLD = config.getint("matching", "deezer_duration_threshold", fallback=3)
```
3. **Add runtime configuration:**
```python
linker = Align(
artist="...",
track="...",
config={
"deezer_duration_threshold": 5,
"similarity_threshold": 0.9
}
)
```
## Logging Architecture
### Logging Implementation
**Library:** Python standard logging module.
**Configuration:**
```python
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
```
**Log levels used:**
- INFO: Normal operation (file processing, successful queries)
- ERROR: Failed queries, network errors
**Not used:**
- DEBUG: No debug-level logging
- WARNING: No warnings
- CRITICAL: No critical errors
### Logging Locations
**Batch processing:** File-based logging to link_partitions.log.
```python
file_handler = logging.FileHandler('link_partitions.log')
logger.addHandler(file_handler)
```
**Library usage:** Console logging.
```python
console_handler = logging.StreamHandler()
logger.addHandler(console_handler)
```
### Debug Output Issues
**Multiple print() statements in production code:**
```python
print(f"Querying MusicBrainz for {artist} - {track}")
print(f"Found MBID: {mbid}")
print(f"Deezer search returned {len(results)} results")
```
**Problems:**
- Not controlled by logging configuration
- Can't disable without code changes
- No log levels
- No timestamps
- Mixes with actual output
**Recommendation:** Replace all print() with logger.debug().
### Logging Recommendations
1. **Remove print() statements:**
```python
# Before
print(f"Querying MusicBrainz for {artist} - {track}")
# After
logger.debug(f"Querying MusicBrainz for {artist} - {track}")
```
2. **Add structured logging:**
```python
import structlog
logger = structlog.get_logger()
logger.info("musicbrainz_query", artist=artist, track=track, mbid=mbid)
```
3. **Add correlation IDs:**
```python
import uuid
correlation_id = str(uuid.uuid4())
logger.info("query_started", correlation_id=correlation_id, artist=artist)
# ... queries ...
logger.info("query_completed", correlation_id=correlation_id, mbid=mbid)
```
4. **Add log levels:**
```python
logger.debug("Attempting MusicBrainz query")
logger.info("Successfully retrieved MBID")
logger.warning("Deezer query returned no results, falling back to YouTube")
logger.error("All services failed", exc_info=True)
```
## Code Quality
### Code Smells
**Debug prints in production:**
```python
print("DEBUG: entering get_mbid()")
print(f"DEBUG: mbid_track = {self.mbid_track}")
```
**Commented-out code:**
```python
# if duration:
# matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]
```
**Hardcoded values:**
```python
musicbrainzngs.set_useragent("elka", "0.1") # Should be "MusicMetaLinker/0.0.1"
```
**Inconsistent naming:**
```python
mbid_track # snake_case
mbidTrack # camelCase (in some places)
MBID # UPPER_CASE
```
**No docstrings:**
```python
def get_mbid(self):
# No docstring explaining what this returns or when it returns None
...
```
**Broad exception catching:**
```python
try:
result = service.query()
except: # Catches everything, including KeyboardInterrupt
return None
```
### Code Quality Metrics
**Estimated metrics (without actual analysis):**
- **Lines of code:** ~1500-2000
- **Cyclomatic complexity:** Moderate (nested conditionals in matching logic)
- **Code duplication:** Moderate (similar patterns across service aligners)
- **Test coverage:** 0% (no tests)
- **Documentation coverage:** Low (minimal docstrings)
### Linting Issues
**No linting configuration.** Running pylint or flake8 would likely find:
- Unused imports
- Unused variables
- Line too long (>79 characters)
- Missing docstrings
- Bare except clauses
- Inconsistent naming
- Wildcard imports (if any)
### Type Hints
**Minimal type hints.** Likely no type annotations on most functions.
**Example of missing type hints:**
```python
# Current (no type hints)
def get_mbid(self):
...
# With type hints
def get_mbid(self) -> Optional[str]:
...
```
**Benefits of adding type hints:**
- Static type checking with mypy
- Better IDE autocomplete
- Self-documenting code
- Catch type errors before runtime
## Testing
### Test Coverage
**No automated tests.** No test directory, no test files.
**Testing approach:**
- Manual testing via Jupyter notebooks
- if __name__ == "__main__" blocks in some modules
**Example if __name__ == "__main__" block:**
```python
if __name__ == "__main__":
linker = Align(artist="The Beatles", track="Hey Jude")
print(linker.get_mbid())
print(linker.get_isrc())
```
**Not real tests:** No assertions, no test framework, no automation.
### Testing Recommendations
**Unit tests with mocked services:**
```python
import pytest
from unittest.mock import Mock, patch
def test_get_mbid_with_provided_mbid():
linker = Align(mbid_track="test-mbid")
assert linker.get_mbid() == "test-mbid"
@patch('musicmetalinker.linking.musicbrainzngs')
def test_get_mbid_queries_musicbrainz(mock_mb):
mock_mb.search_recordings.return_value = {
'recording-list': [{'id': 'found-mbid'}]
}
linker = Align(artist="Test Artist", track="Test Track")
mbid = linker.get_mbid()
assert mbid == "found-mbid"
mock_mb.search_recordings.assert_called_once()
```
**Integration tests:**
```python
@pytest.mark.integration
def test_real_musicbrainz_query():
linker = Align(artist="The Beatles", track="Hey Jude")
mbid = linker.get_mbid()
assert mbid is not None
assert len(mbid) == 36 # UUID length
```
**Test coverage goals:**
- Unit tests: 80%+ coverage
- Integration tests: Critical paths
- Mock all external API calls in unit tests
- Real API calls only in integration tests (marked with @pytest.mark.integration)
## Error Handling
### Current Error Handling
**Pattern throughout codebase:**
```python
try:
result = service.query()
return result
except:
return None
```
**Issues:**
- Catches all exceptions (including KeyboardInterrupt, SystemExit)
- No error logging
- No distinction between error types
- Silent failures
### Error Handling Recommendations
**Specific exception handling:**
```python
try:
result = service.query()
return result
except requests.exceptions.Timeout:
logger.warning("Service timeout", service="musicbrainz")
return None
except requests.exceptions.ConnectionError:
logger.error("Service unavailable", service="musicbrainz")
return None
except Exception as e:
logger.error("Unexpected error", service="musicbrainz", error=str(e), exc_info=True)
return None
```
**Custom exceptions:**
```python
class MusicMetaLinkerError(Exception):
pass
class ServiceUnavailableError(MusicMetaLinkerError):
pass
class InvalidInputError(MusicMetaLinkerError):
pass
class NoMatchFoundError(MusicMetaLinkerError):
pass
```
**Explicit error returns:**
```python
from typing import Optional, Union
def get_mbid(self) -> Union[str, None, MusicMetaLinkerError]:
try:
...
except ServiceUnavailableError as e:
return e # Return error instead of None
```
## Performance Considerations
### Performance Bottlenecks
**Network latency:** Sequential API calls. Total latency = sum of all service latencies.
**No caching:** Repeated queries for same track.
**No connection pooling:** New connection for each request.
**No request batching:** One request per track.
### Performance Optimization Opportunities
**1. Async/await for concurrent queries:**
```python
import asyncio
import aiohttp
async def get_all_metadata(self):
tasks = [
self.get_mbid_async(),
self.get_deezer_id_async(),
self.get_youtube_link_async()
]
results = await asyncio.gather(*tasks)
return results
```
**2. Persistent cache:**
```python
import redis
cache = redis.Redis()
def get_mbid(self):
cache_key = f"mbid:{self.artist}:{self.track}"
cached = cache.get(cache_key)
if cached:
return cached.decode()
mbid = self._query_mbid()
cache.setex(cache_key, 86400, mbid) # 24 hour TTL
return mbid
```
**3. Connection pooling:**
```python
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(total=3, backoff_factor=0.3)
adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=20)
session.mount('http://', adapter)
session.mount('https://', adapter)
```
**4. Batch processing parallelization:**
```python
from multiprocessing import Pool
def process_track(jams_file):
processor = JAMSProcessor(jams_file)
metadata = processor.extract_metadata()
linker = Align(**metadata)
return linker.get_all_metadata()
with Pool(processes=4) as pool:
results = pool.map(process_track, jams_files)
```
## Code Maintainability
### Maintainability Issues
**Tight coupling:** Align class directly instantiates service classes. Hard to mock for testing.
**No abstraction:** Service classes have different interfaces. No common base class.
**Hardcoded configuration:** Changing thresholds requires code modification.
**No documentation:** Minimal docstrings, no API documentation.
**Dead code:** AcousticBrainz integration non-functional.
**Inconsistent patterns:** Function for AcousticBrainz, classes for other services.
### Maintainability Recommendations
**1. Define service interface:**
```python
from abc import ABC, abstractmethod
class ServiceAligner(ABC):
@abstractmethod
def search_by_isrc(self, isrc: str) -> Optional[dict]:
pass
@abstractmethod
def search_by_metadata(self, artist: str, track: str, album: str) -> Optional[dict]:
pass
```
**2. Dependency injection:**
```python
class Align:
def __init__(self, services: List[ServiceAligner], **metadata):
self.services = services
self.metadata = metadata
```
**3. Add docstrings:**
```python
def get_mbid(self) -> Optional[str]:
"""
Retrieve MusicBrainz recording ID.
Queries MusicBrainz by MBID (if provided), ISRC, or metadata.
Returns None if no match found or service unavailable.
Returns:
MusicBrainz recording ID (UUID format) or None
"""
...
```
**4. Remove dead code:**
Delete acousticbrainz_link() function and all references.
**5. Add configuration class:**
```python
from dataclasses import dataclass
@dataclass
class MatchingConfig:
deezer_duration_threshold: int = 3
musicbrainz_duration_threshold: int = 5
similarity_threshold: float = 0.8
user_agent: str = "MusicMetaLinker/0.0.1"
```
## Security Considerations
### Security Issues
**Plaintext credentials:** Spotify credentials in mml_secrets.py (not encrypted).
**No input validation:** Metadata strings not sanitized.
**Broad exception catching:** May hide security-relevant errors.
**No dependency scanning:** Vulnerable dependencies unknown.
### Security Recommendations
**1. Encrypt credentials:**
```python
from cryptography.fernet import Fernet
key = os.getenv("ENCRYPTION_KEY")
cipher = Fernet(key)
encrypted_secret = cipher.encrypt(SPOTIFY_CLIENT_SECRET.encode())
```
**2. Input validation:**
```python
import re
def validate_mbid(mbid: str) -> bool:
uuid_pattern = r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$'
return bool(re.match(uuid_pattern, mbid, re.IGNORECASE))
def validate_isrc(isrc: str) -> bool:
isrc_pattern = r'^[A-Z]{2}[A-Z0-9]{3}[0-9]{7}$'
return bool(re.match(isrc_pattern, isrc))
```
**3. Dependency scanning:**
```bash
pip install pip-audit
pip-audit
```
**4. Security headers for API calls:**
```python
headers = {
'User-Agent': 'MusicMetaLinker/0.0.1',
'X-Request-ID': str(uuid.uuid4())
}
response = requests.get(url, headers=headers)
```
## Code Recommendations Summary
### Immediate Fixes
1. Remove all print() statements, replace with logger.debug()
2. Remove commented-out code
3. Fix User-Agent: "elka/0.1" → "MusicMetaLinker/0.0.1"
4. Remove AcousticBrainz integration
5. Add docstrings to all public methods
### Short-Term Improvements
1. Add type hints throughout codebase
2. Add unit tests with mocked services
3. Add linting (pylint, flake8)
4. Add formatting (black, isort)
5. Add specific exception handling
6. Add input validation
7. Add configuration system
### Long-Term Enhancements
1. Refactor to use service interface abstraction
2. Add dependency injection
3. Add async/await for concurrent queries
4. Add persistent caching
5. Add connection pooling
6. Add structured logging
7. Add monitoring and metrics
8. Add comprehensive documentation
9. Add integration tests
10. Add CI/CD pipeline
## Codebase Maturity Assessment
**Current state:** Research prototype. Pre-release quality.
**Maturity level:** 2/5
**Strengths:**
- Clear separation of concerns (service classes)
- Simple, understandable structure
- Functional for research use
**Weaknesses:**
- No tests
- Debug code in production
- Hardcoded configuration
- Dead code
- No documentation
- No error handling
- No input validation
**Recommendation:** Suitable for academic exploration. Requires significant refactoring for production use.
@@ -0,0 +1,501 @@
# MusicMetaLinker Data Architecture
## Data Storage Model
MusicMetaLinker has no persistent data storage. All data is in-memory during execution.
**No database:** No SQL, no NoSQL, no embedded databases.
**No file-based persistence:** No local cache files, no serialized objects (except JAMS output).
**Stateless operation:** Each Align instance is independent. No shared state across instances.
## Input Data Formats
### Python Objects
Primary input method: Constructor parameters to Align class.
**Supported data types:**
```python
{
"mbid_track": str, # UUID format
"mbid_release": str, # UUID format
"artist": str, # Free text
"album": str, # Free text
"track": str, # Free text
"track_number": int, # Positive integer
"duration": int | float, # Seconds
"isrc": str, # ISRC format (no validation)
"strict": bool # Matching mode
}
```
**No validation:** Input accepted as-is. Invalid data causes silent failures (returns None).
**No normalization:** Artist names, track titles used exactly as provided. No case normalization, no whitespace trimming, no Unicode normalization.
### JAMS Files
JAMS (JSON Annotated Music Specification) is the standard input format for batch processing.
**JAMS structure:**
```json
{
"file_metadata": {
"title": "Track Name",
"artist": "Artist Name",
"release": "Album Name",
"duration": 123.45,
"identifiers": {
"musicbrainz": "mbid-uuid-here",
"isrc": "GBAYE9200070"
}
},
"sandbox": {
"type": "music_type",
"genre": "rock",
"track_number": 1,
"release_year": 2020
},
"annotations": []
}
```
**Key sections:**
**file_metadata:** Core track metadata. Required fields: title, artist. Optional: release, duration, identifiers.
**sandbox:** Additional metadata. Free-form structure. Common fields: type, genre, track_number, release_year.
**annotations:** Music information retrieval annotations (not used by MusicMetaLinker).
**Parsing logic:**
JAMSProcessor extracts:
- title → track
- artist → artist
- release → album
- duration → duration
- identifiers.musicbrainz → mbid_track
- identifiers.isrc → isrc
- sandbox.track_number → track_number
**Missing fields:** Treated as None. No errors raised.
### CSV Input
No direct CSV input support. Batch processing outputs CSV but doesn't read it.
For CSV input, users must:
1. Parse CSV manually
2. Create Align instances per row
3. Collect results
## Output Data Formats
### Python Objects
Align instance acts as data container. Getters return individual fields.
**No structured output method:** No to_dict(), no to_json(), no serialize().
**Manual aggregation required:**
```python
linker = Align(...)
result = {
"artist": linker.get_artist(),
"track": linker.get_track(),
"mbid": linker.get_mbid(),
"isrc": linker.get_isrc(),
"deezer_id": linker.get_deezer_id(),
# ... etc
}
```
### JAMS Files
Enriched JAMS files with added identifiers.
**Enrichment process:**
1. Read original JAMS file
2. Extract metadata
3. Create Align instance
4. Query all services
5. Add identifiers to file_metadata.identifiers section
6. Write enriched JAMS file
**Added identifiers:**
```json
{
"file_metadata": {
"identifiers": {
"musicbrainz": "mbid-from-query",
"isrc": "isrc-from-query",
"deezer": "deezer-id-from-query",
"youtube": "youtube-url-from-query",
"acousticbrainz": null
}
}
}
```
**Preservation:** Original JAMS structure preserved. Only identifiers section modified.
**Overwrite behavior:** Controlled by --overwrite flag. Without flag, existing identifiers preserved.
### CSV Output
Batch processing generates CSV with all metadata and identifiers.
**CSV schema:**
| Column | Type | Description |
|--------|------|-------------|
| jams_file | str | Original JAMS filename |
| track_name | str | Track title |
| artist_name | str | Artist name |
| album_name | str | Album/release name |
| track_number | int | Track position |
| duration | float | Duration in seconds |
| release_year | int | Release year |
| musicbrainz | str | MBID (UUID) |
| isrc | str | ISRC code |
| deezer_id | int | Deezer track ID |
| deezer_url | str | Full Deezer URL |
| youtube_url | str | Full YouTube URL |
| acousticbrainz | str | AcousticBrainz URL (always null) |
| spotify_id | str | Spotify ID (if available) |
**Missing values:** Empty cells or "None" string (inconsistent).
**Encoding:** UTF-8. No BOM.
**Delimiter:** Comma. No escaping issues documented.
**Headers:** First row contains column names.
**Output location:** Same directory as input JAMS files, named based on directory name.
## Data Transformation Pipeline
### Input Transformation
1. **JAMS parsing:** JSON deserialization via jams library
2. **Field extraction:** Map JAMS fields to Align parameters
3. **Type conversion:** String to int for track_number, string to float for duration
4. **Null handling:** Missing fields become None
### Query Transformation
1. **Metadata normalization:** None (passed as-is to services)
2. **Duration conversion:** MusicBrainz milliseconds → seconds
3. **ID extraction:** Parse service-specific response formats
4. **URL construction:** Build full URLs from IDs
### Output Transformation
1. **Result aggregation:** Collect all getter results
2. **CSV serialization:** pandas DataFrame to CSV
3. **JAMS enrichment:** Inject identifiers into JSON structure
4. **File writing:** JSON serialization with indentation
## Data Quality Issues
### Input Data Quality
**No validation:**
- Invalid MBIDs accepted (wrong format, non-existent)
- Invalid ISRCs accepted (wrong format, non-existent)
- Negative durations accepted
- Empty strings accepted
**No sanitization:**
- Special characters in metadata not escaped
- SQL injection risk if metadata used in queries (not applicable here)
- Command injection risk if metadata used in shell commands (not applicable here)
**No normalization:**
- "The Beatles" vs "Beatles" treated as different
- "feat." vs "featuring" vs "ft." not normalized
- Unicode variants not normalized (e.g., é vs e + combining accent)
### Output Data Quality
**Inconsistent null representation:**
- Python: None
- CSV: Empty string or "None" string
- JAMS: null or missing key
**No data validation:**
- Retrieved MBIDs not validated as UUIDs
- Retrieved ISRCs not validated as ISRC format
- Retrieved URLs not validated as valid URLs
**No conflict resolution:**
- If MusicBrainz and Deezer return different artists, no reconciliation
- First successful query wins, no cross-validation
### Data Accuracy Issues
**YouTube matching:** Weak matching logic. First result assumed correct. High false positive rate.
**Duration filtering:** ±3 seconds threshold may be too loose for short tracks, too strict for live recordings.
**Fuzzy matching:** No documented algorithm. Likely simple string similarity. Doesn't handle:
- Transliterations (e.g., Japanese to romaji)
- Abbreviations (e.g., "feat." vs "featuring")
- Reorderings (e.g., "Artist feat. Guest" vs "Guest & Artist")
**AcousticBrainz:** Always returns null (service shut down). Dead data field.
## Data Flow Diagrams
### Single Track Flow
```
Input (Python dict or JAMS)
Align constructor
[Lazy evaluation - no queries yet]
User calls getter (e.g., get_mbid())
Check cache
If not cached:
Determine service to query
Execute service query
Parse response
Cache result
Return to user
```
### Batch Processing Flow
```
Directory of JAMS files
For each JAMS file:
JAMSProcessor.extract_metadata()
Create Align instance
Call all getters
Collect results in list
End loop
Convert list to pandas DataFrame
Write CSV
Optionally write enriched JAMS files
```
### Service Query Flow
```
Align.get_mbid()
If mbid_track provided:
Return mbid_track
Else if isrc provided:
Query MusicBrainz by ISRC
Else:
Query MusicBrainz by metadata
Parse MusicBrainz response
Extract MBID
Cache and return
```
## Data Caching Strategy
### In-Memory Cache
**Scope:** Single Align instance only.
**Cache key:** Implicit (field name). No explicit key generation.
**Cache invalidation:** None. Values cached for instance lifetime.
**Cache size:** Small (one value per field, ~15 fields max).
**Cache hit rate:** High for repeated getter calls on same instance. Zero across instances.
### No Persistent Cache
**Implications:**
- Repeated queries for same track across runs
- No offline operation
- Network dependency for every query
**Batch processing impact:**
- Processing 1000 tracks = 1000+ API calls
- No deduplication across tracks
- High network usage
### Cache Recommendations
For production use:
1. **Add persistent cache:** Redis or SQLite for cross-run caching
2. **Cache key:** Hash of (artist, track, album, duration)
3. **TTL:** 30 days (metadata rarely changes)
4. **Invalidation:** Manual or TTL-based
5. **Deduplication:** Cache identical queries across tracks
## Data Privacy and Security
### Personal Data
**No personal data collected:** Only public music metadata.
**No user tracking:** No analytics, no telemetry.
**No data sharing:** Results not sent to third parties.
### API Credentials
**Spotify credentials:** Stored in external mml_secrets.py file. Not encrypted. Not in version control.
**Other services:** No credentials required.
### Data Retention
**No retention:** All data discarded when Align instance destroyed.
**Batch output:** CSV and JAMS files written to disk. User responsible for retention and deletion.
## Data Consistency
### Cross-Service Consistency
**No consistency checks:** If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.
**First-wins strategy:** First successful query result used. No validation against other services.
**Conflict scenarios:**
- Different artists across services
- Different track names across services
- Different durations across services
**No conflict resolution:** User receives inconsistent data.
### Temporal Consistency
**No versioning:** Metadata retrieved at query time. No timestamp recorded.
**Staleness:** If MusicBrainz updates metadata after query, Align instance has stale data.
**No refresh:** No way to refresh cached data without creating new instance.
## Data Completeness
### Missing Data Handling
**Graceful degradation:** Missing fields return None. No errors.
**Partial results:** If MusicBrainz succeeds but Deezer fails, MusicBrainz data returned.
**No completeness metrics:** No indication of how many fields successfully retrieved.
### Required vs Optional Fields
**No required fields:** All constructor parameters optional.
**Minimum viable input:** At least one of (mbid_track, isrc, artist+track) recommended.
**Degenerate cases:**
- Empty Align() constructor: All getters return None
- Only duration provided: All getters return None (no searchable metadata)
## Data Format Standards
### Identifier Formats
**MBID:** UUID format (e.g., "6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e"). No validation.
**ISRC:** 12-character alphanumeric (e.g., "GBAYE9200070"). No validation.
**Deezer ID:** Integer. No range validation.
**YouTube ID:** Alphanumeric string (e.g., "dQw4w9WgXcQ"). No validation.
### Metadata Formats
**Artist, track, album:** Free text. No format constraints.
**Duration:** Seconds (int or float). MusicBrainz milliseconds converted to seconds.
**Track number:** Integer. No validation (negative numbers accepted).
**Release date:** ISO format (YYYY-MM-DD) or year only (YYYY). Inconsistent across services.
**BPM:** Integer or float. No range validation.
## Data Interoperability
### JAMS Compatibility
JAMS is a standard format in music information retrieval research. MusicMetaLinker's JAMS support enables interoperability with:
- mir_eval (evaluation framework)
- librosa (audio analysis)
- madmom (music analysis)
- Other MIR tools
### Service Compatibility
**MusicBrainz:** Uses official musicbrainzngs library. Compatible with MusicBrainz API changes (library handles versioning).
**Deezer:** Uses official deezer-python library. Compatible with Deezer API.
**YouTube Music:** Uses unofficial ytmusicapi. Fragile to YouTube changes. No API stability guarantees.
**Spotify:** Uses official spotipy library. Compatible with Spotify API.
## Data Limitations
1. **No bulk operations:** Each track processed individually
2. **No streaming:** All data loaded into memory
3. **No compression:** JAMS files written uncompressed
4. **No encryption:** All data stored in plaintext
5. **No checksums:** No data integrity verification
6. **No versioning:** No metadata version tracking
7. **No provenance:** No record of which service provided which field
8. **No confidence scores:** No indication of match quality
## Data Recommendations
For production use:
1. **Add validation:** Validate all input and output formats
2. **Add normalization:** Normalize artist names, track titles
3. **Add conflict resolution:** Cross-validate results across services
4. **Add provenance tracking:** Record which service provided each field
5. **Add confidence scores:** Indicate match quality
6. **Add persistent cache:** Reduce API calls
7. **Add data versioning:** Track when metadata retrieved
8. **Add bulk operations:** Process multiple tracks efficiently
9. **Remove dead fields:** Delete AcousticBrainz from output
10. **Add structured output:** to_dict(), to_json() methods
The data model is simple and functional for research use. Production use requires significant enhancements.
@@ -0,0 +1,611 @@
# MusicMetaLinker Deployment
## Distribution Model
MusicMetaLinker is distributed as source code only. No binary distributions, no PyPI package, no conda package.
**Installation method:** Direct from GitHub via pip.
```bash
pip install git+https://github.com/andreamust/MusicMetaLinker.git
```
**Implications:**
- Requires git installed
- Requires network access to GitHub
- No version pinning (always installs latest commit)
- No offline installation
## Build System
### Build Backend
**PEP 517 compliant:** Uses pyproject.toml for build configuration.
**Build backend:** hatchling (modern Python build tool).
**pyproject.toml structure:**
```toml
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "musicmetalinker"
version = "0.0.1"
dependencies = [
"musicbrainzngs",
"deezer-python",
"ytmusicapi",
"spotipy",
"requests",
"tqdm",
"jams",
"pandas",
"cryptography"
]
```
**No setup.py:** Modern packaging only.
**No setup.cfg:** All configuration in pyproject.toml.
### Build Process
**Local build:**
```bash
git clone https://github.com/andreamust/MusicMetaLinker.git
cd MusicMetaLinker
pip install -e .
```
**-e flag:** Editable install. Changes to source code immediately reflected.
**Build artifacts:** None. Pure Python package, no compilation.
### Dependencies
**Runtime dependencies:**
- musicbrainzngs: MusicBrainz API client
- deezer-python: Deezer API wrapper
- ytmusicapi: YouTube Music API client
- spotipy: Spotify API client
- requests: HTTP library
- tqdm: Progress bars
- jams: JAMS format support
- pandas: CSV output
- cryptography: Required by spotipy
**No optional dependencies:** All dependencies required.
**No development dependencies:** No test framework, no linting tools, no type checkers.
**Dependency versions:** No version constraints. Always installs latest compatible versions.
**Risk:** Breaking changes in dependencies may break MusicMetaLinker.
## Deployment Environments
### Library Deployment
**Target environment:** Python 3.8+ on any platform (Linux, macOS, Windows).
**Installation:**
```bash
pip install git+https://github.com/andreamust/MusicMetaLinker.git
```
**Usage:**
```python
from musicmetalinker.linking import Align
linker = Align(artist="...", track="...")
mbid = linker.get_mbid()
```
**No configuration required** (except Spotify credentials for dataset preparation).
### Batch Processing Deployment
**Target environment:** Python 3.8+ with file system access.
**Installation:** Same as library deployment.
**Usage:**
```bash
cd /path/to/MusicMetaLinker
python link_partitions.py /path/to/jams/files --save --limit audio --overwrite
```
**Requirements:**
- JAMS files in target directory
- Write permissions for output CSV and enriched JAMS files
- Network access for API queries
**Optional:** ffmpeg for audio conversion (if processing audio files directly).
### Research Environment Deployment
**Typical setup:** Jupyter notebook or Python script in research project.
**Installation:**
```bash
pip install git+https://github.com/andreamust/MusicMetaLinker.git
```
**Interactive testing:**
Notebooks included in repository:
- deezer_test.ipynb: Test Deezer integration
- queries.ipynb: Test various query patterns
**Usage:**
```python
# In Jupyter notebook
from musicmetalinker.linking import Align
linker = Align(...)
# Interactive exploration of results
```
## Configuration Management
### No Configuration Files
All configuration hardcoded in source files.
**Hardcoded values:**
- User-Agent: "elka/0.1" (in linking.py)
- Duration thresholds: 3s (Deezer), 5s (MusicBrainz)
- Similarity threshold: 0.8
- API endpoints: In library code
**No config.ini, no config.yaml, no .env files.**
### Spotify Credentials
**Only external configuration:** mml_secrets.py for Spotify credentials.
**Location:** Must be in Python path (typically same directory as scripts).
**Structure:**
```python
# mml_secrets.py
SPOTIFY_CLIENT_ID = "your-client-id-here"
SPOTIFY_CLIENT_SECRET = "your-client-secret-here"
```
**Not in repository:** Users must create this file manually.
**No documentation:** No instructions for obtaining Spotify credentials.
**Obtaining credentials:**
1. Register app at https://developer.spotify.com/dashboard
2. Copy client ID and secret
3. Create mml_secrets.py with credentials
### Environment Variables
**Not used:** No environment variable configuration.
**Recommendation:** Use environment variables for credentials instead of mml_secrets.py.
```python
import os
SPOTIFY_CLIENT_ID = os.getenv("SPOTIFY_CLIENT_ID")
SPOTIFY_CLIENT_SECRET = os.getenv("SPOTIFY_CLIENT_SECRET")
```
## Runtime Requirements
### Python Version
**Minimum:** Python 3.8
**Tested on:** Unknown (no CI/CD, no test matrix).
**Likely compatible:** Python 3.8, 3.9, 3.10, 3.11, 3.12
**Type hints:** Not used extensively. No runtime type checking.
### System Dependencies
**Required:**
- Python 3.8+
- pip
- git (for installation)
- Network access (for API queries)
**Optional:**
- ffmpeg (for audio conversion in batch processing)
**No database:** No PostgreSQL, MySQL, MongoDB, etc.
**No message queue:** No RabbitMQ, Redis, Kafka, etc.
**No web server:** No nginx, Apache, etc.
### Platform Support
**Linux:** Fully supported. Primary development platform (likely).
**macOS:** Fully supported. All dependencies available.
**Windows:** Likely supported. All dependencies have Windows wheels. Potential issues:
- Path separators (/ vs \)
- Line endings (LF vs CRLF)
- Case-sensitive file systems
**No platform-specific code:** Pure Python, no C extensions (except in dependencies).
## Containerization
### Docker
**No Dockerfile provided.**
**Sample Dockerfile:**
```dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
RUN pip install git+https://github.com/andreamust/MusicMetaLinker.git
COPY mml_secrets.py /app/
CMD ["python"]
```
**For batch processing:**
```dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y git ffmpeg && rm -rf /var/lib/apt/lists/*
RUN pip install git+https://github.com/andreamust/MusicMetaLinker.git
RUN git clone https://github.com/andreamust/MusicMetaLinker.git /app/MusicMetaLinker
WORKDIR /app/MusicMetaLinker
ENTRYPOINT ["python", "link_partitions.py"]
```
**Usage:**
```bash
docker build -t musicmetalinker .
docker run -v /path/to/jams:/data musicmetalinker /data --save
```
### Docker Compose
**Not provided.**
**Sample docker-compose.yml:**
```yaml
version: '3.8'
services:
musicmetalinker:
build: .
volumes:
- ./data:/data
- ./output:/output
environment:
- SPOTIFY_CLIENT_ID=${SPOTIFY_CLIENT_ID}
- SPOTIFY_CLIENT_SECRET=${SPOTIFY_CLIENT_SECRET}
```
### Kubernetes
**Not applicable:** MusicMetaLinker is a library/batch tool, not a long-running service.
**Possible use case:** Kubernetes Job for batch processing.
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: musicmetalinker-batch
spec:
template:
spec:
containers:
- name: musicmetalinker
image: musicmetalinker:latest
args: ["/data", "--save"]
volumeMounts:
- name: data
mountPath: /data
restartPolicy: Never
volumes:
- name: data
persistentVolumeClaim:
claimName: jams-data
```
## Continuous Integration/Continuous Deployment
### CI/CD Status
**No CI/CD pipeline.**
**No GitHub Actions, no Travis CI, no CircleCI, no Jenkins.**
**Implications:**
- No automated testing on commits
- No automated builds
- No automated releases
- No quality gates
### Testing
**No test suite.**
**No pytest, no unittest, no nose.**
**Testing approach:**
- Manual testing via Jupyter notebooks
- if __name__ == "__main__" blocks in some modules
**No test coverage metrics.**
### Linting and Formatting
**No linting configuration.**
**No pylint, no flake8, no black, no isort.**
**Code quality:** Inconsistent. Debug prints, commented-out code, inconsistent naming.
### Type Checking
**No type checking.**
**No mypy, no pyright, no pyre.**
**Type hints:** Minimal. Not enforced.
## Monitoring and Logging
### Logging
**Library usage:** Minimal console logging.
**Batch processing:** File-based logging to link_partitions.log.
**Log format:**
```
2024-01-15 10:30:45 - INFO - Processing file: track001.jams
2024-01-15 10:30:46 - INFO - Found MBID: 6b9e7b9e-8f9e-4f9e-9f9e-9f9e9f9e9f9e
2024-01-15 10:30:47 - ERROR - Failed to query Deezer
```
**Log levels:** INFO, ERROR. No DEBUG, WARNING.
**Debug output:** Multiple print() statements in code (not controlled by logging).
### Monitoring
**No monitoring.**
**No metrics collection, no Prometheus, no Grafana, no Datadog.**
**No health checks, no status endpoints.**
### Error Tracking
**No error tracking.**
**No Sentry, no Rollbar, no Bugsnag.**
**Errors silently suppressed.** Returns None on failure.
## Scaling Considerations
### Horizontal Scaling
**Not applicable:** Library runs in single process.
**Batch processing:** Can be parallelized manually.
**Manual parallelization:**
```bash
# Split JAMS files into partitions
# Run multiple instances in parallel
python link_partitions.py /data/partition1 --save &
python link_partitions.py /data/partition2 --save &
python link_partitions.py /data/partition3 --save &
wait
```
**No built-in parallelization.**
### Vertical Scaling
**CPU:** Single-threaded. More CPU cores don't help.
**Memory:** Minimal usage. Each Align instance uses ~1KB. Batch processing uses more for pandas DataFrame.
**Network:** Bottleneck. Sequential API calls. More bandwidth doesn't help (latency-bound).
### Performance Optimization
**No performance optimization.**
**Bottlenecks:**
- Network latency (sequential API calls)
- No caching across instances
- No connection pooling
- No request batching
**Potential optimizations:**
- Async/await for concurrent API calls
- Persistent cache (Redis)
- Connection pooling
- Batch API requests (if services support)
## Security Considerations
### Secrets Management
**Current approach:** Hardcoded in mml_secrets.py.
**Issues:**
- Plaintext credentials
- No encryption
- Risk of committing to version control
**Recommendations:**
- Environment variables
- Secrets vault (HashiCorp Vault, AWS Secrets Manager)
- Encrypted configuration files
### Network Security
**HTTPS:** All API calls use HTTPS.
**Certificate validation:** Handled by requests library (validates by default).
**No proxy support:** No configuration for HTTP proxies.
### Input Validation
**No input validation.**
**Risks:**
- Invalid MBIDs accepted
- Negative durations accepted
- Malformed ISRCs accepted
**Actual risk:** Low. Invalid input causes query failures (returns None).
### Dependency Security
**No dependency scanning.**
**No Dependabot, no Snyk, no safety.**
**Vulnerable dependencies:** Unknown. No automated checks.
**Recommendation:** Run `pip-audit` or `safety check` regularly.
## Backup and Recovery
### Data Backup
**No persistent data:** Nothing to back up (library is stateless).
**Batch output:** CSV and JAMS files. User responsible for backup.
### Disaster Recovery
**Not applicable:** Library has no state to recover.
**Batch processing:** Rerun if output lost. No checkpointing, no resume capability.
## Deployment Checklist
### Library Deployment
- [ ] Python 3.8+ installed
- [ ] pip installed
- [ ] git installed
- [ ] Network access to GitHub
- [ ] Network access to MusicBrainz, Deezer, YouTube Music
- [ ] (Optional) Spotify credentials in mml_secrets.py
### Batch Processing Deployment
- [ ] All library deployment requirements
- [ ] JAMS files prepared
- [ ] Write permissions for output directory
- [ ] (Optional) ffmpeg installed for audio conversion
- [ ] Sufficient disk space for output CSV and enriched JAMS files
### Production Deployment (Recommendations)
- [ ] Pin dependency versions in pyproject.toml
- [ ] Add automated tests
- [ ] Add CI/CD pipeline
- [ ] Add error tracking (Sentry)
- [ ] Add logging (structured JSON logs)
- [ ] Add monitoring (Prometheus metrics)
- [ ] Add rate limiting
- [ ] Add retry logic with exponential backoff
- [ ] Add health checks
- [ ] Use environment variables for configuration
- [ ] Add input validation
- [ ] Add dependency scanning
- [ ] Remove AcousticBrainz integration
- [ ] Fix User-Agent header
- [ ] Add documentation for Spotify setup
## Deployment Recommendations
### Immediate Actions
1. **Publish to PyPI:** Enable `pip install musicmetalinker` without git.
2. **Pin dependencies:** Add version constraints to prevent breaking changes.
3. **Document Spotify setup:** Instructions for obtaining credentials.
4. **Remove AcousticBrainz:** Delete defunct integration.
### Short-Term Improvements
1. **Add CI/CD:** GitHub Actions for automated testing and releases.
2. **Add tests:** pytest suite with mocked API calls.
3. **Add Docker support:** Official Dockerfile and Docker Compose.
4. **Add configuration:** Support environment variables and config files.
5. **Add logging:** Structured logging with configurable levels.
### Long-Term Enhancements
1. **Add monitoring:** Prometheus metrics for API latency, success rates.
2. **Add caching:** Redis for cross-instance caching.
3. **Add async support:** Concurrent API calls for better performance.
4. **Add health checks:** Service availability monitoring.
5. **Add error tracking:** Sentry integration for production debugging.
6. **Add documentation:** Comprehensive deployment guide.
7. **Add versioning:** Semantic versioning with changelog.
8. **Add security scanning:** Automated dependency vulnerability checks.
## Deployment Maturity Assessment
**Current state:** Research prototype. Suitable for academic exploration, not production.
**Maturity level:** 1/5
**Production readiness:** Low
**Gaps:**
- No PyPI distribution
- No CI/CD
- No tests
- No monitoring
- No error tracking
- Hardcoded configuration
- Dead code (AcousticBrainz)
- No documentation for deployment
**Recommendation:** Use for research and prototyping only. Significant work required for production deployment.
@@ -0,0 +1,632 @@
# MusicMetaLinker Evaluation
## Executive Summary
MusicMetaLinker is a research-quality Python library for music metadata entity linking. It connects tracks to external databases (MusicBrainz, Deezer, YouTube Music) to enrich incomplete metadata. The core concept is sound, but implementation is pre-release quality with significant gaps in testing, error handling, and production readiness.
**Version:** 0.0.1 (pre-release)
**Maturity:** Research prototype
**Production readiness:** Low
**Academic value:** Moderate
**Integration potential:** Low (concept valuable, implementation needs work)
## Strengths
### 1. Simple, Clean API
Single Align class provides unified interface to multiple services. Users don't need to understand service-specific APIs.
```python
linker = Align(artist="The Beatles", track="Hey Jude")
mbid = linker.get_mbid()
isrc = linker.get_isrc()
```
**Value:** Low barrier to entry. Easy to integrate into research workflows.
### 2. Cascading Fallback Pattern
Graceful degradation across services. If MusicBrainz fails, tries Deezer. If Deezer fails, tries YouTube Music.
**Value:** Maximizes coverage. Handles service unavailability gracefully.
**Applicability:** This pattern is worth adopting in other metadata aggregation systems.
### 3. JAMS Format Support
Supports JAMS (JSON Annotated Music Specification), a standard format in music information retrieval research.
**Value:** Interoperability with academic MIR tools (mir_eval, librosa, madmom).
**Use case:** Dataset preparation for music research projects.
### 4. Batch Processing
link_partitions.py enables processing entire directories of JAMS files with progress tracking and CSV output.
**Value:** Scales to dataset-level operations. Useful for preparing research datasets.
### 5. MIT License
Permissive license allows unrestricted use, modification, and distribution.
**Value:** Can be freely integrated into commercial or academic projects.
### 6. Minimal Dependencies
Only essential dependencies. No exotic or unmaintained libraries.
**Value:** Easy to install and maintain. Low dependency risk.
### 7. Multi-Service Coverage
Integrates with multiple authoritative sources (MusicBrainz, Deezer, YouTube Music).
**Value:** Comprehensive metadata coverage. Cross-validation potential (not currently implemented).
## Weaknesses
### 1. Pre-Release Quality (v0.0.1)
Version number indicates early development. Codebase confirms this.
**Evidence:**
- Debug print() statements in production code
- Commented-out code sections
- Hardcoded configuration values
- No automated tests
- No CI/CD pipeline
**Impact:** Not suitable for production use without significant hardening.
### 2. No Automated Tests
Zero test coverage. No unit tests, no integration tests, no test framework.
**Testing approach:** Manual testing via Jupyter notebooks.
**Impact:**
- No regression detection
- Difficult to refactor safely
- No confidence in correctness
- Breaking changes undetected
**Risk:** High. Changes may introduce bugs undetected until runtime.
### 3. No CI/CD
No GitHub Actions, no Travis CI, no automated builds or releases.
**Impact:**
- No automated quality gates
- No automated testing on commits
- Manual release process
- No deployment automation
### 4. Debug Prints in Production Code
Multiple print() statements throughout codebase.
```python
print(f"DEBUG: Querying MusicBrainz for {artist} - {track}")
print(f"Found MBID: {mbid}")
```
**Impact:**
- Pollutes output
- Can't be disabled without code changes
- No log levels or timestamps
- Unprofessional appearance
### 5. Hardcoded Configuration
All configuration values hardcoded in source files.
**Examples:**
- User-Agent: "elka/0.1" (appears to be from parent project)
- Duration thresholds: 3s (Deezer), 5s (MusicBrainz)
- Similarity threshold: 0.8
- API endpoints
**Impact:**
- No runtime configuration
- Changing thresholds requires code modification
- No environment-specific settings
- Can't A/B test matching strategies
### 6. Not on PyPI
Only installable from GitHub. Not published to PyPI.
```bash
pip install git+https://github.com/andreamust/MusicMetaLinker.git
```
**Impact:**
- Requires git installed
- No version pinning
- No offline installation
- Less discoverable
### 7. Missing mml_secrets.py
Spotify credentials required in external file not in repository.
**Impact:**
- Users must create file manually
- No documentation for obtaining credentials
- Confusing error if file missing
- Poor user experience
### 8. AcousticBrainz Integration Broken
AcousticBrainz shut down in 2022. Integration always returns None.
**Impact:**
- Dead code in codebase
- Wasted execution time
- Misleading CSV output (acousticbrainz column always null)
- Maintenance burden
**Recommendation:** Remove entirely.
### 9. No Rate Limiting
No rate limiting for API calls. Risk of being blocked by services.
**MusicBrainz:** Recommends 1 request/second. Not enforced.
**Deezer, YouTube Music:** Unknown limits. Not enforced.
**Impact:**
- Risk of IP bans
- Risk of service degradation
- Batch processing may fail partway through
### 10. Silent Error Handling
All errors suppressed. Failed queries return None.
```python
try:
result = service.query()
except:
return None
```
**Impact:**
- No distinction between "not found" and "service error"
- No error messages
- Difficult debugging
- No visibility into failures
### 11. YouTube Matching Weakness
YouTube Music matching is weak. First result assumed correct. No duration filtering (commented out).
**Impact:**
- High false positive rate
- Incorrect YouTube links
- Low confidence in YouTube results
**Recommendation:** Improve matching logic or remove YouTube integration.
### 12. No Input Validation
No validation of input parameters.
**Accepted without validation:**
- Invalid MBIDs (wrong format, non-existent)
- Invalid ISRCs (wrong format, non-existent)
- Negative durations
- Empty strings
**Impact:**
- Silent failures
- Wasted API calls
- Confusing behavior
### 13. No Cross-Service Validation
Results from different services not compared or validated.
**Example:** If MusicBrainz returns artist "The Beatles" and Deezer returns "Beatles", no reconciliation.
**Impact:**
- Inconsistent results
- No confidence scoring
- No conflict resolution
### 14. No Persistent Caching
No caching across Align instances. Repeated queries for same track.
**Impact:**
- Wasted API calls
- Slow batch processing
- High network usage
- Risk of rate limiting
### 15. Single-Threaded Execution
Sequential API calls. No parallelization.
**Impact:**
- Slow batch processing (latency multiplied by number of tracks)
- Underutilized network bandwidth
- Poor performance at scale
## Use Case Evaluation
### Academic Research
**Suitability:** Moderate
**Strengths:**
- JAMS format support
- Batch processing
- Multi-service coverage
- MIT license
**Weaknesses:**
- No tests (can't verify correctness)
- Broken integrations (AcousticBrainz)
- Weak YouTube matching
- No documentation
**Recommendation:** Usable for exploratory research. Not suitable for published results without validation.
### Dataset Preparation
**Suitability:** Moderate
**Strengths:**
- Batch processing with progress tracking
- CSV output
- JAMS enrichment
- Cascading fallback
**Weaknesses:**
- No rate limiting (risk of being blocked)
- No caching (slow for large datasets)
- No parallelization (slow)
- Silent failures (incomplete datasets)
**Recommendation:** Usable for small to medium datasets (hundreds to thousands of tracks). Not suitable for large-scale datasets (millions of tracks) without optimization.
### Production Music Applications
**Suitability:** Low
**Strengths:**
- Simple API
- Multi-service coverage
**Weaknesses:**
- No tests
- No error handling
- No monitoring
- No rate limiting
- Pre-release quality
- Hardcoded configuration
- Dead code
**Recommendation:** Not suitable for production without significant refactoring. Consider as reference implementation only.
### Metadata Enrichment Service
**Suitability:** Low
**Strengths:**
- Cascading fallback pattern
- Multi-service integration
**Weaknesses:**
- No async support
- No caching
- No rate limiting
- No error handling
- No monitoring
- Single-threaded
**Recommendation:** Core concept applicable. Implementation needs complete rewrite for production service.
## Integration Assessment
### Integration into Metadata Aggregator
**Conceptual value:** High. Cascading fallback pattern and multi-service aggregation are sound architectural patterns.
**Implementation value:** Low. Pre-release quality, broken integrations, no tests.
**Reuse strategy:**
**Don't adopt the code directly.** Instead:
1. **Study the pattern:** Understand cascading fallback and service orchestration
2. **Identify valuable integrations:** MusicBrainz and Deezer integrations worth studying
3. **Reimplement the concept:** Build new implementation with proper error handling, testing, configuration
4. **Borrow matching logic:** Duration filtering and fuzzy matching algorithms applicable
**Specific learnings:**
**Cascading fallback pattern:**
```python
def get_identifier(self):
# Try authoritative source first
if self.has_mbid():
return self.query_musicbrainz()
# Try commercial source with ISRC
if self.has_isrc():
return self.query_deezer()
# Fall back to metadata search
return self.query_by_metadata()
```
**Duration filtering:**
```python
def filter_by_duration(results, target_duration, threshold=3):
return [r for r in results if abs(r.duration - target_duration) <= threshold]
```
**Fuzzy matching:**
```python
from difflib import SequenceMatcher
def similarity(a, b):
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
def fuzzy_match(results, target, threshold=0.8):
return [r for r in results if similarity(r.name, target) >= threshold]
```
### Integration Recommendations
**What to adopt:**
- Cascading fallback pattern
- Duration filtering approach
- Fuzzy string matching
- JAMS format support (if working with academic datasets)
**What to avoid:**
- Direct code reuse
- YouTube Music integration (weak matching)
- AcousticBrainz integration (defunct)
- Hardcoded configuration approach
- Silent error handling pattern
**What to improve:**
- Add comprehensive error handling
- Add input validation
- Add persistent caching
- Add async/await for concurrency
- Add rate limiting
- Add cross-service validation
- Add confidence scoring
- Add monitoring and metrics
## Competitive Analysis
### Comparison with Alternatives
**MusicBrainz Picard:**
- Desktop application for music tagging
- More mature (v2.x)
- GUI-based
- Comprehensive MusicBrainz integration
- Not a library (can't integrate programmatically)
**beets:**
- Music library management tool
- Plugin architecture
- CLI and library API
- Mature (v1.x)
- More comprehensive than MusicMetaLinker
- Heavier weight (full music library management)
**musicbrainzngs:**
- Official MusicBrainz Python client
- Focused on single service
- Well-maintained
- No multi-service aggregation
- Lower-level API
**MusicMetaLinker positioning:**
- Lighter than beets (focused on entity linking only)
- Multi-service (unlike musicbrainzngs)
- Library API (unlike Picard)
- Less mature than all alternatives
- Academic focus (JAMS support)
**Unique value proposition:** Multi-service entity linking with JAMS support for academic research.
**Competitive disadvantage:** Pre-release quality, no tests, limited documentation.
## Technical Debt Assessment
### High-Priority Debt
1. **No tests:** Blocks safe refactoring and feature development
2. **Dead code:** AcousticBrainz integration non-functional
3. **Debug prints:** Unprofessional, pollutes output
4. **Hardcoded config:** Inflexible, difficult to customize
5. **Silent errors:** Difficult debugging, poor user experience
**Estimated effort to address:** 2-3 weeks full-time development
### Medium-Priority Debt
1. **No rate limiting:** Risk of service blocks
2. **No caching:** Performance and efficiency issues
3. **No input validation:** Silent failures, wasted API calls
4. **Single-threaded:** Performance bottleneck
5. **No CI/CD:** Manual testing and releases
**Estimated effort to address:** 2-3 weeks full-time development
### Low-Priority Debt
1. **Not on PyPI:** Distribution inconvenience
2. **No documentation:** Learning curve for new users
3. **No type hints:** IDE support, static analysis
4. **Inconsistent naming:** Code readability
5. **No monitoring:** Production visibility
**Estimated effort to address:** 1-2 weeks full-time development
**Total technical debt:** 5-8 weeks full-time development to production-ready state.
## Risk Assessment
### Technical Risks
**High:**
- No tests: Changes may introduce bugs
- Broken integrations: AcousticBrainz always fails
- No rate limiting: Risk of IP bans
- Silent errors: Difficult debugging
**Medium:**
- YouTube Music: Unofficial API may break
- No caching: Performance issues at scale
- Hardcoded config: Inflexible for different use cases
**Low:**
- Dependency vulnerabilities: No scanning
- Security: Plaintext credentials
### Operational Risks
**High:**
- No monitoring: No visibility into production issues
- No error tracking: Can't diagnose failures
- No health checks: Can't detect service outages
**Medium:**
- No CI/CD: Manual releases error-prone
- No documentation: Difficult onboarding
- No versioning strategy: Breaking changes unpredictable
**Low:**
- No backup/recovery: Stateless, nothing to back up
- No scaling strategy: Single-threaded, limited throughput
### Legal Risks
**Medium:**
- YouTube Music: Reverse-engineered API may violate ToS
- No license headers: Unclear licensing for individual files
**Low:**
- MIT license: Permissive, low legal risk
- No personal data: No GDPR concerns
## Recommendations
### For Academic Use
**Acceptable with caveats:**
1. **Validate results:** Cross-check critical metadata manually
2. **Document limitations:** Note AcousticBrainz non-functional, YouTube matching weak
3. **Small to medium datasets:** Hundreds to thousands of tracks, not millions
4. **Exploratory research:** Not for published results without validation
**Improvements for academic use:**
1. Add logging to track which services provided which data
2. Add confidence scores to indicate match quality
3. Remove AcousticBrainz integration
4. Document known limitations
### For Production Use
**Not recommended without significant refactoring.**
**Minimum requirements for production:**
1. **Add comprehensive test suite** (unit and integration tests)
2. **Add error handling** (specific exceptions, logging, retry logic)
3. **Add rate limiting** (respect service limits)
4. **Add caching** (persistent cache for repeated queries)
5. **Add monitoring** (metrics, health checks, error tracking)
6. **Add configuration system** (environment variables, config files)
7. **Remove dead code** (AcousticBrainz)
8. **Add input validation** (validate MBIDs, ISRCs, etc.)
9. **Add CI/CD** (automated testing and releases)
10. **Publish to PyPI** (standard distribution)
**Estimated effort:** 5-8 weeks full-time development.
### For Integration into Metadata Aggregator
**Recommendation: Study the pattern, reimplement the concept.**
**What to learn from MusicMetaLinker:**
1. **Cascading fallback pattern:** Query authoritative sources first, fall back to less reliable sources
2. **Duration filtering:** Use duration to disambiguate multiple matches
3. **Fuzzy matching:** Use string similarity for metadata-based search
4. **Multi-service aggregation:** Combine results from multiple sources
5. **JAMS format:** If working with academic datasets
**What to implement differently:**
1. **Service abstraction:** Define common interface for all services
2. **Dependency injection:** Pass service instances to orchestrator
3. **Async/await:** Concurrent API calls for better performance
4. **Persistent caching:** Redis or similar for cross-instance caching
5. **Error handling:** Explicit error types, logging, retry logic
6. **Configuration:** Runtime configuration for thresholds and endpoints
7. **Validation:** Input validation and cross-service validation
8. **Monitoring:** Metrics, health checks, error tracking
9. **Testing:** Comprehensive test suite with mocked services
10. **Documentation:** API documentation, usage examples, deployment guide
## Overall Assessment
### Strengths Summary
- Simple, clean API
- Sound architectural pattern (cascading fallback)
- JAMS format support for academic use
- Batch processing capabilities
- MIT license
- Minimal dependencies
### Weaknesses Summary
- Pre-release quality (v0.0.1)
- No automated tests
- No CI/CD
- Debug code in production
- Hardcoded configuration
- Broken integrations (AcousticBrainz)
- Weak YouTube matching
- No rate limiting
- Silent error handling
- Not on PyPI
### Final Verdict
**Academic value:** Moderate. Useful for exploratory research and dataset preparation. Not suitable for published results without validation.
**Production value:** Low. Requires 5-8 weeks of development to reach production readiness.
**Integration value:** Moderate. Core concept (cascading fallback, multi-service aggregation) is valuable. Implementation should be studied but not directly adopted.
**Recommendation:** Use MusicMetaLinker as a reference implementation to understand entity linking patterns. Reimplement the concept with proper error handling, testing, and production hardening for serious use.
**Best use case:** Academic research projects with small to medium datasets where perfect accuracy is not critical and manual validation is feasible.
**Avoid for:** Production music applications, large-scale dataset processing, published research results, commercial products.
### Relevance Score
**Conceptual relevance:** 8/10. Cascading fallback and multi-service aggregation are highly relevant patterns.
**Implementation relevance:** 3/10. Pre-release quality, broken integrations, no tests make direct adoption inadvisable.
**Overall relevance:** 5/10. Study the pattern, don't adopt the code.
@@ -0,0 +1,662 @@
# MusicMetaLinker Integrations
## Integration Architecture
MusicMetaLinker integrates with five external services:
1. MusicBrainz (open music encyclopedia)
2. Deezer (commercial streaming service)
3. YouTube Music (commercial streaming service)
4. AcousticBrainz (audio analysis database - defunct)
5. Spotify (commercial streaming service - limited use)
Each integration uses a different library and authentication approach.
## MusicBrainz Integration
### Library and Authentication
**Library:** musicbrainzngs (official Python client)
**Authentication:** None required for read-only queries.
**User-Agent:** Required by MusicBrainz API. Hardcoded as "elka/0.1" (appears to be from parent project, not MusicMetaLinker-specific).
**Rate limiting:** MusicBrainz recommends 1 request/second. Not enforced by MusicMetaLinker.
### API Endpoints
All queries go through musicbrainzngs library, which handles endpoint construction.
**Base URL:** https://musicbrainz.org/ws/2/
**Endpoints used:**
- Recording search: /recording?query=...
- Recording lookup: /recording/{mbid}
- ISRC search: /isrc/{isrc}
### Query Patterns
**By MBID (most reliable):**
```python
import musicbrainzngs as mb
mb.set_useragent("elka", "0.1")
result = mb.get_recording_by_id(
mbid,
includes=["artists", "releases", "isrcs"]
)
```
**includes parameter:** Fetches related entities in single request. Reduces API calls.
**By ISRC:**
```python
result = mb.get_recordings_by_isrc(
isrc,
includes=["artists", "releases", "isrcs"]
)
```
Returns list of recordings with that ISRC. Multiple recordings possible (different releases, remasters).
**By metadata:**
```python
query = f'artist:"{artist}" AND recording:"{track}"'
if album:
query += f' AND release:"{album}"'
result = mb.search_recordings(
query=query,
limit=10
)
```
Lucene query syntax. Quoted strings for exact matching. Returns ranked results.
### Response Parsing
**Recording structure:**
```python
{
"recording": {
"id": "mbid-uuid",
"title": "Track Name",
"length": 123456, # milliseconds
"artist-credit": [
{"artist": {"name": "Artist Name"}}
],
"release-list": [
{
"title": "Album Name",
"date": "2020-01-15",
"track-list": [
{"number": "1"}
]
}
],
"isrc-list": ["GBAYE9200070"]
}
}
```
**Extraction logic:**
- **MBID:** recording.id
- **Track:** recording.title
- **Artist:** recording.artist-credit[0].artist.name (first artist only)
- **Duration:** recording.length / 1000 (convert milliseconds to seconds)
- **Album:** recording.release-list[0].title (first release only)
- **Release date:** recording.release-list[0].date
- **Track number:** recording.release-list[0].track-list[0].number
- **ISRC:** recording.isrc-list[0] (first ISRC only)
**Multiple values:** MusicBrainz returns lists for artists, releases, ISRCs. MusicMetaLinker takes first value only. No aggregation or selection logic.
### Filtering and Matching
**Duration filtering:**
```python
if duration:
matches = [r for r in results if abs(r['length']/1000 - duration) < 5]
```
±5 second threshold for metadata searches. Hardcoded.
**Fuzzy string matching:**
Uses difflib.SequenceMatcher for artist/track/album similarity.
```python
from difflib import SequenceMatcher
def similarity(a, b):
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
# Match if similarity > 0.8 (80%)
```
Threshold hardcoded at 0.8. No configuration option.
### Error Handling
**Network errors:** Caught and suppressed. Returns None.
**Invalid MBID:** Returns None.
**No results:** Returns None.
**Rate limiting:** No handling. If rate limited, returns None.
### Integration Strengths
1. **Official library:** musicbrainzngs is maintained by MusicBrainz community
2. **Rich metadata:** Comprehensive music information
3. **No authentication:** Easy to use
4. **Includes parameter:** Efficient data fetching
5. **Authoritative source:** MusicBrainz is ground truth for music metadata
### Integration Weaknesses
1. **Hardcoded User-Agent:** "elka/0.1" not specific to MusicMetaLinker
2. **No rate limiting:** Risk of being blocked
3. **First-value-only:** Ignores multiple artists, releases, ISRCs
4. **Hardcoded thresholds:** Duration (5s), similarity (0.8) not configurable
5. **No error visibility:** Silent failures
## Deezer Integration
### Library and Authentication
**Library:** deezer-python (community library, not official)
**Authentication:** None required for search API.
**Rate limiting:** Unknown. Not documented. Not enforced by MusicMetaLinker.
### API Endpoints
deezer-python library handles endpoint construction.
**Base URL:** https://api.deezer.com/
**Endpoints used:**
- Track search: /search/track?q=...
- ISRC search: /track/isrc:{isrc}
### Query Patterns
**By ISRC (preferred):**
```python
import deezer
client = deezer.Client()
result = client.search(f'isrc:{isrc}', relation='track')
```
Returns list of tracks with that ISRC. Usually single result.
**By metadata:**
```python
query = f'{artist} {track}'
if album:
query += f' {album}'
result = client.search(query, relation='track')
```
Simple concatenation. No advanced query syntax.
### Response Parsing
**Track structure:**
```python
{
"id": 123456789,
"title": "Track Name",
"duration": 123, # seconds
"artist": {
"name": "Artist Name"
},
"album": {
"title": "Album Name"
},
"release_date": "2020-01-15",
"bpm": 120,
"isrc": "GBAYE9200070",
"rank": 500000
}
```
**Extraction logic:**
- **Deezer ID:** track.id
- **Track:** track.title
- **Artist:** track.artist.name
- **Album:** track.album.title
- **Duration:** track.duration (already in seconds)
- **Release date:** track.release_date
- **BPM:** track.bpm
- **ISRC:** track.isrc
- **Rank:** track.rank (popularity metric)
### Filtering and Matching
**Duration filtering (critical for Deezer):**
```python
duration_threshold = 3 # seconds
matches = [
t for t in results
if abs(t.duration - duration) <= duration_threshold
]
```
±3 second threshold. Configurable via parameter but defaults to 3.
**Why critical:** Deezer returns many versions of same track (radio edit, album version, remaster, live). Duration filtering essential for correct match.
**Fuzzy matching:**
Same difflib.SequenceMatcher approach as MusicBrainz. 0.8 similarity threshold.
**Ranking:**
If multiple matches after filtering, selects highest rank (most popular version).
```python
best_match = max(matches, key=lambda t: t.rank)
```
### Error Handling
**Network errors:** Caught and suppressed. Returns None.
**Invalid ISRC:** Returns empty list, treated as no match.
**No results:** Returns None.
### Integration Strengths
1. **Strong ISRC support:** Deezer has comprehensive ISRC coverage
2. **Duration filtering:** Effective for disambiguating versions
3. **Popularity ranking:** Helps select canonical version
4. **BPM data:** Only source of BPM in MusicMetaLinker
5. **Fast API:** Generally faster than MusicBrainz
### Integration Weaknesses
1. **Unofficial library:** deezer-python not maintained by Deezer
2. **No authentication:** Limited to public API (no user-specific data)
3. **Simple search:** No advanced query syntax
4. **Hardcoded threshold:** 3-second duration threshold may not suit all use cases
5. **Commercial bias:** Deezer catalog may not include obscure or independent releases
## YouTube Music Integration
### Library and Authentication
**Library:** ytmusicapi (unofficial, reverse-engineered API)
**Authentication:** None required for search.
**Rate limiting:** Unknown. YouTube may block aggressive usage.
### API Endpoints
ytmusicapi reverse-engineers YouTube Music web interface. No official API.
**Endpoints:** Internal to ytmusicapi. Not exposed to MusicMetaLinker.
### Query Patterns
**By metadata only:**
```python
from ytmusicapi import YTMusic
ytmusic = YTMusic()
query = f'{artist} {track} {album}'
results = ytmusic.search(query, filter='songs')
```
**filter='songs':** Excludes videos, albums, playlists. Returns only song results.
**No ISRC support:** YouTube Music API doesn't support ISRC search.
**No MBID support:** YouTube Music doesn't use MBIDs.
### Response Parsing
**Song structure:**
```python
{
"videoId": "dQw4w9WgXcQ",
"title": "Track Name",
"artists": [
{"name": "Artist Name"}
],
"album": {
"name": "Album Name"
},
"duration": "7:11", # string format
"duration_seconds": 431
}
```
**Extraction logic:**
- **YouTube ID:** result.videoId
- **YouTube URL:** f"https://www.youtube.com/watch?v={videoId}"
- **Track:** result.title
- **Artist:** result.artists[0].name (first artist only)
- **Album:** result.album.name
### Filtering and Matching
**No duration filtering:** Duration filtering code commented out in MusicMetaLinker.
```python
# if duration:
# matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]
```
**Why commented out:** Unknown. Possibly unreliable duration data from YouTube.
**No fuzzy matching:** First result assumed correct.
```python
best_match = results[0] if results else None
```
**Critical weakness:** High false positive rate. No validation that first result is correct match.
### Error Handling
**Network errors:** Caught and suppressed. Returns None.
**No results:** Returns None.
**API changes:** ytmusicapi may break if YouTube changes web interface. No error handling for this.
### Integration Strengths
1. **Broad coverage:** YouTube Music has extensive catalog
2. **No authentication:** Easy to use
3. **Filter parameter:** Excludes non-song results
### Integration Weaknesses
1. **Unofficial API:** Reverse-engineered, fragile to changes
2. **No duration filtering:** Commented out, high false positive rate
3. **First-result-only:** No ranking or validation
4. **No ISRC support:** Can't use authoritative identifiers
5. **Legal risk:** Reverse-engineered APIs may violate ToS
6. **No error handling:** API breakage causes silent failures
## AcousticBrainz Integration
### Library and Authentication
**Library:** requests (direct HTTP calls)
**Authentication:** None.
### API Endpoints
**Base URL:** https://acousticbrainz.org/
**Endpoint:** /{mbid}
### Query Pattern
```python
import requests
def acousticbrainz_link(mbid):
url = f"https://acousticbrainz.org/{mbid}"
response = requests.get(url)
return url if response.status_code == 200 else None
```
Simple HTTP GET. Returns URL if MBID exists, None otherwise.
### Critical Issue: Service Shutdown
**AcousticBrainz shut down in 2022.** All queries return 404.
**Impact:** This integration is completely non-functional. Dead code.
**Why still in codebase:** Unknown. Possibly not updated since shutdown.
**Recommendation:** Remove this integration entirely.
### Integration Strengths
None. Service is defunct.
### Integration Weaknesses
1. **Service shut down:** Non-functional
2. **Dead code:** Wastes execution time
3. **Misleading output:** CSV includes acousticbrainz column (always null)
4. **No deprecation notice:** Code doesn't warn users
## Spotify Integration
### Library and Authentication
**Library:** spotipy (official Spotify Python client)
**Authentication:** OAuth2 client credentials flow.
**Credentials:** Stored in external mml_secrets.py file (not in repository).
**mml_secrets.py structure:**
```python
SPOTIFY_CLIENT_ID = "your-client-id"
SPOTIFY_CLIENT_SECRET = "your-client-secret"
```
### Usage Scope
**Limited use:** Spotify integration only used in Billboard dataset cleaning script (prepare_dataset.py).
**Not used in main Align workflow.** Spotify not queried by Align class.
### Query Pattern
```python
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
auth_manager = SpotifyClientCredentials(
client_id=SPOTIFY_CLIENT_ID,
client_secret=SPOTIFY_CLIENT_SECRET
)
sp = spotipy.Spotify(auth_manager=auth_manager)
result = sp.search(q=f'track:{track} artist:{artist}', type='track', limit=1)
```
### Use Case
**Billboard dataset cleaning:** Extract ISRCs from Spotify for Billboard chart tracks.
**Workflow:**
1. Billboard dataset has artist/track names but no ISRCs
2. Query Spotify by artist/track
3. Extract ISRC from Spotify result
4. Use ISRC for subsequent MusicBrainz/Deezer queries
### Integration Strengths
1. **Official library:** spotipy maintained by Spotify
2. **OAuth2:** Secure authentication
3. **Rich metadata:** Comprehensive track information
4. **ISRC support:** Spotify provides ISRCs
### Integration Weaknesses
1. **Requires credentials:** Users must register Spotify app
2. **External secrets file:** mml_secrets.py not in repository, must be created manually
3. **Limited use:** Only for dataset preparation, not main workflow
4. **No documentation:** No instructions for obtaining credentials
## Integration Comparison
| Service | Library | Auth | ISRC Support | Duration Filtering | Matching Quality | Status |
|---------|---------|------|--------------|-------------------|------------------|--------|
| MusicBrainz | musicbrainzngs | None | Yes | ±5s | Fuzzy (0.8) | Active |
| Deezer | deezer-python | None | Yes | ±3s | Fuzzy (0.8) + Rank | Active |
| YouTube Music | ytmusicapi | None | No | Commented out | First result | Active (fragile) |
| AcousticBrainz | requests | None | N/A | N/A | N/A | Defunct |
| Spotify | spotipy | OAuth2 | Yes | N/A | N/A | Active (limited use) |
## Integration Orchestration
### Service Selection Logic
**Priority order:**
1. **MusicBrainz** if MBID provided (authoritative)
2. **Deezer** if ISRC provided (fast, reliable)
3. **MusicBrainz** if metadata provided (fallback)
4. **Deezer** if metadata provided (fallback)
5. **YouTube Music** if metadata provided (last resort)
### Parallel vs Sequential
**Sequential execution:** Services queried one at a time. No parallelization.
**Implications:**
- Total latency = sum of all service latencies
- Slow for batch processing
- Simple error handling (no race conditions)
### Result Aggregation
**No cross-validation:** Results from different services not compared.
**First-wins strategy:** First successful query for each field used.
**Example:**
- MBID from MusicBrainz
- ISRC from Deezer (if not in MusicBrainz)
- BPM from Deezer (only source)
- YouTube link from YouTube Music
**No conflict resolution:** If MusicBrainz and Deezer return different artists, no reconciliation.
## Integration Error Handling
### Network Errors
All network errors caught and suppressed. Returns None.
**No retry logic:** Single attempt per service.
**No exponential backoff:** Immediate failure on error.
**No circuit breaker:** Repeated failures don't disable service.
### Rate Limiting
**No rate limiting implementation.**
**Risks:**
- MusicBrainz: Recommends 1 req/s, may block aggressive usage
- Deezer: Unknown limits, may block
- YouTube Music: Unknown limits, may block or break API
**Batch processing:** High risk of rate limiting (no delays between requests).
### Service Unavailability
**No health checks:** Services assumed available.
**No fallback:** If MusicBrainz down, no alternative for MBID lookup.
**No status monitoring:** No logging of service failures.
## Integration Security
### API Keys
**MusicBrainz, Deezer, YouTube Music:** No API keys required.
**Spotify:** Client credentials in external file (not encrypted).
### Data Privacy
**No personal data sent:** Only public music metadata queried.
**No user tracking:** No analytics sent to services.
### HTTPS
All services use HTTPS. No plaintext HTTP.
### Input Sanitization
**No sanitization:** Metadata strings passed directly to APIs.
**Potential risks:**
- Query injection (if services use SQL/NoSQL)
- Command injection (if services execute shell commands)
**Actual risk:** Low. All services use HTTP APIs with proper escaping.
## Integration Recommendations
### Immediate Fixes
1. **Remove AcousticBrainz:** Delete defunct integration
2. **Fix User-Agent:** Change "elka/0.1" to "MusicMetaLinker/0.0.1"
3. **Add rate limiting:** Implement delays between requests
4. **Document Spotify setup:** Instructions for obtaining credentials
### Short-Term Improvements
1. **Add retry logic:** Exponential backoff for network errors
2. **Add timeout configuration:** Configurable request timeouts
3. **Enable YouTube duration filtering:** Uncomment and test
4. **Add error logging:** Log service failures
5. **Add health checks:** Verify service availability before queries
### Long-Term Enhancements
1. **Parallel queries:** Use asyncio for concurrent API calls
2. **Cross-validation:** Compare results across services
3. **Confidence scores:** Indicate match quality
4. **Service abstraction:** Common interface for all services
5. **Plugin architecture:** Allow adding new services without code changes
6. **Caching layer:** Reduce redundant API calls
7. **Circuit breaker:** Disable failing services temporarily
8. **Metrics collection:** Track success rates, latencies per service
## Integration Value Assessment
**High value:**
- MusicBrainz: Authoritative, comprehensive, reliable
- Deezer: Fast, good ISRC coverage, BPM data
**Medium value:**
- Spotify: Useful for dataset preparation, requires setup
**Low value:**
- YouTube Music: Weak matching, fragile API, high false positives
- AcousticBrainz: Defunct, zero value
**Recommendation:** Keep MusicBrainz and Deezer. Remove AcousticBrainz. Improve YouTube Music matching or remove. Keep Spotify for dataset preparation.
@@ -0,0 +1,218 @@
# MusicMetaLinker Overview
## Project Identity
**Name:** MusicMetaLinker
**Version:** 0.0.1 (pre-release)
**Language:** Python 3.8+
**License:** MIT
**Type:** Library
**Repository:** https://github.com/andreamust/MusicMetaLinker
**Author:** Andrea Poltronieri
**Installation:** `pip install git+https://github.com/andreamust/MusicMetaLinker.git`
MusicMetaLinker is not available on PyPI. Installation requires direct GitHub access.
## Purpose and Scope
MusicMetaLinker performs entity linking for music tracks. It connects local music metadata to external databases, enriching incomplete or inconsistent information with authoritative data from multiple sources.
The library addresses a common problem in music information retrieval: fragmented metadata across different platforms. A track might have an MBID in one system, an ISRC in another, and only artist/title strings in a third. MusicMetaLinker bridges these gaps by querying multiple services and consolidating results.
Primary use case: academic music research and dataset preparation. The library supports JAMS (JSON Annotated Music Specification), a format common in music information retrieval research.
## Core Functionality
MusicMetaLinker implements a three-step workflow:
1. **Service Selection:** Based on available input identifiers (MBID, ISRC, or metadata strings), the library determines which external services to query and in what order.
2. **Information Retrieval:** Parallel or sequential queries to MusicBrainz, Deezer, YouTube Music, and AcousticBrainz. Each service has specialized search logic.
3. **Filtering and Matching:** Results are filtered by duration, track number, and fuzzy string matching to identify the best match across services.
The library returns enriched metadata including:
- Standardized identifiers (MBID, ISRC, Deezer ID)
- Corrected metadata (artist, album, track name)
- Additional attributes (BPM, release date)
- Direct links to external services
## Dependencies
Core dependencies:
- **musicbrainzngs:** MusicBrainz API client
- **deezer-python:** Deezer API wrapper
- **ytmusicapi:** YouTube Music unofficial API
- **spotipy:** Spotify API client (limited use)
- **requests:** HTTP client for AcousticBrainz
- **tqdm:** Progress bars for batch processing
- **jams:** JAMS format support
- **pandas:** CSV output for batch processing
- **cryptography:** Required by spotipy
All dependencies are standard Python packages. No exotic or unmaintained libraries.
## Architecture Pattern
MusicMetaLinker uses a cascading fallback pattern:
1. If MBID is provided, query MusicBrainz first (authoritative source)
2. If ISRC is available, try Deezer (commercial database with ISRCs)
3. Fall back to metadata string search across all services
4. Aggregate results, preferring more authoritative sources
This pattern ensures maximum coverage while respecting data quality hierarchies. MusicBrainz is treated as ground truth when available.
## Key Components
**Align class (linking.py):** Main entry point. Orchestrates all service queries and exposes unified getter methods.
**Service-specific aligners:**
- MusicBrainzAlign: Queries MusicBrainz by MBID, ISRC, or metadata
- DeezerAlign: Searches Deezer with duration-based filtering
- YouTubeAlign: Searches YouTube Music by metadata strings
**Batch processing:**
- link_partitions.py: Process directories of JAMS files
- JAMSProcessor: Read/write JAMS format with metadata enrichment
**Utilities:**
- MBDownload: Bulk download from MusicBrainz
- prepare_dataset.py: Dataset preparation scripts
## Workflow Example
Typical usage:
```python
from musicmetalinker.linking import Align
# Initialize with available metadata
linker = Align(
artist="The Beatles",
track="Hey Jude",
album="Hey Jude",
duration=431
)
# Retrieve enriched metadata
mbid = linker.get_mbid()
isrc = linker.get_isrc()
deezer_id = linker.get_deezer_id()
youtube_link = linker.get_youtube_link()
```
The Align class handles all service queries internally. Users don't interact with individual service classes directly.
## Batch Processing
For dataset-scale operations:
```bash
python link_partitions.py /path/to/jams/files --save --limit audio --overwrite
```
Processes all JAMS files in a directory, enriches metadata, and outputs CSV with consolidated identifiers. Useful for preparing research datasets.
## Target Audience
Primary users:
- Music information retrieval researchers
- Dataset curators
- Academic projects requiring standardized music metadata
Not designed for:
- Production music applications (pre-release quality)
- Real-time streaming services (no rate limiting)
- End-user applications (library-only, no GUI)
## Development Status
Version 0.0.1 indicates early development. The codebase contains:
- Debug print statements in production code
- Hardcoded configuration values
- Commented-out code sections
- No automated tests
- No CI/CD pipeline
This is research-quality code, not production-ready software. Suitable for academic exploration and prototyping, but requires significant hardening for production use.
## Integration with External Services
**MusicBrainz:** Open music encyclopedia. No authentication required. Rate limiting recommended but not implemented.
**Deezer:** Commercial streaming service with public API. No authentication for basic search. More permissive than Spotify for metadata access.
**YouTube Music:** Unofficial API via ytmusicapi. No authentication. Fragile to YouTube changes.
**AcousticBrainz:** Audio feature database. Note: AcousticBrainz shut down in 2022. This integration is non-functional.
**Spotify:** Limited use for ISRC extraction in Billboard dataset cleaning. Requires OAuth2 credentials via external mml_secrets.py file (not in repository).
## Licensing and Reuse
MIT license permits unrestricted use, modification, and distribution. No copyleft restrictions.
The library can be freely integrated into commercial or academic projects. Attribution to Andrea Poltronieri is required.
## Installation Requirements
Python 3.8 or higher required. No platform-specific dependencies except optional ffmpeg for audio conversion in batch processing.
Installation from GitHub requires git and pip. No binary distributions available.
## Configuration
All configuration is hardcoded in source files:
- User-Agent: "elka/0.1" (appears to be from a parent project)
- API endpoints: Hardcoded URLs
- Matching thresholds: Hardcoded in service classes
- Spotify credentials: External mml_secrets.py module
No configuration files, environment variables, or runtime configuration options.
## Output Formats
**Library mode:** Python objects with getter methods
**Batch mode:** CSV with columns:
- jams_file
- track_name, artist_name, album_name
- track_number, duration, release_year
- musicbrainz, isrc
- deezer_id, deezer_url
- youtube_url
- acousticbrainz
- spotify_id
JAMS files can be enriched in place with new identifiers added to the identifiers section.
## Performance Characteristics
No performance benchmarks provided. Expected bottlenecks:
- Network latency for API calls
- Sequential service queries (no parallelization)
- No caching of results
Batch processing includes progress bars via tqdm but no performance optimization.
## Error Handling
Errors are silently suppressed. Failed queries return None. No exceptions propagate to callers.
This makes the library robust to individual service failures but provides no visibility into what went wrong. Debugging requires examining log files or adding print statements.
## Maintenance Status
Last commit activity and maintenance frequency unknown from provided information. Repository is public but development status unclear.
AcousticBrainz integration is broken (service discontinued). No indication this has been addressed.
## Relevance Assessment
**Conceptual value:** High. The cascading fallback pattern and multi-service aggregation approach are sound architectural patterns for entity linking.
**Implementation value:** Low. Pre-release quality, broken integrations, no tests, hardcoded configuration.
**Reuse recommendation:** Study the pattern, don't adopt the code. Reimplement the concept with proper error handling, configuration management, and test coverage.