# MusicMetaLinker Architecture ## System Overview MusicMetaLinker implements a service-oriented architecture for music metadata entity linking. The system coordinates queries across multiple external APIs, aggregates results, and presents a unified interface through a single orchestrator class. Architecture pattern: Facade with cascading fallback strategy. ## Core Components ### Align Class (linking.py) The Align class is the primary orchestrator and sole public interface. It encapsulates all service interactions and presents a clean getter-based API. **Constructor signature:** ```python Align( mbid_track=None, mbid_release=None, artist=None, album=None, track=None, track_number=None, duration=None, isrc=None, strict=False ) ``` **Responsibilities:** - Initialize service-specific aligners based on available input - Coordinate query execution across services - Aggregate and normalize results - Expose unified getter methods for all metadata fields **Internal state:** - Stores all input parameters - Maintains references to service aligner instances - Caches retrieved metadata to avoid redundant queries The Align class doesn't implement service-specific logic. It delegates to specialized classes and functions. ### MusicBrainzAlign Class Handles all MusicBrainz interactions. MusicBrainz is treated as the authoritative source when MBIDs are available. **Key methods:** **get_recording(mbid):** Retrieves full recording data by MBID. Returns artist, album, track name, duration, ISRCs, and related identifiers. **get_best_match(artist, track, album, duration):** Searches MusicBrainz by metadata strings. Filters results by duration and fuzzy string matching. Returns the highest-scoring match. **get_iswc():** Retrieves International Standard Musical Work Code if available. **Search strategy:** 1. If MBID provided, direct lookup (most reliable) 2. If ISRC provided, search by ISRC 3. Fall back to metadata string search with filtering MusicBrainz queries include related entities (artists, releases, ISRCs) in a single request to minimize API calls. ### DeezerAlign Class Interfaces with Deezer's public API. Deezer provides commercial metadata with strong ISRC coverage. **Key methods:** **best_match(artist, track, album, duration, duration_threshold=3):** Searches Deezer and filters by duration. The duration_threshold parameter allows ±3 seconds variance by default. **get_rank():** Returns Deezer's internal popularity rank for the track. **Search strategy:** 1. If ISRC available, search by ISRC (most accurate) 2. Fall back to metadata string search 3. Filter results by duration (±3 seconds) 4. Apply fuzzy string matching to artist/track/album Duration filtering is critical for Deezer because metadata searches often return multiple versions (radio edit, album version, remaster). ### YouTubeAlign Class Queries YouTube Music via the unofficial ytmusicapi library. **Key methods:** **get_best_match(artist, track, album):** Searches YouTube Music with filter="songs". Returns the first result (no sophisticated ranking). **get_youtube_id():** Extracts YouTube video ID from search results. **Search strategy:** - Constructs query string: "{artist} {track} {album}" - Filters to songs only (excludes videos, albums) - Returns first result YouTube matching is the weakest link. No duration filtering (commented out in code). No fuzzy matching. First result is assumed correct. ### acousticbrainz_link Function Standalone function (not a class) that checks if an MBID exists in AcousticBrainz. **Implementation:** ```python def acousticbrainz_link(mbid): url = f"https://acousticbrainz.org/{mbid}" response = requests.get(url) return url if response.status_code == 200 else None ``` Simple HTTP check. Returns URL if MBID exists, None otherwise. **Critical issue:** AcousticBrainz shut down in 2022. This function always returns None. Dead code. ## Data Flow ### Initialization Flow 1. User creates Align instance with available metadata 2. Align constructor stores all input parameters 3. Service aligners are instantiated on-demand (lazy initialization) 4. No queries execute during construction ### Query Flow 1. User calls getter method (e.g., get_mbid()) 2. Align checks if value already cached 3. If not cached, determines which service to query based on available input 4. Executes service-specific query 5. Caches result 6. Returns value to user Queries are lazy and cached. Calling get_mbid() twice only queries MusicBrainz once. ### Cascading Fallback Strategy Priority order for identifier resolution: **For MBID:** 1. Use provided mbid_track if available 2. Query MusicBrainz by ISRC 3. Query MusicBrainz by metadata strings 4. Return None if all fail **For ISRC:** 1. Use provided ISRC if available 2. Extract from MusicBrainz recording (if MBID available) 3. Query Deezer and extract ISRC from result 4. Return None if all fail **For Deezer ID:** 1. Query Deezer by ISRC 2. Query Deezer by metadata strings 3. Return None if all fail **For YouTube link:** 1. Query YouTube Music by metadata strings 2. Return None if no results Each service is queried independently. No cross-service validation or conflict resolution. ## Supporting Components ### JAMSProcessor (preprocessor.py) Handles reading and writing JAMS (JSON Annotated Music Specification) files. **Responsibilities:** - Parse JAMS JSON structure - Extract metadata from file_metadata and sandbox sections - Enrich JAMS files with new identifiers - Write updated JAMS files JAMS structure: ```json { "file_metadata": { "title": "track name", "artist": "artist name", "release": "album name", "duration": 123.45, "identifiers": { "musicbrainz": "mbid-here" } }, "sandbox": { "type": "genre", "genre": "rock", "track_number": 1, "release_year": 2020 } } ``` JAMSProcessor reads these fields, passes them to Align, and writes enriched identifiers back to the identifiers section. ### MBDownload (musicbrainz_dump.py) Utility for bulk downloading MusicBrainz data. **Purpose:** Pre-populate local datasets with MusicBrainz metadata to reduce API calls during batch processing. **Implementation details:** Not fully specified in provided information. Likely queries MusicBrainz in batches and caches results locally. ### link_partitions.py Batch processing script for directories of JAMS files. **Workflow:** 1. Scan directory for JAMS files 2. For each file, extract metadata via JAMSProcessor 3. Create Align instance and query all services 4. Collect results in pandas DataFrame 5. Output CSV with all identifiers **Command-line options:** - `--save`: Write enriched JAMS files back to disk - `--limit audio`: Only process audio files (skip non-audio JAMS) - `--overwrite`: Overwrite existing enriched files Includes progress bars via tqdm and logging to link_partitions.log. ### prepare_dataset.py Dataset preparation utilities. Specific functionality not detailed in provided information. Likely includes: - Data cleaning - Format conversion - Batch metadata enrichment ## Configuration Architecture No configuration system. All settings hardcoded in source files. **Hardcoded values:** - MusicBrainz User-Agent: "elka/0.1" - Deezer duration threshold: 3 seconds - API endpoints: Direct URLs in code - Spotify credentials: Imported from external mml_secrets.py **Implications:** - No runtime configuration - No environment-specific settings - Changing thresholds requires code modification - No A/B testing of matching strategies ## Error Handling Architecture Error handling is minimal and inconsistent. **Pattern:** ```python try: result = service.query() return result except: return None ``` All exceptions are caught and suppressed. Failed queries return None. No error logging, no exception propagation, no retry logic. **Consequences:** - Silent failures - No visibility into what went wrong - Difficult debugging - No distinction between "not found" and "service error" ## Logging Architecture Uses Python's standard logging module. **Batch processing:** File-based logging to link_partitions.log. Includes timestamps, log levels, and progress information. **Library usage:** Console logging. Minimal output. **Debug output:** Multiple print() statements scattered throughout code. Not controlled by logging configuration. **Issues:** - Debug prints in production code - No structured logging - No log levels for debug prints - No correlation IDs for tracking requests across services ## Concurrency Model Single-threaded, synchronous execution. No parallelization. **Query execution:** - Services queried sequentially - No concurrent API calls - No async/await - No thread pools **Implications:** - Slow batch processing (network latency multiplied by number of tracks) - Underutilized network bandwidth - Simple debugging (no race conditions) Batch processing could benefit significantly from parallel execution. ## Dependency Injection No dependency injection. Service classes instantiated directly in Align constructor. **Current pattern:** ```python self.mb_align = MusicBrainzAlign(...) self.deezer_align = DeezerAlign(...) ``` **Implications:** - Difficult to mock services for testing - Tight coupling between Align and service implementations - No interface-based programming - Hard to swap service implementations ## State Management State is managed in Align instance variables. **Cached values:** - All input parameters (artist, track, album, etc.) - Retrieved metadata (MBID, ISRC, Deezer ID, etc.) - Service aligner instances **Cache invalidation:** None. Values cached for lifetime of Align instance. **Thread safety:** Not thread-safe. No locks, no synchronization. ## Extension Points Limited extensibility. **Adding new services:** 1. Create new service aligner class 2. Instantiate in Align constructor 3. Add getter methods to Align 4. Update cascading fallback logic No plugin system, no service registry, no abstract base classes. **Modifying matching logic:** Requires editing service aligner classes directly. No strategy pattern, no configurable matchers. ## Testing Architecture No test suite. No test directory. No test configuration. **Testing approach:** - Manual testing via Jupyter notebooks (deezer_test.ipynb, queries.ipynb) - if __name__ == "__main__" blocks in some modules - No unit tests, no integration tests, no mocks ## Build and Packaging Uses hatchling (PEP 517 build backend). **pyproject.toml structure:** - Project metadata (name, version, authors) - Dependencies - Build system configuration No setup.py. Modern Python packaging. **Distribution:** GitHub only. Not published to PyPI. ## Deployment Architecture Library deployment: pip install from GitHub. Batch processing deployment: Clone repository, install dependencies, run Python scripts directly. No Docker containers, no systemd services, no process managers. ## Performance Considerations No performance optimization. **Bottlenecks:** - Network latency (sequential API calls) - No caching across Align instances - No request batching - No connection pooling **Memory usage:** - Minimal (only current track metadata in memory) - No large data structures - Pandas DataFrame for batch output (could be large for big datasets) ## Security Architecture Minimal security considerations. **API credentials:** - MusicBrainz: No authentication - Deezer: No authentication - YouTube Music: No authentication - Spotify: OAuth2 client credentials in external file **Secrets management:** - Spotify credentials in mml_secrets.py (not in repository) - No encryption - No environment variables - No secrets vault **Input validation:** - No validation of user input - No sanitization of metadata strings - Potential injection vulnerabilities if metadata used in shell commands ## Architectural Strengths 1. **Simple facade:** Single Align class hides complexity 2. **Cascading fallback:** Graceful degradation when services fail 3. **Lazy evaluation:** Only query services when needed 4. **Service isolation:** Each service in separate class ## Architectural Weaknesses 1. **No abstraction:** Service classes have different interfaces 2. **Tight coupling:** Align directly instantiates service classes 3. **No error handling:** Silent failures everywhere 4. **No concurrency:** Sequential execution only 5. **Hardcoded configuration:** No runtime flexibility 6. **No testing:** Untestable design (tight coupling, no mocks) 7. **Dead code:** AcousticBrainz integration non-functional 8. **Inconsistent patterns:** Function for AcousticBrainz, classes for others ## Architectural Recommendations For production use, consider: 1. **Define service interface:** Abstract base class for all aligners 2. **Dependency injection:** Pass service instances to Align constructor 3. **Configuration system:** External config for thresholds, endpoints, credentials 4. **Error handling:** Explicit error types, logging, retry logic 5. **Async execution:** Use asyncio for concurrent API calls 6. **Caching layer:** Redis or in-memory cache for repeated queries 7. **Remove dead code:** Delete AcousticBrainz integration 8. **Add tests:** Unit tests with mocked services 9. **Structured logging:** JSON logs with correlation IDs 10. **Rate limiting:** Respect API rate limits with backoff The core pattern (cascading fallback across services) is sound. The implementation needs significant hardening.