# MusicMetaLinker Codebase Analysis ## Repository Structure ``` MusicMetaLinker/ ├── musicmetalinker/ │ ├── __init__.py │ ├── linking.py # Core Align class and service aligners │ ├── preprocessor.py # JAMSProcessor for JAMS file handling │ ├── musicbrainz_dump.py # MusicBrainz bulk download utilities │ └── utils.py # Utility functions (likely) ├── link_partitions.py # Batch processing CLI ├── prepare_dataset.py # Dataset preparation scripts ├── deezer_test.ipynb # Deezer integration testing notebook ├── queries.ipynb # Query testing notebook ├── pyproject.toml # Build configuration ├── README.md # Project documentation └── LICENSE # MIT license ``` **No tests directory.** No test files. **No docs directory.** Documentation in README only. **No examples directory.** Examples in notebooks only. ## Code Organization ### linking.py **Primary module.** Contains all core functionality. **Classes:** - **Align:** Main orchestrator class - **MusicBrainzAlign:** MusicBrainz service integration - **DeezerAlign:** Deezer service integration - **YouTubeAlign:** YouTube Music service integration **Functions:** - **acousticbrainz_link(mbid):** AcousticBrainz URL checker (defunct) **Estimated size:** 500-800 lines (based on typical structure). **Responsibilities:** - Service coordination - Query execution - Result aggregation - Metadata normalization **Code quality issues:** - Debug print() statements in production code - Commented-out code sections - Hardcoded configuration values - No docstrings (likely) - Inconsistent naming conventions ### preprocessor.py **JAMS file handling.** **Classes:** - **JAMSProcessor:** Read/write JAMS files, extract metadata, enrich with identifiers **Responsibilities:** - Parse JAMS JSON structure - Extract file_metadata and sandbox fields - Inject new identifiers - Write enriched JAMS files **Dependencies:** - jams library for JAMS format support - json for JSON parsing ### musicbrainz_dump.py **Bulk MusicBrainz download utilities.** **Classes:** - **MBDownload:** Batch download from MusicBrainz **Purpose:** Pre-populate datasets with MusicBrainz metadata to reduce API calls. **Implementation details:** Not fully specified. Likely includes: - Batch query logic - Rate limiting (hopefully) - Local caching - CSV or JSON output ### link_partitions.py **Batch processing CLI script.** **Functionality:** - Scan directory for JAMS files - Process each file with Align - Collect results in pandas DataFrame - Output CSV with all identifiers - Optionally write enriched JAMS files **Command-line arguments:** - Positional: directory path - --save: Write enriched JAMS files - --limit audio: Only process audio files - --overwrite: Overwrite existing files **Logging:** File-based to link_partitions.log. **Progress tracking:** tqdm progress bars. ### prepare_dataset.py **Dataset preparation utilities.** **Functionality:** Not fully specified. Likely includes: - Data cleaning - Format conversion - Metadata normalization - Spotify ISRC extraction for Billboard dataset **Spotify integration:** Uses spotipy with credentials from mml_secrets.py. ### Notebooks **deezer_test.ipynb:** Interactive testing of Deezer integration. **queries.ipynb:** Interactive testing of various query patterns. **Purpose:** Manual testing and exploration. Not automated tests. ## Configuration Management ### Hardcoded Configuration All configuration values hardcoded in source files. **linking.py:** ```python # MusicBrainz User-Agent musicbrainzngs.set_useragent("elka", "0.1") # Duration thresholds MUSICBRAINZ_DURATION_THRESHOLD = 5 # seconds DEEZER_DURATION_THRESHOLD = 3 # seconds # Similarity threshold SIMILARITY_THRESHOLD = 0.8 ``` **Issues:** - No runtime configuration - Changing thresholds requires code modification - No environment-specific settings - "elka/0.1" User-Agent suggests code copied from another project ### External Configuration **Only external config:** mml_secrets.py for Spotify credentials. **Not in repository.** Users must create manually. **Structure:** ```python SPOTIFY_CLIENT_ID = "..." SPOTIFY_CLIENT_SECRET = "..." ``` **Import pattern:** ```python try: from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET except ImportError: SPOTIFY_CLIENT_ID = None SPOTIFY_CLIENT_SECRET = None ``` **Graceful degradation:** If mml_secrets.py missing, Spotify features disabled. ### Configuration Recommendations 1. **Use environment variables:** ```python import os SPOTIFY_CLIENT_ID = os.getenv("SPOTIFY_CLIENT_ID") MUSICBRAINZ_USER_AGENT = os.getenv("MUSICBRAINZ_USER_AGENT", "MusicMetaLinker/0.0.1") DEEZER_DURATION_THRESHOLD = int(os.getenv("DEEZER_DURATION_THRESHOLD", "3")) ``` 2. **Add config file support:** ```python import configparser config = configparser.ConfigParser() config.read("musicmetalinker.ini") DEEZER_DURATION_THRESHOLD = config.getint("matching", "deezer_duration_threshold", fallback=3) ``` 3. **Add runtime configuration:** ```python linker = Align( artist="...", track="...", config={ "deezer_duration_threshold": 5, "similarity_threshold": 0.9 } ) ``` ## Logging Architecture ### Logging Implementation **Library:** Python standard logging module. **Configuration:** ```python import logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s' ) logger = logging.getLogger(__name__) ``` **Log levels used:** - INFO: Normal operation (file processing, successful queries) - ERROR: Failed queries, network errors **Not used:** - DEBUG: No debug-level logging - WARNING: No warnings - CRITICAL: No critical errors ### Logging Locations **Batch processing:** File-based logging to link_partitions.log. ```python file_handler = logging.FileHandler('link_partitions.log') logger.addHandler(file_handler) ``` **Library usage:** Console logging. ```python console_handler = logging.StreamHandler() logger.addHandler(console_handler) ``` ### Debug Output Issues **Multiple print() statements in production code:** ```python print(f"Querying MusicBrainz for {artist} - {track}") print(f"Found MBID: {mbid}") print(f"Deezer search returned {len(results)} results") ``` **Problems:** - Not controlled by logging configuration - Can't disable without code changes - No log levels - No timestamps - Mixes with actual output **Recommendation:** Replace all print() with logger.debug(). ### Logging Recommendations 1. **Remove print() statements:** ```python # Before print(f"Querying MusicBrainz for {artist} - {track}") # After logger.debug(f"Querying MusicBrainz for {artist} - {track}") ``` 2. **Add structured logging:** ```python import structlog logger = structlog.get_logger() logger.info("musicbrainz_query", artist=artist, track=track, mbid=mbid) ``` 3. **Add correlation IDs:** ```python import uuid correlation_id = str(uuid.uuid4()) logger.info("query_started", correlation_id=correlation_id, artist=artist) # ... queries ... logger.info("query_completed", correlation_id=correlation_id, mbid=mbid) ``` 4. **Add log levels:** ```python logger.debug("Attempting MusicBrainz query") logger.info("Successfully retrieved MBID") logger.warning("Deezer query returned no results, falling back to YouTube") logger.error("All services failed", exc_info=True) ``` ## Code Quality ### Code Smells **Debug prints in production:** ```python print("DEBUG: entering get_mbid()") print(f"DEBUG: mbid_track = {self.mbid_track}") ``` **Commented-out code:** ```python # if duration: # matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10] ``` **Hardcoded values:** ```python musicbrainzngs.set_useragent("elka", "0.1") # Should be "MusicMetaLinker/0.0.1" ``` **Inconsistent naming:** ```python mbid_track # snake_case mbidTrack # camelCase (in some places) MBID # UPPER_CASE ``` **No docstrings:** ```python def get_mbid(self): # No docstring explaining what this returns or when it returns None ... ``` **Broad exception catching:** ```python try: result = service.query() except: # Catches everything, including KeyboardInterrupt return None ``` ### Code Quality Metrics **Estimated metrics (without actual analysis):** - **Lines of code:** ~1500-2000 - **Cyclomatic complexity:** Moderate (nested conditionals in matching logic) - **Code duplication:** Moderate (similar patterns across service aligners) - **Test coverage:** 0% (no tests) - **Documentation coverage:** Low (minimal docstrings) ### Linting Issues **No linting configuration.** Running pylint or flake8 would likely find: - Unused imports - Unused variables - Line too long (>79 characters) - Missing docstrings - Bare except clauses - Inconsistent naming - Wildcard imports (if any) ### Type Hints **Minimal type hints.** Likely no type annotations on most functions. **Example of missing type hints:** ```python # Current (no type hints) def get_mbid(self): ... # With type hints def get_mbid(self) -> Optional[str]: ... ``` **Benefits of adding type hints:** - Static type checking with mypy - Better IDE autocomplete - Self-documenting code - Catch type errors before runtime ## Testing ### Test Coverage **No automated tests.** No test directory, no test files. **Testing approach:** - Manual testing via Jupyter notebooks - if __name__ == "__main__" blocks in some modules **Example if __name__ == "__main__" block:** ```python if __name__ == "__main__": linker = Align(artist="The Beatles", track="Hey Jude") print(linker.get_mbid()) print(linker.get_isrc()) ``` **Not real tests:** No assertions, no test framework, no automation. ### Testing Recommendations **Unit tests with mocked services:** ```python import pytest from unittest.mock import Mock, patch def test_get_mbid_with_provided_mbid(): linker = Align(mbid_track="test-mbid") assert linker.get_mbid() == "test-mbid" @patch('musicmetalinker.linking.musicbrainzngs') def test_get_mbid_queries_musicbrainz(mock_mb): mock_mb.search_recordings.return_value = { 'recording-list': [{'id': 'found-mbid'}] } linker = Align(artist="Test Artist", track="Test Track") mbid = linker.get_mbid() assert mbid == "found-mbid" mock_mb.search_recordings.assert_called_once() ``` **Integration tests:** ```python @pytest.mark.integration def test_real_musicbrainz_query(): linker = Align(artist="The Beatles", track="Hey Jude") mbid = linker.get_mbid() assert mbid is not None assert len(mbid) == 36 # UUID length ``` **Test coverage goals:** - Unit tests: 80%+ coverage - Integration tests: Critical paths - Mock all external API calls in unit tests - Real API calls only in integration tests (marked with @pytest.mark.integration) ## Error Handling ### Current Error Handling **Pattern throughout codebase:** ```python try: result = service.query() return result except: return None ``` **Issues:** - Catches all exceptions (including KeyboardInterrupt, SystemExit) - No error logging - No distinction between error types - Silent failures ### Error Handling Recommendations **Specific exception handling:** ```python try: result = service.query() return result except requests.exceptions.Timeout: logger.warning("Service timeout", service="musicbrainz") return None except requests.exceptions.ConnectionError: logger.error("Service unavailable", service="musicbrainz") return None except Exception as e: logger.error("Unexpected error", service="musicbrainz", error=str(e), exc_info=True) return None ``` **Custom exceptions:** ```python class MusicMetaLinkerError(Exception): pass class ServiceUnavailableError(MusicMetaLinkerError): pass class InvalidInputError(MusicMetaLinkerError): pass class NoMatchFoundError(MusicMetaLinkerError): pass ``` **Explicit error returns:** ```python from typing import Optional, Union def get_mbid(self) -> Union[str, None, MusicMetaLinkerError]: try: ... except ServiceUnavailableError as e: return e # Return error instead of None ``` ## Performance Considerations ### Performance Bottlenecks **Network latency:** Sequential API calls. Total latency = sum of all service latencies. **No caching:** Repeated queries for same track. **No connection pooling:** New connection for each request. **No request batching:** One request per track. ### Performance Optimization Opportunities **1. Async/await for concurrent queries:** ```python import asyncio import aiohttp async def get_all_metadata(self): tasks = [ self.get_mbid_async(), self.get_deezer_id_async(), self.get_youtube_link_async() ] results = await asyncio.gather(*tasks) return results ``` **2. Persistent cache:** ```python import redis cache = redis.Redis() def get_mbid(self): cache_key = f"mbid:{self.artist}:{self.track}" cached = cache.get(cache_key) if cached: return cached.decode() mbid = self._query_mbid() cache.setex(cache_key, 86400, mbid) # 24 hour TTL return mbid ``` **3. Connection pooling:** ```python import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry session = requests.Session() retry = Retry(total=3, backoff_factor=0.3) adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=20) session.mount('http://', adapter) session.mount('https://', adapter) ``` **4. Batch processing parallelization:** ```python from multiprocessing import Pool def process_track(jams_file): processor = JAMSProcessor(jams_file) metadata = processor.extract_metadata() linker = Align(**metadata) return linker.get_all_metadata() with Pool(processes=4) as pool: results = pool.map(process_track, jams_files) ``` ## Code Maintainability ### Maintainability Issues **Tight coupling:** Align class directly instantiates service classes. Hard to mock for testing. **No abstraction:** Service classes have different interfaces. No common base class. **Hardcoded configuration:** Changing thresholds requires code modification. **No documentation:** Minimal docstrings, no API documentation. **Dead code:** AcousticBrainz integration non-functional. **Inconsistent patterns:** Function for AcousticBrainz, classes for other services. ### Maintainability Recommendations **1. Define service interface:** ```python from abc import ABC, abstractmethod class ServiceAligner(ABC): @abstractmethod def search_by_isrc(self, isrc: str) -> Optional[dict]: pass @abstractmethod def search_by_metadata(self, artist: str, track: str, album: str) -> Optional[dict]: pass ``` **2. Dependency injection:** ```python class Align: def __init__(self, services: List[ServiceAligner], **metadata): self.services = services self.metadata = metadata ``` **3. Add docstrings:** ```python def get_mbid(self) -> Optional[str]: """ Retrieve MusicBrainz recording ID. Queries MusicBrainz by MBID (if provided), ISRC, or metadata. Returns None if no match found or service unavailable. Returns: MusicBrainz recording ID (UUID format) or None """ ... ``` **4. Remove dead code:** Delete acousticbrainz_link() function and all references. **5. Add configuration class:** ```python from dataclasses import dataclass @dataclass class MatchingConfig: deezer_duration_threshold: int = 3 musicbrainz_duration_threshold: int = 5 similarity_threshold: float = 0.8 user_agent: str = "MusicMetaLinker/0.0.1" ``` ## Security Considerations ### Security Issues **Plaintext credentials:** Spotify credentials in mml_secrets.py (not encrypted). **No input validation:** Metadata strings not sanitized. **Broad exception catching:** May hide security-relevant errors. **No dependency scanning:** Vulnerable dependencies unknown. ### Security Recommendations **1. Encrypt credentials:** ```python from cryptography.fernet import Fernet key = os.getenv("ENCRYPTION_KEY") cipher = Fernet(key) encrypted_secret = cipher.encrypt(SPOTIFY_CLIENT_SECRET.encode()) ``` **2. Input validation:** ```python import re def validate_mbid(mbid: str) -> bool: uuid_pattern = r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$' return bool(re.match(uuid_pattern, mbid, re.IGNORECASE)) def validate_isrc(isrc: str) -> bool: isrc_pattern = r'^[A-Z]{2}[A-Z0-9]{3}[0-9]{7}$' return bool(re.match(isrc_pattern, isrc)) ``` **3. Dependency scanning:** ```bash pip install pip-audit pip-audit ``` **4. Security headers for API calls:** ```python headers = { 'User-Agent': 'MusicMetaLinker/0.0.1', 'X-Request-ID': str(uuid.uuid4()) } response = requests.get(url, headers=headers) ``` ## Code Recommendations Summary ### Immediate Fixes 1. Remove all print() statements, replace with logger.debug() 2. Remove commented-out code 3. Fix User-Agent: "elka/0.1" → "MusicMetaLinker/0.0.1" 4. Remove AcousticBrainz integration 5. Add docstrings to all public methods ### Short-Term Improvements 1. Add type hints throughout codebase 2. Add unit tests with mocked services 3. Add linting (pylint, flake8) 4. Add formatting (black, isort) 5. Add specific exception handling 6. Add input validation 7. Add configuration system ### Long-Term Enhancements 1. Refactor to use service interface abstraction 2. Add dependency injection 3. Add async/await for concurrent queries 4. Add persistent caching 5. Add connection pooling 6. Add structured logging 7. Add monitoring and metrics 8. Add comprehensive documentation 9. Add integration tests 10. Add CI/CD pipeline ## Codebase Maturity Assessment **Current state:** Research prototype. Pre-release quality. **Maturity level:** 2/5 **Strengths:** - Clear separation of concerns (service classes) - Simple, understandable structure - Functional for research use **Weaknesses:** - No tests - Debug code in production - Hardcoded configuration - Dead code - No documentation - No error handling - No input validation **Recommendation:** Suitable for academic exploration. Requires significant refactoring for production use.