- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
18 KiB
MusicMetaLinker Codebase Analysis
Repository Structure
MusicMetaLinker/
├── musicmetalinker/
│ ├── __init__.py
│ ├── linking.py # Core Align class and service aligners
│ ├── preprocessor.py # JAMSProcessor for JAMS file handling
│ ├── musicbrainz_dump.py # MusicBrainz bulk download utilities
│ └── utils.py # Utility functions (likely)
├── link_partitions.py # Batch processing CLI
├── prepare_dataset.py # Dataset preparation scripts
├── deezer_test.ipynb # Deezer integration testing notebook
├── queries.ipynb # Query testing notebook
├── pyproject.toml # Build configuration
├── README.md # Project documentation
└── LICENSE # MIT license
No tests directory. No test files.
No docs directory. Documentation in README only.
No examples directory. Examples in notebooks only.
Code Organization
linking.py
Primary module. Contains all core functionality.
Classes:
- Align: Main orchestrator class
- MusicBrainzAlign: MusicBrainz service integration
- DeezerAlign: Deezer service integration
- YouTubeAlign: YouTube Music service integration
Functions:
- acousticbrainz_link(mbid): AcousticBrainz URL checker (defunct)
Estimated size: 500-800 lines (based on typical structure).
Responsibilities:
- Service coordination
- Query execution
- Result aggregation
- Metadata normalization
Code quality issues:
- Debug print() statements in production code
- Commented-out code sections
- Hardcoded configuration values
- No docstrings (likely)
- Inconsistent naming conventions
preprocessor.py
JAMS file handling.
Classes:
- JAMSProcessor: Read/write JAMS files, extract metadata, enrich with identifiers
Responsibilities:
- Parse JAMS JSON structure
- Extract file_metadata and sandbox fields
- Inject new identifiers
- Write enriched JAMS files
Dependencies:
- jams library for JAMS format support
- json for JSON parsing
musicbrainz_dump.py
Bulk MusicBrainz download utilities.
Classes:
- MBDownload: Batch download from MusicBrainz
Purpose: Pre-populate datasets with MusicBrainz metadata to reduce API calls.
Implementation details: Not fully specified. Likely includes:
- Batch query logic
- Rate limiting (hopefully)
- Local caching
- CSV or JSON output
link_partitions.py
Batch processing CLI script.
Functionality:
- Scan directory for JAMS files
- Process each file with Align
- Collect results in pandas DataFrame
- Output CSV with all identifiers
- Optionally write enriched JAMS files
Command-line arguments:
- Positional: directory path
- --save: Write enriched JAMS files
- --limit audio: Only process audio files
- --overwrite: Overwrite existing files
Logging: File-based to link_partitions.log.
Progress tracking: tqdm progress bars.
prepare_dataset.py
Dataset preparation utilities.
Functionality: Not fully specified. Likely includes:
- Data cleaning
- Format conversion
- Metadata normalization
- Spotify ISRC extraction for Billboard dataset
Spotify integration: Uses spotipy with credentials from mml_secrets.py.
Notebooks
deezer_test.ipynb: Interactive testing of Deezer integration.
queries.ipynb: Interactive testing of various query patterns.
Purpose: Manual testing and exploration. Not automated tests.
Configuration Management
Hardcoded Configuration
All configuration values hardcoded in source files.
linking.py:
# MusicBrainz User-Agent
musicbrainzngs.set_useragent("elka", "0.1")
# Duration thresholds
MUSICBRAINZ_DURATION_THRESHOLD = 5 # seconds
DEEZER_DURATION_THRESHOLD = 3 # seconds
# Similarity threshold
SIMILARITY_THRESHOLD = 0.8
Issues:
- No runtime configuration
- Changing thresholds requires code modification
- No environment-specific settings
- "elka/0.1" User-Agent suggests code copied from another project
External Configuration
Only external config: mml_secrets.py for Spotify credentials.
Not in repository. Users must create manually.
Structure:
SPOTIFY_CLIENT_ID = "..."
SPOTIFY_CLIENT_SECRET = "..."
Import pattern:
try:
from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
except ImportError:
SPOTIFY_CLIENT_ID = None
SPOTIFY_CLIENT_SECRET = None
Graceful degradation: If mml_secrets.py missing, Spotify features disabled.
Configuration Recommendations
- Use environment variables:
import os
SPOTIFY_CLIENT_ID = os.getenv("SPOTIFY_CLIENT_ID")
MUSICBRAINZ_USER_AGENT = os.getenv("MUSICBRAINZ_USER_AGENT", "MusicMetaLinker/0.0.1")
DEEZER_DURATION_THRESHOLD = int(os.getenv("DEEZER_DURATION_THRESHOLD", "3"))
- Add config file support:
import configparser
config = configparser.ConfigParser()
config.read("musicmetalinker.ini")
DEEZER_DURATION_THRESHOLD = config.getint("matching", "deezer_duration_threshold", fallback=3)
- Add runtime configuration:
linker = Align(
artist="...",
track="...",
config={
"deezer_duration_threshold": 5,
"similarity_threshold": 0.9
}
)
Logging Architecture
Logging Implementation
Library: Python standard logging module.
Configuration:
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
Log levels used:
- INFO: Normal operation (file processing, successful queries)
- ERROR: Failed queries, network errors
Not used:
- DEBUG: No debug-level logging
- WARNING: No warnings
- CRITICAL: No critical errors
Logging Locations
Batch processing: File-based logging to link_partitions.log.
file_handler = logging.FileHandler('link_partitions.log')
logger.addHandler(file_handler)
Library usage: Console logging.
console_handler = logging.StreamHandler()
logger.addHandler(console_handler)
Debug Output Issues
Multiple print() statements in production code:
print(f"Querying MusicBrainz for {artist} - {track}")
print(f"Found MBID: {mbid}")
print(f"Deezer search returned {len(results)} results")
Problems:
- Not controlled by logging configuration
- Can't disable without code changes
- No log levels
- No timestamps
- Mixes with actual output
Recommendation: Replace all print() with logger.debug().
Logging Recommendations
- Remove print() statements:
# Before
print(f"Querying MusicBrainz for {artist} - {track}")
# After
logger.debug(f"Querying MusicBrainz for {artist} - {track}")
- Add structured logging:
import structlog
logger = structlog.get_logger()
logger.info("musicbrainz_query", artist=artist, track=track, mbid=mbid)
- Add correlation IDs:
import uuid
correlation_id = str(uuid.uuid4())
logger.info("query_started", correlation_id=correlation_id, artist=artist)
# ... queries ...
logger.info("query_completed", correlation_id=correlation_id, mbid=mbid)
- Add log levels:
logger.debug("Attempting MusicBrainz query")
logger.info("Successfully retrieved MBID")
logger.warning("Deezer query returned no results, falling back to YouTube")
logger.error("All services failed", exc_info=True)
Code Quality
Code Smells
Debug prints in production:
print("DEBUG: entering get_mbid()")
print(f"DEBUG: mbid_track = {self.mbid_track}")
Commented-out code:
# if duration:
# matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]
Hardcoded values:
musicbrainzngs.set_useragent("elka", "0.1") # Should be "MusicMetaLinker/0.0.1"
Inconsistent naming:
mbid_track # snake_case
mbidTrack # camelCase (in some places)
MBID # UPPER_CASE
No docstrings:
def get_mbid(self):
# No docstring explaining what this returns or when it returns None
...
Broad exception catching:
try:
result = service.query()
except: # Catches everything, including KeyboardInterrupt
return None
Code Quality Metrics
Estimated metrics (without actual analysis):
- Lines of code: ~1500-2000
- Cyclomatic complexity: Moderate (nested conditionals in matching logic)
- Code duplication: Moderate (similar patterns across service aligners)
- Test coverage: 0% (no tests)
- Documentation coverage: Low (minimal docstrings)
Linting Issues
No linting configuration. Running pylint or flake8 would likely find:
- Unused imports
- Unused variables
- Line too long (>79 characters)
- Missing docstrings
- Bare except clauses
- Inconsistent naming
- Wildcard imports (if any)
Type Hints
Minimal type hints. Likely no type annotations on most functions.
Example of missing type hints:
# Current (no type hints)
def get_mbid(self):
...
# With type hints
def get_mbid(self) -> Optional[str]:
...
Benefits of adding type hints:
- Static type checking with mypy
- Better IDE autocomplete
- Self-documenting code
- Catch type errors before runtime
Testing
Test Coverage
No automated tests. No test directory, no test files.
Testing approach:
- Manual testing via Jupyter notebooks
- if name == "main" blocks in some modules
Example if name == "main" block:
if __name__ == "__main__":
linker = Align(artist="The Beatles", track="Hey Jude")
print(linker.get_mbid())
print(linker.get_isrc())
Not real tests: No assertions, no test framework, no automation.
Testing Recommendations
Unit tests with mocked services:
import pytest
from unittest.mock import Mock, patch
def test_get_mbid_with_provided_mbid():
linker = Align(mbid_track="test-mbid")
assert linker.get_mbid() == "test-mbid"
@patch('musicmetalinker.linking.musicbrainzngs')
def test_get_mbid_queries_musicbrainz(mock_mb):
mock_mb.search_recordings.return_value = {
'recording-list': [{'id': 'found-mbid'}]
}
linker = Align(artist="Test Artist", track="Test Track")
mbid = linker.get_mbid()
assert mbid == "found-mbid"
mock_mb.search_recordings.assert_called_once()
Integration tests:
@pytest.mark.integration
def test_real_musicbrainz_query():
linker = Align(artist="The Beatles", track="Hey Jude")
mbid = linker.get_mbid()
assert mbid is not None
assert len(mbid) == 36 # UUID length
Test coverage goals:
- Unit tests: 80%+ coverage
- Integration tests: Critical paths
- Mock all external API calls in unit tests
- Real API calls only in integration tests (marked with @pytest.mark.integration)
Error Handling
Current Error Handling
Pattern throughout codebase:
try:
result = service.query()
return result
except:
return None
Issues:
- Catches all exceptions (including KeyboardInterrupt, SystemExit)
- No error logging
- No distinction between error types
- Silent failures
Error Handling Recommendations
Specific exception handling:
try:
result = service.query()
return result
except requests.exceptions.Timeout:
logger.warning("Service timeout", service="musicbrainz")
return None
except requests.exceptions.ConnectionError:
logger.error("Service unavailable", service="musicbrainz")
return None
except Exception as e:
logger.error("Unexpected error", service="musicbrainz", error=str(e), exc_info=True)
return None
Custom exceptions:
class MusicMetaLinkerError(Exception):
pass
class ServiceUnavailableError(MusicMetaLinkerError):
pass
class InvalidInputError(MusicMetaLinkerError):
pass
class NoMatchFoundError(MusicMetaLinkerError):
pass
Explicit error returns:
from typing import Optional, Union
def get_mbid(self) -> Union[str, None, MusicMetaLinkerError]:
try:
...
except ServiceUnavailableError as e:
return e # Return error instead of None
Performance Considerations
Performance Bottlenecks
Network latency: Sequential API calls. Total latency = sum of all service latencies.
No caching: Repeated queries for same track.
No connection pooling: New connection for each request.
No request batching: One request per track.
Performance Optimization Opportunities
1. Async/await for concurrent queries:
import asyncio
import aiohttp
async def get_all_metadata(self):
tasks = [
self.get_mbid_async(),
self.get_deezer_id_async(),
self.get_youtube_link_async()
]
results = await asyncio.gather(*tasks)
return results
2. Persistent cache:
import redis
cache = redis.Redis()
def get_mbid(self):
cache_key = f"mbid:{self.artist}:{self.track}"
cached = cache.get(cache_key)
if cached:
return cached.decode()
mbid = self._query_mbid()
cache.setex(cache_key, 86400, mbid) # 24 hour TTL
return mbid
3. Connection pooling:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(total=3, backoff_factor=0.3)
adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=20)
session.mount('http://', adapter)
session.mount('https://', adapter)
4. Batch processing parallelization:
from multiprocessing import Pool
def process_track(jams_file):
processor = JAMSProcessor(jams_file)
metadata = processor.extract_metadata()
linker = Align(**metadata)
return linker.get_all_metadata()
with Pool(processes=4) as pool:
results = pool.map(process_track, jams_files)
Code Maintainability
Maintainability Issues
Tight coupling: Align class directly instantiates service classes. Hard to mock for testing.
No abstraction: Service classes have different interfaces. No common base class.
Hardcoded configuration: Changing thresholds requires code modification.
No documentation: Minimal docstrings, no API documentation.
Dead code: AcousticBrainz integration non-functional.
Inconsistent patterns: Function for AcousticBrainz, classes for other services.
Maintainability Recommendations
1. Define service interface:
from abc import ABC, abstractmethod
class ServiceAligner(ABC):
@abstractmethod
def search_by_isrc(self, isrc: str) -> Optional[dict]:
pass
@abstractmethod
def search_by_metadata(self, artist: str, track: str, album: str) -> Optional[dict]:
pass
2. Dependency injection:
class Align:
def __init__(self, services: List[ServiceAligner], **metadata):
self.services = services
self.metadata = metadata
3. Add docstrings:
def get_mbid(self) -> Optional[str]:
"""
Retrieve MusicBrainz recording ID.
Queries MusicBrainz by MBID (if provided), ISRC, or metadata.
Returns None if no match found or service unavailable.
Returns:
MusicBrainz recording ID (UUID format) or None
"""
...
4. Remove dead code:
Delete acousticbrainz_link() function and all references.
5. Add configuration class:
from dataclasses import dataclass
@dataclass
class MatchingConfig:
deezer_duration_threshold: int = 3
musicbrainz_duration_threshold: int = 5
similarity_threshold: float = 0.8
user_agent: str = "MusicMetaLinker/0.0.1"
Security Considerations
Security Issues
Plaintext credentials: Spotify credentials in mml_secrets.py (not encrypted).
No input validation: Metadata strings not sanitized.
Broad exception catching: May hide security-relevant errors.
No dependency scanning: Vulnerable dependencies unknown.
Security Recommendations
1. Encrypt credentials:
from cryptography.fernet import Fernet
key = os.getenv("ENCRYPTION_KEY")
cipher = Fernet(key)
encrypted_secret = cipher.encrypt(SPOTIFY_CLIENT_SECRET.encode())
2. Input validation:
import re
def validate_mbid(mbid: str) -> bool:
uuid_pattern = r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$'
return bool(re.match(uuid_pattern, mbid, re.IGNORECASE))
def validate_isrc(isrc: str) -> bool:
isrc_pattern = r'^[A-Z]{2}[A-Z0-9]{3}[0-9]{7}$'
return bool(re.match(isrc_pattern, isrc))
3. Dependency scanning:
pip install pip-audit
pip-audit
4. Security headers for API calls:
headers = {
'User-Agent': 'MusicMetaLinker/0.0.1',
'X-Request-ID': str(uuid.uuid4())
}
response = requests.get(url, headers=headers)
Code Recommendations Summary
Immediate Fixes
- Remove all print() statements, replace with logger.debug()
- Remove commented-out code
- Fix User-Agent: "elka/0.1" → "MusicMetaLinker/0.0.1"
- Remove AcousticBrainz integration
- Add docstrings to all public methods
Short-Term Improvements
- Add type hints throughout codebase
- Add unit tests with mocked services
- Add linting (pylint, flake8)
- Add formatting (black, isort)
- Add specific exception handling
- Add input validation
- Add configuration system
Long-Term Enhancements
- Refactor to use service interface abstraction
- Add dependency injection
- Add async/await for concurrent queries
- Add persistent caching
- Add connection pooling
- Add structured logging
- Add monitoring and metrics
- Add comprehensive documentation
- Add integration tests
- Add CI/CD pipeline
Codebase Maturity Assessment
Current state: Research prototype. Pre-release quality.
Maturity level: 2/5
Strengths:
- Clear separation of concerns (service classes)
- Simple, understandable structure
- Functional for research use
Weaknesses:
- No tests
- Debug code in production
- Hardcoded configuration
- Dead code
- No documentation
- No error handling
- No input validation
Recommendation: Suitable for academic exploration. Requires significant refactoring for production use.