Files
metadata-agregator/docs/research/musicmetalinker/analysis/CODEBASE.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

18 KiB

MusicMetaLinker Codebase Analysis

Repository Structure

MusicMetaLinker/
├── musicmetalinker/
│   ├── __init__.py
│   ├── linking.py              # Core Align class and service aligners
│   ├── preprocessor.py         # JAMSProcessor for JAMS file handling
│   ├── musicbrainz_dump.py     # MusicBrainz bulk download utilities
│   └── utils.py                # Utility functions (likely)
├── link_partitions.py          # Batch processing CLI
├── prepare_dataset.py          # Dataset preparation scripts
├── deezer_test.ipynb          # Deezer integration testing notebook
├── queries.ipynb              # Query testing notebook
├── pyproject.toml             # Build configuration
├── README.md                  # Project documentation
└── LICENSE                    # MIT license

No tests directory. No test files.

No docs directory. Documentation in README only.

No examples directory. Examples in notebooks only.

Code Organization

linking.py

Primary module. Contains all core functionality.

Classes:

  • Align: Main orchestrator class
  • MusicBrainzAlign: MusicBrainz service integration
  • DeezerAlign: Deezer service integration
  • YouTubeAlign: YouTube Music service integration

Functions:

  • acousticbrainz_link(mbid): AcousticBrainz URL checker (defunct)

Estimated size: 500-800 lines (based on typical structure).

Responsibilities:

  • Service coordination
  • Query execution
  • Result aggregation
  • Metadata normalization

Code quality issues:

  • Debug print() statements in production code
  • Commented-out code sections
  • Hardcoded configuration values
  • No docstrings (likely)
  • Inconsistent naming conventions

preprocessor.py

JAMS file handling.

Classes:

  • JAMSProcessor: Read/write JAMS files, extract metadata, enrich with identifiers

Responsibilities:

  • Parse JAMS JSON structure
  • Extract file_metadata and sandbox fields
  • Inject new identifiers
  • Write enriched JAMS files

Dependencies:

  • jams library for JAMS format support
  • json for JSON parsing

musicbrainz_dump.py

Bulk MusicBrainz download utilities.

Classes:

  • MBDownload: Batch download from MusicBrainz

Purpose: Pre-populate datasets with MusicBrainz metadata to reduce API calls.

Implementation details: Not fully specified. Likely includes:

  • Batch query logic
  • Rate limiting (hopefully)
  • Local caching
  • CSV or JSON output

Batch processing CLI script.

Functionality:

  • Scan directory for JAMS files
  • Process each file with Align
  • Collect results in pandas DataFrame
  • Output CSV with all identifiers
  • Optionally write enriched JAMS files

Command-line arguments:

  • Positional: directory path
  • --save: Write enriched JAMS files
  • --limit audio: Only process audio files
  • --overwrite: Overwrite existing files

Logging: File-based to link_partitions.log.

Progress tracking: tqdm progress bars.

prepare_dataset.py

Dataset preparation utilities.

Functionality: Not fully specified. Likely includes:

  • Data cleaning
  • Format conversion
  • Metadata normalization
  • Spotify ISRC extraction for Billboard dataset

Spotify integration: Uses spotipy with credentials from mml_secrets.py.

Notebooks

deezer_test.ipynb: Interactive testing of Deezer integration.

queries.ipynb: Interactive testing of various query patterns.

Purpose: Manual testing and exploration. Not automated tests.

Configuration Management

Hardcoded Configuration

All configuration values hardcoded in source files.

linking.py:

# MusicBrainz User-Agent
musicbrainzngs.set_useragent("elka", "0.1")

# Duration thresholds
MUSICBRAINZ_DURATION_THRESHOLD = 5  # seconds
DEEZER_DURATION_THRESHOLD = 3       # seconds

# Similarity threshold
SIMILARITY_THRESHOLD = 0.8

Issues:

  • No runtime configuration
  • Changing thresholds requires code modification
  • No environment-specific settings
  • "elka/0.1" User-Agent suggests code copied from another project

External Configuration

Only external config: mml_secrets.py for Spotify credentials.

Not in repository. Users must create manually.

Structure:

SPOTIFY_CLIENT_ID = "..."
SPOTIFY_CLIENT_SECRET = "..."

Import pattern:

try:
    from mml_secrets import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
except ImportError:
    SPOTIFY_CLIENT_ID = None
    SPOTIFY_CLIENT_SECRET = None

Graceful degradation: If mml_secrets.py missing, Spotify features disabled.

Configuration Recommendations

  1. Use environment variables:
import os

SPOTIFY_CLIENT_ID = os.getenv("SPOTIFY_CLIENT_ID")
MUSICBRAINZ_USER_AGENT = os.getenv("MUSICBRAINZ_USER_AGENT", "MusicMetaLinker/0.0.1")
DEEZER_DURATION_THRESHOLD = int(os.getenv("DEEZER_DURATION_THRESHOLD", "3"))
  1. Add config file support:
import configparser

config = configparser.ConfigParser()
config.read("musicmetalinker.ini")

DEEZER_DURATION_THRESHOLD = config.getint("matching", "deezer_duration_threshold", fallback=3)
  1. Add runtime configuration:
linker = Align(
    artist="...",
    track="...",
    config={
        "deezer_duration_threshold": 5,
        "similarity_threshold": 0.9
    }
)

Logging Architecture

Logging Implementation

Library: Python standard logging module.

Configuration:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

logger = logging.getLogger(__name__)

Log levels used:

  • INFO: Normal operation (file processing, successful queries)
  • ERROR: Failed queries, network errors

Not used:

  • DEBUG: No debug-level logging
  • WARNING: No warnings
  • CRITICAL: No critical errors

Logging Locations

Batch processing: File-based logging to link_partitions.log.

file_handler = logging.FileHandler('link_partitions.log')
logger.addHandler(file_handler)

Library usage: Console logging.

console_handler = logging.StreamHandler()
logger.addHandler(console_handler)

Debug Output Issues

Multiple print() statements in production code:

print(f"Querying MusicBrainz for {artist} - {track}")
print(f"Found MBID: {mbid}")
print(f"Deezer search returned {len(results)} results")

Problems:

  • Not controlled by logging configuration
  • Can't disable without code changes
  • No log levels
  • No timestamps
  • Mixes with actual output

Recommendation: Replace all print() with logger.debug().

Logging Recommendations

  1. Remove print() statements:
# Before
print(f"Querying MusicBrainz for {artist} - {track}")

# After
logger.debug(f"Querying MusicBrainz for {artist} - {track}")
  1. Add structured logging:
import structlog

logger = structlog.get_logger()
logger.info("musicbrainz_query", artist=artist, track=track, mbid=mbid)
  1. Add correlation IDs:
import uuid

correlation_id = str(uuid.uuid4())
logger.info("query_started", correlation_id=correlation_id, artist=artist)
# ... queries ...
logger.info("query_completed", correlation_id=correlation_id, mbid=mbid)
  1. Add log levels:
logger.debug("Attempting MusicBrainz query")
logger.info("Successfully retrieved MBID")
logger.warning("Deezer query returned no results, falling back to YouTube")
logger.error("All services failed", exc_info=True)

Code Quality

Code Smells

Debug prints in production:

print("DEBUG: entering get_mbid()")
print(f"DEBUG: mbid_track = {self.mbid_track}")

Commented-out code:

# if duration:
#     matches = [r for r in results if abs(r['duration_seconds'] - duration) < 10]

Hardcoded values:

musicbrainzngs.set_useragent("elka", "0.1")  # Should be "MusicMetaLinker/0.0.1"

Inconsistent naming:

mbid_track  # snake_case
mbidTrack   # camelCase (in some places)
MBID        # UPPER_CASE

No docstrings:

def get_mbid(self):
    # No docstring explaining what this returns or when it returns None
    ...

Broad exception catching:

try:
    result = service.query()
except:  # Catches everything, including KeyboardInterrupt
    return None

Code Quality Metrics

Estimated metrics (without actual analysis):

  • Lines of code: ~1500-2000
  • Cyclomatic complexity: Moderate (nested conditionals in matching logic)
  • Code duplication: Moderate (similar patterns across service aligners)
  • Test coverage: 0% (no tests)
  • Documentation coverage: Low (minimal docstrings)

Linting Issues

No linting configuration. Running pylint or flake8 would likely find:

  • Unused imports
  • Unused variables
  • Line too long (>79 characters)
  • Missing docstrings
  • Bare except clauses
  • Inconsistent naming
  • Wildcard imports (if any)

Type Hints

Minimal type hints. Likely no type annotations on most functions.

Example of missing type hints:

# Current (no type hints)
def get_mbid(self):
    ...

# With type hints
def get_mbid(self) -> Optional[str]:
    ...

Benefits of adding type hints:

  • Static type checking with mypy
  • Better IDE autocomplete
  • Self-documenting code
  • Catch type errors before runtime

Testing

Test Coverage

No automated tests. No test directory, no test files.

Testing approach:

  • Manual testing via Jupyter notebooks
  • if name == "main" blocks in some modules

Example if name == "main" block:

if __name__ == "__main__":
    linker = Align(artist="The Beatles", track="Hey Jude")
    print(linker.get_mbid())
    print(linker.get_isrc())

Not real tests: No assertions, no test framework, no automation.

Testing Recommendations

Unit tests with mocked services:

import pytest
from unittest.mock import Mock, patch

def test_get_mbid_with_provided_mbid():
    linker = Align(mbid_track="test-mbid")
    assert linker.get_mbid() == "test-mbid"

@patch('musicmetalinker.linking.musicbrainzngs')
def test_get_mbid_queries_musicbrainz(mock_mb):
    mock_mb.search_recordings.return_value = {
        'recording-list': [{'id': 'found-mbid'}]
    }
    
    linker = Align(artist="Test Artist", track="Test Track")
    mbid = linker.get_mbid()
    
    assert mbid == "found-mbid"
    mock_mb.search_recordings.assert_called_once()

Integration tests:

@pytest.mark.integration
def test_real_musicbrainz_query():
    linker = Align(artist="The Beatles", track="Hey Jude")
    mbid = linker.get_mbid()
    
    assert mbid is not None
    assert len(mbid) == 36  # UUID length

Test coverage goals:

  • Unit tests: 80%+ coverage
  • Integration tests: Critical paths
  • Mock all external API calls in unit tests
  • Real API calls only in integration tests (marked with @pytest.mark.integration)

Error Handling

Current Error Handling

Pattern throughout codebase:

try:
    result = service.query()
    return result
except:
    return None

Issues:

  • Catches all exceptions (including KeyboardInterrupt, SystemExit)
  • No error logging
  • No distinction between error types
  • Silent failures

Error Handling Recommendations

Specific exception handling:

try:
    result = service.query()
    return result
except requests.exceptions.Timeout:
    logger.warning("Service timeout", service="musicbrainz")
    return None
except requests.exceptions.ConnectionError:
    logger.error("Service unavailable", service="musicbrainz")
    return None
except Exception as e:
    logger.error("Unexpected error", service="musicbrainz", error=str(e), exc_info=True)
    return None

Custom exceptions:

class MusicMetaLinkerError(Exception):
    pass

class ServiceUnavailableError(MusicMetaLinkerError):
    pass

class InvalidInputError(MusicMetaLinkerError):
    pass

class NoMatchFoundError(MusicMetaLinkerError):
    pass

Explicit error returns:

from typing import Optional, Union

def get_mbid(self) -> Union[str, None, MusicMetaLinkerError]:
    try:
        ...
    except ServiceUnavailableError as e:
        return e  # Return error instead of None

Performance Considerations

Performance Bottlenecks

Network latency: Sequential API calls. Total latency = sum of all service latencies.

No caching: Repeated queries for same track.

No connection pooling: New connection for each request.

No request batching: One request per track.

Performance Optimization Opportunities

1. Async/await for concurrent queries:

import asyncio
import aiohttp

async def get_all_metadata(self):
    tasks = [
        self.get_mbid_async(),
        self.get_deezer_id_async(),
        self.get_youtube_link_async()
    ]
    results = await asyncio.gather(*tasks)
    return results

2. Persistent cache:

import redis

cache = redis.Redis()

def get_mbid(self):
    cache_key = f"mbid:{self.artist}:{self.track}"
    cached = cache.get(cache_key)
    if cached:
        return cached.decode()
    
    mbid = self._query_mbid()
    cache.setex(cache_key, 86400, mbid)  # 24 hour TTL
    return mbid

3. Connection pooling:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retry = Retry(total=3, backoff_factor=0.3)
adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=20)
session.mount('http://', adapter)
session.mount('https://', adapter)

4. Batch processing parallelization:

from multiprocessing import Pool

def process_track(jams_file):
    processor = JAMSProcessor(jams_file)
    metadata = processor.extract_metadata()
    linker = Align(**metadata)
    return linker.get_all_metadata()

with Pool(processes=4) as pool:
    results = pool.map(process_track, jams_files)

Code Maintainability

Maintainability Issues

Tight coupling: Align class directly instantiates service classes. Hard to mock for testing.

No abstraction: Service classes have different interfaces. No common base class.

Hardcoded configuration: Changing thresholds requires code modification.

No documentation: Minimal docstrings, no API documentation.

Dead code: AcousticBrainz integration non-functional.

Inconsistent patterns: Function for AcousticBrainz, classes for other services.

Maintainability Recommendations

1. Define service interface:

from abc import ABC, abstractmethod

class ServiceAligner(ABC):
    @abstractmethod
    def search_by_isrc(self, isrc: str) -> Optional[dict]:
        pass
    
    @abstractmethod
    def search_by_metadata(self, artist: str, track: str, album: str) -> Optional[dict]:
        pass

2. Dependency injection:

class Align:
    def __init__(self, services: List[ServiceAligner], **metadata):
        self.services = services
        self.metadata = metadata

3. Add docstrings:

def get_mbid(self) -> Optional[str]:
    """
    Retrieve MusicBrainz recording ID.
    
    Queries MusicBrainz by MBID (if provided), ISRC, or metadata.
    Returns None if no match found or service unavailable.
    
    Returns:
        MusicBrainz recording ID (UUID format) or None
    """
    ...

4. Remove dead code:

Delete acousticbrainz_link() function and all references.

5. Add configuration class:

from dataclasses import dataclass

@dataclass
class MatchingConfig:
    deezer_duration_threshold: int = 3
    musicbrainz_duration_threshold: int = 5
    similarity_threshold: float = 0.8
    user_agent: str = "MusicMetaLinker/0.0.1"

Security Considerations

Security Issues

Plaintext credentials: Spotify credentials in mml_secrets.py (not encrypted).

No input validation: Metadata strings not sanitized.

Broad exception catching: May hide security-relevant errors.

No dependency scanning: Vulnerable dependencies unknown.

Security Recommendations

1. Encrypt credentials:

from cryptography.fernet import Fernet

key = os.getenv("ENCRYPTION_KEY")
cipher = Fernet(key)

encrypted_secret = cipher.encrypt(SPOTIFY_CLIENT_SECRET.encode())

2. Input validation:

import re

def validate_mbid(mbid: str) -> bool:
    uuid_pattern = r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$'
    return bool(re.match(uuid_pattern, mbid, re.IGNORECASE))

def validate_isrc(isrc: str) -> bool:
    isrc_pattern = r'^[A-Z]{2}[A-Z0-9]{3}[0-9]{7}$'
    return bool(re.match(isrc_pattern, isrc))

3. Dependency scanning:

pip install pip-audit
pip-audit

4. Security headers for API calls:

headers = {
    'User-Agent': 'MusicMetaLinker/0.0.1',
    'X-Request-ID': str(uuid.uuid4())
}
response = requests.get(url, headers=headers)

Code Recommendations Summary

Immediate Fixes

  1. Remove all print() statements, replace with logger.debug()
  2. Remove commented-out code
  3. Fix User-Agent: "elka/0.1" → "MusicMetaLinker/0.0.1"
  4. Remove AcousticBrainz integration
  5. Add docstrings to all public methods

Short-Term Improvements

  1. Add type hints throughout codebase
  2. Add unit tests with mocked services
  3. Add linting (pylint, flake8)
  4. Add formatting (black, isort)
  5. Add specific exception handling
  6. Add input validation
  7. Add configuration system

Long-Term Enhancements

  1. Refactor to use service interface abstraction
  2. Add dependency injection
  3. Add async/await for concurrent queries
  4. Add persistent caching
  5. Add connection pooling
  6. Add structured logging
  7. Add monitoring and metrics
  8. Add comprehensive documentation
  9. Add integration tests
  10. Add CI/CD pipeline

Codebase Maturity Assessment

Current state: Research prototype. Pre-release quality.

Maturity level: 2/5

Strengths:

  • Clear separation of concerns (service classes)
  • Simple, understandable structure
  • Functional for research use

Weaknesses:

  • No tests
  • Debug code in production
  • Hardcoded configuration
  • Dead code
  • No documentation
  • No error handling
  • No input validation

Recommendation: Suitable for academic exploration. Requires significant refactoring for production use.