Files
metadata-agregator/docs/research/minim/analysis/CODEBASE.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

24 KiB

minim: Codebase Analysis

Repository Structure

minim/
├── .github/
│   └── workflows/
│       └── ci.yml                 # GitHub Actions CI/CD
├── docs/
│   ├── conf.py                    # Sphinx configuration
│   ├── index.rst                  # Documentation index
│   └── ...                        # Additional documentation
├── minim/
│   ├── __init__.py                # Package initialization (65 lines)
│   ├── audio.py                   # Audio file handling (1,860 lines)
│   ├── discogs.py                 # Discogs API client (5,501 lines)
│   ├── itunes.py                  # iTunes API client (575 lines)
│   ├── qobuz.py                   # Qobuz API client (5,579 lines)
│   ├── spotify.py                 # Spotify API client (9,862 lines)
│   ├── tidal.py                   # TIDAL API client (12,338 lines)
│   └── utility.py                 # Shared utilities (136 lines)
├── tests/
│   ├── test_audio.py              # Audio module tests
│   ├── test_discogs.py            # Discogs tests
│   ├── test_itunes.py             # iTunes tests
│   ├── test_qobuz.py              # Qobuz tests
│   ├── test_spotify.py            # Spotify tests
│   └── test_tidal.py              # TIDAL tests
├── .coveragerc                    # Coverage configuration
├── .gitignore                     # Git ignore patterns
├── environment.yml                # Conda environment
├── LICENSE                        # GPL-3.0 license
├── README.md                      # Project README
└── setup.py                       # Package setup

Total Source Lines: 35,916 (excluding tests, docs, config)

Module Distribution:

  • tidal.py: 34.4% (12,338 lines)
  • spotify.py: 27.5% (9,862 lines)
  • discogs.py: 15.3% (5,501 lines)
  • qobuz.py: 15.5% (5,579 lines)
  • audio.py: 5.2% (1,860 lines)
  • itunes.py: 1.6% (575 lines)
  • utility.py: 0.4% (136 lines)
  • __init__.py: 0.2% (65 lines)

Observation: tidal.py is disproportionately large. This suggests either comprehensive API coverage or a need for refactoring into submodules.

Code Organization

Package Initialization (__init__.py)

Purpose: Package metadata and version info

Contents:

"""
minim: Comprehensive music metadata library
"""

__version__ = "1.1.0"
__author__ = "Benjamin Ye"
__email__ = "bbye98@gmail.com"
__license__ = "GPL-3.0"
__url__ = "https://github.com/bbye98/minim"

# No automatic imports (users import specific modules)

Design Choice: No automatic imports. Users explicitly import modules:

from minim import spotify  # Not: from minim.spotify import WebAPI

Utility Module (utility.py)

Purpose: Shared utilities across all modules

Functions:

Config File Handling:

def get_config_path() -> str:
    """Get path to minim config file."""
    return os.path.expanduser("~/minim.cfg")

def load_config() -> ConfigParser:
    """Load config file."""
    config = ConfigParser()
    config.read(get_config_path())
    return config

def save_config(config: ConfigParser) -> None:
    """Save config file."""
    with open(get_config_path(), "w") as f:
        config.write(f)

String Formatting:

def format_duration(seconds: int) -> str:
    """Format duration in seconds to MM:SS or HH:MM:SS."""
    hours, remainder = divmod(seconds, 3600)
    minutes, seconds = divmod(remainder, 60)
    
    if hours > 0:
        return f"{hours}:{minutes:02d}:{seconds:02d}"
    else:
        return f"{minutes}:{seconds:02d}"

def sanitize_filename(filename: str) -> str:
    """Remove invalid characters from filename."""
    invalid_chars = '<>:"/\\|?*'
    for char in invalid_chars:
        filename = filename.replace(char, "_")
    return filename

URL Handling:

def build_url(base: str, path: str, params: dict = None) -> str:
    """Build URL with path and query parameters."""
    url = base.rstrip("/") + "/" + path.lstrip("/")
    
    if params:
        query = "&".join(f"{k}={v}" for k, v in params.items() if v is not None)
        url += "?" + query
    
    return url

Minimal Utilities: Only 136 lines. Most logic is self-contained within each module.

Configuration Management

Config File Format

Location: ~/minim.cfg

Parser: Python's ConfigParser (INI format)

Structure:

[section_name]
key = value
key2 = value2

Reading:

from configparser import ConfigParser
import os

config = ConfigParser()
config.read(os.path.expanduser("~/minim.cfg"))

value = config.get("section", "key", fallback=None)
int_value = config.getint("section", "key", fallback=0)
bool_value = config.getboolean("section", "key", fallback=False)

Writing:

if not config.has_section("section"):
    config.add_section("section")

config.set("section", "key", "value")

with open(os.path.expanduser("~/minim.cfg"), "w") as f:
    config.write(f)

Environment Variables

Pattern: {SERVICE}_{FIELD} in uppercase

Examples:

  • SPOTIFY_CLIENT_ID
  • TIDAL_ACCESS_TOKEN
  • QOBUZ_EMAIL

Reading:

import os

client_id = os.getenv("SPOTIFY_CLIENT_ID")
client_secret = os.getenv("SPOTIFY_CLIENT_SECRET")

Precedence in Code:

def __init__(self, client_id=None, client_secret=None):
    # 1. Explicit parameter
    self.client_id = client_id
    
    # 2. Environment variable
    if not self.client_id:
        self.client_id = os.getenv("SPOTIFY_CLIENT_ID")
    
    # 3. Config file
    if not self.client_id:
        config = load_config()
        if config.has_section("spotify"):
            self.client_id = config.get("spotify", "client_id", fallback=None)

Logging and Error Handling

Logging

No Structured Logging: minim does not use Python's logging module.

Warnings:

import warnings

warnings.warn("Token will expire soon", UserWarning)

Use Cases:

  • Non-critical issues (token expiration warnings)
  • Deprecated features
  • Fallback behavior

No Debug Logging: No verbose output for debugging. Users must add their own logging.

Error Handling

Strategy: Fail-fast with exceptions

Exception Types:

  • RuntimeError: API errors, HTTP failures
  • ValueError: Invalid input, unsupported formats
  • FileNotFoundError: Missing audio files
  • KeyError: Missing required fields in API responses

No Custom Exceptions: All errors use built-in exception types.

Example:

def _request(self, method, url, **kwargs):
    response = requests.request(method, url, **kwargs)
    
    if not response.ok:
        raise RuntimeError(
            f"{method} {url} failed: {response.status_code} {response.text}"
        )
    
    return response.json()

Error Messages:

  • Include HTTP method and URL
  • Include status code and response body
  • No error codes or structured error objects

Caller Responsibility:

try:
    track = api.get_track(12345)
except RuntimeError as e:
    # Parse error message to determine cause
    if "404" in str(e):
        print("Track not found")
    elif "401" in str(e):
        print("Authentication failed")
    else:
        print(f"Unknown error: {e}")

Testing Infrastructure

Test Framework

Tool: pytest

Test Files:

  • tests/test_audio.py: Audio file handling tests
  • tests/test_discogs.py: Discogs API tests
  • tests/test_itunes.py: iTunes API tests
  • tests/test_qobuz.py: Qobuz API tests
  • tests/test_spotify.py: Spotify API tests
  • tests/test_tidal.py: TIDAL API tests

Test Structure:

import pytest
from minim import spotify

class TestSpotifyWebAPI:
    @classmethod
    def setup_class(cls):
        """Set up API client for all tests."""
        cls.api = spotify.WebAPI(
            client_id=os.getenv("SPOTIFY_CLIENT_ID"),
            client_secret=os.getenv("SPOTIFY_CLIENT_SECRET")
        )
        cls.api.set_flow("client_credentials")
        cls.api.set_access_token()
    
    def test_search(self):
        """Test search functionality."""
        results = self.api.search("Radiohead", types=["artist"], limit=1)
        
        assert "artists" in results
        assert len(results["artists"]["items"]) > 0
        assert results["artists"]["items"][0]["name"] == "Radiohead"
    
    def test_get_artist(self):
        """Test get artist by ID."""
        artist = self.api.get_artist("4Z8W4fKeB5YxbusRsdQVPb")
        
        assert artist["name"] == "Radiohead"
        assert artist["type"] == "artist"
    
    def test_invalid_id(self):
        """Test error handling for invalid ID."""
        with pytest.raises(RuntimeError):
            self.api.get_artist("invalid_id")

Class-Based Tests:

  • setup_class(): Run once before all tests in class
  • teardown_class(): Run once after all tests in class
  • Shared API client across tests (reduces authentication overhead)

Real API Calls:

  • Tests make actual HTTP requests to services
  • Requires valid credentials in environment variables
  • May fail if services are down or rate limits exceeded

No Mocking: Tests do not use unittest.mock or responses library. All API calls are real.

Pros:

  • Tests verify actual API behavior
  • Catches API changes immediately

Cons:

  • Slow (network latency)
  • Flaky (depends on service availability)
  • Rate limiting issues
  • Requires credentials

Coverage Configuration

File: .coveragerc

[run]
source = minim
omit =
    */tests/*
    */__init__.py
    */site-packages/*

[report]
exclude_lines =
    pragma: no cover
    def __repr__
    raise AssertionError
    raise NotImplementedError
    if __name__ == .__main__.:
    if TYPE_CHECKING:

precision = 2
show_missing = True

Coverage Execution:

coverage run -m pytest tests/
coverage report
coverage html

Coverage Metrics: Not documented in repository. Estimated 60-80% based on test file count and module complexity.

Continuous Integration

Platform: GitHub Actions

Workflow: .github/workflows/ci.yml

Triggers:

  • Push to main or dev branches
  • Pull requests to main

Jobs:

Linting:

- name: Lint with ruff
  run: ruff check .

Testing:

- name: Run tests
  env:
    SPOTIFY_CLIENT_ID: ${{ secrets.SPOTIFY_CLIENT_ID }}
    SPOTIFY_CLIENT_SECRET: ${{ secrets.SPOTIFY_CLIENT_SECRET }}
    TIDAL_CLIENT_ID: ${{ secrets.TIDAL_CLIENT_ID }}
    TIDAL_CLIENT_SECRET: ${{ secrets.TIDAL_CLIENT_SECRET }}
  run: pytest tests/

Environment:

  • OS: Ubuntu 22.04
  • Python: 3.9
  • FFmpeg: Installed via apt

Secrets: API credentials stored in GitHub Secrets, injected as environment variables.

Code Style

Linting

Tool: ruff (modern, fast Python linter)

Replaces: flake8, pylint, isort, pyupgrade

Configuration: pyproject.toml or ruff.toml

[tool.ruff]
line-length = 88
target-version = "py39"

[tool.ruff.lint]
select = [
    "E",   # pycodestyle errors
    "W",   # pycodestyle warnings
    "F",   # pyflakes
    "I",   # isort
    "N",   # pep8-naming
    "UP",  # pyupgrade
]
ignore = [
    "E501",  # line too long (handled by formatter)
]

Execution:

ruff check .
ruff check --fix .  # Auto-fix issues

Formatting

No Formatter: minim does not use black, autopep8, or similar formatters.

Style: Follows PEP 8 with manual formatting.

Line Length: Approximately 88 characters (black default), but not enforced.

Type Hints

Partial Coverage: Type hints used inconsistently.

Examples:

With Type Hints:

def search(self, query: str, types: list[str] = ["track"], limit: int = 20) -> dict:
    """Search Spotify catalog."""
    ...

Without Type Hints:

def _request(self, method, url, **kwargs):
    """Make HTTP request."""
    ...

No Type Checking: Does not use mypy or pyright for static type checking.

Recommendation for v2: Add comprehensive type hints and integrate mypy into CI.

Docstrings

Format: Google-style docstrings

Example:

def get_track(self, track_id: str, market: str = None) -> dict:
    """
    Get track details.
    
    Args:
        track_id: Spotify track ID
        market: ISO 3166-1 alpha-2 country code
    
    Returns:
        Track object with metadata
    
    Raises:
        RuntimeError: If API request fails
    
    Example:
        >>> api = WebAPI(client_id="...", client_secret="...")
        >>> track = api.get_track("3n3Ppam7vgaVa1iaRUc9Lp")
        >>> print(track["name"])
        Creep
    """
    params = {}
    if market:
        params["market"] = market
    
    return self._request("GET", f"/tracks/{track_id}", params=params)

Coverage: Most public methods have docstrings. Private methods (_request, _get_headers) often lack documentation.

Sphinx Integration: Docstrings parsed by Sphinx for ReadTheDocs documentation.

Code Patterns

API Client Pattern

Common Structure:

class API:
    def __init__(self, client_id=None, client_secret=None, access_token=None):
        # Load credentials from parameters, env vars, or config file
        self.client_id = client_id or os.getenv("SERVICE_CLIENT_ID")
        self.client_secret = client_secret or os.getenv("SERVICE_CLIENT_SECRET")
        self.access_token = access_token
        
        # Load from config file if not provided
        config = load_config()
        if config.has_section("service"):
            self.access_token = self.access_token or config.get("service", "access_token")
        
        # API base URL
        self.base_url = "https://api.service.com/v1"
    
    def set_flow(self, flow_type="authorization_code", **kwargs):
        """Configure OAuth flow."""
        self.flow_type = flow_type
        # Store flow-specific parameters
    
    def set_access_token(self, method="http.server"):
        """Obtain access token via OAuth flow."""
        # Implement OAuth flow
        # Save token to config file
    
    def _get_headers(self) -> dict:
        """Get HTTP headers with authentication."""
        return {"Authorization": f"Bearer {self.access_token}"}
    
    def _request(self, method: str, url: str, **kwargs) -> dict:
        """Make authenticated HTTP request."""
        if not url.startswith("http"):
            url = self.base_url + url
        
        headers = kwargs.pop("headers", {})
        headers.update(self._get_headers())
        
        response = requests.request(method, url, headers=headers, **kwargs)
        
        if not response.ok:
            raise RuntimeError(f"{method} {url} failed: {response.status_code}")
        
        return response.json()
    
    # Public API methods
    def search(self, query: str, **kwargs) -> dict:
        """Search catalog."""
        return self._request("GET", "/search", params={"q": query, **kwargs})
    
    def get_track(self, track_id: str) -> dict:
        """Get track details."""
        return self._request("GET", f"/tracks/{track_id}")

Consistency: All API clients (discogs.py, spotify.py, tidal.py, qobuz.py) follow this pattern with minor variations.

Audio File Pattern

Base Class with Subclasses:

class Audio:
    def __init__(self, filepath: str):
        self.filepath = filepath
        self._file = mutagen.File(filepath)
        
        # Auto-detect format and change class
        if isinstance(self._file, mutagen.flac.FLAC):
            self.__class__ = FLAC
        elif isinstance(self._file, mutagen.mp3.MP3):
            self.__class__ = MP3
        # ... etc
        
        self.read_metadata()
    
    def read_metadata(self):
        """Read metadata from file. Implemented by subclasses."""
        raise NotImplementedError
    
    def write_metadata(self):
        """Write metadata to file. Implemented by subclasses."""
        raise NotImplementedError

class FLAC(Audio):
    def read_metadata(self):
        self.title = self._file.get("TITLE", [None])[0]
        self.artist = self._file.get("ARTIST", [None])[0]
        # ... etc
    
    def write_metadata(self):
        self._file["TITLE"] = self.title
        self._file["ARTIST"] = self.artist
        # ... etc
        self._file.save()

Dynamic Class Change: self.__class__ = FLAC changes instance class after initialization. Unusual pattern but works for format auto-detection.

OAuth Callback Pattern

Three Implementations:

1. http.server:

def _listen_http_server(self):
    class CallbackHandler(BaseHTTPRequestHandler):
        def do_GET(self):
            query = parse_qs(urlparse(self.path).query)
            self.server.authorization_code = query.get("code", [None])[0]
            self.send_response(200)
            self.end_headers()
            self.wfile.write(b"Authorization successful. You may close this window.")
    
    server = HTTPServer(("localhost", 8888), CallbackHandler)
    server.handle_request()
    return server.authorization_code

2. Flask:

def _listen_flask(self):
    app = Flask(__name__)
    authorization_code = None
    
    @app.route("/callback")
    def callback():
        nonlocal authorization_code
        authorization_code = request.args.get("code")
        shutdown = request.environ.get("werkzeug.server.shutdown")
        if shutdown:
            shutdown()
        return "Authorization successful. You may close this window."
    
    app.run(port=8888)
    return authorization_code

3. Playwright:

def _automate_browser(self):
    from playwright.sync_api import sync_playwright
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        page.goto(self.auth_url)
        page.fill("#username", self.email)
        page.fill("#password", self.password)
        page.click("button[type=submit]")
        
        page.wait_for_url(f"{self.redirect_uri}*")
        code = parse_qs(urlparse(page.url).query)["code"][0]
        
        browser.close()
        return code

Flexibility: Users choose callback method based on environment (headless server, desktop, etc.).

Code Quality Issues

Large Monolithic Files

Problem: tidal.py is 12,338 lines (34% of codebase).

Impact:

  • Difficult to navigate
  • Slow to load in editors
  • Hard to maintain
  • Merge conflicts more likely

Recommendation: Split into submodules:

minim/tidal/
├── __init__.py
├── auth.py          # Authentication
├── catalog.py       # Catalog endpoints
├── streaming.py     # Streaming URLs
├── lyrics.py        # Lyrics endpoints
├── user.py          # User library
└── models.py        # Data models

Generic Error Handling

Problem: All errors are RuntimeError with string messages.

Impact:

  • Caller must parse error messages to determine cause
  • No structured error handling
  • Difficult to distinguish error types

Recommendation: Define custom exceptions:

class MinimError(Exception):
    """Base exception for minim."""

class APIError(MinimError):
    """API request failed."""
    def __init__(self, status_code: int, message: str):
        self.status_code = status_code
        self.message = message
        super().__init__(f"API error {status_code}: {message}")

class AuthenticationError(MinimError):
    """Authentication failed."""

class RateLimitError(APIError):
    """Rate limit exceeded."""
    def __init__(self, retry_after: int):
        self.retry_after = retry_after
        super().__init__(429, f"Rate limit exceeded. Retry after {retry_after}s")

No Rate Limiting

Problem: No built-in rate limiting. Caller responsible for tracking.

Impact:

  • Easy to exceed service rate limits
  • No automatic backoff
  • Tests may fail due to rate limiting

Recommendation: Implement rate limiter:

from time import time, sleep

class RateLimiter:
    def __init__(self, requests_per_minute: int):
        self.requests_per_minute = requests_per_minute
        self.requests = []
    
    def wait_if_needed(self):
        now = time()
        # Remove requests older than 1 minute
        self.requests = [t for t in self.requests if now - t < 60]
        
        if len(self.requests) >= self.requests_per_minute:
            sleep_time = 60 - (now - self.requests[0])
            if sleep_time > 0:
                sleep(sleep_time)
        
        self.requests.append(time())

# Usage in API client
class API:
    def __init__(self):
        self.rate_limiter = RateLimiter(60)  # 60 requests per minute
    
    def _request(self, method, url, **kwargs):
        self.rate_limiter.wait_if_needed()
        # Make request

Plain Text Token Storage

Problem: Tokens stored unencrypted in ~/minim.cfg.

Impact:

  • Security risk on shared systems
  • Tokens readable by any process
  • Passwords stored in plain text (Qobuz)

Recommendation: Use OS keychain:

import keyring

# Store token
keyring.set_password("minim", "spotify_access_token", access_token)

# Retrieve token
access_token = keyring.get_password("minim", "spotify_access_token")

Inconsistent Type Hints

Problem: Some functions have type hints, others don't.

Impact:

  • Reduced IDE autocomplete support
  • No static type checking
  • Harder to understand function signatures

Recommendation: Add comprehensive type hints and enable mypy:

from typing import Optional, Dict, List, Any

def search(
    self,
    query: str,
    types: List[str] = ["track"],
    limit: int = 20,
    offset: int = 0
) -> Dict[str, Any]:
    """Search catalog."""
    ...

Code Metrics

Complexity

Cyclomatic Complexity: Not measured. Likely moderate to high in large modules (tidal.py, spotify.py).

Recommendation: Use radon to measure complexity:

pip install radon
radon cc minim/ -a  # Average complexity
radon cc minim/ -n D  # Show functions with complexity > D (high)

Duplication

Code Duplication: Likely present across API clients (authentication, request handling).

Recommendation: Extract common patterns to base class:

class BaseAPI:
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.load_credentials()
    
    def load_credentials(self):
        # Common credential loading logic
        ...
    
    def _request(self, method, url, **kwargs):
        # Common request handling
        ...

class SpotifyAPI(BaseAPI):
    def __init__(self):
        super().__init__("spotify")
        self.base_url = "https://api.spotify.com/v1"

Dependencies

Direct Dependencies: 3 (cryptography, mutagen, requests)

Optional Dependencies: 6 (ffmpeg, flask, levenshtein, numpy, pillow, playwright)

Dependency Graph: Flat (no transitive dependencies within minim modules).

Recommendation: Keep dependencies minimal. Current approach is good.

Summary

minim's codebase is well-structured for a personal project but shows signs of organic growth:

Strengths:

  • Consistent API client pattern across modules
  • Comprehensive test coverage with real API calls
  • Good documentation (docstrings, ReadTheDocs)
  • Minimal dependencies
  • CI/CD with GitHub Actions

Weaknesses:

  • Large monolithic files (tidal.py at 12K lines)
  • Generic error handling (all RuntimeError)
  • No rate limiting
  • Plain text token storage
  • Inconsistent type hints
  • No static type checking

Recommendations for v2:

  • Split large modules into subpackages
  • Define custom exception hierarchy
  • Implement rate limiting and backoff
  • Use OS keychain for token storage
  • Add comprehensive type hints
  • Integrate mypy for static type checking
  • Extract common patterns to base classes
  • Add code complexity and duplication metrics to CI

The codebase is production-ready for personal use but requires hardening for commercial or large-scale deployment. The v2 rewrite on the dev branch addresses many of these issues.