Files
metadata-agregator/docs/research/minim/analysis/CODEBASE.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

905 lines
24 KiB
Markdown

# minim: Codebase Analysis
## Repository Structure
```
minim/
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions CI/CD
├── docs/
│ ├── conf.py # Sphinx configuration
│ ├── index.rst # Documentation index
│ └── ... # Additional documentation
├── minim/
│ ├── __init__.py # Package initialization (65 lines)
│ ├── audio.py # Audio file handling (1,860 lines)
│ ├── discogs.py # Discogs API client (5,501 lines)
│ ├── itunes.py # iTunes API client (575 lines)
│ ├── qobuz.py # Qobuz API client (5,579 lines)
│ ├── spotify.py # Spotify API client (9,862 lines)
│ ├── tidal.py # TIDAL API client (12,338 lines)
│ └── utility.py # Shared utilities (136 lines)
├── tests/
│ ├── test_audio.py # Audio module tests
│ ├── test_discogs.py # Discogs tests
│ ├── test_itunes.py # iTunes tests
│ ├── test_qobuz.py # Qobuz tests
│ ├── test_spotify.py # Spotify tests
│ └── test_tidal.py # TIDAL tests
├── .coveragerc # Coverage configuration
├── .gitignore # Git ignore patterns
├── environment.yml # Conda environment
├── LICENSE # GPL-3.0 license
├── README.md # Project README
└── setup.py # Package setup
```
**Total Source Lines:** 35,916 (excluding tests, docs, config)
**Module Distribution:**
- `tidal.py`: 34.4% (12,338 lines)
- `spotify.py`: 27.5% (9,862 lines)
- `discogs.py`: 15.3% (5,501 lines)
- `qobuz.py`: 15.5% (5,579 lines)
- `audio.py`: 5.2% (1,860 lines)
- `itunes.py`: 1.6% (575 lines)
- `utility.py`: 0.4% (136 lines)
- `__init__.py`: 0.2% (65 lines)
**Observation:** `tidal.py` is disproportionately large. This suggests either comprehensive API coverage or a need for refactoring into submodules.
## Code Organization
### Package Initialization (`__init__.py`)
**Purpose:** Package metadata and version info
**Contents:**
```python
"""
minim: Comprehensive music metadata library
"""
__version__ = "1.1.0"
__author__ = "Benjamin Ye"
__email__ = "bbye98@gmail.com"
__license__ = "GPL-3.0"
__url__ = "https://github.com/bbye98/minim"
# No automatic imports (users import specific modules)
```
**Design Choice:** No automatic imports. Users explicitly import modules:
```python
from minim import spotify # Not: from minim.spotify import WebAPI
```
### Utility Module (`utility.py`)
**Purpose:** Shared utilities across all modules
**Functions:**
**Config File Handling:**
```python
def get_config_path() -> str:
"""Get path to minim config file."""
return os.path.expanduser("~/minim.cfg")
def load_config() -> ConfigParser:
"""Load config file."""
config = ConfigParser()
config.read(get_config_path())
return config
def save_config(config: ConfigParser) -> None:
"""Save config file."""
with open(get_config_path(), "w") as f:
config.write(f)
```
**String Formatting:**
```python
def format_duration(seconds: int) -> str:
"""Format duration in seconds to MM:SS or HH:MM:SS."""
hours, remainder = divmod(seconds, 3600)
minutes, seconds = divmod(remainder, 60)
if hours > 0:
return f"{hours}:{minutes:02d}:{seconds:02d}"
else:
return f"{minutes}:{seconds:02d}"
def sanitize_filename(filename: str) -> str:
"""Remove invalid characters from filename."""
invalid_chars = '<>:"/\\|?*'
for char in invalid_chars:
filename = filename.replace(char, "_")
return filename
```
**URL Handling:**
```python
def build_url(base: str, path: str, params: dict = None) -> str:
"""Build URL with path and query parameters."""
url = base.rstrip("/") + "/" + path.lstrip("/")
if params:
query = "&".join(f"{k}={v}" for k, v in params.items() if v is not None)
url += "?" + query
return url
```
**Minimal Utilities:** Only 136 lines. Most logic is self-contained within each module.
## Configuration Management
### Config File Format
**Location:** `~/minim.cfg`
**Parser:** Python's `ConfigParser` (INI format)
**Structure:**
```ini
[section_name]
key = value
key2 = value2
```
**Reading:**
```python
from configparser import ConfigParser
import os
config = ConfigParser()
config.read(os.path.expanduser("~/minim.cfg"))
value = config.get("section", "key", fallback=None)
int_value = config.getint("section", "key", fallback=0)
bool_value = config.getboolean("section", "key", fallback=False)
```
**Writing:**
```python
if not config.has_section("section"):
config.add_section("section")
config.set("section", "key", "value")
with open(os.path.expanduser("~/minim.cfg"), "w") as f:
config.write(f)
```
### Environment Variables
**Pattern:** `{SERVICE}_{FIELD}` in uppercase
**Examples:**
- `SPOTIFY_CLIENT_ID`
- `TIDAL_ACCESS_TOKEN`
- `QOBUZ_EMAIL`
**Reading:**
```python
import os
client_id = os.getenv("SPOTIFY_CLIENT_ID")
client_secret = os.getenv("SPOTIFY_CLIENT_SECRET")
```
**Precedence in Code:**
```python
def __init__(self, client_id=None, client_secret=None):
# 1. Explicit parameter
self.client_id = client_id
# 2. Environment variable
if not self.client_id:
self.client_id = os.getenv("SPOTIFY_CLIENT_ID")
# 3. Config file
if not self.client_id:
config = load_config()
if config.has_section("spotify"):
self.client_id = config.get("spotify", "client_id", fallback=None)
```
## Logging and Error Handling
### Logging
**No Structured Logging:** minim does not use Python's `logging` module.
**Warnings:**
```python
import warnings
warnings.warn("Token will expire soon", UserWarning)
```
**Use Cases:**
- Non-critical issues (token expiration warnings)
- Deprecated features
- Fallback behavior
**No Debug Logging:** No verbose output for debugging. Users must add their own logging.
### Error Handling
**Strategy:** Fail-fast with exceptions
**Exception Types:**
- `RuntimeError`: API errors, HTTP failures
- `ValueError`: Invalid input, unsupported formats
- `FileNotFoundError`: Missing audio files
- `KeyError`: Missing required fields in API responses
**No Custom Exceptions:** All errors use built-in exception types.
**Example:**
```python
def _request(self, method, url, **kwargs):
response = requests.request(method, url, **kwargs)
if not response.ok:
raise RuntimeError(
f"{method} {url} failed: {response.status_code} {response.text}"
)
return response.json()
```
**Error Messages:**
- Include HTTP method and URL
- Include status code and response body
- No error codes or structured error objects
**Caller Responsibility:**
```python
try:
track = api.get_track(12345)
except RuntimeError as e:
# Parse error message to determine cause
if "404" in str(e):
print("Track not found")
elif "401" in str(e):
print("Authentication failed")
else:
print(f"Unknown error: {e}")
```
## Testing Infrastructure
### Test Framework
**Tool:** pytest
**Test Files:**
- `tests/test_audio.py`: Audio file handling tests
- `tests/test_discogs.py`: Discogs API tests
- `tests/test_itunes.py`: iTunes API tests
- `tests/test_qobuz.py`: Qobuz API tests
- `tests/test_spotify.py`: Spotify API tests
- `tests/test_tidal.py`: TIDAL API tests
**Test Structure:**
```python
import pytest
from minim import spotify
class TestSpotifyWebAPI:
@classmethod
def setup_class(cls):
"""Set up API client for all tests."""
cls.api = spotify.WebAPI(
client_id=os.getenv("SPOTIFY_CLIENT_ID"),
client_secret=os.getenv("SPOTIFY_CLIENT_SECRET")
)
cls.api.set_flow("client_credentials")
cls.api.set_access_token()
def test_search(self):
"""Test search functionality."""
results = self.api.search("Radiohead", types=["artist"], limit=1)
assert "artists" in results
assert len(results["artists"]["items"]) > 0
assert results["artists"]["items"][0]["name"] == "Radiohead"
def test_get_artist(self):
"""Test get artist by ID."""
artist = self.api.get_artist("4Z8W4fKeB5YxbusRsdQVPb")
assert artist["name"] == "Radiohead"
assert artist["type"] == "artist"
def test_invalid_id(self):
"""Test error handling for invalid ID."""
with pytest.raises(RuntimeError):
self.api.get_artist("invalid_id")
```
**Class-Based Tests:**
- `setup_class()`: Run once before all tests in class
- `teardown_class()`: Run once after all tests in class
- Shared API client across tests (reduces authentication overhead)
**Real API Calls:**
- Tests make actual HTTP requests to services
- Requires valid credentials in environment variables
- May fail if services are down or rate limits exceeded
**No Mocking:** Tests do not use `unittest.mock` or `responses` library. All API calls are real.
**Pros:**
- Tests verify actual API behavior
- Catches API changes immediately
**Cons:**
- Slow (network latency)
- Flaky (depends on service availability)
- Rate limiting issues
- Requires credentials
### Coverage Configuration
**File:** `.coveragerc`
```ini
[run]
source = minim
omit =
*/tests/*
*/__init__.py
*/site-packages/*
[report]
exclude_lines =
pragma: no cover
def __repr__
raise AssertionError
raise NotImplementedError
if __name__ == .__main__.:
if TYPE_CHECKING:
precision = 2
show_missing = True
```
**Coverage Execution:**
```bash
coverage run -m pytest tests/
coverage report
coverage html
```
**Coverage Metrics:** Not documented in repository. Estimated 60-80% based on test file count and module complexity.
### Continuous Integration
**Platform:** GitHub Actions
**Workflow:** `.github/workflows/ci.yml`
**Triggers:**
- Push to `main` or `dev` branches
- Pull requests to `main`
**Jobs:**
**Linting:**
```yaml
- name: Lint with ruff
run: ruff check .
```
**Testing:**
```yaml
- name: Run tests
env:
SPOTIFY_CLIENT_ID: ${{ secrets.SPOTIFY_CLIENT_ID }}
SPOTIFY_CLIENT_SECRET: ${{ secrets.SPOTIFY_CLIENT_SECRET }}
TIDAL_CLIENT_ID: ${{ secrets.TIDAL_CLIENT_ID }}
TIDAL_CLIENT_SECRET: ${{ secrets.TIDAL_CLIENT_SECRET }}
run: pytest tests/
```
**Environment:**
- OS: Ubuntu 22.04
- Python: 3.9
- FFmpeg: Installed via apt
**Secrets:** API credentials stored in GitHub Secrets, injected as environment variables.
## Code Style
### Linting
**Tool:** ruff (modern, fast Python linter)
**Replaces:** flake8, pylint, isort, pyupgrade
**Configuration:** `pyproject.toml` or `ruff.toml`
```toml
[tool.ruff]
line-length = 88
target-version = "py39"
[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"W", # pycodestyle warnings
"F", # pyflakes
"I", # isort
"N", # pep8-naming
"UP", # pyupgrade
]
ignore = [
"E501", # line too long (handled by formatter)
]
```
**Execution:**
```bash
ruff check .
ruff check --fix . # Auto-fix issues
```
### Formatting
**No Formatter:** minim does not use `black`, `autopep8`, or similar formatters.
**Style:** Follows PEP 8 with manual formatting.
**Line Length:** Approximately 88 characters (black default), but not enforced.
### Type Hints
**Partial Coverage:** Type hints used inconsistently.
**Examples:**
**With Type Hints:**
```python
def search(self, query: str, types: list[str] = ["track"], limit: int = 20) -> dict:
"""Search Spotify catalog."""
...
```
**Without Type Hints:**
```python
def _request(self, method, url, **kwargs):
"""Make HTTP request."""
...
```
**No Type Checking:** Does not use `mypy` or `pyright` for static type checking.
**Recommendation for v2:** Add comprehensive type hints and integrate `mypy` into CI.
### Docstrings
**Format:** Google-style docstrings
**Example:**
```python
def get_track(self, track_id: str, market: str = None) -> dict:
"""
Get track details.
Args:
track_id: Spotify track ID
market: ISO 3166-1 alpha-2 country code
Returns:
Track object with metadata
Raises:
RuntimeError: If API request fails
Example:
>>> api = WebAPI(client_id="...", client_secret="...")
>>> track = api.get_track("3n3Ppam7vgaVa1iaRUc9Lp")
>>> print(track["name"])
Creep
"""
params = {}
if market:
params["market"] = market
return self._request("GET", f"/tracks/{track_id}", params=params)
```
**Coverage:** Most public methods have docstrings. Private methods (`_request`, `_get_headers`) often lack documentation.
**Sphinx Integration:** Docstrings parsed by Sphinx for ReadTheDocs documentation.
## Code Patterns
### API Client Pattern
**Common Structure:**
```python
class API:
def __init__(self, client_id=None, client_secret=None, access_token=None):
# Load credentials from parameters, env vars, or config file
self.client_id = client_id or os.getenv("SERVICE_CLIENT_ID")
self.client_secret = client_secret or os.getenv("SERVICE_CLIENT_SECRET")
self.access_token = access_token
# Load from config file if not provided
config = load_config()
if config.has_section("service"):
self.access_token = self.access_token or config.get("service", "access_token")
# API base URL
self.base_url = "https://api.service.com/v1"
def set_flow(self, flow_type="authorization_code", **kwargs):
"""Configure OAuth flow."""
self.flow_type = flow_type
# Store flow-specific parameters
def set_access_token(self, method="http.server"):
"""Obtain access token via OAuth flow."""
# Implement OAuth flow
# Save token to config file
def _get_headers(self) -> dict:
"""Get HTTP headers with authentication."""
return {"Authorization": f"Bearer {self.access_token}"}
def _request(self, method: str, url: str, **kwargs) -> dict:
"""Make authenticated HTTP request."""
if not url.startswith("http"):
url = self.base_url + url
headers = kwargs.pop("headers", {})
headers.update(self._get_headers())
response = requests.request(method, url, headers=headers, **kwargs)
if not response.ok:
raise RuntimeError(f"{method} {url} failed: {response.status_code}")
return response.json()
# Public API methods
def search(self, query: str, **kwargs) -> dict:
"""Search catalog."""
return self._request("GET", "/search", params={"q": query, **kwargs})
def get_track(self, track_id: str) -> dict:
"""Get track details."""
return self._request("GET", f"/tracks/{track_id}")
```
**Consistency:** All API clients (`discogs.py`, `spotify.py`, `tidal.py`, `qobuz.py`) follow this pattern with minor variations.
### Audio File Pattern
**Base Class with Subclasses:**
```python
class Audio:
def __init__(self, filepath: str):
self.filepath = filepath
self._file = mutagen.File(filepath)
# Auto-detect format and change class
if isinstance(self._file, mutagen.flac.FLAC):
self.__class__ = FLAC
elif isinstance(self._file, mutagen.mp3.MP3):
self.__class__ = MP3
# ... etc
self.read_metadata()
def read_metadata(self):
"""Read metadata from file. Implemented by subclasses."""
raise NotImplementedError
def write_metadata(self):
"""Write metadata to file. Implemented by subclasses."""
raise NotImplementedError
class FLAC(Audio):
def read_metadata(self):
self.title = self._file.get("TITLE", [None])[0]
self.artist = self._file.get("ARTIST", [None])[0]
# ... etc
def write_metadata(self):
self._file["TITLE"] = self.title
self._file["ARTIST"] = self.artist
# ... etc
self._file.save()
```
**Dynamic Class Change:** `self.__class__ = FLAC` changes instance class after initialization. Unusual pattern but works for format auto-detection.
### OAuth Callback Pattern
**Three Implementations:**
**1. http.server:**
```python
def _listen_http_server(self):
class CallbackHandler(BaseHTTPRequestHandler):
def do_GET(self):
query = parse_qs(urlparse(self.path).query)
self.server.authorization_code = query.get("code", [None])[0]
self.send_response(200)
self.end_headers()
self.wfile.write(b"Authorization successful. You may close this window.")
server = HTTPServer(("localhost", 8888), CallbackHandler)
server.handle_request()
return server.authorization_code
```
**2. Flask:**
```python
def _listen_flask(self):
app = Flask(__name__)
authorization_code = None
@app.route("/callback")
def callback():
nonlocal authorization_code
authorization_code = request.args.get("code")
shutdown = request.environ.get("werkzeug.server.shutdown")
if shutdown:
shutdown()
return "Authorization successful. You may close this window."
app.run(port=8888)
return authorization_code
```
**3. Playwright:**
```python
def _automate_browser(self):
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(self.auth_url)
page.fill("#username", self.email)
page.fill("#password", self.password)
page.click("button[type=submit]")
page.wait_for_url(f"{self.redirect_uri}*")
code = parse_qs(urlparse(page.url).query)["code"][0]
browser.close()
return code
```
**Flexibility:** Users choose callback method based on environment (headless server, desktop, etc.).
## Code Quality Issues
### Large Monolithic Files
**Problem:** `tidal.py` is 12,338 lines (34% of codebase).
**Impact:**
- Difficult to navigate
- Slow to load in editors
- Hard to maintain
- Merge conflicts more likely
**Recommendation:** Split into submodules:
```
minim/tidal/
├── __init__.py
├── auth.py # Authentication
├── catalog.py # Catalog endpoints
├── streaming.py # Streaming URLs
├── lyrics.py # Lyrics endpoints
├── user.py # User library
└── models.py # Data models
```
### Generic Error Handling
**Problem:** All errors are `RuntimeError` with string messages.
**Impact:**
- Caller must parse error messages to determine cause
- No structured error handling
- Difficult to distinguish error types
**Recommendation:** Define custom exceptions:
```python
class MinimError(Exception):
"""Base exception for minim."""
class APIError(MinimError):
"""API request failed."""
def __init__(self, status_code: int, message: str):
self.status_code = status_code
self.message = message
super().__init__(f"API error {status_code}: {message}")
class AuthenticationError(MinimError):
"""Authentication failed."""
class RateLimitError(APIError):
"""Rate limit exceeded."""
def __init__(self, retry_after: int):
self.retry_after = retry_after
super().__init__(429, f"Rate limit exceeded. Retry after {retry_after}s")
```
### No Rate Limiting
**Problem:** No built-in rate limiting. Caller responsible for tracking.
**Impact:**
- Easy to exceed service rate limits
- No automatic backoff
- Tests may fail due to rate limiting
**Recommendation:** Implement rate limiter:
```python
from time import time, sleep
class RateLimiter:
def __init__(self, requests_per_minute: int):
self.requests_per_minute = requests_per_minute
self.requests = []
def wait_if_needed(self):
now = time()
# Remove requests older than 1 minute
self.requests = [t for t in self.requests if now - t < 60]
if len(self.requests) >= self.requests_per_minute:
sleep_time = 60 - (now - self.requests[0])
if sleep_time > 0:
sleep(sleep_time)
self.requests.append(time())
# Usage in API client
class API:
def __init__(self):
self.rate_limiter = RateLimiter(60) # 60 requests per minute
def _request(self, method, url, **kwargs):
self.rate_limiter.wait_if_needed()
# Make request
```
### Plain Text Token Storage
**Problem:** Tokens stored unencrypted in `~/minim.cfg`.
**Impact:**
- Security risk on shared systems
- Tokens readable by any process
- Passwords stored in plain text (Qobuz)
**Recommendation:** Use OS keychain:
```python
import keyring
# Store token
keyring.set_password("minim", "spotify_access_token", access_token)
# Retrieve token
access_token = keyring.get_password("minim", "spotify_access_token")
```
### Inconsistent Type Hints
**Problem:** Some functions have type hints, others don't.
**Impact:**
- Reduced IDE autocomplete support
- No static type checking
- Harder to understand function signatures
**Recommendation:** Add comprehensive type hints and enable `mypy`:
```python
from typing import Optional, Dict, List, Any
def search(
self,
query: str,
types: List[str] = ["track"],
limit: int = 20,
offset: int = 0
) -> Dict[str, Any]:
"""Search catalog."""
...
```
## Code Metrics
### Complexity
**Cyclomatic Complexity:** Not measured. Likely moderate to high in large modules (`tidal.py`, `spotify.py`).
**Recommendation:** Use `radon` to measure complexity:
```bash
pip install radon
radon cc minim/ -a # Average complexity
radon cc minim/ -n D # Show functions with complexity > D (high)
```
### Duplication
**Code Duplication:** Likely present across API clients (authentication, request handling).
**Recommendation:** Extract common patterns to base class:
```python
class BaseAPI:
def __init__(self, service_name: str):
self.service_name = service_name
self.load_credentials()
def load_credentials(self):
# Common credential loading logic
...
def _request(self, method, url, **kwargs):
# Common request handling
...
class SpotifyAPI(BaseAPI):
def __init__(self):
super().__init__("spotify")
self.base_url = "https://api.spotify.com/v1"
```
### Dependencies
**Direct Dependencies:** 3 (cryptography, mutagen, requests)
**Optional Dependencies:** 6 (ffmpeg, flask, levenshtein, numpy, pillow, playwright)
**Dependency Graph:** Flat (no transitive dependencies within minim modules).
**Recommendation:** Keep dependencies minimal. Current approach is good.
## Summary
minim's codebase is well-structured for a personal project but shows signs of organic growth:
**Strengths:**
- Consistent API client pattern across modules
- Comprehensive test coverage with real API calls
- Good documentation (docstrings, ReadTheDocs)
- Minimal dependencies
- CI/CD with GitHub Actions
**Weaknesses:**
- Large monolithic files (`tidal.py` at 12K lines)
- Generic error handling (all `RuntimeError`)
- No rate limiting
- Plain text token storage
- Inconsistent type hints
- No static type checking
**Recommendations for v2:**
- Split large modules into subpackages
- Define custom exception hierarchy
- Implement rate limiting and backoff
- Use OS keychain for token storage
- Add comprehensive type hints
- Integrate `mypy` for static type checking
- Extract common patterns to base classes
- Add code complexity and duplication metrics to CI
The codebase is production-ready for personal use but requires hardening for commercial or large-scale deployment. The v2 rewrite on the `dev` branch addresses many of these issues.