feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
@@ -0,0 +1,441 @@
+# MusicMetaLinker Architecture
+
+## System Overview
+
+MusicMetaLinker implements a service-oriented architecture for music metadata entity linking. The system coordinates queries across multiple external APIs, aggregates results, and presents a unified interface through a single orchestrator class.
+
+Architecture pattern: Facade with cascading fallback strategy.
+
+## Core Components
+
+### Align Class (linking.py)
+
+The Align class is the primary orchestrator and sole public interface. It encapsulates all service interactions and presents a clean getter-based API.
+
+**Constructor signature:**
+```python
+Align(
+    mbid_track=None,
+    mbid_release=None, 
+    artist=None,
+    album=None,
+    track=None,
+    track_number=None,
+    duration=None,
+    isrc=None,
+    strict=False
+)
+```
+
+**Responsibilities:**
+- Initialize service-specific aligners based on available input
+- Coordinate query execution across services
+- Aggregate and normalize results
+- Expose unified getter methods for all metadata fields
+
+**Internal state:**
+- Stores all input parameters
+- Maintains references to service aligner instances
+- Caches retrieved metadata to avoid redundant queries
+
+The Align class doesn't implement service-specific logic. It delegates to specialized classes and functions.
+
+### MusicBrainzAlign Class
+
+Handles all MusicBrainz interactions. MusicBrainz is treated as the authoritative source when MBIDs are available.
+
+**Key methods:**
+
+**get_recording(mbid):** Retrieves full recording data by MBID. Returns artist, album, track name, duration, ISRCs, and related identifiers.
+
+**get_best_match(artist, track, album, duration):** Searches MusicBrainz by metadata strings. Filters results by duration and fuzzy string matching. Returns the highest-scoring match.
+
+**get_iswc():** Retrieves International Standard Musical Work Code if available.
+
+**Search strategy:**
+1. If MBID provided, direct lookup (most reliable)
+2. If ISRC provided, search by ISRC
+3. Fall back to metadata string search with filtering
+
+MusicBrainz queries include related entities (artists, releases, ISRCs) in a single request to minimize API calls.
+
+### DeezerAlign Class
+
+Interfaces with Deezer's public API. Deezer provides commercial metadata with strong ISRC coverage.
+
+**Key methods:**
+
+**best_match(artist, track, album, duration, duration_threshold=3):** Searches Deezer and filters by duration. The duration_threshold parameter allows ±3 seconds variance by default.
+
+**get_rank():** Returns Deezer's internal popularity rank for the track.
+
+**Search strategy:**
+1. If ISRC available, search by ISRC (most accurate)
+2. Fall back to metadata string search
+3. Filter results by duration (±3 seconds)
+4. Apply fuzzy string matching to artist/track/album
+
+Duration filtering is critical for Deezer because metadata searches often return multiple versions (radio edit, album version, remaster).
+
+### YouTubeAlign Class
+
+Queries YouTube Music via the unofficial ytmusicapi library.
+
+**Key methods:**
+
+**get_best_match(artist, track, album):** Searches YouTube Music with filter="songs". Returns the first result (no sophisticated ranking).
+
+**get_youtube_id():** Extracts YouTube video ID from search results.
+
+**Search strategy:**
+- Constructs query string: "{artist} {track} {album}"
+- Filters to songs only (excludes videos, albums)
+- Returns first result
+
+YouTube matching is the weakest link. No duration filtering (commented out in code). No fuzzy matching. First result is assumed correct.
+
+### acousticbrainz_link Function
+
+Standalone function (not a class) that checks if an MBID exists in AcousticBrainz.
+
+**Implementation:**
+```python
+def acousticbrainz_link(mbid):
+    url = f"https://acousticbrainz.org/{mbid}"
+    response = requests.get(url)
+    return url if response.status_code == 200 else None
+```
+
+Simple HTTP check. Returns URL if MBID exists, None otherwise.
+
+**Critical issue:** AcousticBrainz shut down in 2022. This function always returns None. Dead code.
+
+## Data Flow
+
+### Initialization Flow
+
+1. User creates Align instance with available metadata
+2. Align constructor stores all input parameters
+3. Service aligners are instantiated on-demand (lazy initialization)
+4. No queries execute during construction
+
+### Query Flow
+
+1. User calls getter method (e.g., get_mbid())
+2. Align checks if value already cached
+3. If not cached, determines which service to query based on available input
+4. Executes service-specific query
+5. Caches result
+6. Returns value to user
+
+Queries are lazy and cached. Calling get_mbid() twice only queries MusicBrainz once.
+
+### Cascading Fallback Strategy
+
+Priority order for identifier resolution:
+
+**For MBID:**
+1. Use provided mbid_track if available
+2. Query MusicBrainz by ISRC
+3. Query MusicBrainz by metadata strings
+4. Return None if all fail
+
+**For ISRC:**
+1. Use provided ISRC if available
+2. Extract from MusicBrainz recording (if MBID available)
+3. Query Deezer and extract ISRC from result
+4. Return None if all fail
+
+**For Deezer ID:**
+1. Query Deezer by ISRC
+2. Query Deezer by metadata strings
+3. Return None if all fail
+
+**For YouTube link:**
+1. Query YouTube Music by metadata strings
+2. Return None if no results
+
+Each service is queried independently. No cross-service validation or conflict resolution.
+
+## Supporting Components
+
+### JAMSProcessor (preprocessor.py)
+
+Handles reading and writing JAMS (JSON Annotated Music Specification) files.
+
+**Responsibilities:**
+- Parse JAMS JSON structure
+- Extract metadata from file_metadata and sandbox sections
+- Enrich JAMS files with new identifiers
+- Write updated JAMS files
+
+JAMS structure:
+```json
+{
+  "file_metadata": {
+    "title": "track name",
+    "artist": "artist name",
+    "release": "album name",
+    "duration": 123.45,
+    "identifiers": {
+      "musicbrainz": "mbid-here"
+    }
+  },
+  "sandbox": {
+    "type": "genre",
+    "genre": "rock",
+    "track_number": 1,
+    "release_year": 2020
+  }
+}
+```
+
+JAMSProcessor reads these fields, passes them to Align, and writes enriched identifiers back to the identifiers section.
+
+### MBDownload (musicbrainz_dump.py)
+
+Utility for bulk downloading MusicBrainz data.
+
+**Purpose:** Pre-populate local datasets with MusicBrainz metadata to reduce API calls during batch processing.
+
+**Implementation details:** Not fully specified in provided information. Likely queries MusicBrainz in batches and caches results locally.
+
+### link_partitions.py
+
+Batch processing script for directories of JAMS files.
+
+**Workflow:**
+1. Scan directory for JAMS files
+2. For each file, extract metadata via JAMSProcessor
+3. Create Align instance and query all services
+4. Collect results in pandas DataFrame
+5. Output CSV with all identifiers
+
+**Command-line options:**
+- `--save`: Write enriched JAMS files back to disk
+- `--limit audio`: Only process audio files (skip non-audio JAMS)
+- `--overwrite`: Overwrite existing enriched files
+
+Includes progress bars via tqdm and logging to link_partitions.log.
+
+### prepare_dataset.py
+
+Dataset preparation utilities. Specific functionality not detailed in provided information. Likely includes:
+- Data cleaning
+- Format conversion
+- Batch metadata enrichment
+
+## Configuration Architecture
+
+No configuration system. All settings hardcoded in source files.
+
+**Hardcoded values:**
+- MusicBrainz User-Agent: "elka/0.1"
+- Deezer duration threshold: 3 seconds
+- API endpoints: Direct URLs in code
+- Spotify credentials: Imported from external mml_secrets.py
+
+**Implications:**
+- No runtime configuration
+- No environment-specific settings
+- Changing thresholds requires code modification
+- No A/B testing of matching strategies
+
+## Error Handling Architecture
+
+Error handling is minimal and inconsistent.
+
+**Pattern:**
+```python
+try:
+    result = service.query()
+    return result
+except:
+    return None
+```
+
+All exceptions are caught and suppressed. Failed queries return None. No error logging, no exception propagation, no retry logic.
+
+**Consequences:**
+- Silent failures
+- No visibility into what went wrong
+- Difficult debugging
+- No distinction between "not found" and "service error"
+
+## Logging Architecture
+
+Uses Python's standard logging module.
+
+**Batch processing:** File-based logging to link_partitions.log. Includes timestamps, log levels, and progress information.
+
+**Library usage:** Console logging. Minimal output.
+
+**Debug output:** Multiple print() statements scattered throughout code. Not controlled by logging configuration.
+
+**Issues:**
+- Debug prints in production code
+- No structured logging
+- No log levels for debug prints
+- No correlation IDs for tracking requests across services
+
+## Concurrency Model
+
+Single-threaded, synchronous execution. No parallelization.
+
+**Query execution:**
+- Services queried sequentially
+- No concurrent API calls
+- No async/await
+- No thread pools
+
+**Implications:**
+- Slow batch processing (network latency multiplied by number of tracks)
+- Underutilized network bandwidth
+- Simple debugging (no race conditions)
+
+Batch processing could benefit significantly from parallel execution.
+
+## Dependency Injection
+
+No dependency injection. Service classes instantiated directly in Align constructor.
+
+**Current pattern:**
+```python
+self.mb_align = MusicBrainzAlign(...)
+self.deezer_align = DeezerAlign(...)
+```
+
+**Implications:**
+- Difficult to mock services for testing
+- Tight coupling between Align and service implementations
+- No interface-based programming
+- Hard to swap service implementations
+
+## State Management
+
+State is managed in Align instance variables.
+
+**Cached values:**
+- All input parameters (artist, track, album, etc.)
+- Retrieved metadata (MBID, ISRC, Deezer ID, etc.)
+- Service aligner instances
+
+**Cache invalidation:** None. Values cached for lifetime of Align instance.
+
+**Thread safety:** Not thread-safe. No locks, no synchronization.
+
+## Extension Points
+
+Limited extensibility.
+
+**Adding new services:**
+1. Create new service aligner class
+2. Instantiate in Align constructor
+3. Add getter methods to Align
+4. Update cascading fallback logic
+
+No plugin system, no service registry, no abstract base classes.
+
+**Modifying matching logic:**
+Requires editing service aligner classes directly. No strategy pattern, no configurable matchers.
+
+## Testing Architecture
+
+No test suite. No test directory. No test configuration.
+
+**Testing approach:**
+- Manual testing via Jupyter notebooks (deezer_test.ipynb, queries.ipynb)
+- if __name__ == "__main__" blocks in some modules
+- No unit tests, no integration tests, no mocks
+
+## Build and Packaging
+
+Uses hatchling (PEP 517 build backend).
+
+**pyproject.toml structure:**
+- Project metadata (name, version, authors)
+- Dependencies
+- Build system configuration
+
+No setup.py. Modern Python packaging.
+
+**Distribution:** GitHub only. Not published to PyPI.
+
+## Deployment Architecture
+
+Library deployment: pip install from GitHub.
+
+Batch processing deployment: Clone repository, install dependencies, run Python scripts directly.
+
+No Docker containers, no systemd services, no process managers.
+
+## Performance Considerations
+
+No performance optimization.
+
+**Bottlenecks:**
+- Network latency (sequential API calls)
+- No caching across Align instances
+- No request batching
+- No connection pooling
+
+**Memory usage:**
+- Minimal (only current track metadata in memory)
+- No large data structures
+- Pandas DataFrame for batch output (could be large for big datasets)
+
+## Security Architecture
+
+Minimal security considerations.
+
+**API credentials:**
+- MusicBrainz: No authentication
+- Deezer: No authentication
+- YouTube Music: No authentication
+- Spotify: OAuth2 client credentials in external file
+
+**Secrets management:**
+- Spotify credentials in mml_secrets.py (not in repository)
+- No encryption
+- No environment variables
+- No secrets vault
+
+**Input validation:**
+- No validation of user input
+- No sanitization of metadata strings
+- Potential injection vulnerabilities if metadata used in shell commands
+
+## Architectural Strengths
+
+1. **Simple facade:** Single Align class hides complexity
+2. **Cascading fallback:** Graceful degradation when services fail
+3. **Lazy evaluation:** Only query services when needed
+4. **Service isolation:** Each service in separate class
+
+## Architectural Weaknesses
+
+1. **No abstraction:** Service classes have different interfaces
+2. **Tight coupling:** Align directly instantiates service classes
+3. **No error handling:** Silent failures everywhere
+4. **No concurrency:** Sequential execution only
+5. **Hardcoded configuration:** No runtime flexibility
+6. **No testing:** Untestable design (tight coupling, no mocks)
+7. **Dead code:** AcousticBrainz integration non-functional
+8. **Inconsistent patterns:** Function for AcousticBrainz, classes for others
+
+## Architectural Recommendations
+
+For production use, consider:
+
+1. **Define service interface:** Abstract base class for all aligners
+2. **Dependency injection:** Pass service instances to Align constructor
+3. **Configuration system:** External config for thresholds, endpoints, credentials
+4. **Error handling:** Explicit error types, logging, retry logic
+5. **Async execution:** Use asyncio for concurrent API calls
+6. **Caching layer:** Redis or in-memory cache for repeated queries
+7. **Remove dead code:** Delete AcousticBrainz integration
+8. **Add tests:** Unit tests with mocked services
+9. **Structured logging:** JSON logs with correlation IDs
+10. **Rate limiting:** Respect API rate limits with backoff
+
+The core pattern (cascading fallback across services) is sound. The implementation needs significant hardening.