Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

13 KiB

Raw Permalink Blame History

MusicMetaLinker Architecture

System Overview

MusicMetaLinker implements a service-oriented architecture for music metadata entity linking. The system coordinates queries across multiple external APIs, aggregates results, and presents a unified interface through a single orchestrator class.

Architecture pattern: Facade with cascading fallback strategy.

Core Components

Align Class (linking.py)

The Align class is the primary orchestrator and sole public interface. It encapsulates all service interactions and presents a clean getter-based API.

Constructor signature:

Align(
    mbid_track=None,
    mbid_release=None, 
    artist=None,
    album=None,
    track=None,
    track_number=None,
    duration=None,
    isrc=None,
    strict=False
)

Responsibilities:

Initialize service-specific aligners based on available input
Coordinate query execution across services
Aggregate and normalize results
Expose unified getter methods for all metadata fields

Internal state:

Stores all input parameters
Maintains references to service aligner instances
Caches retrieved metadata to avoid redundant queries

The Align class doesn't implement service-specific logic. It delegates to specialized classes and functions.

MusicBrainzAlign Class

Handles all MusicBrainz interactions. MusicBrainz is treated as the authoritative source when MBIDs are available.

Key methods:

get_recording(mbid): Retrieves full recording data by MBID. Returns artist, album, track name, duration, ISRCs, and related identifiers.

get_best_match(artist, track, album, duration): Searches MusicBrainz by metadata strings. Filters results by duration and fuzzy string matching. Returns the highest-scoring match.

get_iswc(): Retrieves International Standard Musical Work Code if available.

Search strategy:

If MBID provided, direct lookup (most reliable)
If ISRC provided, search by ISRC
Fall back to metadata string search with filtering

MusicBrainz queries include related entities (artists, releases, ISRCs) in a single request to minimize API calls.

DeezerAlign Class

Interfaces with Deezer's public API. Deezer provides commercial metadata with strong ISRC coverage.

Key methods:

best_match(artist, track, album, duration, duration_threshold=3): Searches Deezer and filters by duration. The duration_threshold parameter allows ±3 seconds variance by default.

get_rank(): Returns Deezer's internal popularity rank for the track.

Search strategy:

If ISRC available, search by ISRC (most accurate)
Fall back to metadata string search
Filter results by duration (±3 seconds)
Apply fuzzy string matching to artist/track/album

Duration filtering is critical for Deezer because metadata searches often return multiple versions (radio edit, album version, remaster).

YouTubeAlign Class

Queries YouTube Music via the unofficial ytmusicapi library.

Key methods:

get_best_match(artist, track, album): Searches YouTube Music with filter="songs". Returns the first result (no sophisticated ranking).

get_youtube_id(): Extracts YouTube video ID from search results.

Search strategy:

Constructs query string: "{artist} {track} {album}"
Filters to songs only (excludes videos, albums)
Returns first result

YouTube matching is the weakest link. No duration filtering (commented out in code). No fuzzy matching. First result is assumed correct.

acousticbrainz_link Function

Standalone function (not a class) that checks if an MBID exists in AcousticBrainz.

Implementation:

def acousticbrainz_link(mbid):
    url = f"https://acousticbrainz.org/{mbid}"
    response = requests.get(url)
    return url if response.status_code == 200 else None

Simple HTTP check. Returns URL if MBID exists, None otherwise.

Critical issue: AcousticBrainz shut down in 2022. This function always returns None. Dead code.

Data Flow

Initialization Flow

User creates Align instance with available metadata
Align constructor stores all input parameters
Service aligners are instantiated on-demand (lazy initialization)
No queries execute during construction

Query Flow

User calls getter method (e.g., get_mbid())
Align checks if value already cached
If not cached, determines which service to query based on available input
Executes service-specific query
Caches result
Returns value to user

Queries are lazy and cached. Calling get_mbid() twice only queries MusicBrainz once.

Cascading Fallback Strategy

Priority order for identifier resolution:

For MBID:

Use provided mbid_track if available
Query MusicBrainz by ISRC
Query MusicBrainz by metadata strings
Return None if all fail

For ISRC:

Use provided ISRC if available
Extract from MusicBrainz recording (if MBID available)
Query Deezer and extract ISRC from result
Return None if all fail

For Deezer ID:

Query Deezer by ISRC
Query Deezer by metadata strings
Return None if all fail

For YouTube link:

Query YouTube Music by metadata strings
Return None if no results

Each service is queried independently. No cross-service validation or conflict resolution.

Supporting Components

JAMSProcessor (preprocessor.py)

Handles reading and writing JAMS (JSON Annotated Music Specification) files.

Responsibilities:

Parse JAMS JSON structure
Extract metadata from file_metadata and sandbox sections
Enrich JAMS files with new identifiers
Write updated JAMS files

JAMS structure:

{
  "file_metadata": {
    "title": "track name",
    "artist": "artist name",
    "release": "album name",
    "duration": 123.45,
    "identifiers": {
      "musicbrainz": "mbid-here"
    }
  },
  "sandbox": {
    "type": "genre",
    "genre": "rock",
    "track_number": 1,
    "release_year": 2020
  }
}

JAMSProcessor reads these fields, passes them to Align, and writes enriched identifiers back to the identifiers section.

MBDownload (musicbrainz_dump.py)

Utility for bulk downloading MusicBrainz data.

Purpose: Pre-populate local datasets with MusicBrainz metadata to reduce API calls during batch processing.

Implementation details: Not fully specified in provided information. Likely queries MusicBrainz in batches and caches results locally.

link_partitions.py

Batch processing script for directories of JAMS files.

Workflow:

Scan directory for JAMS files
For each file, extract metadata via JAMSProcessor
Create Align instance and query all services
Collect results in pandas DataFrame
Output CSV with all identifiers

Command-line options:

--save: Write enriched JAMS files back to disk
--limit audio: Only process audio files (skip non-audio JAMS)
--overwrite: Overwrite existing enriched files

Includes progress bars via tqdm and logging to link_partitions.log.

prepare_dataset.py

Dataset preparation utilities. Specific functionality not detailed in provided information. Likely includes:

Data cleaning
Format conversion
Batch metadata enrichment

Configuration Architecture

No configuration system. All settings hardcoded in source files.

Hardcoded values:

MusicBrainz User-Agent: "elka/0.1"
Deezer duration threshold: 3 seconds
API endpoints: Direct URLs in code
Spotify credentials: Imported from external mml_secrets.py

Implications:

No runtime configuration
No environment-specific settings
Changing thresholds requires code modification
No A/B testing of matching strategies

Error Handling Architecture

Error handling is minimal and inconsistent.

Pattern:

try:
    result = service.query()
    return result
except:
    return None

All exceptions are caught and suppressed. Failed queries return None. No error logging, no exception propagation, no retry logic.

Consequences:

Silent failures
No visibility into what went wrong
Difficult debugging
No distinction between "not found" and "service error"

Logging Architecture

Uses Python's standard logging module.

Batch processing: File-based logging to link_partitions.log. Includes timestamps, log levels, and progress information.

Library usage: Console logging. Minimal output.

Debug output: Multiple print() statements scattered throughout code. Not controlled by logging configuration.

Issues:

Debug prints in production code
No structured logging
No log levels for debug prints
No correlation IDs for tracking requests across services

Concurrency Model

Single-threaded, synchronous execution. No parallelization.

Query execution:

Services queried sequentially
No concurrent API calls
No async/await
No thread pools

Implications:

Slow batch processing (network latency multiplied by number of tracks)
Underutilized network bandwidth
Simple debugging (no race conditions)

Batch processing could benefit significantly from parallel execution.

Dependency Injection

No dependency injection. Service classes instantiated directly in Align constructor.

Current pattern:

self.mb_align = MusicBrainzAlign(...)
self.deezer_align = DeezerAlign(...)

Implications:

Difficult to mock services for testing
Tight coupling between Align and service implementations
No interface-based programming
Hard to swap service implementations

State Management

State is managed in Align instance variables.

Cached values:

All input parameters (artist, track, album, etc.)
Retrieved metadata (MBID, ISRC, Deezer ID, etc.)
Service aligner instances

Cache invalidation: None. Values cached for lifetime of Align instance.

Thread safety: Not thread-safe. No locks, no synchronization.

Extension Points

Limited extensibility.

Adding new services:

Create new service aligner class
Instantiate in Align constructor
Add getter methods to Align
Update cascading fallback logic

No plugin system, no service registry, no abstract base classes.

Modifying matching logic: Requires editing service aligner classes directly. No strategy pattern, no configurable matchers.

Testing Architecture

No test suite. No test directory. No test configuration.

Testing approach:

Manual testing via Jupyter notebooks (deezer_test.ipynb, queries.ipynb)
if name == "main" blocks in some modules
No unit tests, no integration tests, no mocks

Build and Packaging

Uses hatchling (PEP 517 build backend).

pyproject.toml structure:

Project metadata (name, version, authors)
Dependencies
Build system configuration

No setup.py. Modern Python packaging.

Distribution: GitHub only. Not published to PyPI.

Deployment Architecture

Library deployment: pip install from GitHub.

Batch processing deployment: Clone repository, install dependencies, run Python scripts directly.

No Docker containers, no systemd services, no process managers.

Performance Considerations

No performance optimization.

Bottlenecks:

Network latency (sequential API calls)
No caching across Align instances
No request batching
No connection pooling

Memory usage:

Minimal (only current track metadata in memory)
No large data structures
Pandas DataFrame for batch output (could be large for big datasets)

Security Architecture

Minimal security considerations.

API credentials:

MusicBrainz: No authentication
Deezer: No authentication
YouTube Music: No authentication
Spotify: OAuth2 client credentials in external file

Secrets management:

Spotify credentials in mml_secrets.py (not in repository)
No encryption
No environment variables
No secrets vault

Input validation:

No validation of user input
No sanitization of metadata strings
Potential injection vulnerabilities if metadata used in shell commands

Architectural Strengths

Simple facade: Single Align class hides complexity
Cascading fallback: Graceful degradation when services fail
Lazy evaluation: Only query services when needed
Service isolation: Each service in separate class

Architectural Weaknesses

No abstraction: Service classes have different interfaces
Tight coupling: Align directly instantiates service classes
No error handling: Silent failures everywhere
No concurrency: Sequential execution only
Hardcoded configuration: No runtime flexibility
No testing: Untestable design (tight coupling, no mocks)
Dead code: AcousticBrainz integration non-functional
Inconsistent patterns: Function for AcousticBrainz, classes for others

Architectural Recommendations

For production use, consider:

Define service interface: Abstract base class for all aligners
Dependency injection: Pass service instances to Align constructor
Configuration system: External config for thresholds, endpoints, credentials
Error handling: Explicit error types, logging, retry logic
Async execution: Use asyncio for concurrent API calls
Caching layer: Redis or in-memory cache for repeated queries
Remove dead code: Delete AcousticBrainz integration
Add tests: Unit tests with mocked services
Structured logging: JSON logs with correlation IDs
Rate limiting: Respect API rate limits with backoff

The core pattern (cascading fallback across services) is sound. The implementation needs significant hardening.

13 KiB Raw Permalink Blame History

MusicMetaLinker Architecture

System Overview

Core Components

Align Class (linking.py)

MusicBrainzAlign Class

DeezerAlign Class

YouTubeAlign Class

acousticbrainz_link Function

Data Flow

Initialization Flow

Query Flow

Cascading Fallback Strategy

Supporting Components

JAMSProcessor (preprocessor.py)

MBDownload (musicbrainz_dump.py)

link_partitions.py

prepare_dataset.py

Configuration Architecture

Error Handling Architecture

Logging Architecture

Concurrency Model

Dependency Injection

State Management

Extension Points

Testing Architecture

Build and Packaging

Deployment Architecture

Performance Considerations

Security Architecture

Architectural Strengths

Architectural Weaknesses

Architectural Recommendations

13 KiB

Raw Permalink Blame History