Files
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

13 KiB

MusicMetaLinker Architecture

System Overview

MusicMetaLinker implements a service-oriented architecture for music metadata entity linking. The system coordinates queries across multiple external APIs, aggregates results, and presents a unified interface through a single orchestrator class.

Architecture pattern: Facade with cascading fallback strategy.

Core Components

Align Class (linking.py)

The Align class is the primary orchestrator and sole public interface. It encapsulates all service interactions and presents a clean getter-based API.

Constructor signature:

Align(
    mbid_track=None,
    mbid_release=None, 
    artist=None,
    album=None,
    track=None,
    track_number=None,
    duration=None,
    isrc=None,
    strict=False
)

Responsibilities:

  • Initialize service-specific aligners based on available input
  • Coordinate query execution across services
  • Aggregate and normalize results
  • Expose unified getter methods for all metadata fields

Internal state:

  • Stores all input parameters
  • Maintains references to service aligner instances
  • Caches retrieved metadata to avoid redundant queries

The Align class doesn't implement service-specific logic. It delegates to specialized classes and functions.

MusicBrainzAlign Class

Handles all MusicBrainz interactions. MusicBrainz is treated as the authoritative source when MBIDs are available.

Key methods:

get_recording(mbid): Retrieves full recording data by MBID. Returns artist, album, track name, duration, ISRCs, and related identifiers.

get_best_match(artist, track, album, duration): Searches MusicBrainz by metadata strings. Filters results by duration and fuzzy string matching. Returns the highest-scoring match.

get_iswc(): Retrieves International Standard Musical Work Code if available.

Search strategy:

  1. If MBID provided, direct lookup (most reliable)
  2. If ISRC provided, search by ISRC
  3. Fall back to metadata string search with filtering

MusicBrainz queries include related entities (artists, releases, ISRCs) in a single request to minimize API calls.

DeezerAlign Class

Interfaces with Deezer's public API. Deezer provides commercial metadata with strong ISRC coverage.

Key methods:

best_match(artist, track, album, duration, duration_threshold=3): Searches Deezer and filters by duration. The duration_threshold parameter allows ±3 seconds variance by default.

get_rank(): Returns Deezer's internal popularity rank for the track.

Search strategy:

  1. If ISRC available, search by ISRC (most accurate)
  2. Fall back to metadata string search
  3. Filter results by duration (±3 seconds)
  4. Apply fuzzy string matching to artist/track/album

Duration filtering is critical for Deezer because metadata searches often return multiple versions (radio edit, album version, remaster).

YouTubeAlign Class

Queries YouTube Music via the unofficial ytmusicapi library.

Key methods:

get_best_match(artist, track, album): Searches YouTube Music with filter="songs". Returns the first result (no sophisticated ranking).

get_youtube_id(): Extracts YouTube video ID from search results.

Search strategy:

  • Constructs query string: "{artist} {track} {album}"
  • Filters to songs only (excludes videos, albums)
  • Returns first result

YouTube matching is the weakest link. No duration filtering (commented out in code). No fuzzy matching. First result is assumed correct.

Standalone function (not a class) that checks if an MBID exists in AcousticBrainz.

Implementation:

def acousticbrainz_link(mbid):
    url = f"https://acousticbrainz.org/{mbid}"
    response = requests.get(url)
    return url if response.status_code == 200 else None

Simple HTTP check. Returns URL if MBID exists, None otherwise.

Critical issue: AcousticBrainz shut down in 2022. This function always returns None. Dead code.

Data Flow

Initialization Flow

  1. User creates Align instance with available metadata
  2. Align constructor stores all input parameters
  3. Service aligners are instantiated on-demand (lazy initialization)
  4. No queries execute during construction

Query Flow

  1. User calls getter method (e.g., get_mbid())
  2. Align checks if value already cached
  3. If not cached, determines which service to query based on available input
  4. Executes service-specific query
  5. Caches result
  6. Returns value to user

Queries are lazy and cached. Calling get_mbid() twice only queries MusicBrainz once.

Cascading Fallback Strategy

Priority order for identifier resolution:

For MBID:

  1. Use provided mbid_track if available
  2. Query MusicBrainz by ISRC
  3. Query MusicBrainz by metadata strings
  4. Return None if all fail

For ISRC:

  1. Use provided ISRC if available
  2. Extract from MusicBrainz recording (if MBID available)
  3. Query Deezer and extract ISRC from result
  4. Return None if all fail

For Deezer ID:

  1. Query Deezer by ISRC
  2. Query Deezer by metadata strings
  3. Return None if all fail

For YouTube link:

  1. Query YouTube Music by metadata strings
  2. Return None if no results

Each service is queried independently. No cross-service validation or conflict resolution.

Supporting Components

JAMSProcessor (preprocessor.py)

Handles reading and writing JAMS (JSON Annotated Music Specification) files.

Responsibilities:

  • Parse JAMS JSON structure
  • Extract metadata from file_metadata and sandbox sections
  • Enrich JAMS files with new identifiers
  • Write updated JAMS files

JAMS structure:

{
  "file_metadata": {
    "title": "track name",
    "artist": "artist name",
    "release": "album name",
    "duration": 123.45,
    "identifiers": {
      "musicbrainz": "mbid-here"
    }
  },
  "sandbox": {
    "type": "genre",
    "genre": "rock",
    "track_number": 1,
    "release_year": 2020
  }
}

JAMSProcessor reads these fields, passes them to Align, and writes enriched identifiers back to the identifiers section.

MBDownload (musicbrainz_dump.py)

Utility for bulk downloading MusicBrainz data.

Purpose: Pre-populate local datasets with MusicBrainz metadata to reduce API calls during batch processing.

Implementation details: Not fully specified in provided information. Likely queries MusicBrainz in batches and caches results locally.

Batch processing script for directories of JAMS files.

Workflow:

  1. Scan directory for JAMS files
  2. For each file, extract metadata via JAMSProcessor
  3. Create Align instance and query all services
  4. Collect results in pandas DataFrame
  5. Output CSV with all identifiers

Command-line options:

  • --save: Write enriched JAMS files back to disk
  • --limit audio: Only process audio files (skip non-audio JAMS)
  • --overwrite: Overwrite existing enriched files

Includes progress bars via tqdm and logging to link_partitions.log.

prepare_dataset.py

Dataset preparation utilities. Specific functionality not detailed in provided information. Likely includes:

  • Data cleaning
  • Format conversion
  • Batch metadata enrichment

Configuration Architecture

No configuration system. All settings hardcoded in source files.

Hardcoded values:

  • MusicBrainz User-Agent: "elka/0.1"
  • Deezer duration threshold: 3 seconds
  • API endpoints: Direct URLs in code
  • Spotify credentials: Imported from external mml_secrets.py

Implications:

  • No runtime configuration
  • No environment-specific settings
  • Changing thresholds requires code modification
  • No A/B testing of matching strategies

Error Handling Architecture

Error handling is minimal and inconsistent.

Pattern:

try:
    result = service.query()
    return result
except:
    return None

All exceptions are caught and suppressed. Failed queries return None. No error logging, no exception propagation, no retry logic.

Consequences:

  • Silent failures
  • No visibility into what went wrong
  • Difficult debugging
  • No distinction between "not found" and "service error"

Logging Architecture

Uses Python's standard logging module.

Batch processing: File-based logging to link_partitions.log. Includes timestamps, log levels, and progress information.

Library usage: Console logging. Minimal output.

Debug output: Multiple print() statements scattered throughout code. Not controlled by logging configuration.

Issues:

  • Debug prints in production code
  • No structured logging
  • No log levels for debug prints
  • No correlation IDs for tracking requests across services

Concurrency Model

Single-threaded, synchronous execution. No parallelization.

Query execution:

  • Services queried sequentially
  • No concurrent API calls
  • No async/await
  • No thread pools

Implications:

  • Slow batch processing (network latency multiplied by number of tracks)
  • Underutilized network bandwidth
  • Simple debugging (no race conditions)

Batch processing could benefit significantly from parallel execution.

Dependency Injection

No dependency injection. Service classes instantiated directly in Align constructor.

Current pattern:

self.mb_align = MusicBrainzAlign(...)
self.deezer_align = DeezerAlign(...)

Implications:

  • Difficult to mock services for testing
  • Tight coupling between Align and service implementations
  • No interface-based programming
  • Hard to swap service implementations

State Management

State is managed in Align instance variables.

Cached values:

  • All input parameters (artist, track, album, etc.)
  • Retrieved metadata (MBID, ISRC, Deezer ID, etc.)
  • Service aligner instances

Cache invalidation: None. Values cached for lifetime of Align instance.

Thread safety: Not thread-safe. No locks, no synchronization.

Extension Points

Limited extensibility.

Adding new services:

  1. Create new service aligner class
  2. Instantiate in Align constructor
  3. Add getter methods to Align
  4. Update cascading fallback logic

No plugin system, no service registry, no abstract base classes.

Modifying matching logic: Requires editing service aligner classes directly. No strategy pattern, no configurable matchers.

Testing Architecture

No test suite. No test directory. No test configuration.

Testing approach:

  • Manual testing via Jupyter notebooks (deezer_test.ipynb, queries.ipynb)
  • if name == "main" blocks in some modules
  • No unit tests, no integration tests, no mocks

Build and Packaging

Uses hatchling (PEP 517 build backend).

pyproject.toml structure:

  • Project metadata (name, version, authors)
  • Dependencies
  • Build system configuration

No setup.py. Modern Python packaging.

Distribution: GitHub only. Not published to PyPI.

Deployment Architecture

Library deployment: pip install from GitHub.

Batch processing deployment: Clone repository, install dependencies, run Python scripts directly.

No Docker containers, no systemd services, no process managers.

Performance Considerations

No performance optimization.

Bottlenecks:

  • Network latency (sequential API calls)
  • No caching across Align instances
  • No request batching
  • No connection pooling

Memory usage:

  • Minimal (only current track metadata in memory)
  • No large data structures
  • Pandas DataFrame for batch output (could be large for big datasets)

Security Architecture

Minimal security considerations.

API credentials:

  • MusicBrainz: No authentication
  • Deezer: No authentication
  • YouTube Music: No authentication
  • Spotify: OAuth2 client credentials in external file

Secrets management:

  • Spotify credentials in mml_secrets.py (not in repository)
  • No encryption
  • No environment variables
  • No secrets vault

Input validation:

  • No validation of user input
  • No sanitization of metadata strings
  • Potential injection vulnerabilities if metadata used in shell commands

Architectural Strengths

  1. Simple facade: Single Align class hides complexity
  2. Cascading fallback: Graceful degradation when services fail
  3. Lazy evaluation: Only query services when needed
  4. Service isolation: Each service in separate class

Architectural Weaknesses

  1. No abstraction: Service classes have different interfaces
  2. Tight coupling: Align directly instantiates service classes
  3. No error handling: Silent failures everywhere
  4. No concurrency: Sequential execution only
  5. Hardcoded configuration: No runtime flexibility
  6. No testing: Untestable design (tight coupling, no mocks)
  7. Dead code: AcousticBrainz integration non-functional
  8. Inconsistent patterns: Function for AcousticBrainz, classes for others

Architectural Recommendations

For production use, consider:

  1. Define service interface: Abstract base class for all aligners
  2. Dependency injection: Pass service instances to Align constructor
  3. Configuration system: External config for thresholds, endpoints, credentials
  4. Error handling: Explicit error types, logging, retry logic
  5. Async execution: Use asyncio for concurrent API calls
  6. Caching layer: Redis or in-memory cache for repeated queries
  7. Remove dead code: Delete AcousticBrainz integration
  8. Add tests: Unit tests with mocked services
  9. Structured logging: JSON logs with correlation IDs
  10. Rate limiting: Respect API rate limits with backoff

The core pattern (cascading fallback across services) is sound. The implementation needs significant hardening.