Files
metadata-agregator/docs/research/minim/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

17 KiB

minim: Data Management

Data Storage Architecture

minim does not use a database. All data is either:

  1. Ephemeral: API responses held in memory during execution
  2. Token Storage: OAuth tokens persisted to ~/minim.cfg
  3. Audio Metadata: Written to audio file tags via mutagen

There is no SQL database, no NoSQL store, no caching layer, no persistent data beyond configuration and audio files.

Token Storage

File Location

Path: ~/minim.cfg (expands to user's home directory)

Format: INI-style configuration file via Python's ConfigParser

Permissions: Default file permissions (typically 0644 on Unix, readable by user and group)

Security: Plain text storage. No encryption, no obfuscation, no OS keychain integration.

File Structure

[discogs]
consumer_key = Abcd1234Efgh5678
consumer_secret = IjklMnopQrstUvwx
access_token = YzabCdefGhijKlmn
access_token_secret = OpqrStuvWxyzAbcd

[qobuz]
app_id = 123456789
app_secret = abcdefghijklmnopqrstuvwxyz
email = user@example.com
password = MySecurePassword123
access_token = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
expires_at = 1672531200

[spotify]
client_id = 1234567890abcdef1234567890abcdef
client_secret = fedcba0987654321fedcba0987654321
redirect_uri = http://localhost:8888
access_token = BQDxK7...truncated...
refresh_token = AQBz3...truncated...
expires_at = 1672527600
scopes = user-library-read,playlist-read-private,user-read-playback-state

[tidal]
client_id = abcdefgh-1234-5678-90ab-cdefghijklmn
client_secret = ijklmnop-qrst-uvwx-yzab-cdefghijklmn
access_token = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
refresh_token = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
user_id = 12345678
country_code = US
expires_at = 1672534800

Data Fields

Common Fields (OAuth 2.0):

  • client_id: Application identifier
  • client_secret: Application secret
  • access_token: Bearer token for API requests
  • refresh_token: Token for obtaining new access tokens
  • expires_at: Unix timestamp when access token expires

Service-Specific Fields:

Discogs (OAuth 1.0a):

  • consumer_key: OAuth consumer key
  • consumer_secret: OAuth consumer secret
  • access_token: OAuth access token
  • access_token_secret: OAuth access token secret
  • personal_access_token: Alternative to OAuth (from Discogs settings)

Qobuz:

  • app_id: Qobuz application ID (extracted from web player)
  • app_secret: Qobuz application secret (extracted from web player)
  • email: User email for password grant
  • password: User password (stored in plain text)

Spotify:

  • redirect_uri: OAuth redirect URI
  • scopes: Comma-separated list of permission scopes

TIDAL:

  • user_id: TIDAL user ID (numeric)
  • country_code: Two-letter country code for content availability

Read/Write Operations

Reading:

from configparser import ConfigParser
import os

config = ConfigParser()
config.read(os.path.expanduser("~/minim.cfg"))

if config.has_section("spotify"):
    access_token = config.get("spotify", "access_token", fallback=None)
    refresh_token = config.get("spotify", "refresh_token", fallback=None)
    expires_at = config.getint("spotify", "expires_at", fallback=0)

Writing:

config = ConfigParser()
config.read(os.path.expanduser("~/minim.cfg"))

if not config.has_section("spotify"):
    config.add_section("spotify")

config.set("spotify", "access_token", new_access_token)
config.set("spotify", "refresh_token", new_refresh_token)
config.set("spotify", "expires_at", str(int(time.time()) + 3600))

with open(os.path.expanduser("~/minim.cfg"), "w") as f:
    config.write(f)

Concurrency: Not thread-safe. Concurrent writes from multiple processes can corrupt the file. No file locking, no atomic writes.

Security Implications

Risks:

  1. Plain Text Passwords: Qobuz passwords stored unencrypted
  2. Token Exposure: Access tokens readable by any process running as the user
  3. No Expiration Cleanup: Expired tokens remain in file indefinitely
  4. File Permissions: Default permissions may allow group/other read access

Mitigations (Not Implemented):

  • Encrypt sensitive fields using OS keychain (Keyring, Keychain Access, Windows Credential Manager)
  • Set restrictive file permissions (0600, user-only read/write)
  • Use environment variables for sensitive credentials
  • Implement token rotation and cleanup

Recommendation: For production use, replace file-based storage with secure credential management (AWS Secrets Manager, HashiCorp Vault, OS keychain).

Audio Metadata Storage

Tag Formats

minim writes metadata to audio files using format-specific tag systems:

Format Tag System Implementation
FLAC Vorbis Comments mutagen.flac.FLAC
MP3 ID3v2.4 mutagen.id3.ID3
MP4/M4A MP4 Atoms mutagen.mp4.MP4
Ogg Vorbis Vorbis Comments mutagen.oggvorbis.OggVorbis
WAVE ID3v2 (non-standard) mutagen.wave.WAVE

Field Mapping

FLAC (Vorbis Comments):

TITLE = Track title
ARTIST = Primary artist(s)
ALBUMARTIST = Album artist
ALBUM = Album title
DATE = Release date (YYYY-MM-DD or YYYY)
GENRE = Genre
TRACKNUMBER = Track number
DISCNUMBER = Disc number
ISRC = International Standard Recording Code
BARCODE = UPC/EAN barcode
LYRICS = Song lyrics
COMMENT = Freeform comment
COPYRIGHT = Copyright notice
METADATA_BLOCK_PICTURE = Embedded artwork (base64-encoded)

MP3 (ID3v2.4):

TIT2 = Track title
TPE1 = Primary artist(s)
TPE2 = Album artist
TALB = Album title
TDRC = Release date
TCON = Genre
TRCK = Track number (format: "3" or "3/12")
TPOS = Disc number (format: "1" or "1/2")
TSRC = ISRC
TXXX:BARCODE = UPC/EAN barcode (custom frame)
USLT = Unsynchronized lyrics
COMM = Comment
TCOP = Copyright
APIC = Attached picture (artwork)

MP4 (Atoms):

©nam = Track title
©ART = Primary artist(s)
aART = Album artist
©alb = Album title
©day = Release date
©gen = Genre
trkn = Track number (tuple: (track, total))
disk = Disc number (tuple: (disc, total))
----:com.apple.iTunes:ISRC = ISRC (custom atom)
----:com.apple.iTunes:BARCODE = UPC/EAN barcode
©lyr = Lyrics
©cmt = Comment
cprt = Copyright
covr = Cover art

Ogg Vorbis (Vorbis Comments): Same as FLAC (both use Vorbis Comments).

WAVE (ID3v2): Same as MP3 (WAVE files can contain ID3v2 tags, though non-standard).

Write Operations

FLAC Example:

import mutagen.flac

audio = mutagen.flac.FLAC("track.flac")

# Text fields
audio["TITLE"] = "Creep"
audio["ARTIST"] = "Radiohead"
audio["ALBUM"] = "Pablo Honey"
audio["DATE"] = "1993"
audio["TRACKNUMBER"] = "2"
audio["DISCNUMBER"] = "1"
audio["ISRC"] = "GBAYE9200070"

# Artwork
picture = mutagen.flac.Picture()
picture.type = 3  # Front cover
picture.mime = "image/jpeg"
picture.desc = "Cover"
picture.data = open("cover.jpg", "rb").read()
audio.add_picture(picture)

audio.save()

MP3 Example:

from mutagen.id3 import ID3, TIT2, TPE1, TALB, TDRC, TRCK, APIC

audio = ID3("track.mp3")

audio["TIT2"] = TIT2(encoding=3, text="Creep")
audio["TPE1"] = TPE1(encoding=3, text="Radiohead")
audio["TALB"] = TALB(encoding=3, text="Pablo Honey")
audio["TDRC"] = TDRC(encoding=3, text="1993")
audio["TRCK"] = TRCK(encoding=3, text="2/12")

audio["APIC"] = APIC(
    encoding=3,
    mime="image/jpeg",
    type=3,
    desc="Cover",
    data=open("cover.jpg", "rb").read()
)

audio.save()

MP4 Example:

import mutagen.mp4

audio = mutagen.mp4.MP4("track.m4a")

audio["©nam"] = "Creep"
audio["©ART"] = "Radiohead"
audio["©alb"] = "Pablo Honey"
audio["©day"] = "1993"
audio["trkn"] = [(2, 12)]  # Track 2 of 12
audio["disk"] = [(1, 1)]   # Disc 1 of 1

audio["covr"] = [
    mutagen.mp4.MP4Cover(
        open("cover.jpg", "rb").read(),
        imageformat=mutagen.mp4.MP4Cover.FORMAT_JPEG
    )
]

audio.save()

Read Operations

Auto-Detection:

import mutagen

audio = mutagen.File("track.flac")

# Access fields (format-agnostic where possible)
title = audio.get("TITLE", [None])[0]  # FLAC/Ogg
title = audio.get("TIT2", None)        # MP3
title = audio.get("©nam", [None])[0]   # MP4

minim Abstraction:

from minim.audio import Audio

audio = Audio("track.flac")  # Auto-detects format

# Unified interface
print(audio.title)
print(audio.artist)
print(audio.album)
print(audio.track_number)

Artwork Handling

Fetching from API:

import requests

# Spotify example
track = spotify_api.get_track("3n3Ppam7vgaVa1iaRUc9Lp")
artwork_url = track["album"]["images"][0]["url"]  # Largest image
artwork_data = requests.get(artwork_url).content

# TIDAL example
track = tidal_api.get_track(12345678)
cover_id = track["album"]["cover"].replace("-", "/")
artwork_url = f"https://resources.tidal.com/images/{cover_id}/1280x1280.jpg"
artwork_data = requests.get(artwork_url).content

Embedding in File:

audio = Audio("track.flac")
audio.artwork = artwork_data  # bytes
audio.write_metadata()

Image Formats: JPEG and PNG supported by all tag formats. JPEG preferred for smaller file size.

Size Considerations: Large artwork (>1MB) significantly increases file size. Recommendation: 600x600 to 1200x1200 pixels, JPEG quality 85-90%.

Data Flow

API Response to Audio File

Complete Workflow:

from minim import spotify
from minim.audio import Audio

# 1. Authenticate
api = spotify.WebAPI(client_id="...", client_secret="...")
api.set_flow("client_credentials")
api.set_access_token()

# 2. Search for track
results = api.search("Radiohead Creep", types=["track"], limit=1)
track = results["tracks"]["items"][0]

# 3. Load audio file
audio = Audio("track.flac")

# 4. Map API response to metadata
audio.set_metadata_using_spotify(track)

# 5. Write to file
audio.write_metadata()

Data Transformations:

Step 4 (Mapping):

def set_metadata_using_spotify(self, track_data: dict):
    # Direct mappings
    self.title = track_data["name"]
    self.album = track_data["album"]["name"]
    self.date = track_data["album"]["release_date"]
    self.track_number = track_data["track_number"]
    self.disc_number = track_data["disc_number"]
    
    # Array to string
    self.artist = ", ".join(a["name"] for a in track_data["artists"])
    
    # Nested object
    self.isrc = track_data.get("external_ids", {}).get("isrc")
    
    # Fetch external resource
    if track_data["album"]["images"]:
        artwork_url = track_data["album"]["images"][0]["url"]
        self.artwork = requests.get(artwork_url).content

Step 5 (Writing):

# FLAC implementation
def write_metadata(self):
    self._file["TITLE"] = self.title
    self._file["ARTIST"] = self.artist
    self._file["ALBUM"] = self.album
    self._file["DATE"] = self.date
    self._file["TRACKNUMBER"] = str(self.track_number)
    self._file["DISCNUMBER"] = str(self.disc_number)
    
    if self.isrc:
        self._file["ISRC"] = self.isrc
    
    if self.artwork:
        picture = mutagen.flac.Picture()
        picture.data = self.artwork
        picture.type = 3
        picture.mime = "image/jpeg"
        self._file.add_picture(picture)
    
    self._file.save()

Service-Specific Normalization

Artist Handling:

Spotify (array of objects):

{
  "artists": [
    {"name": "Radiohead", "id": "4Z8W4fKeB5YxbusRsdQVPb"},
    {"name": "Thom Yorke", "id": "3WrFJ7ztbogyGnTHbHJFl2"}
  ]
}

Normalization: ", ".join(a["name"] for a in artists)"Radiohead, Thom Yorke"

TIDAL (array of objects):

{
  "artists": [
    {"name": "Radiohead", "id": 4050}
  ]
}

Normalization: Same as Spotify.

iTunes (string):

{
  "artistName": "Radiohead"
}

Normalization: Direct assignment.

Qobuz (object):

{
  "performer": {"name": "Radiohead", "id": 12345}
}

Normalization: performer["name"]

Date Handling:

Spotify:

  • Full date: "2023-01-15""2023-01-15"
  • Year only: "2023""2023"
  • Month precision: "2023-01""2023-01"

TIDAL:

  • ISO 8601 with time: "2023-01-15T00:00:00.000Z""2023-01-15" (strip time)

iTunes:

  • ISO 8601: "2023-01-15T00:00:00Z""2023-01-15"

Qobuz:

  • Unix timestamp: 1673740800datetime.fromtimestamp(1673740800).strftime("%Y-%m-%d")
  • ISO 8601: "2023-01-15""2023-01-15"

Track/Disc Number Handling:

Spotify:

{
  "track_number": 3,
  "disc_number": 1
}

Normalization: Direct assignment.

TIDAL:

{
  "trackNumber": 3,
  "volumeNumber": 1
}

Normalization: track_number = trackNumber, disc_number = volumeNumber

iTunes:

{
  "trackNumber": 3,
  "trackCount": 12
}

Normalization: track_number = trackNumber (ignore trackCount)

Qobuz:

{
  "track_number": 3,
  "media_number": 1
}

Normalization: Direct assignment.

Format Conversion

FFmpeg Integration

Conversion Workflow:

audio = Audio("track.flac")

# Convert to MP3
mp3_audio = audio.convert("track.mp3", "mp3", bitrate="320k")

# Convert to AAC
m4a_audio = audio.convert("track.m4a", "m4a", bitrate="256k")

# Convert to Ogg Vorbis
ogg_audio = audio.convert("track.ogg", "ogg", quality=10)

FFmpeg Command Construction:

def convert(self, output_path: str, format: str, **options):
    cmd = ["ffmpeg", "-i", self.filepath]
    
    # Codec selection
    codec_map = {
        "flac": "flac",
        "mp3": "libmp3lame",
        "m4a": "aac",
        "ogg": "libvorbis",
        "wav": "pcm_s16le"
    }
    cmd.extend(["-c:a", codec_map[format]])
    
    # Options
    if "bitrate" in options:
        cmd.extend(["-b:a", options["bitrate"]])
    if "quality" in options:
        cmd.extend(["-q:a", str(options["quality"])])
    if "sample_rate" in options:
        cmd.extend(["-ar", str(options["sample_rate"])])
    
    cmd.append(output_path)
    
    subprocess.run(cmd, check=True)

Metadata Preservation:

# After conversion, copy metadata
converted = Audio(output_path)
converted.title = self.title
converted.artist = self.artist
converted.album = self.album
# ... copy all fields
converted.artwork = self.artwork
converted.write_metadata()

Lossy to Lossless: Converting lossy formats (MP3, AAC) to lossless (FLAC) does not improve quality. The conversion is technically lossless but the source is already lossy.

Lossless to Lossy: Converting FLAC to MP3/AAC reduces file size but loses audio information. Irreversible.

Data Validation

No Validation: minim does not validate metadata before writing to files.

Potential Issues:

  • Invalid dates (e.g., "2023-13-45") written as-is
  • Track numbers exceeding album track count
  • Non-numeric values in numeric fields
  • Oversized artwork (multi-megabyte images)

Recommendation: Implement validation layer:

def validate_metadata(audio: Audio):
    # Date validation
    if audio.date:
        try:
            datetime.strptime(audio.date, "%Y-%m-%d")
        except ValueError:
            # Try year-only format
            try:
                datetime.strptime(audio.date, "%Y")
            except ValueError:
                raise ValueError(f"Invalid date format: {audio.date}")
    
    # Track number validation
    if audio.track_number and audio.track_number < 1:
        raise ValueError(f"Invalid track number: {audio.track_number}")
    
    # Artwork size validation
    if audio.artwork and len(audio.artwork) > 2 * 1024 * 1024:  # 2MB
        warnings.warn(f"Large artwork: {len(audio.artwork)} bytes")

Data Retention

Token Expiration: Access tokens expire (typically 1 hour for OAuth 2.0). Refresh tokens used to obtain new access tokens without re-authentication.

Token Cleanup: Expired tokens remain in ~/minim.cfg indefinitely. No automatic cleanup.

Audio Metadata: Persists in files until overwritten or file deleted.

API Response Caching: Not implemented. Every request hits the API.

Data Privacy

Sensitive Data in Config File:

  • User passwords (Qobuz)
  • Access tokens (all services)
  • Refresh tokens (OAuth 2.0 services)
  • User IDs and email addresses

Exposure Risks:

  • Backup systems may copy ~/minim.cfg to cloud storage
  • Version control systems may accidentally commit config file
  • Malware can read tokens and impersonate user

Recommendations:

  1. Add ~/minim.cfg to .gitignore
  2. Exclude from cloud backup or encrypt backups
  3. Use environment variables for CI/CD
  4. Rotate tokens regularly
  5. Revoke tokens when no longer needed

Summary

minim's data management is minimal and file-based:

  • No database: All data is ephemeral or file-based
  • Token storage: Plain text INI file at ~/minim.cfg
  • Audio metadata: Written to file tags via mutagen
  • No caching: API responses not persisted
  • No validation: Metadata written as-is without checks

This approach is simple and suitable for personal use but lacks security and robustness for production systems. The v2 rewrite addresses security concerns with OS keychain integration and adds validation layers.

For a metadata aggregator project, consider:

  • Secure credential storage (OS keychain, secrets manager)
  • Database for caching API responses (reduce API calls)
  • Metadata validation before writing to files
  • Audit logging for data access and modifications