feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
+664
View File
@@ -0,0 +1,664 @@
# minim: Data Management
## Data Storage Architecture
minim does **not use a database**. All data is either:
1. **Ephemeral:** API responses held in memory during execution
2. **Token Storage:** OAuth tokens persisted to `~/minim.cfg`
3. **Audio Metadata:** Written to audio file tags via mutagen
There is no SQL database, no NoSQL store, no caching layer, no persistent data beyond configuration and audio files.
## Token Storage
### File Location
**Path:** `~/minim.cfg` (expands to user's home directory)
**Format:** INI-style configuration file via Python's `ConfigParser`
**Permissions:** Default file permissions (typically 0644 on Unix, readable by user and group)
**Security:** Plain text storage. No encryption, no obfuscation, no OS keychain integration.
### File Structure
```ini
[discogs]
consumer_key = Abcd1234Efgh5678
consumer_secret = IjklMnopQrstUvwx
access_token = YzabCdefGhijKlmn
access_token_secret = OpqrStuvWxyzAbcd
[qobuz]
app_id = 123456789
app_secret = abcdefghijklmnopqrstuvwxyz
email = user@example.com
password = MySecurePassword123
access_token = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
expires_at = 1672531200
[spotify]
client_id = 1234567890abcdef1234567890abcdef
client_secret = fedcba0987654321fedcba0987654321
redirect_uri = http://localhost:8888
access_token = BQDxK7...truncated...
refresh_token = AQBz3...truncated...
expires_at = 1672527600
scopes = user-library-read,playlist-read-private,user-read-playback-state
[tidal]
client_id = abcdefgh-1234-5678-90ab-cdefghijklmn
client_secret = ijklmnop-qrst-uvwx-yzab-cdefghijklmn
access_token = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
refresh_token = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
user_id = 12345678
country_code = US
expires_at = 1672534800
```
### Data Fields
**Common Fields (OAuth 2.0):**
- `client_id`: Application identifier
- `client_secret`: Application secret
- `access_token`: Bearer token for API requests
- `refresh_token`: Token for obtaining new access tokens
- `expires_at`: Unix timestamp when access token expires
**Service-Specific Fields:**
**Discogs (OAuth 1.0a):**
- `consumer_key`: OAuth consumer key
- `consumer_secret`: OAuth consumer secret
- `access_token`: OAuth access token
- `access_token_secret`: OAuth access token secret
- `personal_access_token`: Alternative to OAuth (from Discogs settings)
**Qobuz:**
- `app_id`: Qobuz application ID (extracted from web player)
- `app_secret`: Qobuz application secret (extracted from web player)
- `email`: User email for password grant
- `password`: User password (stored in plain text)
**Spotify:**
- `redirect_uri`: OAuth redirect URI
- `scopes`: Comma-separated list of permission scopes
**TIDAL:**
- `user_id`: TIDAL user ID (numeric)
- `country_code`: Two-letter country code for content availability
### Read/Write Operations
**Reading:**
```python
from configparser import ConfigParser
import os
config = ConfigParser()
config.read(os.path.expanduser("~/minim.cfg"))
if config.has_section("spotify"):
access_token = config.get("spotify", "access_token", fallback=None)
refresh_token = config.get("spotify", "refresh_token", fallback=None)
expires_at = config.getint("spotify", "expires_at", fallback=0)
```
**Writing:**
```python
config = ConfigParser()
config.read(os.path.expanduser("~/minim.cfg"))
if not config.has_section("spotify"):
config.add_section("spotify")
config.set("spotify", "access_token", new_access_token)
config.set("spotify", "refresh_token", new_refresh_token)
config.set("spotify", "expires_at", str(int(time.time()) + 3600))
with open(os.path.expanduser("~/minim.cfg"), "w") as f:
config.write(f)
```
**Concurrency:** Not thread-safe. Concurrent writes from multiple processes can corrupt the file. No file locking, no atomic writes.
### Security Implications
**Risks:**
1. **Plain Text Passwords:** Qobuz passwords stored unencrypted
2. **Token Exposure:** Access tokens readable by any process running as the user
3. **No Expiration Cleanup:** Expired tokens remain in file indefinitely
4. **File Permissions:** Default permissions may allow group/other read access
**Mitigations (Not Implemented):**
- Encrypt sensitive fields using OS keychain (Keyring, Keychain Access, Windows Credential Manager)
- Set restrictive file permissions (0600, user-only read/write)
- Use environment variables for sensitive credentials
- Implement token rotation and cleanup
**Recommendation:** For production use, replace file-based storage with secure credential management (AWS Secrets Manager, HashiCorp Vault, OS keychain).
## Audio Metadata Storage
### Tag Formats
minim writes metadata to audio files using format-specific tag systems:
| Format | Tag System | Implementation |
|--------|------------|----------------|
| FLAC | Vorbis Comments | `mutagen.flac.FLAC` |
| MP3 | ID3v2.4 | `mutagen.id3.ID3` |
| MP4/M4A | MP4 Atoms | `mutagen.mp4.MP4` |
| Ogg Vorbis | Vorbis Comments | `mutagen.oggvorbis.OggVorbis` |
| WAVE | ID3v2 (non-standard) | `mutagen.wave.WAVE` |
### Field Mapping
**FLAC (Vorbis Comments):**
```
TITLE = Track title
ARTIST = Primary artist(s)
ALBUMARTIST = Album artist
ALBUM = Album title
DATE = Release date (YYYY-MM-DD or YYYY)
GENRE = Genre
TRACKNUMBER = Track number
DISCNUMBER = Disc number
ISRC = International Standard Recording Code
BARCODE = UPC/EAN barcode
LYRICS = Song lyrics
COMMENT = Freeform comment
COPYRIGHT = Copyright notice
METADATA_BLOCK_PICTURE = Embedded artwork (base64-encoded)
```
**MP3 (ID3v2.4):**
```
TIT2 = Track title
TPE1 = Primary artist(s)
TPE2 = Album artist
TALB = Album title
TDRC = Release date
TCON = Genre
TRCK = Track number (format: "3" or "3/12")
TPOS = Disc number (format: "1" or "1/2")
TSRC = ISRC
TXXX:BARCODE = UPC/EAN barcode (custom frame)
USLT = Unsynchronized lyrics
COMM = Comment
TCOP = Copyright
APIC = Attached picture (artwork)
```
**MP4 (Atoms):**
```
©nam = Track title
©ART = Primary artist(s)
aART = Album artist
©alb = Album title
©day = Release date
©gen = Genre
trkn = Track number (tuple: (track, total))
disk = Disc number (tuple: (disc, total))
----:com.apple.iTunes:ISRC = ISRC (custom atom)
----:com.apple.iTunes:BARCODE = UPC/EAN barcode
©lyr = Lyrics
©cmt = Comment
cprt = Copyright
covr = Cover art
```
**Ogg Vorbis (Vorbis Comments):**
Same as FLAC (both use Vorbis Comments).
**WAVE (ID3v2):**
Same as MP3 (WAVE files can contain ID3v2 tags, though non-standard).
### Write Operations
**FLAC Example:**
```python
import mutagen.flac
audio = mutagen.flac.FLAC("track.flac")
# Text fields
audio["TITLE"] = "Creep"
audio["ARTIST"] = "Radiohead"
audio["ALBUM"] = "Pablo Honey"
audio["DATE"] = "1993"
audio["TRACKNUMBER"] = "2"
audio["DISCNUMBER"] = "1"
audio["ISRC"] = "GBAYE9200070"
# Artwork
picture = mutagen.flac.Picture()
picture.type = 3 # Front cover
picture.mime = "image/jpeg"
picture.desc = "Cover"
picture.data = open("cover.jpg", "rb").read()
audio.add_picture(picture)
audio.save()
```
**MP3 Example:**
```python
from mutagen.id3 import ID3, TIT2, TPE1, TALB, TDRC, TRCK, APIC
audio = ID3("track.mp3")
audio["TIT2"] = TIT2(encoding=3, text="Creep")
audio["TPE1"] = TPE1(encoding=3, text="Radiohead")
audio["TALB"] = TALB(encoding=3, text="Pablo Honey")
audio["TDRC"] = TDRC(encoding=3, text="1993")
audio["TRCK"] = TRCK(encoding=3, text="2/12")
audio["APIC"] = APIC(
encoding=3,
mime="image/jpeg",
type=3,
desc="Cover",
data=open("cover.jpg", "rb").read()
)
audio.save()
```
**MP4 Example:**
```python
import mutagen.mp4
audio = mutagen.mp4.MP4("track.m4a")
audio["©nam"] = "Creep"
audio["©ART"] = "Radiohead"
audio["©alb"] = "Pablo Honey"
audio["©day"] = "1993"
audio["trkn"] = [(2, 12)] # Track 2 of 12
audio["disk"] = [(1, 1)] # Disc 1 of 1
audio["covr"] = [
mutagen.mp4.MP4Cover(
open("cover.jpg", "rb").read(),
imageformat=mutagen.mp4.MP4Cover.FORMAT_JPEG
)
]
audio.save()
```
### Read Operations
**Auto-Detection:**
```python
import mutagen
audio = mutagen.File("track.flac")
# Access fields (format-agnostic where possible)
title = audio.get("TITLE", [None])[0] # FLAC/Ogg
title = audio.get("TIT2", None) # MP3
title = audio.get("©nam", [None])[0] # MP4
```
**minim Abstraction:**
```python
from minim.audio import Audio
audio = Audio("track.flac") # Auto-detects format
# Unified interface
print(audio.title)
print(audio.artist)
print(audio.album)
print(audio.track_number)
```
### Artwork Handling
**Fetching from API:**
```python
import requests
# Spotify example
track = spotify_api.get_track("3n3Ppam7vgaVa1iaRUc9Lp")
artwork_url = track["album"]["images"][0]["url"] # Largest image
artwork_data = requests.get(artwork_url).content
# TIDAL example
track = tidal_api.get_track(12345678)
cover_id = track["album"]["cover"].replace("-", "/")
artwork_url = f"https://resources.tidal.com/images/{cover_id}/1280x1280.jpg"
artwork_data = requests.get(artwork_url).content
```
**Embedding in File:**
```python
audio = Audio("track.flac")
audio.artwork = artwork_data # bytes
audio.write_metadata()
```
**Image Formats:** JPEG and PNG supported by all tag formats. JPEG preferred for smaller file size.
**Size Considerations:** Large artwork (>1MB) significantly increases file size. Recommendation: 600x600 to 1200x1200 pixels, JPEG quality 85-90%.
## Data Flow
### API Response to Audio File
**Complete Workflow:**
```python
from minim import spotify
from minim.audio import Audio
# 1. Authenticate
api = spotify.WebAPI(client_id="...", client_secret="...")
api.set_flow("client_credentials")
api.set_access_token()
# 2. Search for track
results = api.search("Radiohead Creep", types=["track"], limit=1)
track = results["tracks"]["items"][0]
# 3. Load audio file
audio = Audio("track.flac")
# 4. Map API response to metadata
audio.set_metadata_using_spotify(track)
# 5. Write to file
audio.write_metadata()
```
**Data Transformations:**
**Step 4 (Mapping):**
```python
def set_metadata_using_spotify(self, track_data: dict):
# Direct mappings
self.title = track_data["name"]
self.album = track_data["album"]["name"]
self.date = track_data["album"]["release_date"]
self.track_number = track_data["track_number"]
self.disc_number = track_data["disc_number"]
# Array to string
self.artist = ", ".join(a["name"] for a in track_data["artists"])
# Nested object
self.isrc = track_data.get("external_ids", {}).get("isrc")
# Fetch external resource
if track_data["album"]["images"]:
artwork_url = track_data["album"]["images"][0]["url"]
self.artwork = requests.get(artwork_url).content
```
**Step 5 (Writing):**
```python
# FLAC implementation
def write_metadata(self):
self._file["TITLE"] = self.title
self._file["ARTIST"] = self.artist
self._file["ALBUM"] = self.album
self._file["DATE"] = self.date
self._file["TRACKNUMBER"] = str(self.track_number)
self._file["DISCNUMBER"] = str(self.disc_number)
if self.isrc:
self._file["ISRC"] = self.isrc
if self.artwork:
picture = mutagen.flac.Picture()
picture.data = self.artwork
picture.type = 3
picture.mime = "image/jpeg"
self._file.add_picture(picture)
self._file.save()
```
### Service-Specific Normalization
**Artist Handling:**
**Spotify (array of objects):**
```json
{
"artists": [
{"name": "Radiohead", "id": "4Z8W4fKeB5YxbusRsdQVPb"},
{"name": "Thom Yorke", "id": "3WrFJ7ztbogyGnTHbHJFl2"}
]
}
```
**Normalization:** `", ".join(a["name"] for a in artists)``"Radiohead, Thom Yorke"`
**TIDAL (array of objects):**
```json
{
"artists": [
{"name": "Radiohead", "id": 4050}
]
}
```
**Normalization:** Same as Spotify.
**iTunes (string):**
```json
{
"artistName": "Radiohead"
}
```
**Normalization:** Direct assignment.
**Qobuz (object):**
```json
{
"performer": {"name": "Radiohead", "id": 12345}
}
```
**Normalization:** `performer["name"]`
**Date Handling:**
**Spotify:**
- Full date: `"2023-01-15"``"2023-01-15"`
- Year only: `"2023"``"2023"`
- Month precision: `"2023-01"``"2023-01"`
**TIDAL:**
- ISO 8601 with time: `"2023-01-15T00:00:00.000Z"``"2023-01-15"` (strip time)
**iTunes:**
- ISO 8601: `"2023-01-15T00:00:00Z"``"2023-01-15"`
**Qobuz:**
- Unix timestamp: `1673740800``datetime.fromtimestamp(1673740800).strftime("%Y-%m-%d")`
- ISO 8601: `"2023-01-15"``"2023-01-15"`
**Track/Disc Number Handling:**
**Spotify:**
```json
{
"track_number": 3,
"disc_number": 1
}
```
**Normalization:** Direct assignment.
**TIDAL:**
```json
{
"trackNumber": 3,
"volumeNumber": 1
}
```
**Normalization:** `track_number = trackNumber`, `disc_number = volumeNumber`
**iTunes:**
```json
{
"trackNumber": 3,
"trackCount": 12
}
```
**Normalization:** `track_number = trackNumber` (ignore `trackCount`)
**Qobuz:**
```json
{
"track_number": 3,
"media_number": 1
}
```
**Normalization:** Direct assignment.
## Format Conversion
### FFmpeg Integration
**Conversion Workflow:**
```python
audio = Audio("track.flac")
# Convert to MP3
mp3_audio = audio.convert("track.mp3", "mp3", bitrate="320k")
# Convert to AAC
m4a_audio = audio.convert("track.m4a", "m4a", bitrate="256k")
# Convert to Ogg Vorbis
ogg_audio = audio.convert("track.ogg", "ogg", quality=10)
```
**FFmpeg Command Construction:**
```python
def convert(self, output_path: str, format: str, **options):
cmd = ["ffmpeg", "-i", self.filepath]
# Codec selection
codec_map = {
"flac": "flac",
"mp3": "libmp3lame",
"m4a": "aac",
"ogg": "libvorbis",
"wav": "pcm_s16le"
}
cmd.extend(["-c:a", codec_map[format]])
# Options
if "bitrate" in options:
cmd.extend(["-b:a", options["bitrate"]])
if "quality" in options:
cmd.extend(["-q:a", str(options["quality"])])
if "sample_rate" in options:
cmd.extend(["-ar", str(options["sample_rate"])])
cmd.append(output_path)
subprocess.run(cmd, check=True)
```
**Metadata Preservation:**
```python
# After conversion, copy metadata
converted = Audio(output_path)
converted.title = self.title
converted.artist = self.artist
converted.album = self.album
# ... copy all fields
converted.artwork = self.artwork
converted.write_metadata()
```
**Lossy to Lossless:** Converting lossy formats (MP3, AAC) to lossless (FLAC) does not improve quality. The conversion is technically lossless but the source is already lossy.
**Lossless to Lossy:** Converting FLAC to MP3/AAC reduces file size but loses audio information. Irreversible.
## Data Validation
**No Validation:** minim does not validate metadata before writing to files.
**Potential Issues:**
- Invalid dates (e.g., `"2023-13-45"`) written as-is
- Track numbers exceeding album track count
- Non-numeric values in numeric fields
- Oversized artwork (multi-megabyte images)
**Recommendation:** Implement validation layer:
```python
def validate_metadata(audio: Audio):
# Date validation
if audio.date:
try:
datetime.strptime(audio.date, "%Y-%m-%d")
except ValueError:
# Try year-only format
try:
datetime.strptime(audio.date, "%Y")
except ValueError:
raise ValueError(f"Invalid date format: {audio.date}")
# Track number validation
if audio.track_number and audio.track_number < 1:
raise ValueError(f"Invalid track number: {audio.track_number}")
# Artwork size validation
if audio.artwork and len(audio.artwork) > 2 * 1024 * 1024: # 2MB
warnings.warn(f"Large artwork: {len(audio.artwork)} bytes")
```
## Data Retention
**Token Expiration:** Access tokens expire (typically 1 hour for OAuth 2.0). Refresh tokens used to obtain new access tokens without re-authentication.
**Token Cleanup:** Expired tokens remain in `~/minim.cfg` indefinitely. No automatic cleanup.
**Audio Metadata:** Persists in files until overwritten or file deleted.
**API Response Caching:** Not implemented. Every request hits the API.
## Data Privacy
**Sensitive Data in Config File:**
- User passwords (Qobuz)
- Access tokens (all services)
- Refresh tokens (OAuth 2.0 services)
- User IDs and email addresses
**Exposure Risks:**
- Backup systems may copy `~/minim.cfg` to cloud storage
- Version control systems may accidentally commit config file
- Malware can read tokens and impersonate user
**Recommendations:**
1. Add `~/minim.cfg` to `.gitignore`
2. Exclude from cloud backup or encrypt backups
3. Use environment variables for CI/CD
4. Rotate tokens regularly
5. Revoke tokens when no longer needed
## Summary
minim's data management is minimal and file-based:
- **No database:** All data is ephemeral or file-based
- **Token storage:** Plain text INI file at `~/minim.cfg`
- **Audio metadata:** Written to file tags via mutagen
- **No caching:** API responses not persisted
- **No validation:** Metadata written as-is without checks
This approach is simple and suitable for personal use but lacks security and robustness for production systems. The v2 rewrite addresses security concerns with OS keychain integration and adds validation layers.
For a metadata aggregator project, consider:
- Secure credential storage (OS keychain, secrets manager)
- Database for caching API responses (reduce API calls)
- Metadata validation before writing to files
- Audit logging for data access and modifications