feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
This commit is contained in:
Alexander
2026-04-28 16:27:14 +02:00
commit a1f6701bac
163 changed files with 95884 additions and 0 deletions
+629
View File
@@ -0,0 +1,629 @@
# GraphBrainz Data Layer
## Data Source Architecture
GraphBrainz is a **stateless proxy** with no persistent database. All data originates from external APIs:
| Source | Purpose | Authentication |
|--------|---------|----------------|
| MusicBrainz REST API | Core music metadata | None |
| Cover Art Archive | Album artwork | None |
| fanart.tv | Artist images | API key required |
| MediaWiki | Wiki images | None |
| TheAudioDB | Artist biographies | API key required |
## MusicBrainz Backend
### Base URL Configuration
| Environment Variable | Default | Purpose |
|---------------------|---------|---------|
| MUSICBRAINZ_BASE_URL | http://musicbrainz.org/ws/2/ | API endpoint |
**Local Mirror Support**:
```bash
MUSICBRAINZ_BASE_URL=http://localhost:5000/ws/2/
```
Using a local MusicBrainz mirror eliminates rate limits and reduces latency.
### API Operations
GraphBrainz uses three MusicBrainz API operations:
#### 1. Lookup
Retrieve single entity by MBID.
**URL Pattern**:
```
GET /ws/2/{entity}/{mbid}?inc={relationships}
```
**Example**:
```
GET /ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?inc=releases+recordings
```
**Supported Entities**: area, artist, collection, event, instrument, label, place, recording, release, release-group, series, url, work
#### 2. Browse
Retrieve entities linked to a parent entity.
**URL Pattern**:
```
GET /ws/2/{entity}?{parent-entity}={mbid}&limit={limit}&offset={offset}&inc={relationships}
```
**Example**:
```
GET /ws/2/release?artist=5b11f4ce-a62d-471e-81fc-a69a8278c7da&limit=25&offset=0
```
**Supported Relationships**: See API.md for full matrix
#### 3. Search
Lucene-based full-text search.
**URL Pattern**:
```
GET /ws/2/{entity}?query={lucene-query}&limit={limit}&offset={offset}
```
**Example**:
```
GET /ws/2/artist?query=artist:Radiohead%20AND%20country:GB&limit=25
```
**Supported Entities**: area, artist, event, instrument, label, place, recording, release, release-group, work
### Include Parameters
GraphBrainz resolvers inspect the GraphQL AST to determine which `inc` parameters are needed:
| Parameter | Description | Entities |
|-----------|-------------|----------|
| aliases | Alternative names | All |
| annotation | Editorial notes | All |
| tags | User-generated tags | All |
| ratings | User ratings | All |
| genres | Genre classifications | All |
| artist-credits | Artist credit details | Recording, Release, ReleaseGroup, Track |
| artists | Related artists | Recording, Release, ReleaseGroup, Work |
| collections | Collections containing entity | All |
| labels | Record labels | Release |
| recordings | Recordings | Artist, Release, Work |
| releases | Releases | Artist, Label, Recording, ReleaseGroup |
| release-groups | Release groups | Artist, Release |
| works | Musical works | Artist, Recording |
| discids | Disc IDs | Release |
| media | Media/tracks | Release |
| isrcs | ISRC codes | Recording |
| url-rels | URL relationships | All |
| artist-rels | Artist relationships | All |
| label-rels | Label relationships | All |
| recording-rels | Recording relationships | All |
| release-rels | Release relationships | All |
| release-group-rels | Release group relationships | All |
| work-rels | Work relationships | All |
| area-rels | Area relationships | All |
| place-rels | Place relationships | All |
| event-rels | Event relationships | All |
| series-rels | Series relationships | All |
| instrument-rels | Instrument relationships | All |
### Response Format
MusicBrainz returns JSON with entity-specific structure:
```json
{
"id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
"name": "Radiohead",
"sort-name": "Radiohead",
"type": "Group",
"country": "GB",
"life-span": {
"begin": "1985"
},
"releases": [
{
"id": "...",
"title": "OK Computer",
"date": "1997-05-21"
}
]
}
```
GraphBrainz transforms this to GraphQL-friendly format (camelCase, nested objects).
## Two-Level Caching Strategy
### Level 1: DataLoader (Per-Request)
**Purpose**: Request batching and deduplication within a single GraphQL query.
**Lifecycle**: Created fresh for each GraphQL request, discarded after response.
**Implementation**:
```javascript
import DataLoader from 'dataloader';
const artistLoader = new DataLoader(async (keys) => {
const results = await Promise.all(
keys.map(key => fetchArtist(key.mbid, key.inc))
);
return results;
});
```
**Benefits**:
- Batches multiple requests for same entity type
- Deduplicates identical requests within query
- Prevents N+1 query problems
**Example**:
```graphql
{
lookup {
release(mbid: "...") {
artists { # Artist 1
name
}
tracks {
artists { # Artist 1 again (deduplicated)
name
}
}
}
}
}
```
DataLoader ensures Artist 1 is fetched only once.
### Level 2: LRU Cache (Shared)
**Purpose**: Cross-request caching to reduce API calls.
**Lifecycle**: Shared across all requests, persists for configured TTL.
**Configuration**:
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | GRAPHBRAINZ_CACHE_SIZE | 8192 items |
| TTL | GRAPHBRAINZ_CACHE_TTL | 86400000 ms (1 day) |
**Implementation**:
```javascript
import LRU from 'lru-cache';
const cache = new LRU({
max: 8192,
ttl: 86400000, // 1 day
updateAgeOnGet: true,
updateAgeOnHas: true
});
```
**Cache Key Strategy**:
Keys combine entity type, MBID, and `inc` parameters to prevent collisions:
```
artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:releases,recordings
release:f0c8b1e5-...:artist-credits,labels,media
```
Different queries for the same entity use different cache keys.
**Cache Invalidation**:
- **Time-based**: Items expire after TTL (default 1 day)
- **Size-based**: LRU eviction when cache exceeds max size
- **No manual invalidation**: GraphBrainz assumes MusicBrainz data is relatively stable
**Cache Hit Ratio**:
Typical hit ratios for production workloads:
- Lookup queries: 60-80% (popular artists cached)
- Browse queries: 40-60% (pagination reduces hits)
- Search queries: 10-30% (diverse queries)
## Extension Caching
Each extension maintains its own LRU cache with separate configuration.
### Cover Art Archive
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | COVERART_CACHE_SIZE | 8192 |
| TTL | COVERART_CACHE_TTL | 86400000 ms |
**Cache Key**: `coverart:{release-mbid}`
### fanart.tv
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | FANART_CACHE_SIZE | 8192 |
| TTL | FANART_CACHE_TTL | 86400000 ms |
**Cache Key**: `fanart:{artist-mbid}`
### TheAudioDB
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | THEAUDIODB_CACHE_SIZE | 8192 |
| TTL | THEAUDIODB_CACHE_TTL | 86400000 ms |
**Cache Key**: `theaudiodb:{artist-mbid}`
### MediaWiki
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | MEDIAWIKI_CACHE_SIZE | 8192 |
| TTL | MEDIAWIKI_CACHE_TTL | 86400000 ms |
**Cache Key**: `mediawiki:{artist-name}`
## Data Flow
Complete request flow from GraphQL query to response:
```
1. GraphQL Query Received
2. Resolver Inspects AST
↓ (determines required inc parameters)
3. DataLoader.load({ mbid, inc })
4. Check DataLoader Cache (per-request)
↓ (miss)
5. Check LRU Cache (shared)
↓ (miss)
6. Rate Limiter Queue
↓ (acquire token)
7. HTTP Request via got
8. MusicBrainz API Response
9. Store in LRU Cache
10. Return to DataLoader
11. Return to Resolver
12. GraphQL Response
```
**Cache Hit Path**:
```
1. GraphQL Query Received
2. Resolver Inspects AST
3. DataLoader.load({ mbid, inc })
4. Check DataLoader Cache (per-request)
↓ (hit - return immediately)
5. GraphQL Response
```
**Shared Cache Hit Path**:
```
1. GraphQL Query Received
2. Resolver Inspects AST
3. DataLoader.load({ mbid, inc })
4. Check DataLoader Cache (per-request)
↓ (miss)
5. Check LRU Cache (shared)
↓ (hit - return immediately)
6. Store in DataLoader Cache
7. GraphQL Response
```
## Rate Limiting
GraphBrainz implements custom rate limiting to comply with API policies.
### MusicBrainz Rate Limits
**Policy**: 5 requests per 5.5 seconds (approximately 0.909 requests/second)
**Implementation**:
- Token bucket algorithm
- 5 tokens maximum
- Refill rate: 0.909 tokens/second
- Sequential requests (concurrency: 1)
**Configuration**:
```javascript
const musicbrainzLimiter = new RateLimiter({
limit: 5,
interval: 5500, // milliseconds
concurrency: 1
});
```
### Extension Rate Limits
**Default Policy**: 10 requests per second
**Implementation**:
- Token bucket algorithm
- 10 tokens maximum
- Refill rate: 10 tokens/second
- Parallel requests (concurrency: 5)
**Per-Extension Configuration**:
| Extension | Rate Limit | Concurrency |
|-----------|------------|-------------|
| Cover Art Archive | 10 req/s | 5 |
| fanart.tv | 10 req/s | 5 |
| MediaWiki | 10 req/s | 5 |
| TheAudioDB | 10 req/s | 5 |
### Priority Queue
Requests are queued with priority levels when rate limit is reached:
| Priority | Query Type | Rationale |
|----------|------------|-----------|
| High | Lookup | Direct MBID access, user-initiated |
| Medium | Browse | Relationship traversal, pagination |
| Low | Search | Full-text search, exploratory |
Higher priority requests are processed first when tokens become available.
### Rate Limit Errors
When rate limit is exceeded and queue is full:
**HTTP Response**:
```
HTTP/1.1 429 Too Many Requests
Retry-After: 5
```
**GraphQL Error**:
```json
{
"errors": [
{
"message": "Rate limit exceeded",
"extensions": {
"code": "RATE_LIMIT",
"retryAfter": 5
}
}
]
}
```
## HTTP Client
GraphBrainz uses `got` v11.8.2 for HTTP requests.
### Client Configuration
```javascript
import got from 'got';
const client = got.extend({
prefixUrl: process.env.MUSICBRAINZ_BASE_URL,
headers: {
'User-Agent': 'GraphBrainz/9.0.0 (https://github.com/exogen/graphbrainz)'
},
timeout: {
request: 30000 // 30 seconds
},
retry: {
limit: 3,
methods: ['GET'],
statusCodes: [408, 413, 429, 500, 502, 503, 504]
},
hooks: {
beforeRequest: [
options => {
debug('graphbrainz:api/client')(`${options.method} ${options.url}`);
}
]
}
});
```
### Request Headers
| Header | Value | Purpose |
|--------|-------|---------|
| User-Agent | GraphBrainz/9.0.0 (...) | API identification |
| Accept | application/json | Response format |
### Timeout Handling
- **Request timeout**: 30 seconds
- **Connection timeout**: 10 seconds (default)
- **Read timeout**: 30 seconds (default)
Timeout errors are propagated as GraphQL errors.
### Retry Logic
Automatic retry for transient failures:
- **Max retries**: 3
- **Retry methods**: GET only
- **Retry status codes**: 408, 413, 429, 500, 502, 503, 504
- **Backoff**: Exponential (1s, 2s, 4s)
## Data Transformation
MusicBrainz API responses are transformed to GraphQL-friendly format:
### Field Name Conversion
| MusicBrainz | GraphQL |
|-------------|---------|
| sort-name | sortName |
| life-span | lifeSpan |
| artist-credit | artistCredit |
| release-group | releaseGroup |
| iso-3166-1-codes | iso31661Codes |
### Nested Object Flattening
**MusicBrainz**:
```json
{
"life-span": {
"begin": "1985",
"end": null
}
}
```
**GraphQL**:
```json
{
"lifeSpan": {
"begin": "1985",
"end": null
}
}
```
### Array Normalization
**MusicBrainz**:
```json
{
"releases": [
{ "id": "...", "title": "..." }
]
}
```
**GraphQL** (Relay connection):
```json
{
"releases": {
"edges": [
{
"node": { "id": "...", "title": "..." },
"cursor": "..."
}
],
"pageInfo": { ... },
"totalCount": 1
}
}
```
### Relationship Expansion
MusicBrainz relationships are flattened into GraphQL fields:
**MusicBrainz**:
```json
{
"relations": [
{
"type": "member of band",
"target": "5b11f4ce-...",
"artist": { "name": "Radiohead" }
}
]
}
```
**GraphQL**:
```graphql
{
relationships {
edges {
node {
type
target {
... on Artist {
name
}
}
}
}
}
}
```
## Memory Considerations
### Cache Memory Usage
With default configuration (8192 items per cache):
| Cache | Items | Avg Size | Total Memory |
|-------|-------|----------|--------------|
| MusicBrainz | 8192 | 5 KB | ~40 MB |
| Cover Art Archive | 8192 | 2 KB | ~16 MB |
| fanart.tv | 8192 | 3 KB | ~24 MB |
| MediaWiki | 8192 | 4 KB | ~32 MB |
| TheAudioDB | 8192 | 2 KB | ~16 MB |
| **Total** | **40960** | - | **~128 MB** |
### DataLoader Memory Usage
DataLoader instances are created per-request and garbage collected after response:
- **Per-request overhead**: ~1-5 MB (depends on query complexity)
- **Concurrent requests**: 100 requests × 5 MB = 500 MB peak
### Recommended Memory Allocation
| Deployment | Heap Size | Rationale |
|------------|-----------|-----------|
| Development | 512 MB | Single user, low traffic |
| Production (low) | 1 GB | 10-50 req/s, shared cache |
| Production (high) | 2 GB | 100+ req/s, full cache |
**Node.js Configuration**:
```bash
node --max-old-space-size=2048 cli.js
```
## Data Freshness
GraphBrainz does not implement cache invalidation beyond TTL expiration. Data freshness depends on:
| Data Type | Typical Update Frequency | Cache TTL | Staleness Risk |
|-----------|-------------------------|-----------|----------------|
| Artist metadata | Weeks to months | 1 day | Low |
| Release metadata | Days to weeks | 1 day | Low |
| Relationships | Weeks to months | 1 day | Low |
| Cover art | Months to years | 1 day | Very low |
| Artist images | Months to years | 1 day | Very low |
| Biographies | Months to years | 1 day | Very low |
For real-time data requirements, reduce cache TTL:
```bash
GRAPHBRAINZ_CACHE_TTL=3600000 # 1 hour
```
Or disable caching entirely:
```bash
GRAPHBRAINZ_CACHE_SIZE=0
```