Files
metadata-agregator/docs/research/graphbrainz/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

630 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GraphBrainz Data Layer
## Data Source Architecture
GraphBrainz is a **stateless proxy** with no persistent database. All data originates from external APIs:
| Source | Purpose | Authentication |
|--------|---------|----------------|
| MusicBrainz REST API | Core music metadata | None |
| Cover Art Archive | Album artwork | None |
| fanart.tv | Artist images | API key required |
| MediaWiki | Wiki images | None |
| TheAudioDB | Artist biographies | API key required |
## MusicBrainz Backend
### Base URL Configuration
| Environment Variable | Default | Purpose |
|---------------------|---------|---------|
| MUSICBRAINZ_BASE_URL | http://musicbrainz.org/ws/2/ | API endpoint |
**Local Mirror Support**:
```bash
MUSICBRAINZ_BASE_URL=http://localhost:5000/ws/2/
```
Using a local MusicBrainz mirror eliminates rate limits and reduces latency.
### API Operations
GraphBrainz uses three MusicBrainz API operations:
#### 1. Lookup
Retrieve single entity by MBID.
**URL Pattern**:
```
GET /ws/2/{entity}/{mbid}?inc={relationships}
```
**Example**:
```
GET /ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?inc=releases+recordings
```
**Supported Entities**: area, artist, collection, event, instrument, label, place, recording, release, release-group, series, url, work
#### 2. Browse
Retrieve entities linked to a parent entity.
**URL Pattern**:
```
GET /ws/2/{entity}?{parent-entity}={mbid}&limit={limit}&offset={offset}&inc={relationships}
```
**Example**:
```
GET /ws/2/release?artist=5b11f4ce-a62d-471e-81fc-a69a8278c7da&limit=25&offset=0
```
**Supported Relationships**: See API.md for full matrix
#### 3. Search
Lucene-based full-text search.
**URL Pattern**:
```
GET /ws/2/{entity}?query={lucene-query}&limit={limit}&offset={offset}
```
**Example**:
```
GET /ws/2/artist?query=artist:Radiohead%20AND%20country:GB&limit=25
```
**Supported Entities**: area, artist, event, instrument, label, place, recording, release, release-group, work
### Include Parameters
GraphBrainz resolvers inspect the GraphQL AST to determine which `inc` parameters are needed:
| Parameter | Description | Entities |
|-----------|-------------|----------|
| aliases | Alternative names | All |
| annotation | Editorial notes | All |
| tags | User-generated tags | All |
| ratings | User ratings | All |
| genres | Genre classifications | All |
| artist-credits | Artist credit details | Recording, Release, ReleaseGroup, Track |
| artists | Related artists | Recording, Release, ReleaseGroup, Work |
| collections | Collections containing entity | All |
| labels | Record labels | Release |
| recordings | Recordings | Artist, Release, Work |
| releases | Releases | Artist, Label, Recording, ReleaseGroup |
| release-groups | Release groups | Artist, Release |
| works | Musical works | Artist, Recording |
| discids | Disc IDs | Release |
| media | Media/tracks | Release |
| isrcs | ISRC codes | Recording |
| url-rels | URL relationships | All |
| artist-rels | Artist relationships | All |
| label-rels | Label relationships | All |
| recording-rels | Recording relationships | All |
| release-rels | Release relationships | All |
| release-group-rels | Release group relationships | All |
| work-rels | Work relationships | All |
| area-rels | Area relationships | All |
| place-rels | Place relationships | All |
| event-rels | Event relationships | All |
| series-rels | Series relationships | All |
| instrument-rels | Instrument relationships | All |
### Response Format
MusicBrainz returns JSON with entity-specific structure:
```json
{
"id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
"name": "Radiohead",
"sort-name": "Radiohead",
"type": "Group",
"country": "GB",
"life-span": {
"begin": "1985"
},
"releases": [
{
"id": "...",
"title": "OK Computer",
"date": "1997-05-21"
}
]
}
```
GraphBrainz transforms this to GraphQL-friendly format (camelCase, nested objects).
## Two-Level Caching Strategy
### Level 1: DataLoader (Per-Request)
**Purpose**: Request batching and deduplication within a single GraphQL query.
**Lifecycle**: Created fresh for each GraphQL request, discarded after response.
**Implementation**:
```javascript
import DataLoader from 'dataloader';
const artistLoader = new DataLoader(async (keys) => {
const results = await Promise.all(
keys.map(key => fetchArtist(key.mbid, key.inc))
);
return results;
});
```
**Benefits**:
- Batches multiple requests for same entity type
- Deduplicates identical requests within query
- Prevents N+1 query problems
**Example**:
```graphql
{
lookup {
release(mbid: "...") {
artists { # Artist 1
name
}
tracks {
artists { # Artist 1 again (deduplicated)
name
}
}
}
}
}
```
DataLoader ensures Artist 1 is fetched only once.
### Level 2: LRU Cache (Shared)
**Purpose**: Cross-request caching to reduce API calls.
**Lifecycle**: Shared across all requests, persists for configured TTL.
**Configuration**:
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | GRAPHBRAINZ_CACHE_SIZE | 8192 items |
| TTL | GRAPHBRAINZ_CACHE_TTL | 86400000 ms (1 day) |
**Implementation**:
```javascript
import LRU from 'lru-cache';
const cache = new LRU({
max: 8192,
ttl: 86400000, // 1 day
updateAgeOnGet: true,
updateAgeOnHas: true
});
```
**Cache Key Strategy**:
Keys combine entity type, MBID, and `inc` parameters to prevent collisions:
```
artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:releases,recordings
release:f0c8b1e5-...:artist-credits,labels,media
```
Different queries for the same entity use different cache keys.
**Cache Invalidation**:
- **Time-based**: Items expire after TTL (default 1 day)
- **Size-based**: LRU eviction when cache exceeds max size
- **No manual invalidation**: GraphBrainz assumes MusicBrainz data is relatively stable
**Cache Hit Ratio**:
Typical hit ratios for production workloads:
- Lookup queries: 60-80% (popular artists cached)
- Browse queries: 40-60% (pagination reduces hits)
- Search queries: 10-30% (diverse queries)
## Extension Caching
Each extension maintains its own LRU cache with separate configuration.
### Cover Art Archive
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | COVERART_CACHE_SIZE | 8192 |
| TTL | COVERART_CACHE_TTL | 86400000 ms |
**Cache Key**: `coverart:{release-mbid}`
### fanart.tv
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | FANART_CACHE_SIZE | 8192 |
| TTL | FANART_CACHE_TTL | 86400000 ms |
**Cache Key**: `fanart:{artist-mbid}`
### TheAudioDB
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | THEAUDIODB_CACHE_SIZE | 8192 |
| TTL | THEAUDIODB_CACHE_TTL | 86400000 ms |
**Cache Key**: `theaudiodb:{artist-mbid}`
### MediaWiki
| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | MEDIAWIKI_CACHE_SIZE | 8192 |
| TTL | MEDIAWIKI_CACHE_TTL | 86400000 ms |
**Cache Key**: `mediawiki:{artist-name}`
## Data Flow
Complete request flow from GraphQL query to response:
```
1. GraphQL Query Received
2. Resolver Inspects AST
↓ (determines required inc parameters)
3. DataLoader.load({ mbid, inc })
4. Check DataLoader Cache (per-request)
↓ (miss)
5. Check LRU Cache (shared)
↓ (miss)
6. Rate Limiter Queue
↓ (acquire token)
7. HTTP Request via got
8. MusicBrainz API Response
9. Store in LRU Cache
10. Return to DataLoader
11. Return to Resolver
12. GraphQL Response
```
**Cache Hit Path**:
```
1. GraphQL Query Received
2. Resolver Inspects AST
3. DataLoader.load({ mbid, inc })
4. Check DataLoader Cache (per-request)
↓ (hit - return immediately)
5. GraphQL Response
```
**Shared Cache Hit Path**:
```
1. GraphQL Query Received
2. Resolver Inspects AST
3. DataLoader.load({ mbid, inc })
4. Check DataLoader Cache (per-request)
↓ (miss)
5. Check LRU Cache (shared)
↓ (hit - return immediately)
6. Store in DataLoader Cache
7. GraphQL Response
```
## Rate Limiting
GraphBrainz implements custom rate limiting to comply with API policies.
### MusicBrainz Rate Limits
**Policy**: 5 requests per 5.5 seconds (approximately 0.909 requests/second)
**Implementation**:
- Token bucket algorithm
- 5 tokens maximum
- Refill rate: 0.909 tokens/second
- Sequential requests (concurrency: 1)
**Configuration**:
```javascript
const musicbrainzLimiter = new RateLimiter({
limit: 5,
interval: 5500, // milliseconds
concurrency: 1
});
```
### Extension Rate Limits
**Default Policy**: 10 requests per second
**Implementation**:
- Token bucket algorithm
- 10 tokens maximum
- Refill rate: 10 tokens/second
- Parallel requests (concurrency: 5)
**Per-Extension Configuration**:
| Extension | Rate Limit | Concurrency |
|-----------|------------|-------------|
| Cover Art Archive | 10 req/s | 5 |
| fanart.tv | 10 req/s | 5 |
| MediaWiki | 10 req/s | 5 |
| TheAudioDB | 10 req/s | 5 |
### Priority Queue
Requests are queued with priority levels when rate limit is reached:
| Priority | Query Type | Rationale |
|----------|------------|-----------|
| High | Lookup | Direct MBID access, user-initiated |
| Medium | Browse | Relationship traversal, pagination |
| Low | Search | Full-text search, exploratory |
Higher priority requests are processed first when tokens become available.
### Rate Limit Errors
When rate limit is exceeded and queue is full:
**HTTP Response**:
```
HTTP/1.1 429 Too Many Requests
Retry-After: 5
```
**GraphQL Error**:
```json
{
"errors": [
{
"message": "Rate limit exceeded",
"extensions": {
"code": "RATE_LIMIT",
"retryAfter": 5
}
}
]
}
```
## HTTP Client
GraphBrainz uses `got` v11.8.2 for HTTP requests.
### Client Configuration
```javascript
import got from 'got';
const client = got.extend({
prefixUrl: process.env.MUSICBRAINZ_BASE_URL,
headers: {
'User-Agent': 'GraphBrainz/9.0.0 (https://github.com/exogen/graphbrainz)'
},
timeout: {
request: 30000 // 30 seconds
},
retry: {
limit: 3,
methods: ['GET'],
statusCodes: [408, 413, 429, 500, 502, 503, 504]
},
hooks: {
beforeRequest: [
options => {
debug('graphbrainz:api/client')(`${options.method} ${options.url}`);
}
]
}
});
```
### Request Headers
| Header | Value | Purpose |
|--------|-------|---------|
| User-Agent | GraphBrainz/9.0.0 (...) | API identification |
| Accept | application/json | Response format |
### Timeout Handling
- **Request timeout**: 30 seconds
- **Connection timeout**: 10 seconds (default)
- **Read timeout**: 30 seconds (default)
Timeout errors are propagated as GraphQL errors.
### Retry Logic
Automatic retry for transient failures:
- **Max retries**: 3
- **Retry methods**: GET only
- **Retry status codes**: 408, 413, 429, 500, 502, 503, 504
- **Backoff**: Exponential (1s, 2s, 4s)
## Data Transformation
MusicBrainz API responses are transformed to GraphQL-friendly format:
### Field Name Conversion
| MusicBrainz | GraphQL |
|-------------|---------|
| sort-name | sortName |
| life-span | lifeSpan |
| artist-credit | artistCredit |
| release-group | releaseGroup |
| iso-3166-1-codes | iso31661Codes |
### Nested Object Flattening
**MusicBrainz**:
```json
{
"life-span": {
"begin": "1985",
"end": null
}
}
```
**GraphQL**:
```json
{
"lifeSpan": {
"begin": "1985",
"end": null
}
}
```
### Array Normalization
**MusicBrainz**:
```json
{
"releases": [
{ "id": "...", "title": "..." }
]
}
```
**GraphQL** (Relay connection):
```json
{
"releases": {
"edges": [
{
"node": { "id": "...", "title": "..." },
"cursor": "..."
}
],
"pageInfo": { ... },
"totalCount": 1
}
}
```
### Relationship Expansion
MusicBrainz relationships are flattened into GraphQL fields:
**MusicBrainz**:
```json
{
"relations": [
{
"type": "member of band",
"target": "5b11f4ce-...",
"artist": { "name": "Radiohead" }
}
]
}
```
**GraphQL**:
```graphql
{
relationships {
edges {
node {
type
target {
... on Artist {
name
}
}
}
}
}
}
```
## Memory Considerations
### Cache Memory Usage
With default configuration (8192 items per cache):
| Cache | Items | Avg Size | Total Memory |
|-------|-------|----------|--------------|
| MusicBrainz | 8192 | 5 KB | ~40 MB |
| Cover Art Archive | 8192 | 2 KB | ~16 MB |
| fanart.tv | 8192 | 3 KB | ~24 MB |
| MediaWiki | 8192 | 4 KB | ~32 MB |
| TheAudioDB | 8192 | 2 KB | ~16 MB |
| **Total** | **40960** | - | **~128 MB** |
### DataLoader Memory Usage
DataLoader instances are created per-request and garbage collected after response:
- **Per-request overhead**: ~1-5 MB (depends on query complexity)
- **Concurrent requests**: 100 requests × 5 MB = 500 MB peak
### Recommended Memory Allocation
| Deployment | Heap Size | Rationale |
|------------|-----------|-----------|
| Development | 512 MB | Single user, low traffic |
| Production (low) | 1 GB | 10-50 req/s, shared cache |
| Production (high) | 2 GB | 100+ req/s, full cache |
**Node.js Configuration**:
```bash
node --max-old-space-size=2048 cli.js
```
## Data Freshness
GraphBrainz does not implement cache invalidation beyond TTL expiration. Data freshness depends on:
| Data Type | Typical Update Frequency | Cache TTL | Staleness Risk |
|-----------|-------------------------|-----------|----------------|
| Artist metadata | Weeks to months | 1 day | Low |
| Release metadata | Days to weeks | 1 day | Low |
| Relationships | Weeks to months | 1 day | Low |
| Cover art | Months to years | 1 day | Very low |
| Artist images | Months to years | 1 day | Very low |
| Biographies | Months to years | 1 day | Very low |
For real-time data requirements, reduce cache TTL:
```bash
GRAPHBRAINZ_CACHE_TTL=3600000 # 1 hour
```
Or disable caching entirely:
```bash
GRAPHBRAINZ_CACHE_SIZE=0
```