a1f6701bac
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
630 lines
14 KiB
Markdown
630 lines
14 KiB
Markdown
# GraphBrainz Data Layer
|
||
|
||
## Data Source Architecture
|
||
|
||
GraphBrainz is a **stateless proxy** with no persistent database. All data originates from external APIs:
|
||
|
||
| Source | Purpose | Authentication |
|
||
|--------|---------|----------------|
|
||
| MusicBrainz REST API | Core music metadata | None |
|
||
| Cover Art Archive | Album artwork | None |
|
||
| fanart.tv | Artist images | API key required |
|
||
| MediaWiki | Wiki images | None |
|
||
| TheAudioDB | Artist biographies | API key required |
|
||
|
||
## MusicBrainz Backend
|
||
|
||
### Base URL Configuration
|
||
|
||
| Environment Variable | Default | Purpose |
|
||
|---------------------|---------|---------|
|
||
| MUSICBRAINZ_BASE_URL | http://musicbrainz.org/ws/2/ | API endpoint |
|
||
|
||
**Local Mirror Support**:
|
||
```bash
|
||
MUSICBRAINZ_BASE_URL=http://localhost:5000/ws/2/
|
||
```
|
||
|
||
Using a local MusicBrainz mirror eliminates rate limits and reduces latency.
|
||
|
||
### API Operations
|
||
|
||
GraphBrainz uses three MusicBrainz API operations:
|
||
|
||
#### 1. Lookup
|
||
|
||
Retrieve single entity by MBID.
|
||
|
||
**URL Pattern**:
|
||
```
|
||
GET /ws/2/{entity}/{mbid}?inc={relationships}
|
||
```
|
||
|
||
**Example**:
|
||
```
|
||
GET /ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?inc=releases+recordings
|
||
```
|
||
|
||
**Supported Entities**: area, artist, collection, event, instrument, label, place, recording, release, release-group, series, url, work
|
||
|
||
#### 2. Browse
|
||
|
||
Retrieve entities linked to a parent entity.
|
||
|
||
**URL Pattern**:
|
||
```
|
||
GET /ws/2/{entity}?{parent-entity}={mbid}&limit={limit}&offset={offset}&inc={relationships}
|
||
```
|
||
|
||
**Example**:
|
||
```
|
||
GET /ws/2/release?artist=5b11f4ce-a62d-471e-81fc-a69a8278c7da&limit=25&offset=0
|
||
```
|
||
|
||
**Supported Relationships**: See API.md for full matrix
|
||
|
||
#### 3. Search
|
||
|
||
Lucene-based full-text search.
|
||
|
||
**URL Pattern**:
|
||
```
|
||
GET /ws/2/{entity}?query={lucene-query}&limit={limit}&offset={offset}
|
||
```
|
||
|
||
**Example**:
|
||
```
|
||
GET /ws/2/artist?query=artist:Radiohead%20AND%20country:GB&limit=25
|
||
```
|
||
|
||
**Supported Entities**: area, artist, event, instrument, label, place, recording, release, release-group, work
|
||
|
||
### Include Parameters
|
||
|
||
GraphBrainz resolvers inspect the GraphQL AST to determine which `inc` parameters are needed:
|
||
|
||
| Parameter | Description | Entities |
|
||
|-----------|-------------|----------|
|
||
| aliases | Alternative names | All |
|
||
| annotation | Editorial notes | All |
|
||
| tags | User-generated tags | All |
|
||
| ratings | User ratings | All |
|
||
| genres | Genre classifications | All |
|
||
| artist-credits | Artist credit details | Recording, Release, ReleaseGroup, Track |
|
||
| artists | Related artists | Recording, Release, ReleaseGroup, Work |
|
||
| collections | Collections containing entity | All |
|
||
| labels | Record labels | Release |
|
||
| recordings | Recordings | Artist, Release, Work |
|
||
| releases | Releases | Artist, Label, Recording, ReleaseGroup |
|
||
| release-groups | Release groups | Artist, Release |
|
||
| works | Musical works | Artist, Recording |
|
||
| discids | Disc IDs | Release |
|
||
| media | Media/tracks | Release |
|
||
| isrcs | ISRC codes | Recording |
|
||
| url-rels | URL relationships | All |
|
||
| artist-rels | Artist relationships | All |
|
||
| label-rels | Label relationships | All |
|
||
| recording-rels | Recording relationships | All |
|
||
| release-rels | Release relationships | All |
|
||
| release-group-rels | Release group relationships | All |
|
||
| work-rels | Work relationships | All |
|
||
| area-rels | Area relationships | All |
|
||
| place-rels | Place relationships | All |
|
||
| event-rels | Event relationships | All |
|
||
| series-rels | Series relationships | All |
|
||
| instrument-rels | Instrument relationships | All |
|
||
|
||
### Response Format
|
||
|
||
MusicBrainz returns JSON with entity-specific structure:
|
||
|
||
```json
|
||
{
|
||
"id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
|
||
"name": "Radiohead",
|
||
"sort-name": "Radiohead",
|
||
"type": "Group",
|
||
"country": "GB",
|
||
"life-span": {
|
||
"begin": "1985"
|
||
},
|
||
"releases": [
|
||
{
|
||
"id": "...",
|
||
"title": "OK Computer",
|
||
"date": "1997-05-21"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
GraphBrainz transforms this to GraphQL-friendly format (camelCase, nested objects).
|
||
|
||
## Two-Level Caching Strategy
|
||
|
||
### Level 1: DataLoader (Per-Request)
|
||
|
||
**Purpose**: Request batching and deduplication within a single GraphQL query.
|
||
|
||
**Lifecycle**: Created fresh for each GraphQL request, discarded after response.
|
||
|
||
**Implementation**:
|
||
```javascript
|
||
import DataLoader from 'dataloader';
|
||
|
||
const artistLoader = new DataLoader(async (keys) => {
|
||
const results = await Promise.all(
|
||
keys.map(key => fetchArtist(key.mbid, key.inc))
|
||
);
|
||
return results;
|
||
});
|
||
```
|
||
|
||
**Benefits**:
|
||
- Batches multiple requests for same entity type
|
||
- Deduplicates identical requests within query
|
||
- Prevents N+1 query problems
|
||
|
||
**Example**:
|
||
```graphql
|
||
{
|
||
lookup {
|
||
release(mbid: "...") {
|
||
artists { # Artist 1
|
||
name
|
||
}
|
||
tracks {
|
||
artists { # Artist 1 again (deduplicated)
|
||
name
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
DataLoader ensures Artist 1 is fetched only once.
|
||
|
||
### Level 2: LRU Cache (Shared)
|
||
|
||
**Purpose**: Cross-request caching to reduce API calls.
|
||
|
||
**Lifecycle**: Shared across all requests, persists for configured TTL.
|
||
|
||
**Configuration**:
|
||
|
||
| Parameter | Environment Variable | Default |
|
||
|-----------|---------------------|---------|
|
||
| Size | GRAPHBRAINZ_CACHE_SIZE | 8192 items |
|
||
| TTL | GRAPHBRAINZ_CACHE_TTL | 86400000 ms (1 day) |
|
||
|
||
**Implementation**:
|
||
```javascript
|
||
import LRU from 'lru-cache';
|
||
|
||
const cache = new LRU({
|
||
max: 8192,
|
||
ttl: 86400000, // 1 day
|
||
updateAgeOnGet: true,
|
||
updateAgeOnHas: true
|
||
});
|
||
```
|
||
|
||
**Cache Key Strategy**:
|
||
|
||
Keys combine entity type, MBID, and `inc` parameters to prevent collisions:
|
||
|
||
```
|
||
artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:releases,recordings
|
||
release:f0c8b1e5-...:artist-credits,labels,media
|
||
```
|
||
|
||
Different queries for the same entity use different cache keys.
|
||
|
||
**Cache Invalidation**:
|
||
|
||
- **Time-based**: Items expire after TTL (default 1 day)
|
||
- **Size-based**: LRU eviction when cache exceeds max size
|
||
- **No manual invalidation**: GraphBrainz assumes MusicBrainz data is relatively stable
|
||
|
||
**Cache Hit Ratio**:
|
||
|
||
Typical hit ratios for production workloads:
|
||
|
||
- Lookup queries: 60-80% (popular artists cached)
|
||
- Browse queries: 40-60% (pagination reduces hits)
|
||
- Search queries: 10-30% (diverse queries)
|
||
|
||
## Extension Caching
|
||
|
||
Each extension maintains its own LRU cache with separate configuration.
|
||
|
||
### Cover Art Archive
|
||
|
||
| Parameter | Environment Variable | Default |
|
||
|-----------|---------------------|---------|
|
||
| Size | COVERART_CACHE_SIZE | 8192 |
|
||
| TTL | COVERART_CACHE_TTL | 86400000 ms |
|
||
|
||
**Cache Key**: `coverart:{release-mbid}`
|
||
|
||
### fanart.tv
|
||
|
||
| Parameter | Environment Variable | Default |
|
||
|-----------|---------------------|---------|
|
||
| Size | FANART_CACHE_SIZE | 8192 |
|
||
| TTL | FANART_CACHE_TTL | 86400000 ms |
|
||
|
||
**Cache Key**: `fanart:{artist-mbid}`
|
||
|
||
### TheAudioDB
|
||
|
||
| Parameter | Environment Variable | Default |
|
||
|-----------|---------------------|---------|
|
||
| Size | THEAUDIODB_CACHE_SIZE | 8192 |
|
||
| TTL | THEAUDIODB_CACHE_TTL | 86400000 ms |
|
||
|
||
**Cache Key**: `theaudiodb:{artist-mbid}`
|
||
|
||
### MediaWiki
|
||
|
||
| Parameter | Environment Variable | Default |
|
||
|-----------|---------------------|---------|
|
||
| Size | MEDIAWIKI_CACHE_SIZE | 8192 |
|
||
| TTL | MEDIAWIKI_CACHE_TTL | 86400000 ms |
|
||
|
||
**Cache Key**: `mediawiki:{artist-name}`
|
||
|
||
## Data Flow
|
||
|
||
Complete request flow from GraphQL query to response:
|
||
|
||
```
|
||
1. GraphQL Query Received
|
||
↓
|
||
2. Resolver Inspects AST
|
||
↓ (determines required inc parameters)
|
||
3. DataLoader.load({ mbid, inc })
|
||
↓
|
||
4. Check DataLoader Cache (per-request)
|
||
↓ (miss)
|
||
5. Check LRU Cache (shared)
|
||
↓ (miss)
|
||
6. Rate Limiter Queue
|
||
↓ (acquire token)
|
||
7. HTTP Request via got
|
||
↓
|
||
8. MusicBrainz API Response
|
||
↓
|
||
9. Store in LRU Cache
|
||
↓
|
||
10. Return to DataLoader
|
||
↓
|
||
11. Return to Resolver
|
||
↓
|
||
12. GraphQL Response
|
||
```
|
||
|
||
**Cache Hit Path**:
|
||
```
|
||
1. GraphQL Query Received
|
||
↓
|
||
2. Resolver Inspects AST
|
||
↓
|
||
3. DataLoader.load({ mbid, inc })
|
||
↓
|
||
4. Check DataLoader Cache (per-request)
|
||
↓ (hit - return immediately)
|
||
5. GraphQL Response
|
||
```
|
||
|
||
**Shared Cache Hit Path**:
|
||
```
|
||
1. GraphQL Query Received
|
||
↓
|
||
2. Resolver Inspects AST
|
||
↓
|
||
3. DataLoader.load({ mbid, inc })
|
||
↓
|
||
4. Check DataLoader Cache (per-request)
|
||
↓ (miss)
|
||
5. Check LRU Cache (shared)
|
||
↓ (hit - return immediately)
|
||
6. Store in DataLoader Cache
|
||
↓
|
||
7. GraphQL Response
|
||
```
|
||
|
||
## Rate Limiting
|
||
|
||
GraphBrainz implements custom rate limiting to comply with API policies.
|
||
|
||
### MusicBrainz Rate Limits
|
||
|
||
**Policy**: 5 requests per 5.5 seconds (approximately 0.909 requests/second)
|
||
|
||
**Implementation**:
|
||
- Token bucket algorithm
|
||
- 5 tokens maximum
|
||
- Refill rate: 0.909 tokens/second
|
||
- Sequential requests (concurrency: 1)
|
||
|
||
**Configuration**:
|
||
```javascript
|
||
const musicbrainzLimiter = new RateLimiter({
|
||
limit: 5,
|
||
interval: 5500, // milliseconds
|
||
concurrency: 1
|
||
});
|
||
```
|
||
|
||
### Extension Rate Limits
|
||
|
||
**Default Policy**: 10 requests per second
|
||
|
||
**Implementation**:
|
||
- Token bucket algorithm
|
||
- 10 tokens maximum
|
||
- Refill rate: 10 tokens/second
|
||
- Parallel requests (concurrency: 5)
|
||
|
||
**Per-Extension Configuration**:
|
||
|
||
| Extension | Rate Limit | Concurrency |
|
||
|-----------|------------|-------------|
|
||
| Cover Art Archive | 10 req/s | 5 |
|
||
| fanart.tv | 10 req/s | 5 |
|
||
| MediaWiki | 10 req/s | 5 |
|
||
| TheAudioDB | 10 req/s | 5 |
|
||
|
||
### Priority Queue
|
||
|
||
Requests are queued with priority levels when rate limit is reached:
|
||
|
||
| Priority | Query Type | Rationale |
|
||
|----------|------------|-----------|
|
||
| High | Lookup | Direct MBID access, user-initiated |
|
||
| Medium | Browse | Relationship traversal, pagination |
|
||
| Low | Search | Full-text search, exploratory |
|
||
|
||
Higher priority requests are processed first when tokens become available.
|
||
|
||
### Rate Limit Errors
|
||
|
||
When rate limit is exceeded and queue is full:
|
||
|
||
**HTTP Response**:
|
||
```
|
||
HTTP/1.1 429 Too Many Requests
|
||
Retry-After: 5
|
||
```
|
||
|
||
**GraphQL Error**:
|
||
```json
|
||
{
|
||
"errors": [
|
||
{
|
||
"message": "Rate limit exceeded",
|
||
"extensions": {
|
||
"code": "RATE_LIMIT",
|
||
"retryAfter": 5
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
## HTTP Client
|
||
|
||
GraphBrainz uses `got` v11.8.2 for HTTP requests.
|
||
|
||
### Client Configuration
|
||
|
||
```javascript
|
||
import got from 'got';
|
||
|
||
const client = got.extend({
|
||
prefixUrl: process.env.MUSICBRAINZ_BASE_URL,
|
||
headers: {
|
||
'User-Agent': 'GraphBrainz/9.0.0 (https://github.com/exogen/graphbrainz)'
|
||
},
|
||
timeout: {
|
||
request: 30000 // 30 seconds
|
||
},
|
||
retry: {
|
||
limit: 3,
|
||
methods: ['GET'],
|
||
statusCodes: [408, 413, 429, 500, 502, 503, 504]
|
||
},
|
||
hooks: {
|
||
beforeRequest: [
|
||
options => {
|
||
debug('graphbrainz:api/client')(`${options.method} ${options.url}`);
|
||
}
|
||
]
|
||
}
|
||
});
|
||
```
|
||
|
||
### Request Headers
|
||
|
||
| Header | Value | Purpose |
|
||
|--------|-------|---------|
|
||
| User-Agent | GraphBrainz/9.0.0 (...) | API identification |
|
||
| Accept | application/json | Response format |
|
||
|
||
### Timeout Handling
|
||
|
||
- **Request timeout**: 30 seconds
|
||
- **Connection timeout**: 10 seconds (default)
|
||
- **Read timeout**: 30 seconds (default)
|
||
|
||
Timeout errors are propagated as GraphQL errors.
|
||
|
||
### Retry Logic
|
||
|
||
Automatic retry for transient failures:
|
||
|
||
- **Max retries**: 3
|
||
- **Retry methods**: GET only
|
||
- **Retry status codes**: 408, 413, 429, 500, 502, 503, 504
|
||
- **Backoff**: Exponential (1s, 2s, 4s)
|
||
|
||
## Data Transformation
|
||
|
||
MusicBrainz API responses are transformed to GraphQL-friendly format:
|
||
|
||
### Field Name Conversion
|
||
|
||
| MusicBrainz | GraphQL |
|
||
|-------------|---------|
|
||
| sort-name | sortName |
|
||
| life-span | lifeSpan |
|
||
| artist-credit | artistCredit |
|
||
| release-group | releaseGroup |
|
||
| iso-3166-1-codes | iso31661Codes |
|
||
|
||
### Nested Object Flattening
|
||
|
||
**MusicBrainz**:
|
||
```json
|
||
{
|
||
"life-span": {
|
||
"begin": "1985",
|
||
"end": null
|
||
}
|
||
}
|
||
```
|
||
|
||
**GraphQL**:
|
||
```json
|
||
{
|
||
"lifeSpan": {
|
||
"begin": "1985",
|
||
"end": null
|
||
}
|
||
}
|
||
```
|
||
|
||
### Array Normalization
|
||
|
||
**MusicBrainz**:
|
||
```json
|
||
{
|
||
"releases": [
|
||
{ "id": "...", "title": "..." }
|
||
]
|
||
}
|
||
```
|
||
|
||
**GraphQL** (Relay connection):
|
||
```json
|
||
{
|
||
"releases": {
|
||
"edges": [
|
||
{
|
||
"node": { "id": "...", "title": "..." },
|
||
"cursor": "..."
|
||
}
|
||
],
|
||
"pageInfo": { ... },
|
||
"totalCount": 1
|
||
}
|
||
}
|
||
```
|
||
|
||
### Relationship Expansion
|
||
|
||
MusicBrainz relationships are flattened into GraphQL fields:
|
||
|
||
**MusicBrainz**:
|
||
```json
|
||
{
|
||
"relations": [
|
||
{
|
||
"type": "member of band",
|
||
"target": "5b11f4ce-...",
|
||
"artist": { "name": "Radiohead" }
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**GraphQL**:
|
||
```graphql
|
||
{
|
||
relationships {
|
||
edges {
|
||
node {
|
||
type
|
||
target {
|
||
... on Artist {
|
||
name
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
## Memory Considerations
|
||
|
||
### Cache Memory Usage
|
||
|
||
With default configuration (8192 items per cache):
|
||
|
||
| Cache | Items | Avg Size | Total Memory |
|
||
|-------|-------|----------|--------------|
|
||
| MusicBrainz | 8192 | 5 KB | ~40 MB |
|
||
| Cover Art Archive | 8192 | 2 KB | ~16 MB |
|
||
| fanart.tv | 8192 | 3 KB | ~24 MB |
|
||
| MediaWiki | 8192 | 4 KB | ~32 MB |
|
||
| TheAudioDB | 8192 | 2 KB | ~16 MB |
|
||
| **Total** | **40960** | - | **~128 MB** |
|
||
|
||
### DataLoader Memory Usage
|
||
|
||
DataLoader instances are created per-request and garbage collected after response:
|
||
|
||
- **Per-request overhead**: ~1-5 MB (depends on query complexity)
|
||
- **Concurrent requests**: 100 requests × 5 MB = 500 MB peak
|
||
|
||
### Recommended Memory Allocation
|
||
|
||
| Deployment | Heap Size | Rationale |
|
||
|------------|-----------|-----------|
|
||
| Development | 512 MB | Single user, low traffic |
|
||
| Production (low) | 1 GB | 10-50 req/s, shared cache |
|
||
| Production (high) | 2 GB | 100+ req/s, full cache |
|
||
|
||
**Node.js Configuration**:
|
||
```bash
|
||
node --max-old-space-size=2048 cli.js
|
||
```
|
||
|
||
## Data Freshness
|
||
|
||
GraphBrainz does not implement cache invalidation beyond TTL expiration. Data freshness depends on:
|
||
|
||
| Data Type | Typical Update Frequency | Cache TTL | Staleness Risk |
|
||
|-----------|-------------------------|-----------|----------------|
|
||
| Artist metadata | Weeks to months | 1 day | Low |
|
||
| Release metadata | Days to weeks | 1 day | Low |
|
||
| Relationships | Weeks to months | 1 day | Low |
|
||
| Cover art | Months to years | 1 day | Very low |
|
||
| Artist images | Months to years | 1 day | Very low |
|
||
| Biographies | Months to years | 1 day | Very low |
|
||
|
||
For real-time data requirements, reduce cache TTL:
|
||
|
||
```bash
|
||
GRAPHBRAINZ_CACHE_TTL=3600000 # 1 hour
|
||
```
|
||
|
||
Or disable caching entirely:
|
||
|
||
```bash
|
||
GRAPHBRAINZ_CACHE_SIZE=0
|
||
```
|