feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider - PostgreSQL schema with migrations - Service layer with database-first caching - Repository pattern for data access - YAML configuration support - Research documentation for 17 music metadata projects
This commit is contained in:
@@ -0,0 +1,629 @@
|
||||
# GraphBrainz Data Layer
|
||||
|
||||
## Data Source Architecture
|
||||
|
||||
GraphBrainz is a **stateless proxy** with no persistent database. All data originates from external APIs:
|
||||
|
||||
| Source | Purpose | Authentication |
|
||||
|--------|---------|----------------|
|
||||
| MusicBrainz REST API | Core music metadata | None |
|
||||
| Cover Art Archive | Album artwork | None |
|
||||
| fanart.tv | Artist images | API key required |
|
||||
| MediaWiki | Wiki images | None |
|
||||
| TheAudioDB | Artist biographies | API key required |
|
||||
|
||||
## MusicBrainz Backend
|
||||
|
||||
### Base URL Configuration
|
||||
|
||||
| Environment Variable | Default | Purpose |
|
||||
|---------------------|---------|---------|
|
||||
| MUSICBRAINZ_BASE_URL | http://musicbrainz.org/ws/2/ | API endpoint |
|
||||
|
||||
**Local Mirror Support**:
|
||||
```bash
|
||||
MUSICBRAINZ_BASE_URL=http://localhost:5000/ws/2/
|
||||
```
|
||||
|
||||
Using a local MusicBrainz mirror eliminates rate limits and reduces latency.
|
||||
|
||||
### API Operations
|
||||
|
||||
GraphBrainz uses three MusicBrainz API operations:
|
||||
|
||||
#### 1. Lookup
|
||||
|
||||
Retrieve single entity by MBID.
|
||||
|
||||
**URL Pattern**:
|
||||
```
|
||||
GET /ws/2/{entity}/{mbid}?inc={relationships}
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```
|
||||
GET /ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?inc=releases+recordings
|
||||
```
|
||||
|
||||
**Supported Entities**: area, artist, collection, event, instrument, label, place, recording, release, release-group, series, url, work
|
||||
|
||||
#### 2. Browse
|
||||
|
||||
Retrieve entities linked to a parent entity.
|
||||
|
||||
**URL Pattern**:
|
||||
```
|
||||
GET /ws/2/{entity}?{parent-entity}={mbid}&limit={limit}&offset={offset}&inc={relationships}
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```
|
||||
GET /ws/2/release?artist=5b11f4ce-a62d-471e-81fc-a69a8278c7da&limit=25&offset=0
|
||||
```
|
||||
|
||||
**Supported Relationships**: See API.md for full matrix
|
||||
|
||||
#### 3. Search
|
||||
|
||||
Lucene-based full-text search.
|
||||
|
||||
**URL Pattern**:
|
||||
```
|
||||
GET /ws/2/{entity}?query={lucene-query}&limit={limit}&offset={offset}
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```
|
||||
GET /ws/2/artist?query=artist:Radiohead%20AND%20country:GB&limit=25
|
||||
```
|
||||
|
||||
**Supported Entities**: area, artist, event, instrument, label, place, recording, release, release-group, work
|
||||
|
||||
### Include Parameters
|
||||
|
||||
GraphBrainz resolvers inspect the GraphQL AST to determine which `inc` parameters are needed:
|
||||
|
||||
| Parameter | Description | Entities |
|
||||
|-----------|-------------|----------|
|
||||
| aliases | Alternative names | All |
|
||||
| annotation | Editorial notes | All |
|
||||
| tags | User-generated tags | All |
|
||||
| ratings | User ratings | All |
|
||||
| genres | Genre classifications | All |
|
||||
| artist-credits | Artist credit details | Recording, Release, ReleaseGroup, Track |
|
||||
| artists | Related artists | Recording, Release, ReleaseGroup, Work |
|
||||
| collections | Collections containing entity | All |
|
||||
| labels | Record labels | Release |
|
||||
| recordings | Recordings | Artist, Release, Work |
|
||||
| releases | Releases | Artist, Label, Recording, ReleaseGroup |
|
||||
| release-groups | Release groups | Artist, Release |
|
||||
| works | Musical works | Artist, Recording |
|
||||
| discids | Disc IDs | Release |
|
||||
| media | Media/tracks | Release |
|
||||
| isrcs | ISRC codes | Recording |
|
||||
| url-rels | URL relationships | All |
|
||||
| artist-rels | Artist relationships | All |
|
||||
| label-rels | Label relationships | All |
|
||||
| recording-rels | Recording relationships | All |
|
||||
| release-rels | Release relationships | All |
|
||||
| release-group-rels | Release group relationships | All |
|
||||
| work-rels | Work relationships | All |
|
||||
| area-rels | Area relationships | All |
|
||||
| place-rels | Place relationships | All |
|
||||
| event-rels | Event relationships | All |
|
||||
| series-rels | Series relationships | All |
|
||||
| instrument-rels | Instrument relationships | All |
|
||||
|
||||
### Response Format
|
||||
|
||||
MusicBrainz returns JSON with entity-specific structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
|
||||
"name": "Radiohead",
|
||||
"sort-name": "Radiohead",
|
||||
"type": "Group",
|
||||
"country": "GB",
|
||||
"life-span": {
|
||||
"begin": "1985"
|
||||
},
|
||||
"releases": [
|
||||
{
|
||||
"id": "...",
|
||||
"title": "OK Computer",
|
||||
"date": "1997-05-21"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
GraphBrainz transforms this to GraphQL-friendly format (camelCase, nested objects).
|
||||
|
||||
## Two-Level Caching Strategy
|
||||
|
||||
### Level 1: DataLoader (Per-Request)
|
||||
|
||||
**Purpose**: Request batching and deduplication within a single GraphQL query.
|
||||
|
||||
**Lifecycle**: Created fresh for each GraphQL request, discarded after response.
|
||||
|
||||
**Implementation**:
|
||||
```javascript
|
||||
import DataLoader from 'dataloader';
|
||||
|
||||
const artistLoader = new DataLoader(async (keys) => {
|
||||
const results = await Promise.all(
|
||||
keys.map(key => fetchArtist(key.mbid, key.inc))
|
||||
);
|
||||
return results;
|
||||
});
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Batches multiple requests for same entity type
|
||||
- Deduplicates identical requests within query
|
||||
- Prevents N+1 query problems
|
||||
|
||||
**Example**:
|
||||
```graphql
|
||||
{
|
||||
lookup {
|
||||
release(mbid: "...") {
|
||||
artists { # Artist 1
|
||||
name
|
||||
}
|
||||
tracks {
|
||||
artists { # Artist 1 again (deduplicated)
|
||||
name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
DataLoader ensures Artist 1 is fetched only once.
|
||||
|
||||
### Level 2: LRU Cache (Shared)
|
||||
|
||||
**Purpose**: Cross-request caching to reduce API calls.
|
||||
|
||||
**Lifecycle**: Shared across all requests, persists for configured TTL.
|
||||
|
||||
**Configuration**:
|
||||
|
||||
| Parameter | Environment Variable | Default |
|
||||
|-----------|---------------------|---------|
|
||||
| Size | GRAPHBRAINZ_CACHE_SIZE | 8192 items |
|
||||
| TTL | GRAPHBRAINZ_CACHE_TTL | 86400000 ms (1 day) |
|
||||
|
||||
**Implementation**:
|
||||
```javascript
|
||||
import LRU from 'lru-cache';
|
||||
|
||||
const cache = new LRU({
|
||||
max: 8192,
|
||||
ttl: 86400000, // 1 day
|
||||
updateAgeOnGet: true,
|
||||
updateAgeOnHas: true
|
||||
});
|
||||
```
|
||||
|
||||
**Cache Key Strategy**:
|
||||
|
||||
Keys combine entity type, MBID, and `inc` parameters to prevent collisions:
|
||||
|
||||
```
|
||||
artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:releases,recordings
|
||||
release:f0c8b1e5-...:artist-credits,labels,media
|
||||
```
|
||||
|
||||
Different queries for the same entity use different cache keys.
|
||||
|
||||
**Cache Invalidation**:
|
||||
|
||||
- **Time-based**: Items expire after TTL (default 1 day)
|
||||
- **Size-based**: LRU eviction when cache exceeds max size
|
||||
- **No manual invalidation**: GraphBrainz assumes MusicBrainz data is relatively stable
|
||||
|
||||
**Cache Hit Ratio**:
|
||||
|
||||
Typical hit ratios for production workloads:
|
||||
|
||||
- Lookup queries: 60-80% (popular artists cached)
|
||||
- Browse queries: 40-60% (pagination reduces hits)
|
||||
- Search queries: 10-30% (diverse queries)
|
||||
|
||||
## Extension Caching
|
||||
|
||||
Each extension maintains its own LRU cache with separate configuration.
|
||||
|
||||
### Cover Art Archive
|
||||
|
||||
| Parameter | Environment Variable | Default |
|
||||
|-----------|---------------------|---------|
|
||||
| Size | COVERART_CACHE_SIZE | 8192 |
|
||||
| TTL | COVERART_CACHE_TTL | 86400000 ms |
|
||||
|
||||
**Cache Key**: `coverart:{release-mbid}`
|
||||
|
||||
### fanart.tv
|
||||
|
||||
| Parameter | Environment Variable | Default |
|
||||
|-----------|---------------------|---------|
|
||||
| Size | FANART_CACHE_SIZE | 8192 |
|
||||
| TTL | FANART_CACHE_TTL | 86400000 ms |
|
||||
|
||||
**Cache Key**: `fanart:{artist-mbid}`
|
||||
|
||||
### TheAudioDB
|
||||
|
||||
| Parameter | Environment Variable | Default |
|
||||
|-----------|---------------------|---------|
|
||||
| Size | THEAUDIODB_CACHE_SIZE | 8192 |
|
||||
| TTL | THEAUDIODB_CACHE_TTL | 86400000 ms |
|
||||
|
||||
**Cache Key**: `theaudiodb:{artist-mbid}`
|
||||
|
||||
### MediaWiki
|
||||
|
||||
| Parameter | Environment Variable | Default |
|
||||
|-----------|---------------------|---------|
|
||||
| Size | MEDIAWIKI_CACHE_SIZE | 8192 |
|
||||
| TTL | MEDIAWIKI_CACHE_TTL | 86400000 ms |
|
||||
|
||||
**Cache Key**: `mediawiki:{artist-name}`
|
||||
|
||||
## Data Flow
|
||||
|
||||
Complete request flow from GraphQL query to response:
|
||||
|
||||
```
|
||||
1. GraphQL Query Received
|
||||
↓
|
||||
2. Resolver Inspects AST
|
||||
↓ (determines required inc parameters)
|
||||
3. DataLoader.load({ mbid, inc })
|
||||
↓
|
||||
4. Check DataLoader Cache (per-request)
|
||||
↓ (miss)
|
||||
5. Check LRU Cache (shared)
|
||||
↓ (miss)
|
||||
6. Rate Limiter Queue
|
||||
↓ (acquire token)
|
||||
7. HTTP Request via got
|
||||
↓
|
||||
8. MusicBrainz API Response
|
||||
↓
|
||||
9. Store in LRU Cache
|
||||
↓
|
||||
10. Return to DataLoader
|
||||
↓
|
||||
11. Return to Resolver
|
||||
↓
|
||||
12. GraphQL Response
|
||||
```
|
||||
|
||||
**Cache Hit Path**:
|
||||
```
|
||||
1. GraphQL Query Received
|
||||
↓
|
||||
2. Resolver Inspects AST
|
||||
↓
|
||||
3. DataLoader.load({ mbid, inc })
|
||||
↓
|
||||
4. Check DataLoader Cache (per-request)
|
||||
↓ (hit - return immediately)
|
||||
5. GraphQL Response
|
||||
```
|
||||
|
||||
**Shared Cache Hit Path**:
|
||||
```
|
||||
1. GraphQL Query Received
|
||||
↓
|
||||
2. Resolver Inspects AST
|
||||
↓
|
||||
3. DataLoader.load({ mbid, inc })
|
||||
↓
|
||||
4. Check DataLoader Cache (per-request)
|
||||
↓ (miss)
|
||||
5. Check LRU Cache (shared)
|
||||
↓ (hit - return immediately)
|
||||
6. Store in DataLoader Cache
|
||||
↓
|
||||
7. GraphQL Response
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
GraphBrainz implements custom rate limiting to comply with API policies.
|
||||
|
||||
### MusicBrainz Rate Limits
|
||||
|
||||
**Policy**: 5 requests per 5.5 seconds (approximately 0.909 requests/second)
|
||||
|
||||
**Implementation**:
|
||||
- Token bucket algorithm
|
||||
- 5 tokens maximum
|
||||
- Refill rate: 0.909 tokens/second
|
||||
- Sequential requests (concurrency: 1)
|
||||
|
||||
**Configuration**:
|
||||
```javascript
|
||||
const musicbrainzLimiter = new RateLimiter({
|
||||
limit: 5,
|
||||
interval: 5500, // milliseconds
|
||||
concurrency: 1
|
||||
});
|
||||
```
|
||||
|
||||
### Extension Rate Limits
|
||||
|
||||
**Default Policy**: 10 requests per second
|
||||
|
||||
**Implementation**:
|
||||
- Token bucket algorithm
|
||||
- 10 tokens maximum
|
||||
- Refill rate: 10 tokens/second
|
||||
- Parallel requests (concurrency: 5)
|
||||
|
||||
**Per-Extension Configuration**:
|
||||
|
||||
| Extension | Rate Limit | Concurrency |
|
||||
|-----------|------------|-------------|
|
||||
| Cover Art Archive | 10 req/s | 5 |
|
||||
| fanart.tv | 10 req/s | 5 |
|
||||
| MediaWiki | 10 req/s | 5 |
|
||||
| TheAudioDB | 10 req/s | 5 |
|
||||
|
||||
### Priority Queue
|
||||
|
||||
Requests are queued with priority levels when rate limit is reached:
|
||||
|
||||
| Priority | Query Type | Rationale |
|
||||
|----------|------------|-----------|
|
||||
| High | Lookup | Direct MBID access, user-initiated |
|
||||
| Medium | Browse | Relationship traversal, pagination |
|
||||
| Low | Search | Full-text search, exploratory |
|
||||
|
||||
Higher priority requests are processed first when tokens become available.
|
||||
|
||||
### Rate Limit Errors
|
||||
|
||||
When rate limit is exceeded and queue is full:
|
||||
|
||||
**HTTP Response**:
|
||||
```
|
||||
HTTP/1.1 429 Too Many Requests
|
||||
Retry-After: 5
|
||||
```
|
||||
|
||||
**GraphQL Error**:
|
||||
```json
|
||||
{
|
||||
"errors": [
|
||||
{
|
||||
"message": "Rate limit exceeded",
|
||||
"extensions": {
|
||||
"code": "RATE_LIMIT",
|
||||
"retryAfter": 5
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## HTTP Client
|
||||
|
||||
GraphBrainz uses `got` v11.8.2 for HTTP requests.
|
||||
|
||||
### Client Configuration
|
||||
|
||||
```javascript
|
||||
import got from 'got';
|
||||
|
||||
const client = got.extend({
|
||||
prefixUrl: process.env.MUSICBRAINZ_BASE_URL,
|
||||
headers: {
|
||||
'User-Agent': 'GraphBrainz/9.0.0 (https://github.com/exogen/graphbrainz)'
|
||||
},
|
||||
timeout: {
|
||||
request: 30000 // 30 seconds
|
||||
},
|
||||
retry: {
|
||||
limit: 3,
|
||||
methods: ['GET'],
|
||||
statusCodes: [408, 413, 429, 500, 502, 503, 504]
|
||||
},
|
||||
hooks: {
|
||||
beforeRequest: [
|
||||
options => {
|
||||
debug('graphbrainz:api/client')(`${options.method} ${options.url}`);
|
||||
}
|
||||
]
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
### Request Headers
|
||||
|
||||
| Header | Value | Purpose |
|
||||
|--------|-------|---------|
|
||||
| User-Agent | GraphBrainz/9.0.0 (...) | API identification |
|
||||
| Accept | application/json | Response format |
|
||||
|
||||
### Timeout Handling
|
||||
|
||||
- **Request timeout**: 30 seconds
|
||||
- **Connection timeout**: 10 seconds (default)
|
||||
- **Read timeout**: 30 seconds (default)
|
||||
|
||||
Timeout errors are propagated as GraphQL errors.
|
||||
|
||||
### Retry Logic
|
||||
|
||||
Automatic retry for transient failures:
|
||||
|
||||
- **Max retries**: 3
|
||||
- **Retry methods**: GET only
|
||||
- **Retry status codes**: 408, 413, 429, 500, 502, 503, 504
|
||||
- **Backoff**: Exponential (1s, 2s, 4s)
|
||||
|
||||
## Data Transformation
|
||||
|
||||
MusicBrainz API responses are transformed to GraphQL-friendly format:
|
||||
|
||||
### Field Name Conversion
|
||||
|
||||
| MusicBrainz | GraphQL |
|
||||
|-------------|---------|
|
||||
| sort-name | sortName |
|
||||
| life-span | lifeSpan |
|
||||
| artist-credit | artistCredit |
|
||||
| release-group | releaseGroup |
|
||||
| iso-3166-1-codes | iso31661Codes |
|
||||
|
||||
### Nested Object Flattening
|
||||
|
||||
**MusicBrainz**:
|
||||
```json
|
||||
{
|
||||
"life-span": {
|
||||
"begin": "1985",
|
||||
"end": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**GraphQL**:
|
||||
```json
|
||||
{
|
||||
"lifeSpan": {
|
||||
"begin": "1985",
|
||||
"end": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Array Normalization
|
||||
|
||||
**MusicBrainz**:
|
||||
```json
|
||||
{
|
||||
"releases": [
|
||||
{ "id": "...", "title": "..." }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**GraphQL** (Relay connection):
|
||||
```json
|
||||
{
|
||||
"releases": {
|
||||
"edges": [
|
||||
{
|
||||
"node": { "id": "...", "title": "..." },
|
||||
"cursor": "..."
|
||||
}
|
||||
],
|
||||
"pageInfo": { ... },
|
||||
"totalCount": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Relationship Expansion
|
||||
|
||||
MusicBrainz relationships are flattened into GraphQL fields:
|
||||
|
||||
**MusicBrainz**:
|
||||
```json
|
||||
{
|
||||
"relations": [
|
||||
{
|
||||
"type": "member of band",
|
||||
"target": "5b11f4ce-...",
|
||||
"artist": { "name": "Radiohead" }
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**GraphQL**:
|
||||
```graphql
|
||||
{
|
||||
relationships {
|
||||
edges {
|
||||
node {
|
||||
type
|
||||
target {
|
||||
... on Artist {
|
||||
name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Memory Considerations
|
||||
|
||||
### Cache Memory Usage
|
||||
|
||||
With default configuration (8192 items per cache):
|
||||
|
||||
| Cache | Items | Avg Size | Total Memory |
|
||||
|-------|-------|----------|--------------|
|
||||
| MusicBrainz | 8192 | 5 KB | ~40 MB |
|
||||
| Cover Art Archive | 8192 | 2 KB | ~16 MB |
|
||||
| fanart.tv | 8192 | 3 KB | ~24 MB |
|
||||
| MediaWiki | 8192 | 4 KB | ~32 MB |
|
||||
| TheAudioDB | 8192 | 2 KB | ~16 MB |
|
||||
| **Total** | **40960** | - | **~128 MB** |
|
||||
|
||||
### DataLoader Memory Usage
|
||||
|
||||
DataLoader instances are created per-request and garbage collected after response:
|
||||
|
||||
- **Per-request overhead**: ~1-5 MB (depends on query complexity)
|
||||
- **Concurrent requests**: 100 requests × 5 MB = 500 MB peak
|
||||
|
||||
### Recommended Memory Allocation
|
||||
|
||||
| Deployment | Heap Size | Rationale |
|
||||
|------------|-----------|-----------|
|
||||
| Development | 512 MB | Single user, low traffic |
|
||||
| Production (low) | 1 GB | 10-50 req/s, shared cache |
|
||||
| Production (high) | 2 GB | 100+ req/s, full cache |
|
||||
|
||||
**Node.js Configuration**:
|
||||
```bash
|
||||
node --max-old-space-size=2048 cli.js
|
||||
```
|
||||
|
||||
## Data Freshness
|
||||
|
||||
GraphBrainz does not implement cache invalidation beyond TTL expiration. Data freshness depends on:
|
||||
|
||||
| Data Type | Typical Update Frequency | Cache TTL | Staleness Risk |
|
||||
|-----------|-------------------------|-----------|----------------|
|
||||
| Artist metadata | Weeks to months | 1 day | Low |
|
||||
| Release metadata | Days to weeks | 1 day | Low |
|
||||
| Relationships | Weeks to months | 1 day | Low |
|
||||
| Cover art | Months to years | 1 day | Very low |
|
||||
| Artist images | Months to years | 1 day | Very low |
|
||||
| Biographies | Months to years | 1 day | Very low |
|
||||
|
||||
For real-time data requirements, reduce cache TTL:
|
||||
|
||||
```bash
|
||||
GRAPHBRAINZ_CACHE_TTL=3600000 # 1 hour
|
||||
```
|
||||
|
||||
Or disable caching entirely:
|
||||
|
||||
```bash
|
||||
GRAPHBRAINZ_CACHE_SIZE=0
|
||||
```
|
||||
Reference in New Issue
Block a user