Files

T

Alexander a1f6701bac feat: initial implementation of metadata aggregator

- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects

2026-04-28 16:28:53 +02:00

14 KiB

Raw Blame History

GraphBrainz Data Layer

Data Source Architecture

GraphBrainz is a stateless proxy with no persistent database. All data originates from external APIs:

Source	Purpose	Authentication
MusicBrainz REST API	Core music metadata	None
Cover Art Archive	Album artwork	None
fanart.tv	Artist images	API key required
MediaWiki	Wiki images	None
TheAudioDB	Artist biographies	API key required

MusicBrainz Backend

Base URL Configuration

Environment Variable	Default	Purpose
MUSICBRAINZ_BASE_URL	http://musicbrainz.org/ws/2/	API endpoint

Local Mirror Support:

MUSICBRAINZ_BASE_URL=http://localhost:5000/ws/2/

Using a local MusicBrainz mirror eliminates rate limits and reduces latency.

API Operations

GraphBrainz uses three MusicBrainz API operations:

1. Lookup

Retrieve single entity by MBID.

URL Pattern:

GET /ws/2/{entity}/{mbid}?inc={relationships}

Example:

GET /ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?inc=releases+recordings

Supported Entities: area, artist, collection, event, instrument, label, place, recording, release, release-group, series, url, work

2. Browse

Retrieve entities linked to a parent entity.

URL Pattern:

GET /ws/2/{entity}?{parent-entity}={mbid}&limit={limit}&offset={offset}&inc={relationships}

Example:

GET /ws/2/release?artist=5b11f4ce-a62d-471e-81fc-a69a8278c7da&limit=25&offset=0

Supported Relationships: See API.md for full matrix

3. Search

Lucene-based full-text search.

URL Pattern:

GET /ws/2/{entity}?query={lucene-query}&limit={limit}&offset={offset}

Example:

GET /ws/2/artist?query=artist:Radiohead%20AND%20country:GB&limit=25

Supported Entities: area, artist, event, instrument, label, place, recording, release, release-group, work

Include Parameters

GraphBrainz resolvers inspect the GraphQL AST to determine which inc parameters are needed:

Parameter	Description	Entities
aliases	Alternative names	All
annotation	Editorial notes	All
tags	User-generated tags	All
ratings	User ratings	All
genres	Genre classifications	All
artist-credits	Artist credit details	Recording, Release, ReleaseGroup, Track
artists	Related artists	Recording, Release, ReleaseGroup, Work
collections	Collections containing entity	All
labels	Record labels	Release
recordings	Recordings	Artist, Release, Work
releases	Releases	Artist, Label, Recording, ReleaseGroup
release-groups	Release groups	Artist, Release
works	Musical works	Artist, Recording
discids	Disc IDs	Release
media	Media/tracks	Release
isrcs	ISRC codes	Recording
url-rels	URL relationships	All
artist-rels	Artist relationships	All
label-rels	Label relationships	All
recording-rels	Recording relationships	All
release-rels	Release relationships	All
release-group-rels	Release group relationships	All
work-rels	Work relationships	All
area-rels	Area relationships	All
place-rels	Place relationships	All
event-rels	Event relationships	All
series-rels	Series relationships	All
instrument-rels	Instrument relationships	All

Response Format

MusicBrainz returns JSON with entity-specific structure:

{
  "id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
  "name": "Radiohead",
  "sort-name": "Radiohead",
  "type": "Group",
  "country": "GB",
  "life-span": {
    "begin": "1985"
  },
  "releases": [
    {
      "id": "...",
      "title": "OK Computer",
      "date": "1997-05-21"
    }
  ]
}

GraphBrainz transforms this to GraphQL-friendly format (camelCase, nested objects).

Two-Level Caching Strategy

Level 1: DataLoader (Per-Request)

Purpose: Request batching and deduplication within a single GraphQL query.

Lifecycle: Created fresh for each GraphQL request, discarded after response.

Implementation:

import DataLoader from 'dataloader';

const artistLoader = new DataLoader(async (keys) => {
  const results = await Promise.all(
    keys.map(key => fetchArtist(key.mbid, key.inc))
  );
  return results;
});

Benefits:

Batches multiple requests for same entity type
Deduplicates identical requests within query
Prevents N+1 query problems

Example:

{
  lookup {
    release(mbid: "...") {
      artists {  # Artist 1
        name
      }
      tracks {
        artists {  # Artist 1 again (deduplicated)
          name
        }
      }
    }
  }
}

DataLoader ensures Artist 1 is fetched only once.

Level 2: LRU Cache (Shared)

Purpose: Cross-request caching to reduce API calls.

Lifecycle: Shared across all requests, persists for configured TTL.

Configuration:

Parameter	Environment Variable	Default
Size	GRAPHBRAINZ_CACHE_SIZE	8192 items
TTL	GRAPHBRAINZ_CACHE_TTL	86400000 ms (1 day)

Implementation:

import LRU from 'lru-cache';

const cache = new LRU({
  max: 8192,
  ttl: 86400000,  // 1 day
  updateAgeOnGet: true,
  updateAgeOnHas: true
});

Cache Key Strategy:

Keys combine entity type, MBID, and inc parameters to prevent collisions:

artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:releases,recordings
release:f0c8b1e5-...:artist-credits,labels,media

Different queries for the same entity use different cache keys.

Cache Invalidation:

Time-based: Items expire after TTL (default 1 day)
Size-based: LRU eviction when cache exceeds max size
No manual invalidation: GraphBrainz assumes MusicBrainz data is relatively stable

Cache Hit Ratio:

Typical hit ratios for production workloads:

Lookup queries: 60-80% (popular artists cached)
Browse queries: 40-60% (pagination reduces hits)
Search queries: 10-30% (diverse queries)

Extension Caching

Each extension maintains its own LRU cache with separate configuration.

Cover Art Archive

Parameter	Environment Variable	Default
Size	COVERART_CACHE_SIZE	8192
TTL	COVERART_CACHE_TTL	86400000 ms

Cache Key: coverart:{release-mbid}

fanart.tv

Parameter	Environment Variable	Default
Size	FANART_CACHE_SIZE	8192
TTL	FANART_CACHE_TTL	86400000 ms

Cache Key: fanart:{artist-mbid}

TheAudioDB

Parameter	Environment Variable	Default
Size	THEAUDIODB_CACHE_SIZE	8192
TTL	THEAUDIODB_CACHE_TTL	86400000 ms

Cache Key: theaudiodb:{artist-mbid}

MediaWiki

Parameter	Environment Variable	Default
Size	MEDIAWIKI_CACHE_SIZE	8192
TTL	MEDIAWIKI_CACHE_TTL	86400000 ms

Cache Key: mediawiki:{artist-name}

Data Flow

Complete request flow from GraphQL query to response:

1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓ (determines required inc parameters)
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (miss)
5. Check LRU Cache (shared)
   ↓ (miss)
6. Rate Limiter Queue
   ↓ (acquire token)
7. HTTP Request via got
   ↓
8. MusicBrainz API Response
   ↓
9. Store in LRU Cache
   ↓
10. Return to DataLoader
    ↓
11. Return to Resolver
    ↓
12. GraphQL Response

Cache Hit Path:

1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (hit - return immediately)
5. GraphQL Response

Shared Cache Hit Path:

1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (miss)
5. Check LRU Cache (shared)
   ↓ (hit - return immediately)
6. Store in DataLoader Cache
   ↓
7. GraphQL Response

Rate Limiting

GraphBrainz implements custom rate limiting to comply with API policies.

MusicBrainz Rate Limits

Policy: 5 requests per 5.5 seconds (approximately 0.909 requests/second)

Implementation:

Token bucket algorithm
5 tokens maximum
Refill rate: 0.909 tokens/second
Sequential requests (concurrency: 1)

Configuration:

const musicbrainzLimiter = new RateLimiter({
  limit: 5,
  interval: 5500,  // milliseconds
  concurrency: 1
});

Extension Rate Limits

Default Policy: 10 requests per second

Implementation:

Token bucket algorithm
10 tokens maximum
Refill rate: 10 tokens/second
Parallel requests (concurrency: 5)

Per-Extension Configuration:

Extension	Rate Limit	Concurrency
Cover Art Archive	10 req/s	5
fanart.tv	10 req/s	5
MediaWiki	10 req/s	5
TheAudioDB	10 req/s	5

Priority Queue

Requests are queued with priority levels when rate limit is reached:

Priority	Query Type	Rationale
High	Lookup	Direct MBID access, user-initiated
Medium	Browse	Relationship traversal, pagination
Low	Search	Full-text search, exploratory

Higher priority requests are processed first when tokens become available.

Rate Limit Errors

When rate limit is exceeded and queue is full:

HTTP Response:

HTTP/1.1 429 Too Many Requests
Retry-After: 5

GraphQL Error:

{
  "errors": [
    {
      "message": "Rate limit exceeded",
      "extensions": {
        "code": "RATE_LIMIT",
        "retryAfter": 5
      }
    }
  ]
}

HTTP Client

GraphBrainz uses got v11.8.2 for HTTP requests.

Client Configuration

import got from 'got';

const client = got.extend({
  prefixUrl: process.env.MUSICBRAINZ_BASE_URL,
  headers: {
    'User-Agent': 'GraphBrainz/9.0.0 (https://github.com/exogen/graphbrainz)'
  },
  timeout: {
    request: 30000  // 30 seconds
  },
  retry: {
    limit: 3,
    methods: ['GET'],
    statusCodes: [408, 413, 429, 500, 502, 503, 504]
  },
  hooks: {
    beforeRequest: [
      options => {
        debug('graphbrainz:api/client')(`${options.method} ${options.url}`);
      }
    ]
  }
});

Request Headers

Header	Value	Purpose
User-Agent	GraphBrainz/9.0.0 (...)	API identification
Accept	application/json	Response format

Timeout Handling

Request timeout: 30 seconds
Connection timeout: 10 seconds (default)
Read timeout: 30 seconds (default)

Timeout errors are propagated as GraphQL errors.

Retry Logic

Automatic retry for transient failures:

Max retries: 3
Retry methods: GET only
Retry status codes: 408, 413, 429, 500, 502, 503, 504
Backoff: Exponential (1s, 2s, 4s)

Data Transformation

MusicBrainz API responses are transformed to GraphQL-friendly format:

Field Name Conversion

MusicBrainz	GraphQL
sort-name	sortName
life-span	lifeSpan
artist-credit	artistCredit
release-group	releaseGroup
iso-3166-1-codes	iso31661Codes

Nested Object Flattening

MusicBrainz:

{
  "life-span": {
    "begin": "1985",
    "end": null
  }
}

GraphQL:

{
  "lifeSpan": {
    "begin": "1985",
    "end": null
  }
}

Array Normalization

MusicBrainz:

{
  "releases": [
    { "id": "...", "title": "..." }
  ]
}

GraphQL (Relay connection):

{
  "releases": {
    "edges": [
      {
        "node": { "id": "...", "title": "..." },
        "cursor": "..."
      }
    ],
    "pageInfo": { ... },
    "totalCount": 1
  }
}

Relationship Expansion

MusicBrainz relationships are flattened into GraphQL fields:

MusicBrainz:

{
  "relations": [
    {
      "type": "member of band",
      "target": "5b11f4ce-...",
      "artist": { "name": "Radiohead" }
    }
  ]
}

GraphQL:

{
  relationships {
    edges {
      node {
        type
        target {
          ... on Artist {
            name
          }
        }
      }
    }
  }
}

Memory Considerations

Cache Memory Usage

With default configuration (8192 items per cache):

Cache	Items	Avg Size	Total Memory
MusicBrainz	8192	5 KB	~40 MB
Cover Art Archive	8192	2 KB	~16 MB
fanart.tv	8192	3 KB	~24 MB
MediaWiki	8192	4 KB	~32 MB
TheAudioDB	8192	2 KB	~16 MB
Total	40960	-	~128 MB

DataLoader Memory Usage

DataLoader instances are created per-request and garbage collected after response:

Per-request overhead: ~1-5 MB (depends on query complexity)
Concurrent requests: 100 requests × 5 MB = 500 MB peak

Recommended Memory Allocation

Deployment	Heap Size	Rationale
Development	512 MB	Single user, low traffic
Production (low)	1 GB	10-50 req/s, shared cache
Production (high)	2 GB	100+ req/s, full cache

Node.js Configuration:

node --max-old-space-size=2048 cli.js

Data Freshness

GraphBrainz does not implement cache invalidation beyond TTL expiration. Data freshness depends on:

Data Type	Typical Update Frequency	Cache TTL	Staleness Risk
Artist metadata	Weeks to months	1 day	Low
Release metadata	Days to weeks	1 day	Low
Relationships	Weeks to months	1 day	Low
Cover art	Months to years	1 day	Very low
Artist images	Months to years	1 day	Very low
Biographies	Months to years	1 day	Very low

For real-time data requirements, reduce cache TTL:

GRAPHBRAINZ_CACHE_TTL=3600000  # 1 hour

Or disable caching entirely:

GRAPHBRAINZ_CACHE_SIZE=0

14 KiB Raw Blame History Unescape Escape

GraphBrainz Data Layer

Data Source Architecture

MusicBrainz Backend

Base URL Configuration

API Operations

1. Lookup

2. Browse

3. Search

Include Parameters

Response Format

Two-Level Caching Strategy

Level 1: DataLoader (Per-Request)

Level 2: LRU Cache (Shared)

Extension Caching

Cover Art Archive

fanart.tv

TheAudioDB

MediaWiki

Data Flow

Rate Limiting

MusicBrainz Rate Limits

Extension Rate Limits

Priority Queue

Rate Limit Errors

HTTP Client

Client Configuration

Request Headers

Timeout Handling

Retry Logic

Data Transformation

Field Name Conversion

Nested Object Flattening

Array Normalization

Relationship Expansion

Memory Considerations

Cache Memory Usage

DataLoader Memory Usage

Recommended Memory Allocation

Data Freshness

14 KiB

Raw Blame History