Files
metadata-agregator/docs/research/graphbrainz/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

14 KiB
Raw Blame History

GraphBrainz Data Layer

Data Source Architecture

GraphBrainz is a stateless proxy with no persistent database. All data originates from external APIs:

Source Purpose Authentication
MusicBrainz REST API Core music metadata None
Cover Art Archive Album artwork None
fanart.tv Artist images API key required
MediaWiki Wiki images None
TheAudioDB Artist biographies API key required

MusicBrainz Backend

Base URL Configuration

Environment Variable Default Purpose
MUSICBRAINZ_BASE_URL http://musicbrainz.org/ws/2/ API endpoint

Local Mirror Support:

MUSICBRAINZ_BASE_URL=http://localhost:5000/ws/2/

Using a local MusicBrainz mirror eliminates rate limits and reduces latency.

API Operations

GraphBrainz uses three MusicBrainz API operations:

1. Lookup

Retrieve single entity by MBID.

URL Pattern:

GET /ws/2/{entity}/{mbid}?inc={relationships}

Example:

GET /ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?inc=releases+recordings

Supported Entities: area, artist, collection, event, instrument, label, place, recording, release, release-group, series, url, work

2. Browse

Retrieve entities linked to a parent entity.

URL Pattern:

GET /ws/2/{entity}?{parent-entity}={mbid}&limit={limit}&offset={offset}&inc={relationships}

Example:

GET /ws/2/release?artist=5b11f4ce-a62d-471e-81fc-a69a8278c7da&limit=25&offset=0

Supported Relationships: See API.md for full matrix

Lucene-based full-text search.

URL Pattern:

GET /ws/2/{entity}?query={lucene-query}&limit={limit}&offset={offset}

Example:

GET /ws/2/artist?query=artist:Radiohead%20AND%20country:GB&limit=25

Supported Entities: area, artist, event, instrument, label, place, recording, release, release-group, work

Include Parameters

GraphBrainz resolvers inspect the GraphQL AST to determine which inc parameters are needed:

Parameter Description Entities
aliases Alternative names All
annotation Editorial notes All
tags User-generated tags All
ratings User ratings All
genres Genre classifications All
artist-credits Artist credit details Recording, Release, ReleaseGroup, Track
artists Related artists Recording, Release, ReleaseGroup, Work
collections Collections containing entity All
labels Record labels Release
recordings Recordings Artist, Release, Work
releases Releases Artist, Label, Recording, ReleaseGroup
release-groups Release groups Artist, Release
works Musical works Artist, Recording
discids Disc IDs Release
media Media/tracks Release
isrcs ISRC codes Recording
url-rels URL relationships All
artist-rels Artist relationships All
label-rels Label relationships All
recording-rels Recording relationships All
release-rels Release relationships All
release-group-rels Release group relationships All
work-rels Work relationships All
area-rels Area relationships All
place-rels Place relationships All
event-rels Event relationships All
series-rels Series relationships All
instrument-rels Instrument relationships All

Response Format

MusicBrainz returns JSON with entity-specific structure:

{
  "id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
  "name": "Radiohead",
  "sort-name": "Radiohead",
  "type": "Group",
  "country": "GB",
  "life-span": {
    "begin": "1985"
  },
  "releases": [
    {
      "id": "...",
      "title": "OK Computer",
      "date": "1997-05-21"
    }
  ]
}

GraphBrainz transforms this to GraphQL-friendly format (camelCase, nested objects).

Two-Level Caching Strategy

Level 1: DataLoader (Per-Request)

Purpose: Request batching and deduplication within a single GraphQL query.

Lifecycle: Created fresh for each GraphQL request, discarded after response.

Implementation:

import DataLoader from 'dataloader';

const artistLoader = new DataLoader(async (keys) => {
  const results = await Promise.all(
    keys.map(key => fetchArtist(key.mbid, key.inc))
  );
  return results;
});

Benefits:

  • Batches multiple requests for same entity type
  • Deduplicates identical requests within query
  • Prevents N+1 query problems

Example:

{
  lookup {
    release(mbid: "...") {
      artists {  # Artist 1
        name
      }
      tracks {
        artists {  # Artist 1 again (deduplicated)
          name
        }
      }
    }
  }
}

DataLoader ensures Artist 1 is fetched only once.

Level 2: LRU Cache (Shared)

Purpose: Cross-request caching to reduce API calls.

Lifecycle: Shared across all requests, persists for configured TTL.

Configuration:

Parameter Environment Variable Default
Size GRAPHBRAINZ_CACHE_SIZE 8192 items
TTL GRAPHBRAINZ_CACHE_TTL 86400000 ms (1 day)

Implementation:

import LRU from 'lru-cache';

const cache = new LRU({
  max: 8192,
  ttl: 86400000,  // 1 day
  updateAgeOnGet: true,
  updateAgeOnHas: true
});

Cache Key Strategy:

Keys combine entity type, MBID, and inc parameters to prevent collisions:

artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:releases,recordings
release:f0c8b1e5-...:artist-credits,labels,media

Different queries for the same entity use different cache keys.

Cache Invalidation:

  • Time-based: Items expire after TTL (default 1 day)
  • Size-based: LRU eviction when cache exceeds max size
  • No manual invalidation: GraphBrainz assumes MusicBrainz data is relatively stable

Cache Hit Ratio:

Typical hit ratios for production workloads:

  • Lookup queries: 60-80% (popular artists cached)
  • Browse queries: 40-60% (pagination reduces hits)
  • Search queries: 10-30% (diverse queries)

Extension Caching

Each extension maintains its own LRU cache with separate configuration.

Cover Art Archive

Parameter Environment Variable Default
Size COVERART_CACHE_SIZE 8192
TTL COVERART_CACHE_TTL 86400000 ms

Cache Key: coverart:{release-mbid}

fanart.tv

Parameter Environment Variable Default
Size FANART_CACHE_SIZE 8192
TTL FANART_CACHE_TTL 86400000 ms

Cache Key: fanart:{artist-mbid}

TheAudioDB

Parameter Environment Variable Default
Size THEAUDIODB_CACHE_SIZE 8192
TTL THEAUDIODB_CACHE_TTL 86400000 ms

Cache Key: theaudiodb:{artist-mbid}

MediaWiki

Parameter Environment Variable Default
Size MEDIAWIKI_CACHE_SIZE 8192
TTL MEDIAWIKI_CACHE_TTL 86400000 ms

Cache Key: mediawiki:{artist-name}

Data Flow

Complete request flow from GraphQL query to response:

1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓ (determines required inc parameters)
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (miss)
5. Check LRU Cache (shared)
   ↓ (miss)
6. Rate Limiter Queue
   ↓ (acquire token)
7. HTTP Request via got
   ↓
8. MusicBrainz API Response
   ↓
9. Store in LRU Cache
   ↓
10. Return to DataLoader
    ↓
11. Return to Resolver
    ↓
12. GraphQL Response

Cache Hit Path:

1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (hit - return immediately)
5. GraphQL Response

Shared Cache Hit Path:

1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (miss)
5. Check LRU Cache (shared)
   ↓ (hit - return immediately)
6. Store in DataLoader Cache
   ↓
7. GraphQL Response

Rate Limiting

GraphBrainz implements custom rate limiting to comply with API policies.

MusicBrainz Rate Limits

Policy: 5 requests per 5.5 seconds (approximately 0.909 requests/second)

Implementation:

  • Token bucket algorithm
  • 5 tokens maximum
  • Refill rate: 0.909 tokens/second
  • Sequential requests (concurrency: 1)

Configuration:

const musicbrainzLimiter = new RateLimiter({
  limit: 5,
  interval: 5500,  // milliseconds
  concurrency: 1
});

Extension Rate Limits

Default Policy: 10 requests per second

Implementation:

  • Token bucket algorithm
  • 10 tokens maximum
  • Refill rate: 10 tokens/second
  • Parallel requests (concurrency: 5)

Per-Extension Configuration:

Extension Rate Limit Concurrency
Cover Art Archive 10 req/s 5
fanart.tv 10 req/s 5
MediaWiki 10 req/s 5
TheAudioDB 10 req/s 5

Priority Queue

Requests are queued with priority levels when rate limit is reached:

Priority Query Type Rationale
High Lookup Direct MBID access, user-initiated
Medium Browse Relationship traversal, pagination
Low Search Full-text search, exploratory

Higher priority requests are processed first when tokens become available.

Rate Limit Errors

When rate limit is exceeded and queue is full:

HTTP Response:

HTTP/1.1 429 Too Many Requests
Retry-After: 5

GraphQL Error:

{
  "errors": [
    {
      "message": "Rate limit exceeded",
      "extensions": {
        "code": "RATE_LIMIT",
        "retryAfter": 5
      }
    }
  ]
}

HTTP Client

GraphBrainz uses got v11.8.2 for HTTP requests.

Client Configuration

import got from 'got';

const client = got.extend({
  prefixUrl: process.env.MUSICBRAINZ_BASE_URL,
  headers: {
    'User-Agent': 'GraphBrainz/9.0.0 (https://github.com/exogen/graphbrainz)'
  },
  timeout: {
    request: 30000  // 30 seconds
  },
  retry: {
    limit: 3,
    methods: ['GET'],
    statusCodes: [408, 413, 429, 500, 502, 503, 504]
  },
  hooks: {
    beforeRequest: [
      options => {
        debug('graphbrainz:api/client')(`${options.method} ${options.url}`);
      }
    ]
  }
});

Request Headers

Header Value Purpose
User-Agent GraphBrainz/9.0.0 (...) API identification
Accept application/json Response format

Timeout Handling

  • Request timeout: 30 seconds
  • Connection timeout: 10 seconds (default)
  • Read timeout: 30 seconds (default)

Timeout errors are propagated as GraphQL errors.

Retry Logic

Automatic retry for transient failures:

  • Max retries: 3
  • Retry methods: GET only
  • Retry status codes: 408, 413, 429, 500, 502, 503, 504
  • Backoff: Exponential (1s, 2s, 4s)

Data Transformation

MusicBrainz API responses are transformed to GraphQL-friendly format:

Field Name Conversion

MusicBrainz GraphQL
sort-name sortName
life-span lifeSpan
artist-credit artistCredit
release-group releaseGroup
iso-3166-1-codes iso31661Codes

Nested Object Flattening

MusicBrainz:

{
  "life-span": {
    "begin": "1985",
    "end": null
  }
}

GraphQL:

{
  "lifeSpan": {
    "begin": "1985",
    "end": null
  }
}

Array Normalization

MusicBrainz:

{
  "releases": [
    { "id": "...", "title": "..." }
  ]
}

GraphQL (Relay connection):

{
  "releases": {
    "edges": [
      {
        "node": { "id": "...", "title": "..." },
        "cursor": "..."
      }
    ],
    "pageInfo": { ... },
    "totalCount": 1
  }
}

Relationship Expansion

MusicBrainz relationships are flattened into GraphQL fields:

MusicBrainz:

{
  "relations": [
    {
      "type": "member of band",
      "target": "5b11f4ce-...",
      "artist": { "name": "Radiohead" }
    }
  ]
}

GraphQL:

{
  relationships {
    edges {
      node {
        type
        target {
          ... on Artist {
            name
          }
        }
      }
    }
  }
}

Memory Considerations

Cache Memory Usage

With default configuration (8192 items per cache):

Cache Items Avg Size Total Memory
MusicBrainz 8192 5 KB ~40 MB
Cover Art Archive 8192 2 KB ~16 MB
fanart.tv 8192 3 KB ~24 MB
MediaWiki 8192 4 KB ~32 MB
TheAudioDB 8192 2 KB ~16 MB
Total 40960 - ~128 MB

DataLoader Memory Usage

DataLoader instances are created per-request and garbage collected after response:

  • Per-request overhead: ~1-5 MB (depends on query complexity)
  • Concurrent requests: 100 requests × 5 MB = 500 MB peak
Deployment Heap Size Rationale
Development 512 MB Single user, low traffic
Production (low) 1 GB 10-50 req/s, shared cache
Production (high) 2 GB 100+ req/s, full cache

Node.js Configuration:

node --max-old-space-size=2048 cli.js

Data Freshness

GraphBrainz does not implement cache invalidation beyond TTL expiration. Data freshness depends on:

Data Type Typical Update Frequency Cache TTL Staleness Risk
Artist metadata Weeks to months 1 day Low
Release metadata Days to weeks 1 day Low
Relationships Weeks to months 1 day Low
Cover art Months to years 1 day Very low
Artist images Months to years 1 day Very low
Biographies Months to years 1 day Very low

For real-time data requirements, reduce cache TTL:

GRAPHBRAINZ_CACHE_TTL=3600000  # 1 hour

Or disable caching entirely:

GRAPHBRAINZ_CACHE_SIZE=0