# GraphBrainz Data Layer ## Data Source Architecture GraphBrainz is a **stateless proxy** with no persistent database. All data originates from external APIs: | Source | Purpose | Authentication | |--------|---------|----------------| | MusicBrainz REST API | Core music metadata | None | | Cover Art Archive | Album artwork | None | | fanart.tv | Artist images | API key required | | MediaWiki | Wiki images | None | | TheAudioDB | Artist biographies | API key required | ## MusicBrainz Backend ### Base URL Configuration | Environment Variable | Default | Purpose | |---------------------|---------|---------| | MUSICBRAINZ_BASE_URL | http://musicbrainz.org/ws/2/ | API endpoint | **Local Mirror Support**: ```bash MUSICBRAINZ_BASE_URL=http://localhost:5000/ws/2/ ``` Using a local MusicBrainz mirror eliminates rate limits and reduces latency. ### API Operations GraphBrainz uses three MusicBrainz API operations: #### 1. Lookup Retrieve single entity by MBID. **URL Pattern**: ``` GET /ws/2/{entity}/{mbid}?inc={relationships} ``` **Example**: ``` GET /ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?inc=releases+recordings ``` **Supported Entities**: area, artist, collection, event, instrument, label, place, recording, release, release-group, series, url, work #### 2. Browse Retrieve entities linked to a parent entity. **URL Pattern**: ``` GET /ws/2/{entity}?{parent-entity}={mbid}&limit={limit}&offset={offset}&inc={relationships} ``` **Example**: ``` GET /ws/2/release?artist=5b11f4ce-a62d-471e-81fc-a69a8278c7da&limit=25&offset=0 ``` **Supported Relationships**: See API.md for full matrix #### 3. Search Lucene-based full-text search. **URL Pattern**: ``` GET /ws/2/{entity}?query={lucene-query}&limit={limit}&offset={offset} ``` **Example**: ``` GET /ws/2/artist?query=artist:Radiohead%20AND%20country:GB&limit=25 ``` **Supported Entities**: area, artist, event, instrument, label, place, recording, release, release-group, work ### Include Parameters GraphBrainz resolvers inspect the GraphQL AST to determine which `inc` parameters are needed: | Parameter | Description | Entities | |-----------|-------------|----------| | aliases | Alternative names | All | | annotation | Editorial notes | All | | tags | User-generated tags | All | | ratings | User ratings | All | | genres | Genre classifications | All | | artist-credits | Artist credit details | Recording, Release, ReleaseGroup, Track | | artists | Related artists | Recording, Release, ReleaseGroup, Work | | collections | Collections containing entity | All | | labels | Record labels | Release | | recordings | Recordings | Artist, Release, Work | | releases | Releases | Artist, Label, Recording, ReleaseGroup | | release-groups | Release groups | Artist, Release | | works | Musical works | Artist, Recording | | discids | Disc IDs | Release | | media | Media/tracks | Release | | isrcs | ISRC codes | Recording | | url-rels | URL relationships | All | | artist-rels | Artist relationships | All | | label-rels | Label relationships | All | | recording-rels | Recording relationships | All | | release-rels | Release relationships | All | | release-group-rels | Release group relationships | All | | work-rels | Work relationships | All | | area-rels | Area relationships | All | | place-rels | Place relationships | All | | event-rels | Event relationships | All | | series-rels | Series relationships | All | | instrument-rels | Instrument relationships | All | ### Response Format MusicBrainz returns JSON with entity-specific structure: ```json { "id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da", "name": "Radiohead", "sort-name": "Radiohead", "type": "Group", "country": "GB", "life-span": { "begin": "1985" }, "releases": [ { "id": "...", "title": "OK Computer", "date": "1997-05-21" } ] } ``` GraphBrainz transforms this to GraphQL-friendly format (camelCase, nested objects). ## Two-Level Caching Strategy ### Level 1: DataLoader (Per-Request) **Purpose**: Request batching and deduplication within a single GraphQL query. **Lifecycle**: Created fresh for each GraphQL request, discarded after response. **Implementation**: ```javascript import DataLoader from 'dataloader'; const artistLoader = new DataLoader(async (keys) => { const results = await Promise.all( keys.map(key => fetchArtist(key.mbid, key.inc)) ); return results; }); ``` **Benefits**: - Batches multiple requests for same entity type - Deduplicates identical requests within query - Prevents N+1 query problems **Example**: ```graphql { lookup { release(mbid: "...") { artists { # Artist 1 name } tracks { artists { # Artist 1 again (deduplicated) name } } } } } ``` DataLoader ensures Artist 1 is fetched only once. ### Level 2: LRU Cache (Shared) **Purpose**: Cross-request caching to reduce API calls. **Lifecycle**: Shared across all requests, persists for configured TTL. **Configuration**: | Parameter | Environment Variable | Default | |-----------|---------------------|---------| | Size | GRAPHBRAINZ_CACHE_SIZE | 8192 items | | TTL | GRAPHBRAINZ_CACHE_TTL | 86400000 ms (1 day) | **Implementation**: ```javascript import LRU from 'lru-cache'; const cache = new LRU({ max: 8192, ttl: 86400000, // 1 day updateAgeOnGet: true, updateAgeOnHas: true }); ``` **Cache Key Strategy**: Keys combine entity type, MBID, and `inc` parameters to prevent collisions: ``` artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:releases,recordings release:f0c8b1e5-...:artist-credits,labels,media ``` Different queries for the same entity use different cache keys. **Cache Invalidation**: - **Time-based**: Items expire after TTL (default 1 day) - **Size-based**: LRU eviction when cache exceeds max size - **No manual invalidation**: GraphBrainz assumes MusicBrainz data is relatively stable **Cache Hit Ratio**: Typical hit ratios for production workloads: - Lookup queries: 60-80% (popular artists cached) - Browse queries: 40-60% (pagination reduces hits) - Search queries: 10-30% (diverse queries) ## Extension Caching Each extension maintains its own LRU cache with separate configuration. ### Cover Art Archive | Parameter | Environment Variable | Default | |-----------|---------------------|---------| | Size | COVERART_CACHE_SIZE | 8192 | | TTL | COVERART_CACHE_TTL | 86400000 ms | **Cache Key**: `coverart:{release-mbid}` ### fanart.tv | Parameter | Environment Variable | Default | |-----------|---------------------|---------| | Size | FANART_CACHE_SIZE | 8192 | | TTL | FANART_CACHE_TTL | 86400000 ms | **Cache Key**: `fanart:{artist-mbid}` ### TheAudioDB | Parameter | Environment Variable | Default | |-----------|---------------------|---------| | Size | THEAUDIODB_CACHE_SIZE | 8192 | | TTL | THEAUDIODB_CACHE_TTL | 86400000 ms | **Cache Key**: `theaudiodb:{artist-mbid}` ### MediaWiki | Parameter | Environment Variable | Default | |-----------|---------------------|---------| | Size | MEDIAWIKI_CACHE_SIZE | 8192 | | TTL | MEDIAWIKI_CACHE_TTL | 86400000 ms | **Cache Key**: `mediawiki:{artist-name}` ## Data Flow Complete request flow from GraphQL query to response: ``` 1. GraphQL Query Received ↓ 2. Resolver Inspects AST ↓ (determines required inc parameters) 3. DataLoader.load({ mbid, inc }) ↓ 4. Check DataLoader Cache (per-request) ↓ (miss) 5. Check LRU Cache (shared) ↓ (miss) 6. Rate Limiter Queue ↓ (acquire token) 7. HTTP Request via got ↓ 8. MusicBrainz API Response ↓ 9. Store in LRU Cache ↓ 10. Return to DataLoader ↓ 11. Return to Resolver ↓ 12. GraphQL Response ``` **Cache Hit Path**: ``` 1. GraphQL Query Received ↓ 2. Resolver Inspects AST ↓ 3. DataLoader.load({ mbid, inc }) ↓ 4. Check DataLoader Cache (per-request) ↓ (hit - return immediately) 5. GraphQL Response ``` **Shared Cache Hit Path**: ``` 1. GraphQL Query Received ↓ 2. Resolver Inspects AST ↓ 3. DataLoader.load({ mbid, inc }) ↓ 4. Check DataLoader Cache (per-request) ↓ (miss) 5. Check LRU Cache (shared) ↓ (hit - return immediately) 6. Store in DataLoader Cache ↓ 7. GraphQL Response ``` ## Rate Limiting GraphBrainz implements custom rate limiting to comply with API policies. ### MusicBrainz Rate Limits **Policy**: 5 requests per 5.5 seconds (approximately 0.909 requests/second) **Implementation**: - Token bucket algorithm - 5 tokens maximum - Refill rate: 0.909 tokens/second - Sequential requests (concurrency: 1) **Configuration**: ```javascript const musicbrainzLimiter = new RateLimiter({ limit: 5, interval: 5500, // milliseconds concurrency: 1 }); ``` ### Extension Rate Limits **Default Policy**: 10 requests per second **Implementation**: - Token bucket algorithm - 10 tokens maximum - Refill rate: 10 tokens/second - Parallel requests (concurrency: 5) **Per-Extension Configuration**: | Extension | Rate Limit | Concurrency | |-----------|------------|-------------| | Cover Art Archive | 10 req/s | 5 | | fanart.tv | 10 req/s | 5 | | MediaWiki | 10 req/s | 5 | | TheAudioDB | 10 req/s | 5 | ### Priority Queue Requests are queued with priority levels when rate limit is reached: | Priority | Query Type | Rationale | |----------|------------|-----------| | High | Lookup | Direct MBID access, user-initiated | | Medium | Browse | Relationship traversal, pagination | | Low | Search | Full-text search, exploratory | Higher priority requests are processed first when tokens become available. ### Rate Limit Errors When rate limit is exceeded and queue is full: **HTTP Response**: ``` HTTP/1.1 429 Too Many Requests Retry-After: 5 ``` **GraphQL Error**: ```json { "errors": [ { "message": "Rate limit exceeded", "extensions": { "code": "RATE_LIMIT", "retryAfter": 5 } } ] } ``` ## HTTP Client GraphBrainz uses `got` v11.8.2 for HTTP requests. ### Client Configuration ```javascript import got from 'got'; const client = got.extend({ prefixUrl: process.env.MUSICBRAINZ_BASE_URL, headers: { 'User-Agent': 'GraphBrainz/9.0.0 (https://github.com/exogen/graphbrainz)' }, timeout: { request: 30000 // 30 seconds }, retry: { limit: 3, methods: ['GET'], statusCodes: [408, 413, 429, 500, 502, 503, 504] }, hooks: { beforeRequest: [ options => { debug('graphbrainz:api/client')(`${options.method} ${options.url}`); } ] } }); ``` ### Request Headers | Header | Value | Purpose | |--------|-------|---------| | User-Agent | GraphBrainz/9.0.0 (...) | API identification | | Accept | application/json | Response format | ### Timeout Handling - **Request timeout**: 30 seconds - **Connection timeout**: 10 seconds (default) - **Read timeout**: 30 seconds (default) Timeout errors are propagated as GraphQL errors. ### Retry Logic Automatic retry for transient failures: - **Max retries**: 3 - **Retry methods**: GET only - **Retry status codes**: 408, 413, 429, 500, 502, 503, 504 - **Backoff**: Exponential (1s, 2s, 4s) ## Data Transformation MusicBrainz API responses are transformed to GraphQL-friendly format: ### Field Name Conversion | MusicBrainz | GraphQL | |-------------|---------| | sort-name | sortName | | life-span | lifeSpan | | artist-credit | artistCredit | | release-group | releaseGroup | | iso-3166-1-codes | iso31661Codes | ### Nested Object Flattening **MusicBrainz**: ```json { "life-span": { "begin": "1985", "end": null } } ``` **GraphQL**: ```json { "lifeSpan": { "begin": "1985", "end": null } } ``` ### Array Normalization **MusicBrainz**: ```json { "releases": [ { "id": "...", "title": "..." } ] } ``` **GraphQL** (Relay connection): ```json { "releases": { "edges": [ { "node": { "id": "...", "title": "..." }, "cursor": "..." } ], "pageInfo": { ... }, "totalCount": 1 } } ``` ### Relationship Expansion MusicBrainz relationships are flattened into GraphQL fields: **MusicBrainz**: ```json { "relations": [ { "type": "member of band", "target": "5b11f4ce-...", "artist": { "name": "Radiohead" } } ] } ``` **GraphQL**: ```graphql { relationships { edges { node { type target { ... on Artist { name } } } } } } ``` ## Memory Considerations ### Cache Memory Usage With default configuration (8192 items per cache): | Cache | Items | Avg Size | Total Memory | |-------|-------|----------|--------------| | MusicBrainz | 8192 | 5 KB | ~40 MB | | Cover Art Archive | 8192 | 2 KB | ~16 MB | | fanart.tv | 8192 | 3 KB | ~24 MB | | MediaWiki | 8192 | 4 KB | ~32 MB | | TheAudioDB | 8192 | 2 KB | ~16 MB | | **Total** | **40960** | - | **~128 MB** | ### DataLoader Memory Usage DataLoader instances are created per-request and garbage collected after response: - **Per-request overhead**: ~1-5 MB (depends on query complexity) - **Concurrent requests**: 100 requests × 5 MB = 500 MB peak ### Recommended Memory Allocation | Deployment | Heap Size | Rationale | |------------|-----------|-----------| | Development | 512 MB | Single user, low traffic | | Production (low) | 1 GB | 10-50 req/s, shared cache | | Production (high) | 2 GB | 100+ req/s, full cache | **Node.js Configuration**: ```bash node --max-old-space-size=2048 cli.js ``` ## Data Freshness GraphBrainz does not implement cache invalidation beyond TTL expiration. Data freshness depends on: | Data Type | Typical Update Frequency | Cache TTL | Staleness Risk | |-----------|-------------------------|-----------|----------------| | Artist metadata | Weeks to months | 1 day | Low | | Release metadata | Days to weeks | 1 day | Low | | Relationships | Weeks to months | 1 day | Low | | Cover art | Months to years | 1 day | Very low | | Artist images | Months to years | 1 day | Very low | | Biographies | Months to years | 1 day | Very low | For real-time data requirements, reduce cache TTL: ```bash GRAPHBRAINZ_CACHE_TTL=3600000 # 1 hour ``` Or disable caching entirely: ```bash GRAPHBRAINZ_CACHE_SIZE=0 ```