metadata-agregator/docs/research/graphbrainz/analysis/DATA.md

# GraphBrainz Data Layer

## Data Source Architecture

GraphBrainz is a **stateless proxy** with no persistent database. All data originates from external APIs:

| Source | Purpose | Authentication |
|--------|---------|----------------|
| MusicBrainz REST API | Core music metadata | None |
| Cover Art Archive | Album artwork | None |
| fanart.tv | Artist images | API key required |
| MediaWiki | Wiki images | None |
| TheAudioDB | Artist biographies | API key required |

## MusicBrainz Backend

### Base URL Configuration

| Environment Variable | Default | Purpose |
|---------------------|---------|---------|
| MUSICBRAINZ_BASE_URL | http://musicbrainz.org/ws/2/ | API endpoint |

**Local Mirror Support**:
```bash
MUSICBRAINZ_BASE_URL=http://localhost:5000/ws/2/
```

Using a local MusicBrainz mirror eliminates rate limits and reduces latency.

### API Operations

GraphBrainz uses three MusicBrainz API operations:

#### 1. Lookup

Retrieve single entity by MBID.

**URL Pattern**:
```
GET /ws/2/{entity}/{mbid}?inc={relationships}
```

**Example**:
```
GET /ws/2/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?inc=releases+recordings
```

**Supported Entities**: area, artist, collection, event, instrument, label, place, recording, release, release-group, series, url, work

#### 2. Browse

Retrieve entities linked to a parent entity.

**URL Pattern**:
```
GET /ws/2/{entity}?{parent-entity}={mbid}&limit={limit}&offset={offset}&inc={relationships}
```

**Example**:
```
GET /ws/2/release?artist=5b11f4ce-a62d-471e-81fc-a69a8278c7da&limit=25&offset=0
```

**Supported Relationships**: See API.md for full matrix

#### 3. Search

Lucene-based full-text search.

**URL Pattern**:
```
GET /ws/2/{entity}?query={lucene-query}&limit={limit}&offset={offset}
```

**Example**:
```
GET /ws/2/artist?query=artist:Radiohead%20AND%20country:GB&limit=25
```

**Supported Entities**: area, artist, event, instrument, label, place, recording, release, release-group, work

### Include Parameters

GraphBrainz resolvers inspect the GraphQL AST to determine which `inc` parameters are needed:

| Parameter | Description | Entities |
|-----------|-------------|----------|
| aliases | Alternative names | All |
| annotation | Editorial notes | All |
| tags | User-generated tags | All |
| ratings | User ratings | All |
| genres | Genre classifications | All |
| artist-credits | Artist credit details | Recording, Release, ReleaseGroup, Track |
| artists | Related artists | Recording, Release, ReleaseGroup, Work |
| collections | Collections containing entity | All |
| labels | Record labels | Release |
| recordings | Recordings | Artist, Release, Work |
| releases | Releases | Artist, Label, Recording, ReleaseGroup |
| release-groups | Release groups | Artist, Release |
| works | Musical works | Artist, Recording |
| discids | Disc IDs | Release |
| media | Media/tracks | Release |
| isrcs | ISRC codes | Recording |
| url-rels | URL relationships | All |
| artist-rels | Artist relationships | All |
| label-rels | Label relationships | All |
| recording-rels | Recording relationships | All |
| release-rels | Release relationships | All |
| release-group-rels | Release group relationships | All |
| work-rels | Work relationships | All |
| area-rels | Area relationships | All |
| place-rels | Place relationships | All |
| event-rels | Event relationships | All |
| series-rels | Series relationships | All |
| instrument-rels | Instrument relationships | All |

### Response Format

MusicBrainz returns JSON with entity-specific structure:

```json
{
  "id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da",
  "name": "Radiohead",
  "sort-name": "Radiohead",
  "type": "Group",
  "country": "GB",
  "life-span": {
    "begin": "1985"
  },
  "releases": [
    {
      "id": "...",
      "title": "OK Computer",
      "date": "1997-05-21"
    }
  ]
}
```

GraphBrainz transforms this to GraphQL-friendly format (camelCase, nested objects).

## Two-Level Caching Strategy

### Level 1: DataLoader (Per-Request)

**Purpose**: Request batching and deduplication within a single GraphQL query.

**Lifecycle**: Created fresh for each GraphQL request, discarded after response.

**Implementation**:
```javascript
import DataLoader from 'dataloader';

const artistLoader = new DataLoader(async (keys) => {
  const results = await Promise.all(
    keys.map(key => fetchArtist(key.mbid, key.inc))
  );
  return results;
});
```

**Benefits**:
- Batches multiple requests for same entity type
- Deduplicates identical requests within query
- Prevents N+1 query problems

**Example**:
```graphql
{
  lookup {
    release(mbid: "...") {
      artists {  # Artist 1
        name
      }
      tracks {
        artists {  # Artist 1 again (deduplicated)
          name
        }
      }
    }
  }
}
```

DataLoader ensures Artist 1 is fetched only once.

### Level 2: LRU Cache (Shared)

**Purpose**: Cross-request caching to reduce API calls.

**Lifecycle**: Shared across all requests, persists for configured TTL.

**Configuration**:

| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | GRAPHBRAINZ_CACHE_SIZE | 8192 items |
| TTL | GRAPHBRAINZ_CACHE_TTL | 86400000 ms (1 day) |

**Implementation**:
```javascript
import LRU from 'lru-cache';

const cache = new LRU({
  max: 8192,
  ttl: 86400000,  // 1 day
  updateAgeOnGet: true,
  updateAgeOnHas: true
});
```

**Cache Key Strategy**:

Keys combine entity type, MBID, and `inc` parameters to prevent collisions:

```
artist:5b11f4ce-a62d-471e-81fc-a69a8278c7da:releases,recordings
release:f0c8b1e5-...:artist-credits,labels,media
```

Different queries for the same entity use different cache keys.

**Cache Invalidation**:

- **Time-based**: Items expire after TTL (default 1 day)
- **Size-based**: LRU eviction when cache exceeds max size
- **No manual invalidation**: GraphBrainz assumes MusicBrainz data is relatively stable

**Cache Hit Ratio**:

Typical hit ratios for production workloads:

- Lookup queries: 60-80% (popular artists cached)
- Browse queries: 40-60% (pagination reduces hits)
- Search queries: 10-30% (diverse queries)

## Extension Caching

Each extension maintains its own LRU cache with separate configuration.

### Cover Art Archive

| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | COVERART_CACHE_SIZE | 8192 |
| TTL | COVERART_CACHE_TTL | 86400000 ms |

**Cache Key**: `coverart:{release-mbid}`

### fanart.tv

| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | FANART_CACHE_SIZE | 8192 |
| TTL | FANART_CACHE_TTL | 86400000 ms |

**Cache Key**: `fanart:{artist-mbid}`

### TheAudioDB

| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | THEAUDIODB_CACHE_SIZE | 8192 |
| TTL | THEAUDIODB_CACHE_TTL | 86400000 ms |

**Cache Key**: `theaudiodb:{artist-mbid}`

### MediaWiki

| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| Size | MEDIAWIKI_CACHE_SIZE | 8192 |
| TTL | MEDIAWIKI_CACHE_TTL | 86400000 ms |

**Cache Key**: `mediawiki:{artist-name}`

## Data Flow

Complete request flow from GraphQL query to response:

```
1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓ (determines required inc parameters)
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (miss)
5. Check LRU Cache (shared)
   ↓ (miss)
6. Rate Limiter Queue
   ↓ (acquire token)
7. HTTP Request via got
   ↓
8. MusicBrainz API Response
   ↓
9. Store in LRU Cache
   ↓
10. Return to DataLoader
    ↓
11. Return to Resolver
    ↓
12. GraphQL Response
```

**Cache Hit Path**:
```
1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (hit - return immediately)
5. GraphQL Response
```

**Shared Cache Hit Path**:
```
1. GraphQL Query Received
   ↓
2. Resolver Inspects AST
   ↓
3. DataLoader.load({ mbid, inc })
   ↓
4. Check DataLoader Cache (per-request)
   ↓ (miss)
5. Check LRU Cache (shared)
   ↓ (hit - return immediately)
6. Store in DataLoader Cache
   ↓
7. GraphQL Response
```

## Rate Limiting

GraphBrainz implements custom rate limiting to comply with API policies.

### MusicBrainz Rate Limits

**Policy**: 5 requests per 5.5 seconds (approximately 0.909 requests/second)

**Implementation**:
- Token bucket algorithm
- 5 tokens maximum
- Refill rate: 0.909 tokens/second
- Sequential requests (concurrency: 1)

**Configuration**:
```javascript
const musicbrainzLimiter = new RateLimiter({
  limit: 5,
  interval: 5500,  // milliseconds
  concurrency: 1
});
```

### Extension Rate Limits

**Default Policy**: 10 requests per second

**Implementation**:
- Token bucket algorithm
- 10 tokens maximum
- Refill rate: 10 tokens/second
- Parallel requests (concurrency: 5)

**Per-Extension Configuration**:

| Extension | Rate Limit | Concurrency |
|-----------|------------|-------------|
| Cover Art Archive | 10 req/s | 5 |
| fanart.tv | 10 req/s | 5 |
| MediaWiki | 10 req/s | 5 |
| TheAudioDB | 10 req/s | 5 |

### Priority Queue

Requests are queued with priority levels when rate limit is reached:

| Priority | Query Type | Rationale |
|----------|------------|-----------|
| High | Lookup | Direct MBID access, user-initiated |
| Medium | Browse | Relationship traversal, pagination |
| Low | Search | Full-text search, exploratory |

Higher priority requests are processed first when tokens become available.

### Rate Limit Errors

When rate limit is exceeded and queue is full:

**HTTP Response**:
```
HTTP/1.1 429 Too Many Requests
Retry-After: 5
```

**GraphQL Error**:
```json
{
  "errors": [
    {
      "message": "Rate limit exceeded",
      "extensions": {
        "code": "RATE_LIMIT",
        "retryAfter": 5
      }
    }
  ]
}
```

## HTTP Client

GraphBrainz uses `got` v11.8.2 for HTTP requests.

### Client Configuration

```javascript
import got from 'got';

const client = got.extend({
  prefixUrl: process.env.MUSICBRAINZ_BASE_URL,
  headers: {
    'User-Agent': 'GraphBrainz/9.0.0 (https://github.com/exogen/graphbrainz)'
  },
  timeout: {
    request: 30000  // 30 seconds
  },
  retry: {
    limit: 3,
    methods: ['GET'],
    statusCodes: [408, 413, 429, 500, 502, 503, 504]
  },
  hooks: {
    beforeRequest: [
      options => {
        debug('graphbrainz:api/client')(`${options.method} ${options.url}`);
      }
    ]
  }
});
```

### Request Headers

| Header | Value | Purpose |
|--------|-------|---------|
| User-Agent | GraphBrainz/9.0.0 (...) | API identification |
| Accept | application/json | Response format |

### Timeout Handling

- **Request timeout**: 30 seconds
- **Connection timeout**: 10 seconds (default)
- **Read timeout**: 30 seconds (default)

Timeout errors are propagated as GraphQL errors.

### Retry Logic

Automatic retry for transient failures:

- **Max retries**: 3
- **Retry methods**: GET only
- **Retry status codes**: 408, 413, 429, 500, 502, 503, 504
- **Backoff**: Exponential (1s, 2s, 4s)

## Data Transformation

MusicBrainz API responses are transformed to GraphQL-friendly format:

### Field Name Conversion

| MusicBrainz | GraphQL |
|-------------|---------|
| sort-name | sortName |
| life-span | lifeSpan |
| artist-credit | artistCredit |
| release-group | releaseGroup |
| iso-3166-1-codes | iso31661Codes |

### Nested Object Flattening

**MusicBrainz**:
```json
{
  "life-span": {
    "begin": "1985",
    "end": null
  }
}
```

**GraphQL**:
```json
{
  "lifeSpan": {
    "begin": "1985",
    "end": null
  }
}
```

### Array Normalization

**MusicBrainz**:
```json
{
  "releases": [
    { "id": "...", "title": "..." }
  ]
}
```

**GraphQL** (Relay connection):
```json
{
  "releases": {
    "edges": [
      {
        "node": { "id": "...", "title": "..." },
        "cursor": "..."
      }
    ],
    "pageInfo": { ... },
    "totalCount": 1
  }
}
```

### Relationship Expansion

MusicBrainz relationships are flattened into GraphQL fields:

**MusicBrainz**:
```json
{
  "relations": [
    {
      "type": "member of band",
      "target": "5b11f4ce-...",
      "artist": { "name": "Radiohead" }
    }
  ]
}
```

**GraphQL**:
```graphql
{
  relationships {
    edges {
      node {
        type
        target {
          ... on Artist {
            name
          }
        }
      }
    }
  }
}
```

## Memory Considerations

### Cache Memory Usage

With default configuration (8192 items per cache):

| Cache | Items | Avg Size | Total Memory |
|-------|-------|----------|--------------|
| MusicBrainz | 8192 | 5 KB | ~40 MB |
| Cover Art Archive | 8192 | 2 KB | ~16 MB |
| fanart.tv | 8192 | 3 KB | ~24 MB |
| MediaWiki | 8192 | 4 KB | ~32 MB |
| TheAudioDB | 8192 | 2 KB | ~16 MB |
| **Total** | **40960** | - | **~128 MB** |

### DataLoader Memory Usage

DataLoader instances are created per-request and garbage collected after response:

- **Per-request overhead**: ~1-5 MB (depends on query complexity)
- **Concurrent requests**: 100 requests × 5 MB = 500 MB peak

### Recommended Memory Allocation

| Deployment | Heap Size | Rationale |
|------------|-----------|-----------|
| Development | 512 MB | Single user, low traffic |
| Production (low) | 1 GB | 10-50 req/s, shared cache |
| Production (high) | 2 GB | 100+ req/s, full cache |

**Node.js Configuration**:
```bash
node --max-old-space-size=2048 cli.js
```

## Data Freshness

GraphBrainz does not implement cache invalidation beyond TTL expiration. Data freshness depends on:

| Data Type | Typical Update Frequency | Cache TTL | Staleness Risk |
|-----------|-------------------------|-----------|----------------|
| Artist metadata | Weeks to months | 1 day | Low |
| Release metadata | Days to weeks | 1 day | Low |
| Relationships | Weeks to months | 1 day | Low |
| Cover art | Months to years | 1 day | Very low |
| Artist images | Months to years | 1 day | Very low |
| Biographies | Months to years | 1 day | Very low |

For real-time data requirements, reduce cache TTL:

```bash
GRAPHBRAINZ_CACHE_TTL=3600000  # 1 hour
```

Or disable caching entirely:

```bash
GRAPHBRAINZ_CACHE_SIZE=0
```