Files
metadata-agregator/docs/research/bedrock-api/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

23 KiB

Bedrock-API Data Layer

Database Technology

RDBMS: PostgreSQL 15
Driver: github.com/jackc/pgx/v5 (native PostgreSQL driver)
Connection Pooling: pgxpool (pgx connection pool)
Migration Tool: None (manual SQL execution)

Database Schema

Users Table

File: db/migrations/001_create_users_table.up.sql

CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    role VARCHAR(50) DEFAULT 'user',
    is_verified BOOLEAN DEFAULT false,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_users_email ON users(email);

Columns:

Column Type Constraints Purpose
id UUID PRIMARY KEY, DEFAULT gen_random_uuid() Unique user identifier
email VARCHAR(255) UNIQUE, NOT NULL User email (login identifier)
password_hash VARCHAR(255) NOT NULL bcrypt hashed password
role VARCHAR(50) DEFAULT 'user' User role (user/admin)
is_verified BOOLEAN DEFAULT false Email verification status
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP Account creation timestamp

Indexes:

  • Primary key index on id (automatic)
  • B-tree index on email (for login lookups)

No Foreign Keys: Single table schema, no relationships

Schema Limitations

Missing Tables:

  • No metadata cache (tracks, albums, artists, playlists)
  • No user listening history
  • No user playlists
  • No user favorites/likes
  • No play counts
  • No search history
  • No provider credentials (Spotify tokens, etc.)

Minimal User Data:

  • No user profile (name, avatar, bio)
  • No user preferences (language, region)
  • No user settings (privacy, notifications)
  • No user sessions (active logins)

Connection Management

Connection Pool Configuration

File: bedrock_server/main.go

func initDB() (*pgxpool.Pool, error) {
    dbURL := os.Getenv("DATABASE_URL")
    if dbURL == "" {
        return nil, errors.New("DATABASE_URL not set")
    }
    
    config, err := pgxpool.ParseConfig(dbURL)
    if err != nil {
        return nil, fmt.Errorf("parse config: %w", err)
    }
    
    // Pool configuration
    config.MaxConns = 10
    config.MinConns = 2
    config.MaxConnLifetime = time.Hour
    config.MaxConnIdleTime = 30 * time.Minute
    config.HealthCheckPeriod = 1 * time.Minute
    
    pool, err := pgxpool.NewWithConfig(context.Background(), config)
    if err != nil {
        return nil, fmt.Errorf("create pool: %w", err)
    }
    
    // Test connection
    if err := pool.Ping(context.Background()); err != nil {
        return nil, fmt.Errorf("ping: %w", err)
    }
    
    log.Println("Database connection pool initialized")
    return pool, nil
}

Pool Parameters:

Parameter Value Rationale
MaxConns 10 Limit concurrent DB connections
MinConns 2 Keep warm connections ready
MaxConnLifetime 1 hour Prevent stale connections
MaxConnIdleTime 30 minutes Close idle connections
HealthCheckPeriod 1 minute Detect dead connections

Connection String Format:

postgresql://username:password@host:port/database?sslmode=disable

Example:

DATABASE_URL=postgresql://bedrock:bedrock@localhost:5432/bedrock?sslmode=disable

Connection Lifecycle

Application Start:
1. Parse DATABASE_URL from environment
2. Create pgxpool.Config with custom parameters
3. Initialize connection pool
4. Ping database to verify connectivity
5. Pass pool to service layer

Request Handling:
1. Service method receives context and pool
2. Acquire connection from pool (automatic)
3. Execute query
4. Release connection back to pool (automatic via defer)

Application Shutdown:
1. Close connection pool
2. Wait for active connections to finish
3. Release all resources

Data Access Layer

User Store

File: store/user.go

type UserStore struct {
    db *pgxpool.Pool
}

func NewUserStore(db *pgxpool.Pool) *UserStore {
    return &UserStore{db: db}
}

User Operations

Save User

func (s *UserStore) Save(ctx context.Context, email, passwordHash string) (string, error) {
    var userID string
    
    query := `
        INSERT INTO users (email, password_hash)
        VALUES ($1, $2)
        RETURNING id
    `
    
    err := s.db.QueryRow(ctx, query, email, passwordHash).Scan(&userID)
    if err != nil {
        if strings.Contains(err.Error(), "duplicate key") {
            return "", errors.New("email already exists")
        }
        return "", fmt.Errorf("insert user: %w", err)
    }
    
    return userID, nil
}

Behavior:

  • Inserts new user with email and password hash
  • Returns generated UUID
  • Handles duplicate email error
  • Uses parameterized query (SQL injection safe)

Example:

userID, err := userStore.Save(ctx, "user@example.com", "$2a$10$...")
// userID = "550e8400-e29b-41d4-a716-446655440000"

Find User by Email

func (s *UserStore) Find(ctx context.Context, email string) (*User, error) {
    var user User
    
    query := `
        SELECT id, email, password_hash, role, is_verified, created_at
        FROM users
        WHERE email = $1
    `
    
    err := s.db.QueryRow(ctx, query, email).Scan(
        &user.ID,
        &user.Email,
        &user.PasswordHash,
        &user.Role,
        &user.IsVerified,
        &user.CreatedAt,
    )
    
    if err != nil {
        if err == pgx.ErrNoRows {
            return nil, errors.New("user not found")
        }
        return nil, fmt.Errorf("query user: %w", err)
    }
    
    return &user, nil
}

Behavior:

  • Queries user by email (uses index)
  • Returns full user record
  • Handles not found case
  • Uses parameterized query

Example:

user, err := userStore.Find(ctx, "user@example.com")
// user.ID = "550e8400-e29b-41d4-a716-446655440000"
// user.Email = "user@example.com"
// user.PasswordHash = "$2a$10$..."

Find User by ID

func (s *UserStore) FindByID(ctx context.Context, id string) (*User, error) {
    var user User
    
    query := `
        SELECT id, email, password_hash, role, is_verified, created_at
        FROM users
        WHERE id = $1
    `
    
    err := s.db.QueryRow(ctx, query, id).Scan(
        &user.ID,
        &user.Email,
        &user.PasswordHash,
        &user.Role,
        &user.IsVerified,
        &user.CreatedAt,
    )
    
    if err != nil {
        if err == pgx.ErrNoRows {
            return nil, errors.New("user not found")
        }
        return nil, fmt.Errorf("query user: %w", err)
    }
    
    return &user, nil
}

Behavior: Similar to Find, but queries by UUID primary key

User Model

type User struct {
    ID           string
    Email        string
    PasswordHash string
    Role         string
    IsVerified   bool
    CreatedAt    time.Time
}

No ORM: Plain structs, manual scanning

Database Migrations

Migration Files

Directory: db/migrations/

Naming Convention: {number}_{description}.{up|down}.sql

Example Structure:

db/migrations/
├── 001_create_users_table.up.sql
├── 001_create_users_table.down.sql
├── 002_add_user_roles.up.sql
├── 002_add_user_roles.down.sql
├── 003_add_email_verification.up.sql
└── 003_add_email_verification.down.sql

Migration 001: Create Users Table

Up Migration (001_create_users_table.up.sql):

CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    role VARCHAR(50) DEFAULT 'user',
    is_verified BOOLEAN DEFAULT false,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_users_email ON users(email);

Down Migration (001_create_users_table.down.sql):

DROP INDEX IF EXISTS idx_users_email;
DROP TABLE IF EXISTS users;

Migration Execution

No Automated Tool: Migrations must be run manually

Manual Execution:

# Apply migration
psql $DATABASE_URL -f db/migrations/001_create_users_table.up.sql

# Rollback migration
psql $DATABASE_URL -f db/migrations/001_create_users_table.down.sql

Recommended Tools (not integrated):

  • golang-migrate/migrate
  • pressly/goose
  • rubenv/sql-migrate

Migration Tracking

No Tracking Table: No record of applied migrations

Risks:

  • No way to know which migrations have been applied
  • Manual tracking required
  • Risk of applying migrations out of order
  • Risk of applying same migration twice

Recommendation: Integrate migration tool with tracking table

Caching Strategy

Current Implementation

No Caching: All data fetched from providers on every request

Impact:

  • High latency (200-500ms per search)
  • Provider API rate limits
  • Unnecessary API quota consumption
  • No offline capability

Planned Caching (Redis)

Not Implemented: Redis integration planned but not built

Proposed Cache Keys:

Key Pattern TTL Purpose
track:{platform}:{id} 1 hour Track metadata
album:{platform}:{id} 1 hour Album metadata
artist:{platform}:{id} 1 hour Artist metadata
playlist:{platform}:{id} 5 minutes Playlist metadata (changes frequently)
stream:{platform}:{id} 1 hour Stream URLs (expire after 1-6 hours)
search:{query}:{platform} 5 minutes Search results
lyrics:{artist}:{title} 24 hours Lyrics (rarely change)
play:{user_id}:{track_id} 30 seconds Play deduplication
status:{platform} 5 minutes Provider health status

Proposed Cache Invalidation:

  • TTL-based expiration (no manual invalidation)
  • No cache warming (lazy loading)
  • No cache preloading

Proposed Redis Configuration:

redisClient := redis.NewClient(&redis.Options{
    Addr:         os.Getenv("REDIS_URL"),
    Password:     os.Getenv("REDIS_PASSWORD"),
    DB:           0,
    MaxRetries:   3,
    PoolSize:     10,
    MinIdleConns: 2,
})

Cache-Aside Pattern (Proposed)

func (s *server) GetTrack(ctx context.Context, req *pb.GetRequest) (*pb.Track, error) {
    // Try cache first
    cacheKey := fmt.Sprintf("track:%s", req.Id)
    cached, err := s.redis.Get(ctx, cacheKey).Result()
    if err == nil {
        var track pb.Track
        json.Unmarshal([]byte(cached), &track)
        return &track, nil
    }
    
    // Cache miss, fetch from provider
    platform, nativeID := parseNamespacedID(req.Id)
    provider := s.getProvider(platform)
    track, err := provider.GetTrack(ctx, nativeID)
    if err != nil {
        return nil, err
    }
    
    // Store in cache
    trackJSON, _ := json.Marshal(track)
    s.redis.Set(ctx, cacheKey, trackJSON, 1*time.Hour)
    
    return track, nil
}

Data Persistence Patterns

No Metadata Persistence

Current: All metadata is ephemeral (fetched from providers, not stored)

Implications:

  • No historical data
  • No offline access
  • No analytics on metadata changes
  • No data ownership

Alternative Approach (not implemented):

  • Store all fetched metadata in PostgreSQL
  • Update on cache miss
  • Enable historical queries
  • Reduce provider API dependency

No User Data Persistence

Current: Only authentication data is stored

Missing User Data:

  • Listening history
  • Favorite tracks/albums/artists
  • Created playlists
  • Search history
  • Playback state (current track, position)
  • User preferences

Implications:

  • No personalization
  • No recommendations based on history
  • No cross-device sync
  • No user analytics

Transaction Handling

No Transactions

Current: All database operations are single-statement

Example (no transaction):

func (s *UserStore) Save(ctx context.Context, email, passwordHash string) (string, error) {
    var userID string
    err := s.db.QueryRow(ctx,
        "INSERT INTO users (email, password_hash) VALUES ($1, $2) RETURNING id",
        email, passwordHash,
    ).Scan(&userID)
    return userID, err
}

No Multi-Statement Operations: No need for transactions with single table

Future Considerations: If schema expands (user profiles, playlists, etc.), transactions will be needed

Transaction Example (not used):

func (s *UserStore) SaveWithProfile(ctx context.Context, email, passwordHash, name string) error {
    tx, err := s.db.Begin(ctx)
    if err != nil {
        return err
    }
    defer tx.Rollback(ctx)
    
    var userID string
    err = tx.QueryRow(ctx,
        "INSERT INTO users (email, password_hash) VALUES ($1, $2) RETURNING id",
        email, passwordHash,
    ).Scan(&userID)
    if err != nil {
        return err
    }
    
    _, err = tx.Exec(ctx,
        "INSERT INTO profiles (user_id, name) VALUES ($1, $2)",
        userID, name,
    )
    if err != nil {
        return err
    }
    
    return tx.Commit(ctx)
}

Query Performance

Index Usage

Indexed Queries:

-- Uses idx_users_email (B-tree index)
SELECT * FROM users WHERE email = 'user@example.com';

-- Uses primary key index (automatic)
SELECT * FROM users WHERE id = '550e8400-e29b-41d4-a716-446655440000';

No Full Table Scans: All queries use indexes

Query Patterns

Point Lookups Only: No range queries, no aggregations, no joins

Example Queries:

-- Login (index scan on email)
SELECT id, email, password_hash, role, is_verified, created_at
FROM users
WHERE email = $1;

-- Token refresh (index scan on id)
SELECT id, email, role
FROM users
WHERE id = $1;

-- Registration (insert with RETURNING)
INSERT INTO users (email, password_hash)
VALUES ($1, $2)
RETURNING id;

No Complex Queries: Simple CRUD operations only

Data Consistency

Email Uniqueness

Constraint: UNIQUE constraint on email column

Enforcement: Database-level (PostgreSQL)

Race Condition Handling:

err := s.db.QueryRow(ctx, query, email, passwordHash).Scan(&userID)
if err != nil {
    if strings.Contains(err.Error(), "duplicate key") {
        return "", errors.New("email already exists")
    }
    return "", fmt.Errorf("insert user: %w", err)
}

Concurrent Registration: Database prevents duplicate emails even with concurrent requests

UUID Generation

Method: PostgreSQL gen_random_uuid() function

Collision Probability: Negligible (UUID v4 has 122 random bits)

No Application-Level ID Generation: Database handles ID creation

Backup and Recovery

No Automated Backups

Current: No backup strategy implemented

Risks:

  • Data loss on database failure
  • No point-in-time recovery
  • No disaster recovery plan

Recommendations:

  • Enable PostgreSQL continuous archiving (WAL archiving)
  • Schedule daily full backups
  • Test restore procedures
  • Store backups off-site (S3, etc.)

Manual Backup

pg_dump:

pg_dump $DATABASE_URL > backup.sql

Restore:

psql $DATABASE_URL < backup.sql

Data Security

Password Storage

Hashing Algorithm: bcrypt
Cost Factor: 10 (2^10 = 1024 iterations)

Implementation:

func hashPassword(password string) (string, error) {
    bytes, err := bcrypt.GenerateFromPassword([]byte(password), 10)
    return string(bytes), err
}

func checkPasswordHash(password, hash string) bool {
    err := bcrypt.CompareHashAndPassword([]byte(hash), []byte(password))
    return err == nil
}

Security Properties:

  • Salted (bcrypt includes random salt)
  • Slow (cost factor 10 = ~100ms per hash)
  • Resistant to rainbow tables
  • Resistant to brute force (with rate limiting, not implemented)

SQL Injection Prevention

Parameterized Queries: All queries use $1, $2 placeholders

Safe Example:

// Safe: parameterized query
err := s.db.QueryRow(ctx,
    "SELECT * FROM users WHERE email = $1",
    email,
).Scan(&user)

Unsafe Example (not used):

// Unsafe: string concatenation (NOT USED IN CODEBASE)
query := fmt.Sprintf("SELECT * FROM users WHERE email = '%s'", email)
err := s.db.QueryRow(ctx, query).Scan(&user)

All Queries Are Safe: No string concatenation in SQL queries

Connection Security

SSL Mode: Configurable via connection string

Example (SSL disabled):

DATABASE_URL=postgresql://user:pass@localhost:5432/db?sslmode=disable

Example (SSL required):

DATABASE_URL=postgresql://user:pass@localhost:5432/db?sslmode=require

Production Recommendation: Use sslmode=require or sslmode=verify-full

Database Monitoring

No Monitoring

Current: No database monitoring implemented

Missing Metrics:

  • Connection pool utilization
  • Query latency
  • Slow query log
  • Deadlock detection
  • Table bloat
  • Index usage statistics

Recommendations:

  • Enable PostgreSQL pg_stat_statements extension
  • Monitor connection pool metrics (pgxpool provides stats)
  • Set up alerts for connection pool exhaustion
  • Log slow queries (> 1 second)

Connection Pool Stats (Available but Not Used)

stats := pool.Stat()
log.Printf("Total connections: %d", stats.TotalConns())
log.Printf("Idle connections: %d", stats.IdleConns())
log.Printf("Acquired connections: %d", stats.AcquiredConns())
log.Printf("Max connections: %d", stats.MaxConns())

Not Implemented: Stats are available but not logged or exposed

Data Retention

No Retention Policy

Current: Data is never deleted

User Data:

  • Users are never deleted (no account deletion endpoint)
  • No GDPR compliance (no data export, no right to be forgotten)

Recommendations:

  • Implement account deletion endpoint
  • Add soft delete (deleted_at timestamp)
  • Implement data export (GDPR compliance)
  • Add retention policy for inactive accounts

Scalability Considerations

Vertical Scaling

Current Limits:

  • Connection pool: 10 max connections
  • Single PostgreSQL instance
  • No read replicas

Scaling Up:

  • Increase connection pool size
  • Increase PostgreSQL resources (CPU, RAM)
  • Tune PostgreSQL configuration (shared_buffers, work_mem)

Horizontal Scaling

Not Supported: Single database instance

Challenges:

  • No sharding strategy
  • No read/write splitting
  • No multi-region support

Future Considerations:

  • Add read replicas for search queries
  • Shard by user ID for user data
  • Use connection pooler (PgBouncer) for connection management

Data Model Limitations

Single Table Schema

Pros:

  • Simple to understand
  • No joins required
  • Fast queries (index lookups only)

Cons:

  • No relational data (playlists, favorites, etc.)
  • No metadata persistence
  • No user activity tracking
  • Limited functionality

No Audit Trail

Missing:

  • No login history
  • No password change history
  • No account modification log
  • No admin action log

Implications:

  • No security forensics
  • No compliance audit trail
  • No user activity analytics

No Soft Deletes

Hard Delete Only: If delete functionality is added, records are permanently removed

Recommendation: Add deleted_at timestamp for soft deletes

ALTER TABLE users ADD COLUMN deleted_at TIMESTAMP;
CREATE INDEX idx_users_deleted_at ON users(deleted_at);

-- Query active users
SELECT * FROM users WHERE deleted_at IS NULL;

Testing Strategy

No Database Tests

Current: No unit tests for database operations

Missing Tests:

  • User creation with duplicate email
  • User lookup by email
  • User lookup by ID
  • Connection pool exhaustion
  • Database connection failure
  • Transaction rollback (if added)

Recommendation: Add integration tests with test database

Example Test (not implemented):

func TestUserStore_Save_DuplicateEmail(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()
    
    store := NewUserStore(db)
    
    // First save should succeed
    _, err := store.Save(context.Background(), "test@example.com", "hash1")
    if err != nil {
        t.Fatalf("first save failed: %v", err)
    }
    
    // Second save with same email should fail
    _, err = store.Save(context.Background(), "test@example.com", "hash2")
    if err == nil {
        t.Fatal("expected duplicate email error")
    }
}

Environment Configuration

Database URL

Environment Variable: DATABASE_URL

Format: PostgreSQL connection string

Example:

DATABASE_URL=postgresql://bedrock:bedrock@localhost:5432/bedrock?sslmode=disable

Components:

  • Protocol: postgresql://
  • Username: bedrock
  • Password: bedrock
  • Host: localhost
  • Port: 5432
  • Database: bedrock
  • SSL Mode: sslmode=disable

No Validation: Application crashes if DATABASE_URL is invalid

Recommendation: Validate connection string format on startup

Docker Deployment

Docker Compose PostgreSQL

File: docker-compose.yml

version: '3.8'

services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: bedrock
      POSTGRES_PASSWORD: bedrock
      POSTGRES_DB: bedrock
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U bedrock"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  postgres_data:

Features:

  • PostgreSQL 15 Alpine (minimal image)
  • Named volume for data persistence
  • Health check for container orchestration
  • Exposed port for local development

Missing:

  • No initialization scripts (migrations must be run manually)
  • No backup configuration
  • No replication
  • No connection pooler (PgBouncer)

Database Initialization

Manual Process:

# Start PostgreSQL
docker-compose up -d postgres

# Wait for PostgreSQL to be ready
docker-compose exec postgres pg_isready -U bedrock

# Run migrations
docker-compose exec postgres psql -U bedrock -d bedrock -f /migrations/001_create_users_table.up.sql

No Automated Initialization: Migrations must be run manually after container start

Recommendation: Add init script to docker-compose

postgres:
  image: postgres:15-alpine
  volumes:
    - postgres_data:/var/lib/postgresql/data
    - ./db/migrations:/docker-entrypoint-initdb.d

Data Layer Summary

Strengths

  • Simple, focused schema (users only)
  • Proper indexing (email lookup is fast)
  • Connection pooling (pgx/v5)
  • Parameterized queries (SQL injection safe)
  • bcrypt password hashing (secure)

Weaknesses

  • No metadata persistence (all data is ephemeral)
  • No caching (high latency, provider API dependency)
  • No migration tool (manual SQL execution)
  • No monitoring (connection pool, query performance)
  • No backup strategy (data loss risk)
  • No audit trail (security, compliance)
  • Minimal schema (no user data beyond auth)

Recommendations for Metadata Aggregator

Adopt:

  • pgx/v5 driver (excellent performance, native PostgreSQL features)
  • Connection pooling configuration (sensible defaults)
  • Parameterized queries (security best practice)

Avoid:

  • Manual migrations (use golang-migrate or goose)
  • No caching (implement Redis for metadata)
  • Minimal schema (metadata aggregator needs rich schema)

Enhance:

  • Add metadata tables (tracks, albums, artists, labels, etc.)
  • Add user data tables (favorites, playlists, history)
  • Add caching layer (Redis for hot data)
  • Add migration tool (automated schema management)
  • Add monitoring (connection pool, query latency)
  • Add backup strategy (automated backups, point-in-time recovery)