Files
metadata-agregator/docs/research/bedrock-api/analysis/DATA.md
T
Alexander a1f6701bac feat: initial implementation of metadata aggregator
- gRPC service with MusicBrainz provider
- PostgreSQL schema with migrations
- Service layer with database-first caching
- Repository pattern for data access
- YAML configuration support
- Research documentation for 17 music metadata projects
2026-04-28 16:28:53 +02:00

979 lines
23 KiB
Markdown

# Bedrock-API Data Layer
## Database Technology
**RDBMS**: PostgreSQL 15
**Driver**: `github.com/jackc/pgx/v5` (native PostgreSQL driver)
**Connection Pooling**: `pgxpool` (pgx connection pool)
**Migration Tool**: None (manual SQL execution)
## Database Schema
### Users Table
**File**: `db/migrations/001_create_users_table.up.sql`
```sql
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
role VARCHAR(50) DEFAULT 'user',
is_verified BOOLEAN DEFAULT false,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_users_email ON users(email);
```
**Columns**:
| Column | Type | Constraints | Purpose |
|--------|------|-------------|---------|
| id | UUID | PRIMARY KEY, DEFAULT gen_random_uuid() | Unique user identifier |
| email | VARCHAR(255) | UNIQUE, NOT NULL | User email (login identifier) |
| password_hash | VARCHAR(255) | NOT NULL | bcrypt hashed password |
| role | VARCHAR(50) | DEFAULT 'user' | User role (user/admin) |
| is_verified | BOOLEAN | DEFAULT false | Email verification status |
| created_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Account creation timestamp |
**Indexes**:
- Primary key index on `id` (automatic)
- B-tree index on `email` (for login lookups)
**No Foreign Keys**: Single table schema, no relationships
### Schema Limitations
**Missing Tables**:
- No metadata cache (tracks, albums, artists, playlists)
- No user listening history
- No user playlists
- No user favorites/likes
- No play counts
- No search history
- No provider credentials (Spotify tokens, etc.)
**Minimal User Data**:
- No user profile (name, avatar, bio)
- No user preferences (language, region)
- No user settings (privacy, notifications)
- No user sessions (active logins)
## Connection Management
### Connection Pool Configuration
**File**: `bedrock_server/main.go`
```go
func initDB() (*pgxpool.Pool, error) {
dbURL := os.Getenv("DATABASE_URL")
if dbURL == "" {
return nil, errors.New("DATABASE_URL not set")
}
config, err := pgxpool.ParseConfig(dbURL)
if err != nil {
return nil, fmt.Errorf("parse config: %w", err)
}
// Pool configuration
config.MaxConns = 10
config.MinConns = 2
config.MaxConnLifetime = time.Hour
config.MaxConnIdleTime = 30 * time.Minute
config.HealthCheckPeriod = 1 * time.Minute
pool, err := pgxpool.NewWithConfig(context.Background(), config)
if err != nil {
return nil, fmt.Errorf("create pool: %w", err)
}
// Test connection
if err := pool.Ping(context.Background()); err != nil {
return nil, fmt.Errorf("ping: %w", err)
}
log.Println("Database connection pool initialized")
return pool, nil
}
```
**Pool Parameters**:
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| MaxConns | 10 | Limit concurrent DB connections |
| MinConns | 2 | Keep warm connections ready |
| MaxConnLifetime | 1 hour | Prevent stale connections |
| MaxConnIdleTime | 30 minutes | Close idle connections |
| HealthCheckPeriod | 1 minute | Detect dead connections |
**Connection String Format**:
```
postgresql://username:password@host:port/database?sslmode=disable
```
**Example**:
```
DATABASE_URL=postgresql://bedrock:bedrock@localhost:5432/bedrock?sslmode=disable
```
### Connection Lifecycle
```
Application Start:
1. Parse DATABASE_URL from environment
2. Create pgxpool.Config with custom parameters
3. Initialize connection pool
4. Ping database to verify connectivity
5. Pass pool to service layer
Request Handling:
1. Service method receives context and pool
2. Acquire connection from pool (automatic)
3. Execute query
4. Release connection back to pool (automatic via defer)
Application Shutdown:
1. Close connection pool
2. Wait for active connections to finish
3. Release all resources
```
## Data Access Layer
### User Store
**File**: `store/user.go`
```go
type UserStore struct {
db *pgxpool.Pool
}
func NewUserStore(db *pgxpool.Pool) *UserStore {
return &UserStore{db: db}
}
```
### User Operations
#### Save User
```go
func (s *UserStore) Save(ctx context.Context, email, passwordHash string) (string, error) {
var userID string
query := `
INSERT INTO users (email, password_hash)
VALUES ($1, $2)
RETURNING id
`
err := s.db.QueryRow(ctx, query, email, passwordHash).Scan(&userID)
if err != nil {
if strings.Contains(err.Error(), "duplicate key") {
return "", errors.New("email already exists")
}
return "", fmt.Errorf("insert user: %w", err)
}
return userID, nil
}
```
**Behavior**:
- Inserts new user with email and password hash
- Returns generated UUID
- Handles duplicate email error
- Uses parameterized query (SQL injection safe)
**Example**:
```go
userID, err := userStore.Save(ctx, "user@example.com", "$2a$10$...")
// userID = "550e8400-e29b-41d4-a716-446655440000"
```
#### Find User by Email
```go
func (s *UserStore) Find(ctx context.Context, email string) (*User, error) {
var user User
query := `
SELECT id, email, password_hash, role, is_verified, created_at
FROM users
WHERE email = $1
`
err := s.db.QueryRow(ctx, query, email).Scan(
&user.ID,
&user.Email,
&user.PasswordHash,
&user.Role,
&user.IsVerified,
&user.CreatedAt,
)
if err != nil {
if err == pgx.ErrNoRows {
return nil, errors.New("user not found")
}
return nil, fmt.Errorf("query user: %w", err)
}
return &user, nil
}
```
**Behavior**:
- Queries user by email (uses index)
- Returns full user record
- Handles not found case
- Uses parameterized query
**Example**:
```go
user, err := userStore.Find(ctx, "user@example.com")
// user.ID = "550e8400-e29b-41d4-a716-446655440000"
// user.Email = "user@example.com"
// user.PasswordHash = "$2a$10$..."
```
#### Find User by ID
```go
func (s *UserStore) FindByID(ctx context.Context, id string) (*User, error) {
var user User
query := `
SELECT id, email, password_hash, role, is_verified, created_at
FROM users
WHERE id = $1
`
err := s.db.QueryRow(ctx, query, id).Scan(
&user.ID,
&user.Email,
&user.PasswordHash,
&user.Role,
&user.IsVerified,
&user.CreatedAt,
)
if err != nil {
if err == pgx.ErrNoRows {
return nil, errors.New("user not found")
}
return nil, fmt.Errorf("query user: %w", err)
}
return &user, nil
}
```
**Behavior**: Similar to Find, but queries by UUID primary key
### User Model
```go
type User struct {
ID string
Email string
PasswordHash string
Role string
IsVerified bool
CreatedAt time.Time
}
```
**No ORM**: Plain structs, manual scanning
## Database Migrations
### Migration Files
**Directory**: `db/migrations/`
**Naming Convention**: `{number}_{description}.{up|down}.sql`
**Example Structure**:
```
db/migrations/
├── 001_create_users_table.up.sql
├── 001_create_users_table.down.sql
├── 002_add_user_roles.up.sql
├── 002_add_user_roles.down.sql
├── 003_add_email_verification.up.sql
└── 003_add_email_verification.down.sql
```
### Migration 001: Create Users Table
**Up Migration** (`001_create_users_table.up.sql`):
```sql
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
role VARCHAR(50) DEFAULT 'user',
is_verified BOOLEAN DEFAULT false,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_users_email ON users(email);
```
**Down Migration** (`001_create_users_table.down.sql`):
```sql
DROP INDEX IF EXISTS idx_users_email;
DROP TABLE IF EXISTS users;
```
### Migration Execution
**No Automated Tool**: Migrations must be run manually
**Manual Execution**:
```bash
# Apply migration
psql $DATABASE_URL -f db/migrations/001_create_users_table.up.sql
# Rollback migration
psql $DATABASE_URL -f db/migrations/001_create_users_table.down.sql
```
**Recommended Tools** (not integrated):
- `golang-migrate/migrate`
- `pressly/goose`
- `rubenv/sql-migrate`
### Migration Tracking
**No Tracking Table**: No record of applied migrations
**Risks**:
- No way to know which migrations have been applied
- Manual tracking required
- Risk of applying migrations out of order
- Risk of applying same migration twice
**Recommendation**: Integrate migration tool with tracking table
## Caching Strategy
### Current Implementation
**No Caching**: All data fetched from providers on every request
**Impact**:
- High latency (200-500ms per search)
- Provider API rate limits
- Unnecessary API quota consumption
- No offline capability
### Planned Caching (Redis)
**Not Implemented**: Redis integration planned but not built
**Proposed Cache Keys**:
| Key Pattern | TTL | Purpose |
|-------------|-----|---------|
| `track:{platform}:{id}` | 1 hour | Track metadata |
| `album:{platform}:{id}` | 1 hour | Album metadata |
| `artist:{platform}:{id}` | 1 hour | Artist metadata |
| `playlist:{platform}:{id}` | 5 minutes | Playlist metadata (changes frequently) |
| `stream:{platform}:{id}` | 1 hour | Stream URLs (expire after 1-6 hours) |
| `search:{query}:{platform}` | 5 minutes | Search results |
| `lyrics:{artist}:{title}` | 24 hours | Lyrics (rarely change) |
| `play:{user_id}:{track_id}` | 30 seconds | Play deduplication |
| `status:{platform}` | 5 minutes | Provider health status |
**Proposed Cache Invalidation**:
- TTL-based expiration (no manual invalidation)
- No cache warming (lazy loading)
- No cache preloading
**Proposed Redis Configuration**:
```go
redisClient := redis.NewClient(&redis.Options{
Addr: os.Getenv("REDIS_URL"),
Password: os.Getenv("REDIS_PASSWORD"),
DB: 0,
MaxRetries: 3,
PoolSize: 10,
MinIdleConns: 2,
})
```
### Cache-Aside Pattern (Proposed)
```go
func (s *server) GetTrack(ctx context.Context, req *pb.GetRequest) (*pb.Track, error) {
// Try cache first
cacheKey := fmt.Sprintf("track:%s", req.Id)
cached, err := s.redis.Get(ctx, cacheKey).Result()
if err == nil {
var track pb.Track
json.Unmarshal([]byte(cached), &track)
return &track, nil
}
// Cache miss, fetch from provider
platform, nativeID := parseNamespacedID(req.Id)
provider := s.getProvider(platform)
track, err := provider.GetTrack(ctx, nativeID)
if err != nil {
return nil, err
}
// Store in cache
trackJSON, _ := json.Marshal(track)
s.redis.Set(ctx, cacheKey, trackJSON, 1*time.Hour)
return track, nil
}
```
## Data Persistence Patterns
### No Metadata Persistence
**Current**: All metadata is ephemeral (fetched from providers, not stored)
**Implications**:
- No historical data
- No offline access
- No analytics on metadata changes
- No data ownership
**Alternative Approach** (not implemented):
- Store all fetched metadata in PostgreSQL
- Update on cache miss
- Enable historical queries
- Reduce provider API dependency
### No User Data Persistence
**Current**: Only authentication data is stored
**Missing User Data**:
- Listening history
- Favorite tracks/albums/artists
- Created playlists
- Search history
- Playback state (current track, position)
- User preferences
**Implications**:
- No personalization
- No recommendations based on history
- No cross-device sync
- No user analytics
## Transaction Handling
### No Transactions
**Current**: All database operations are single-statement
**Example** (no transaction):
```go
func (s *UserStore) Save(ctx context.Context, email, passwordHash string) (string, error) {
var userID string
err := s.db.QueryRow(ctx,
"INSERT INTO users (email, password_hash) VALUES ($1, $2) RETURNING id",
email, passwordHash,
).Scan(&userID)
return userID, err
}
```
**No Multi-Statement Operations**: No need for transactions with single table
**Future Considerations**: If schema expands (user profiles, playlists, etc.), transactions will be needed
**Transaction Example** (not used):
```go
func (s *UserStore) SaveWithProfile(ctx context.Context, email, passwordHash, name string) error {
tx, err := s.db.Begin(ctx)
if err != nil {
return err
}
defer tx.Rollback(ctx)
var userID string
err = tx.QueryRow(ctx,
"INSERT INTO users (email, password_hash) VALUES ($1, $2) RETURNING id",
email, passwordHash,
).Scan(&userID)
if err != nil {
return err
}
_, err = tx.Exec(ctx,
"INSERT INTO profiles (user_id, name) VALUES ($1, $2)",
userID, name,
)
if err != nil {
return err
}
return tx.Commit(ctx)
}
```
## Query Performance
### Index Usage
**Indexed Queries**:
```sql
-- Uses idx_users_email (B-tree index)
SELECT * FROM users WHERE email = 'user@example.com';
-- Uses primary key index (automatic)
SELECT * FROM users WHERE id = '550e8400-e29b-41d4-a716-446655440000';
```
**No Full Table Scans**: All queries use indexes
### Query Patterns
**Point Lookups Only**: No range queries, no aggregations, no joins
**Example Queries**:
```sql
-- Login (index scan on email)
SELECT id, email, password_hash, role, is_verified, created_at
FROM users
WHERE email = $1;
-- Token refresh (index scan on id)
SELECT id, email, role
FROM users
WHERE id = $1;
-- Registration (insert with RETURNING)
INSERT INTO users (email, password_hash)
VALUES ($1, $2)
RETURNING id;
```
**No Complex Queries**: Simple CRUD operations only
## Data Consistency
### Email Uniqueness
**Constraint**: `UNIQUE` constraint on `email` column
**Enforcement**: Database-level (PostgreSQL)
**Race Condition Handling**:
```go
err := s.db.QueryRow(ctx, query, email, passwordHash).Scan(&userID)
if err != nil {
if strings.Contains(err.Error(), "duplicate key") {
return "", errors.New("email already exists")
}
return "", fmt.Errorf("insert user: %w", err)
}
```
**Concurrent Registration**: Database prevents duplicate emails even with concurrent requests
### UUID Generation
**Method**: PostgreSQL `gen_random_uuid()` function
**Collision Probability**: Negligible (UUID v4 has 122 random bits)
**No Application-Level ID Generation**: Database handles ID creation
## Backup and Recovery
### No Automated Backups
**Current**: No backup strategy implemented
**Risks**:
- Data loss on database failure
- No point-in-time recovery
- No disaster recovery plan
**Recommendations**:
- Enable PostgreSQL continuous archiving (WAL archiving)
- Schedule daily full backups
- Test restore procedures
- Store backups off-site (S3, etc.)
### Manual Backup
**pg_dump**:
```bash
pg_dump $DATABASE_URL > backup.sql
```
**Restore**:
```bash
psql $DATABASE_URL < backup.sql
```
## Data Security
### Password Storage
**Hashing Algorithm**: bcrypt
**Cost Factor**: 10 (2^10 = 1024 iterations)
**Implementation**:
```go
func hashPassword(password string) (string, error) {
bytes, err := bcrypt.GenerateFromPassword([]byte(password), 10)
return string(bytes), err
}
func checkPasswordHash(password, hash string) bool {
err := bcrypt.CompareHashAndPassword([]byte(hash), []byte(password))
return err == nil
}
```
**Security Properties**:
- Salted (bcrypt includes random salt)
- Slow (cost factor 10 = ~100ms per hash)
- Resistant to rainbow tables
- Resistant to brute force (with rate limiting, not implemented)
### SQL Injection Prevention
**Parameterized Queries**: All queries use `$1`, `$2` placeholders
**Safe Example**:
```go
// Safe: parameterized query
err := s.db.QueryRow(ctx,
"SELECT * FROM users WHERE email = $1",
email,
).Scan(&user)
```
**Unsafe Example** (not used):
```go
// Unsafe: string concatenation (NOT USED IN CODEBASE)
query := fmt.Sprintf("SELECT * FROM users WHERE email = '%s'", email)
err := s.db.QueryRow(ctx, query).Scan(&user)
```
**All Queries Are Safe**: No string concatenation in SQL queries
### Connection Security
**SSL Mode**: Configurable via connection string
**Example** (SSL disabled):
```
DATABASE_URL=postgresql://user:pass@localhost:5432/db?sslmode=disable
```
**Example** (SSL required):
```
DATABASE_URL=postgresql://user:pass@localhost:5432/db?sslmode=require
```
**Production Recommendation**: Use `sslmode=require` or `sslmode=verify-full`
## Database Monitoring
### No Monitoring
**Current**: No database monitoring implemented
**Missing Metrics**:
- Connection pool utilization
- Query latency
- Slow query log
- Deadlock detection
- Table bloat
- Index usage statistics
**Recommendations**:
- Enable PostgreSQL `pg_stat_statements` extension
- Monitor connection pool metrics (pgxpool provides stats)
- Set up alerts for connection pool exhaustion
- Log slow queries (> 1 second)
### Connection Pool Stats (Available but Not Used)
```go
stats := pool.Stat()
log.Printf("Total connections: %d", stats.TotalConns())
log.Printf("Idle connections: %d", stats.IdleConns())
log.Printf("Acquired connections: %d", stats.AcquiredConns())
log.Printf("Max connections: %d", stats.MaxConns())
```
**Not Implemented**: Stats are available but not logged or exposed
## Data Retention
### No Retention Policy
**Current**: Data is never deleted
**User Data**:
- Users are never deleted (no account deletion endpoint)
- No GDPR compliance (no data export, no right to be forgotten)
**Recommendations**:
- Implement account deletion endpoint
- Add soft delete (deleted_at timestamp)
- Implement data export (GDPR compliance)
- Add retention policy for inactive accounts
## Scalability Considerations
### Vertical Scaling
**Current Limits**:
- Connection pool: 10 max connections
- Single PostgreSQL instance
- No read replicas
**Scaling Up**:
- Increase connection pool size
- Increase PostgreSQL resources (CPU, RAM)
- Tune PostgreSQL configuration (shared_buffers, work_mem)
### Horizontal Scaling
**Not Supported**: Single database instance
**Challenges**:
- No sharding strategy
- No read/write splitting
- No multi-region support
**Future Considerations**:
- Add read replicas for search queries
- Shard by user ID for user data
- Use connection pooler (PgBouncer) for connection management
## Data Model Limitations
### Single Table Schema
**Pros**:
- Simple to understand
- No joins required
- Fast queries (index lookups only)
**Cons**:
- No relational data (playlists, favorites, etc.)
- No metadata persistence
- No user activity tracking
- Limited functionality
### No Audit Trail
**Missing**:
- No login history
- No password change history
- No account modification log
- No admin action log
**Implications**:
- No security forensics
- No compliance audit trail
- No user activity analytics
### No Soft Deletes
**Hard Delete Only**: If delete functionality is added, records are permanently removed
**Recommendation**: Add `deleted_at` timestamp for soft deletes
```sql
ALTER TABLE users ADD COLUMN deleted_at TIMESTAMP;
CREATE INDEX idx_users_deleted_at ON users(deleted_at);
-- Query active users
SELECT * FROM users WHERE deleted_at IS NULL;
```
## Testing Strategy
### No Database Tests
**Current**: No unit tests for database operations
**Missing Tests**:
- User creation with duplicate email
- User lookup by email
- User lookup by ID
- Connection pool exhaustion
- Database connection failure
- Transaction rollback (if added)
**Recommendation**: Add integration tests with test database
**Example Test** (not implemented):
```go
func TestUserStore_Save_DuplicateEmail(t *testing.T) {
db := setupTestDB(t)
defer db.Close()
store := NewUserStore(db)
// First save should succeed
_, err := store.Save(context.Background(), "test@example.com", "hash1")
if err != nil {
t.Fatalf("first save failed: %v", err)
}
// Second save with same email should fail
_, err = store.Save(context.Background(), "test@example.com", "hash2")
if err == nil {
t.Fatal("expected duplicate email error")
}
}
```
## Environment Configuration
### Database URL
**Environment Variable**: `DATABASE_URL`
**Format**: PostgreSQL connection string
**Example**:
```
DATABASE_URL=postgresql://bedrock:bedrock@localhost:5432/bedrock?sslmode=disable
```
**Components**:
- Protocol: `postgresql://`
- Username: `bedrock`
- Password: `bedrock`
- Host: `localhost`
- Port: `5432`
- Database: `bedrock`
- SSL Mode: `sslmode=disable`
**No Validation**: Application crashes if DATABASE_URL is invalid
**Recommendation**: Validate connection string format on startup
## Docker Deployment
### Docker Compose PostgreSQL
**File**: `docker-compose.yml`
```yaml
version: '3.8'
services:
postgres:
image: postgres:15-alpine
environment:
POSTGRES_USER: bedrock
POSTGRES_PASSWORD: bedrock
POSTGRES_DB: bedrock
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U bedrock"]
interval: 10s
timeout: 5s
retries: 5
volumes:
postgres_data:
```
**Features**:
- PostgreSQL 15 Alpine (minimal image)
- Named volume for data persistence
- Health check for container orchestration
- Exposed port for local development
**Missing**:
- No initialization scripts (migrations must be run manually)
- No backup configuration
- No replication
- No connection pooler (PgBouncer)
### Database Initialization
**Manual Process**:
```bash
# Start PostgreSQL
docker-compose up -d postgres
# Wait for PostgreSQL to be ready
docker-compose exec postgres pg_isready -U bedrock
# Run migrations
docker-compose exec postgres psql -U bedrock -d bedrock -f /migrations/001_create_users_table.up.sql
```
**No Automated Initialization**: Migrations must be run manually after container start
**Recommendation**: Add init script to docker-compose
```yaml
postgres:
image: postgres:15-alpine
volumes:
- postgres_data:/var/lib/postgresql/data
- ./db/migrations:/docker-entrypoint-initdb.d
```
## Data Layer Summary
### Strengths
- Simple, focused schema (users only)
- Proper indexing (email lookup is fast)
- Connection pooling (pgx/v5)
- Parameterized queries (SQL injection safe)
- bcrypt password hashing (secure)
### Weaknesses
- No metadata persistence (all data is ephemeral)
- No caching (high latency, provider API dependency)
- No migration tool (manual SQL execution)
- No monitoring (connection pool, query performance)
- No backup strategy (data loss risk)
- No audit trail (security, compliance)
- Minimal schema (no user data beyond auth)
### Recommendations for Metadata Aggregator
**Adopt**:
- pgx/v5 driver (excellent performance, native PostgreSQL features)
- Connection pooling configuration (sensible defaults)
- Parameterized queries (security best practice)
**Avoid**:
- Manual migrations (use golang-migrate or goose)
- No caching (implement Redis for metadata)
- Minimal schema (metadata aggregator needs rich schema)
**Enhance**:
- Add metadata tables (tracks, albums, artists, labels, etc.)
- Add user data tables (favorites, playlists, history)
- Add caching layer (Redis for hot data)
- Add migration tool (automated schema management)
- Add monitoring (connection pool, query latency)
- Add backup strategy (automated backups, point-in-time recovery)