Files
MusicFS/docs/v2/plans/resilience-testing-v2.md
T
Alexander c6aa47f440 Add resilience testing BlueDoc (v2)
Restructured resilience testing strategy into BlueDoc template format
with proper sections: Abstract, Background, Goals/Non-Goals, Proposed
Design, Cross-Cutting Concerns, Alternatives Considered, Implementation
Plan, and Glossary. Original resilience-testing.md preserved.
2026-05-13 12:46:25 +02:00

22 KiB

MusicFS Resilience Testing: Design Doc

Authors: AI-assisted
Status: Draft
Last Updated: 2026-05-13
Reviewers: TBD
Approvers: TBD
Prerequisites: resilience-fault-tolerance.md, architecture.md


[TOC]


1. Abstract

MusicFS has 162 unit/integration tests but zero fault injection, crash recovery, or resilience tests. This design doc defines the test infrastructure, tooling, and test cases needed to verify that MusicFS survives the 34 failure modes identified in the resilience audit.

The approach uses three testing layers: trait-based mocks with failpoints for fast unit-level verification, fork-kill process tests for crash and signal recovery, and Toxiproxy with Docker for real-protocol network fault injection. A new musicfs-test-utils crate centralizes shared test helpers that are currently duplicated across 29 files.


2. Background

2.1 Current Test State

Metric Value
Total tests 162
Test files with #[cfg(test)] 43
Async tests (#[tokio::test]) 44
Fault injection tests 0
Crash recovery tests 0
Signal handling tests 0
CI pipeline None
Mocking framework None (real components + TempDir)

2.2 What Exists

  • Unit tests: Per-crate #[cfg(test)] modules using real implementations with TempDir isolation
  • Integration tests: crates/musicfs-cas/tests/integration.rs — CAS + fetcher + reader pipeline
  • E2E tests: tests/e2e/e2e_players.rs — mpv/VLC playback over mounted FUSE (#[ignore], manual)
  • Test helpers: make_file_meta(), mock_health() — duplicated across modules, not centralized
  • Test tooling: cargo-nextest and cargo-criterion available in Nix flake

2.3 What's Missing

The resilience audit identified 34 failure modes across 6 phases. None have test coverage. The audit covers:

  • Signal handling and graceful shutdown (Phase A)
  • Crash recovery and cache integrity (Phase B)
  • Network fault tolerance and origin failover (Phase C-D)
  • Runtime deadlocks and resource exhaustion (Phase E)
  • Cache/database sudden death and passthrough mode (Phase F)

2.4 Why "Doing Nothing" Is Not an Option

MusicFS is designed as a critical filesystem daemon. Untested failure paths mean:

  • Crashes that corrupt SQLite, sled, or tantivy go undetected until production
  • Signal handling code (once implemented) has no regression tests
  • Origin failover logic is tested for correctness but not for actual failure scenarios
  • No confidence that the daemon survives real-world conditions (disk full, NAS reboot, OOM)

3. Goals & Non-Goals

3.1 Goals

  • Every resilience issue gets a test — all 34 failure modes from the audit mapped to concrete test cases
  • Tests run without root — no kernel modules, no privileged containers for Layer 1 and Layer 2
  • Tests run fast — Layer 1 tests complete in <1 second each; full resilience suite in <60 seconds
  • Failpoints are zero-cost — conditional compilation via Cargo features; no runtime overhead in release builds
  • Test helpers are centralizedmusicfs-test-utils crate eliminates duplication across 29 files

3.2 Non-Goals

  • Full chaos engineering platform — this is not Jepsen; we test known failure modes, not random exploration
  • Performance benchmarking — covered separately by cargo-criterion; this doc is about correctness under failure
  • CI pipeline setup — pipeline configuration (GitHub Actions, Nix CI) is a separate task; this doc defines what to run, not where
  • FUSE kernel-level testing — testing kernel FUSE module behavior or /dev/fuse edge cases is out of scope

4. Proposed Design

4.1 Testing Layers

┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Toxiproxy + Docker                                  │
│ Real protocols, real latency, real connection drops           │
│ ~5 tests, seconds each, requires docker-compose              │
├─────────────────────────────────────────────────────────────┤
│ Layer 2: Fork-Kill Process Tests                             │
│ Spawn daemon, send signals, kill -9, verify recovery         │
│ ~5 tests, seconds each, cargo test                           │
├─────────────────────────────────────────────────────────────┤
│ Layer 1: Trait Mocks + Failpoints                            │
│ FaultyOrigin, FaultyCasStore, fail_point! macros             │
│ ~25 tests, milliseconds each, cargo test                     │
└─────────────────────────────────────────────────────────────┘

Rule: Every resilience issue gets Layer 1 coverage at minimum. Critical issues (signal handling, crash recovery, FUSE unmount) additionally get Layer 2. Network-specific issues (origin failover, latency, connection drops) additionally get Layer 3.

4.2 New Dependencies

# Cargo.toml [workspace.dependencies]
fail = "0.5"                    # TiKV failpoints — conditional fault injection
rlimit = "0.10"                 # Resource limit manipulation (fd, memory)
nix = "0.29"                    # Signal sending, process control

# Cargo.toml [workspace.features]
failpoints = ["fail/failpoints"]  # Zero-cost when disabled

# dev-dependencies only (not shipped in release binary)
wiremock = "0.6"                # HTTP mock server (S3 origin tests)
assert_cmd = "2.0"              # CLI integration testing

Why these choices:

  • fail (TiKV failpoints): Production-proven by TiKV (distributed KV store). Zero overhead when failpoints feature is disabled. Supports deterministic failure injection with counter/probability controls.
  • rlimit: Test fd exhaustion and memory limits without root. Wraps setrlimit/getrlimit syscalls.
  • nix: Send signals to child processes (kill(pid, SIGTERM)). Already a transitive dependency via fuser.
  • wiremock: Pure-Rust HTTP mock server for S3 origin testing. No external process needed.

4.3 Test Infrastructure Crate

crates/musicfs-test-utils/ — new workspace crate providing shared test helpers.

4.3.1 FaultyOrigin

Wraps any Origin implementation with configurable failure injection:

pub struct FaultyOrigin {
    inner: Arc<dyn Origin>,
    fail_mode: Arc<RwLock<FailMode>>,
    call_count: AtomicUsize,
}

pub enum FailMode {
    Healthy,                           // Pass through to inner
    FailEveryNth(usize),               // Fail on every Nth call
    FailAfterN(usize),                 // Succeed N times, then always fail
    TimeoutMs(u64),                    // Sleep then fail (simulates hung NFS)
    PartialRead { max_bytes: usize },  // Return truncated data
    ReturnError(io::ErrorKind),        // Return specific error
}

Implements Origin trait. fail_mode is Arc<RwLock<>> so tests can change behavior mid-test (e.g., origin "recovers" after health check).

4.3.2 FaultyCasStore

Wraps CasStore with injectable disk errors:

pub struct FaultyCasStore {
    inner: CasStore,
    inject_enospc: AtomicBool,       // put() fails with ENOSPC
    inject_eio_on_read: AtomicBool,  // get() fails with EIO
    inject_corruption: AtomicBool,   // get() returns bad data
}

4.3.3 Centralized Fixtures

Currently duplicated across 29 test modules:

pub fn make_file_meta(id: i64, vpath: &str, size: u64) -> FileMeta;
pub fn make_audio_meta(artist: &str, album: &str, title: &str) -> AudioMeta;
pub async fn setup_test_cas(dir: &Path) -> Arc<CasStore>;
pub fn setup_test_tree(files: &[FileMeta]) -> Arc<RwLock<VirtualTree>>;

4.4 Failpoints Instrumentation

Production code locations that need fail_point! macros:

Location Failpoint Name Simulates
musicfs-cas/src/store.rs put() cas-put-before-write ENOSPC before chunk write
musicfs-cas/src/store.rs put() cas-put-after-write-before-index Crash between write and sled insert
musicfs-cas/src/reader.rs get_or_fetch_manifest() reader-manifest-fetch Manifest fetch failure
musicfs-sync/src/delta.rs detect_changes() delta-sync-after-batch Crash mid-sync
musicfs-origins/src/health.rs check_one() health-check-hang Health check hangs forever
musicfs-cache/src/db.rs open() db-open-corrupt Database corruption on open

All guarded by #[cfg(feature = "failpoints")] — zero-cost in release builds.

4.5 Test File Organization

musicfs/
├── crates/
│   └── musicfs-test-utils/          # NEW — shared test helpers
│       ├── Cargo.toml
│       └── src/
│           ├── lib.rs
│           ├── faulty_origin.rs     # FaultyOrigin with FailMode
│           ├── faulty_cas.rs        # FaultyCasStore
│           ├── fixtures.rs          # make_file_meta, setup_test_cas, etc.
│           └── assertions.rs        # Custom assertions
├── tests/
│   ├── resilience/                  # NEW — resilience test suite
│   │   ├── mod.rs
│   │   ├── signal_handling.rs       # SIGTERM/SIGINT/double-signal
│   │   ├── crash_recovery.rs        # Fork-kill + state verification
│   │   ├── cache_corruption.rs      # SQLite/sled/tantivy/CAS corruption
│   │   ├── disk_failure.rs          # ENOSPC, permissions, passthrough mode
│   │   ├── resource_limits.rs       # fd exhaustion, memory limits
│   │   └── lock_poisoning.rs        # RwLock poison recovery
│   ├── failpoints/                  # NEW — failpoint-gated tests
│   │   ├── mod.rs
│   │   ├── origin_failures.rs       # Injected origin errors
│   │   ├── sync_interruption.rs     # Delta sync crash/resume
│   │   └── cas_failures.rs          # CAS write failures
│   ├── integration/                 # NEW — network integration (Docker)
│   │   ├── docker-compose.yml
│   │   ├── network_faults.rs        # Toxiproxy: latency, drops, bandwidth
│   │   └── origin_failover.rs       # Multi-origin failover integration
│   └── e2e/
│       └── e2e_players.rs           # Existing (unchanged)

Running:

# Layer 1: Fast resilience tests (no Docker, no FUSE)
cargo test --lib --tests resilience

# Layer 1: Failpoint tests (sequential, feature-gated)
cargo test --features failpoints --test failpoints -- --test-threads 1

# Layer 2: Process-level tests (included in resilience/)
cargo test --test resilience

# Layer 3: Network integration (requires docker-compose up)
cargo test --test integration -- --ignored

# All layers
cargo nextest run --features failpoints

4.6 Integration Test Docker Setup

For Layer 3 network fault testing:

# tests/integration/docker-compose.yml
services:
  toxiproxy:
    image: ghcr.io/shopify/toxiproxy:2.9.0
    ports:
      - "8474:8474"          # Toxiproxy API
      - "20000-20010:20000-20010"  # Proxy ports

  minio:
    image: minio/minio
    command: server /data
    ports:
      - "9000:9000"
    environment:
      MINIO_ROOT_USER: test
      MINIO_ROOT_PASSWORD: testtest

  sftp:
    image: atmoz/sftp
    ports:
      - "2222:22"
    command: test:test:::music

Tests use noxious-client crate to configure Toxiproxy faults at runtime (latency injection, connection drops, bandwidth throttling).


5. Cross-Cutting Concerns

5.1 Security & Privacy

  • Tests run without root — no kernel modules, no privileged containers for Layer 1/2
  • Layer 3 Docker tests use ephemeral containers with test credentials only
  • No real music files or user data in tests — synthetic make_file_meta() fixtures
  • rlimit tests restore original limits after test (cleanup in all code paths)

5.2 Observability

  • Failpoint tests log injected faults via tracing — test failures include full trace context
  • Layer 2 (fork-kill) tests capture daemon stdout/stderr for failure diagnosis
  • Test coverage tracked per resilience issue (coverage matrix in Section 7)

5.3 Scalability & Performance

  • Layer 1 tests: <10ms each, ~25 tests = <1s total
  • Layer 2 tests: ~2-5s each (process spawn + signal + verify), ~5 tests = <30s total
  • Layer 3 tests: ~5-10s each (Docker network), ~5 tests = <60s total
  • Full suite: <2 minutes including failpoint tests (sequential --test-threads 1)
  • Failpoint global state requires --test-threads 1 for failpoint tests; all other tests parallelize normally

5.4 Testing the Tests

  • Corruption tests self-validate: create known-good state → corrupt → verify detection
  • FaultyOrigin has mode assertions: assert_eq!(origin.call_count(), expected) to verify injection triggered
  • Failpoint tests verify both the error path AND the happy path (remove failpoint, retry, verify success)
  • Resource limit tests always restore original limits (even on panic — use scopeguard or Drop impl)

6. Alternatives Considered

6.1 Jepsen / Full Chaos Engineering Framework

Rejected. Jepsen tests distributed consensus under network partitions. MusicFS is a single-daemon filesystem — its failure modes are local (disk, signals, panics), not distributed. The 3-layer approach covers our actual failure surface with 10x less complexity.

6.2 proptest / Property-Based Testing

Deferred. Property-based testing (random input generation) is valuable for finding edge cases in path resolution, CDC chunking, and search queries. But it's orthogonal to resilience testing — it tests correctness under random input, not correctness under infrastructure failure. Can be added later without affecting this design.

6.3 loom (Concurrency Model Checker)

Deferred. loom exhaustively checks all possible thread interleavings for data races and deadlocks. It would be useful for the FUSE↔tokio deadlock issue (5.1) and RwLock poison issue (2.9). However, loom requires rewriting code to use loom::sync primitives and is very slow. Not practical for initial resilience coverage. Consider for Phase E hardening.

6.4 In-Process Failure Injection Without Failpoints

Rejected. Alternative: inject failures via trait-method overrides or runtime flags instead of the fail crate. This avoids a new dependency but requires modifying every function signature to accept an error injection parameter. Failpoints are cleaner — they're invisible in production (compiled out) and don't pollute the API surface.

6.5 Mock Framework (mockall)

Rejected for now. The codebase uses real implementations with TempDir isolation — this pattern is well-established across 43 test files. Introducing mockall would split the test codebase into two incompatible patterns. FaultyOrigin wrapper achieves the same result while staying consistent with existing patterns.


7. Implementation Plan

Phase 1: Test Infrastructure (Days 1-2.5)

Task Effort Deliverable
Create musicfs-test-utils crate 1 day FaultyOrigin, FaultyCasStore, centralized fixtures
Add fail crate, instrument 10 failpoints 1 day Failpoint macros in store.rs, reader.rs, delta.rs, health.rs, db.rs
Setup test directory structure 0.5 day tests/resilience/, tests/failpoints/, tests/integration/

Phase 2: Layer 1 Tests (Days 3-5.5)

Test Group Tests Effort Can Write Now?
Cache corruption (SQLite, sled, tantivy, CAS) 4 0.5 day Yes
RwLock poison recovery 2 0.25 day Yes
Health check timeout + parallel checks 2 0.25 day Yes
tantivy crash recovery 2 0.25 day Yes
fd exhaustion 1 0.25 day Yes
Disk space / ENOSPC 2 0.25 day Yes
Origin failover (FaultyOrigin) 3 0.5 day Yes
Panic hook + task supervisor 3 0.5 day Needs implementation
Shutdown orchestration 3 0.5 day Needs implementation
sd_notify mock socket 1 0.25 day Needs implementation
Passthrough mode 1 0.25 day Needs implementation
Systemd service file assertions 1 0.1 day Yes

Phase 3: Layer 2 Tests (Days 6-7)

Test Effort Requires
SIGTERM triggers clean shutdown 0.25 day Signal handler implementation
SIGINT triggers clean shutdown 0.1 day Signal handler implementation
Double-signal forces immediate exit 0.1 day Signal handler implementation
Kill -9 + stale mount detection 0.25 day Stale mount detection implementation
100 concurrent FUSE reads (deadlock) 0.25 day FUSE mount in test (Docker or privileged)

Phase 4: Layer 3 Tests (Days 8-9)

Task Effort Requires
Docker Compose setup (Toxiproxy + MinIO + SFTP) 0.5 day Docker
S3 latency spike test 0.25 day S3 origin implementation
S3 connection drop + failover 0.25 day S3 origin implementation
SFTP connection drop + failover 0.25 day SFTP origin implementation
Origin recovery after partition heal 0.25 day Docker

Rollout

  1. Phase 1 first — test infrastructure is prerequisite for everything else
  2. Phase 2 "write now" tests — 11 tests that can be written before resilience implementation; they document expected behavior as executable specs (currently failing)
  3. Phase 2 remaining — written alongside resilience implementation (test-first development)
  4. Phase 3 — after signal handling and shutdown are implemented
  5. Phase 4 — after S3/SFTP origins are implemented; deferred if origins remain stubs

8. Coverage Matrix

8.1 Issue → Test → Layer Mapping

Issue Description Layer Test Approach Write Now?
2.1 Signal handling 2 Fork daemon + send SIGTERM/SIGINT
2.2 Panic hook 1 catch_unwind + log capture
2.3 Shutdown orchestration 1+2 CancellationToken + ordered teardown
2.4 Cache integrity on startup 1 Corrupt file bytes + reopen
2.5 Interrupted sync 1 Failpoint delta-sync-after-batch
2.6 Task supervisor 1 Spawn panicking task + verify restart
2.7 FUSE unmount on crash 2 Fork + kill -9 + check /proc/mounts
2.8 Disk space 1 Small max_size + oversized write
2.9 RwLock poison 1 Panic in writer thread + verify read
2.10 sd_notify 1 Mock Unix datagram socket
3.1 Watchdog 1 Mock sd_notify + verify WATCHDOG=1
3.5 sled recovery 1 Corrupt sled files + reopen
3.7 ExecStop stub 1 Assert service file contains fusermount
3.8 FUSE read timeout 1 FaultyOrigin with TimeoutMs + verify EIO
4.2.1 Health check timeout 1 FaultyOrigin with 30s hang + timer
4.2.2 Parallel health checks 1 3 origins (2 fast, 1 slow) + timer
4.2.3 Offline mode 1 All origins fail + verify state machine
5.1 FUSE↔tokio deadlock 2 100 concurrent reads with timeout
5.2 tantivy crash 1 Write + mem::forget + reopen
5.3 fd exhaustion 1 rlimit NOFILE=64 + CAS operations
5.7 CAS atomic write 1 Failpoint between write and index
6.3 sled dies at runtime 1 Corrupt sled + verify EIO not panic
6.4 CAS chunk corruption 1 Overwrite chunk file + verify auto-repair
6.6 Passthrough mode 1 Read-only cache dir + verify origin read
Network Origin failover 1+3 FaultyOrigin + Toxiproxy (L1)

8.2 Summary

  • Total test cases: ~35
  • Can write now (before resilience implementation): 15
  • Need implementation first: 12
  • Need Docker (Layer 3 only): 5
  • Need FUSE mount (Layer 2): 3

9. Glossary / References

9.1 Libraries

Library Link Purpose
fail (TiKV failpoints) github.com/tikv/fail-rs Conditional fault injection
rlimit docs.rs/rlimit Resource limit manipulation
nix docs.rs/nix POSIX signal sending
wiremock docs.rs/wiremock HTTP mock server
assert_cmd docs.rs/assert_cmd CLI process testing
Toxiproxy github.com/Shopify/toxiproxy Network fault injection proxy
noxious-client docs.rs/noxious-client Async Toxiproxy Rust client

9.2 References

Document Path
Resilience audit resilience-fault-tolerance.md
Persistent state plan persistent-state.md
Architecture architecture.md
Requirements requirements.md

9.3 Glossary

Term Definition
Failpoint A conditional injection point in production code, compiled out in release builds
FaultyOrigin Test wrapper around Origin trait that injects configurable errors
Layer 1 In-process tests (trait mocks, failpoints) — fastest, no external deps
Layer 2 Process-level tests (fork, signal, kill) — tests daemon lifecycle
Layer 3 Network-level tests (Toxiproxy, Docker) — tests real protocol behavior
Passthrough mode Operating mode where cache is bypassed; reads go directly to origin