Files

T

Alexander 3038c94b8c Add Phase C implementation plan (Production Hardening)

Merges practical items from resilience Phases C+D+E+F into one pass.
Turns all 6 remaining RED tests GREEN:
- D1/D2: Health check timeout + parallel join_all
- C6: Fix recursive CAS calculate_size()
- C7: FUSE read 30s timeout wrapper
- 6.4: Auto-re-fetch corrupt/missing chunks from origin
- 6.6: Passthrough fallback when CAS write fails
- C9: PID file with flock
- 5.3: fd exhaustion graceful handling
~4 days estimated.

2026-05-13 15:42:18 +02:00

22 KiB

Raw Permalink Blame History

Phase C: Production Hardening — Implementation Plan

Authors: AI-assisted
Status: Draft
Last Updated: 2026-05-13
Reviewers: TBD
Approvers: TBD
Prerequisites: phase-b-crash-recovery.md (completed), resilience-fault-tolerance.md
Estimated Effort: ~4 days

[TOC]

1. Abstract

Phase C merges the practical items from Phases C and D of the resilience audit into a single implementation pass. It fixes the remaining 6 RED tests and addresses production-critical issues: health check hangs that block all origin monitoring, unbounded FUSE reads that can freeze the filesystem, broken CAS size accounting that disables eviction, and concurrent mount protection.

Deferred items (depend on unimplemented features or low urgency): interrupted sync recovery (needs persistent state), SIGHUP config reload, connection pooling (S3/SFTP are stubs), event bus backpressure, FUSE session recovery, offline mode state machine, DNS failure handling, stale-data awareness.

RED tests to turn GREEN:

test_local_origin_health_check_has_timeout (D1)
test_health_checks_run_in_parallel (D2)
test_fd_exhaustion_handling (E — 5.3)
test_corrupt_chunk_auto_refetched (F — 6.4)
test_missing_chunk_triggers_origin_fetch (F — 6.4)
test_passthrough_mode_when_cache_disk_dead (F — 6.6)

2. Background

After Phase A+B, the daemon survives signals, recovers from storage corruption on startup, supervises background tasks, and rejects oversized CAS writes. But:

Health checks hang on dead origins — check_one() calls origin.health().await with no timeout. A dead NAS (local origin pointing to network mount) blocks health monitoring for ALL origins because checks run sequentially.
FUSE reads have no timeout — reader.read() in the FUSE read() callback has no timeout. A slow or hung origin blocks the FUSE thread indefinitely.
CAS size tracking is broken — calculate_size() only scans top-level of chunks_dir, missing all chunks in shard subdirectories (aa/bb/<hash>). current_size is always ~0, eviction never triggers.
Corrupt chunks return EIO — when verify_integrity() detects a bad chunk, it returns CasError::IntegrityError. The reader propagates this as EIO to FUSE. It should auto-re-fetch from origin instead.
No concurrent mount protection — two musicfs mount commands can run simultaneously, corrupting SQLite and sled.
fd exhaustion is unhandled — no graceful behavior when file descriptors run out.

3. Goals & Non-Goals

3.1 Goals

Health checks complete within 5 seconds regardless of origin responsiveness
Health checks run in parallel (3 origins checked in ~5s, not ~15s)
FUSE reads timeout after 30 seconds (returns EIO, doesn't hang)
CAS size accounting is correct (recursive shard scan)
Corrupt/missing chunks are auto-re-fetched from origin transparently
PID file prevents concurrent mounts
fd exhaustion produces clean errors, not panics
All 6 remaining RED tests turn GREEN

3.2 Non-Goals

Interrupted sync recovery (C1) — blocked on persistent state
systemd watchdog (C3) — useful but not critical yet
SIGHUP config reload (C4) — nice-to-have
Connection pooling (C5) — S3/SFTP origins are stubs
Event bus backpressure (C8) — low urgency
FUSE session recovery (C10) — complex edge case
Offline mode state machine (D3) — needs broader design
DNS failure handling (D5) — depends on C5
Stale-data awareness (D6) — low severity for music FS

4. Proposed Design

4.1 Implementation Order

4.2  Health check timeout + parallel checks    (2 RED tests, independent)
 ↓
4.3  Fix CAS calculate_size()                  (independent, unblocks eviction)
 ↓
4.4  FUSE read timeout                         (independent)
 ↓
4.5  CAS chunk auto-re-fetch on corruption     (2 RED tests)
 ↓
4.6  PID file / flock                          (independent)
 ↓
4.7  fd exhaustion handling                    (1 RED test)

4.2 Issues D1+D2: Health Check Timeout + Parallel Checks

Problem: check_one() awaits origin.health() with no timeout. check_all() iterates sequentially. One hung origin blocks everything.

Step 1: No stubs needed

Step 2: RED tests already exist

test_local_origin_health_check_has_timeout — FaultyOrigin with TimeoutMs(5000), asserts check completes in <2s.

test_health_checks_run_in_parallel — 3 origins each with TimeoutMs(200), asserts check_all() completes in <350ms (parallel), not ~600ms (sequential).

Step 3: Implementation

File: musicfs-origins/src/health.rs

Wrap origin.health() in check_one() with timeout:

async fn check_one(&self, id: &OriginId, origin: &Arc<dyn Origin>) {
    let start = Instant::now();
    let health_timeout = Duration::from_secs(5);

    let status = match tokio::time::timeout(health_timeout, origin.health()).await {
        Ok(status) => status,
        Err(_) => {
            warn!(origin_id = %id, timeout_ms = health_timeout.as_millis() as u64,
                  "Health check timed out");
            HealthStatus::Unhealthy
        }
    };

    let latency_ms = start.elapsed().as_millis() as u64;
    // ... rest unchanged
}

Change check_all() to use futures::future::join_all:

pub async fn check_all(&self) {
    let origins: Vec<_> = self.origins.iter()
        .map(|e| (e.key().clone(), e.value().clone()))
        .collect();

    let checks: Vec<_> = origins.iter()
        .map(|(id, origin)| self.check_one(id, origin))
        .collect();

    futures::future::join_all(checks).await;
}

Add futures to musicfs-origins/Cargo.toml (or use tokio::join! macro if count is small/known).

Step 4: Verify

cargo test -p musicfs-test-utils --test resilience -- test_local_origin_health_check
cargo test -p musicfs-test-utils --test resilience -- test_health_checks_run_in_parallel

4.3 Issue C6: Fix CAS calculate_size()

Problem: calculate_size() only scans direct children of chunks_dir. Chunks live in shard subdirectories (chunks/aa/bb/<hash>). Size is always ~0, eviction never triggers.

Step 1: No stubs needed

Step 2: Test

#[tokio::test]
async fn test_cas_size_tracking_is_correct() {
    let dir = TempDir::new().unwrap();
    let config = CasConfig { chunks_dir: dir.path().join("chunks"), max_size: 10_000_000, shard_levels: 2 };
    let store = CasStore::open(config).await.unwrap();

    let data = vec![0u8; 1000];
    store.put(&data).await.unwrap();

    // Size should reflect the chunk we just wrote (~1000 bytes)
    assert!(store.current_size() >= 1000, "current_size should track chunk data, got {}", store.current_size());
}

Step 3: Implementation

File: musicfs-cas/src/store.rs — make calculate_size recursive:

async fn calculate_size(dir: &Path) -> u64 {
    Self::calculate_size_recursive(dir).await
}

#[async recursion::async_recursion]
async fn calculate_size_recursive(dir: &Path) -> u64 {
    let mut size = 0u64;
    if let Ok(mut entries) = fs::read_dir(dir).await {
        while let Ok(Some(entry)) = entries.next_entry().await {
            if let Ok(meta) = entry.metadata().await {
                if meta.is_file() {
                    size += meta.len();
                } else if meta.is_dir() {
                    // Skip sled index directory
                    let name = entry.file_name();
                    if name != "index.sled" {
                        size += Self::calculate_size_recursive(&entry.path()).await;
                    }
                }
            }
        }
    }
    size
}

Alternative without async_recursion (use Box::pin):

fn calculate_size_recursive(dir: &Path) -> Pin<Box<dyn Future<Output = u64> + Send + '_>> {
    Box::pin(async move {
        let mut size = 0u64;
        if let Ok(mut entries) = fs::read_dir(dir).await {
            while let Ok(Some(entry)) = entries.next_entry().await {
                if let Ok(meta) = entry.metadata().await {
                    if meta.is_file() {
                        size += meta.len();
                    } else if meta.is_dir() {
                        let name = entry.file_name();
                        if name != "index.sled" {
                            size += Self::calculate_size_recursive(&entry.path()).await;
                        }
                    }
                }
            }
        }
        size
    })
}

4.4 Issue C7: FUSE Read Timeout

Problem: FUSE read() calls handle.block_on(reader.read(...)) with no timeout. A slow origin blocks the entire FUSE thread.

Step 1: No stubs needed

Step 2: Test

#[tokio::test]
async fn test_fuse_read_timeout_returns_eio() {
    // Uses FaultyOrigin with TimeoutMs(60_000) — simulates hung read
    // FUSE read should timeout at 30s and return EIO, not hang forever
    // (This test validates the timeout wrapper, not actual FUSE mount)
}

Step 3: Implementation

File: musicfs-fuse/src/filesystem.rs — wrap the read with timeout:

fn read(&mut self, _req: &Request, ino: u64, _fh: u64, offset: i64, size: u32, _flags: i32, _lock_owner: Option<u64>, reply: ReplyData) {
    // ... file_id lookup unchanged ...

    let reader = reader.clone();
    let handle = self.runtime_handle.clone();
    let result = std::thread::scope(|_| {
        handle.block_on(async {
            tokio::time::timeout(
                Duration::from_secs(30),
                reader.read(file_id, offset as u64, size),
            ).await
        })
    });

    match result {
        Ok(Ok(data)) => {
            trace!(ino, bytes_read = data.len(), "read successful");
            reply.data(&data);
        }
        Ok(Err(e)) => {
            warn!(ino, error = %e, "read failed");
            reply.error(libc::EIO);
        }
        Err(_timeout) => {
            warn!(ino, offset, size, "read timed out after 30s");
            reply.error(libc::EIO);
        }
    }
}

4.5 Issues 6.4: CAS Chunk Auto-Re-Fetch on Corruption/Missing

Problem: When store.get() finds a corrupt or missing chunk, it returns an error. The reader propagates this as EIO to FUSE. It should try to re-fetch the chunk from the origin instead.

Step 1: No stubs needed — modify `FileReader::read()`

Step 2: RED tests already exist

test_corrupt_chunk_auto_refetched — corrupts chunk file on disk, expects read to succeed (re-fetched from origin).

test_missing_chunk_triggers_origin_fetch — deletes chunk file, expects read to succeed.

Both currently fail because the reader doesn't attempt re-fetch on chunk errors.

Step 3: Implementation

File: musicfs-cas/src/reader.rs — add retry-with-refetch in the chunk read loop:

pub async fn read(&self, file_id: FileId, offset: u64, size: u32) -> Result<Bytes, ReaderError> {
    let manifest = self.get_or_fetch_manifest(file_id).await?;

    // ... offset/end calculation unchanged ...

    for chunk_ref in &manifest.chunks {
        // ... range check unchanged ...

        let chunk_data = match self.store.get(&chunk_ref.hash).await {
            Ok(data) => data,
            Err(CasError::IntegrityError { .. }) | Err(CasError::NotFound(_)) => {
                // Chunk is corrupt or missing — try to re-fetch from origin
                warn!(hash = %chunk_ref.hash, "Chunk corrupt/missing, attempting re-fetch");
                if let Some(fetcher) = &self.fetcher {
                    // Re-fetch the entire file (will re-chunk and store)
                    let new_manifest = fetcher.fetch_file(file_id).await?;
                    // Update cached manifest
                    self.manifests.write().insert(file_id, new_manifest);
                    // Retry the get
                    self.store.get(&chunk_ref.hash).await?
                } else {
                    return Err(ReaderError::Cas(CasError::NotFound(chunk_ref.hash.as_hex())));
                }
            }
            Err(e) => return Err(ReaderError::Cas(e)),
        };

        // ... slice extraction unchanged ...
    }

    Ok(result.freeze())
}

Important: The re-fetch downloads the entire file from origin and re-chunks it. For a single corrupt chunk this is wasteful (fetches all chunks to fix one), but it's the simplest correct approach. Chunk-level re-fetch would require the origin to support byte-range reads mapped to chunk boundaries — possible but complex. The file-level approach reuses existing fetch_file() logic.

Step 4: Verify

cargo test -p musicfs-test-utils --test resilience -- test_corrupt_chunk
cargo test -p musicfs-test-utils --test resilience -- test_missing_chunk

Note on test updates: The existing RED tests reference store.chunk_path() which is private. The tests will need to either:

Make chunk_path() pub(crate) or add a test helper
Or construct the path manually using the sharding logic

The tests also need a ContentFetcher with a real LocalOrigin to re-fetch from. The current tests create a CAS store but no fetcher — they need to be updated to include the full pipeline.

4.6 Issue C9: PID File / flock

Problem: Two musicfs mount commands can run simultaneously, both writing to the same SQLite/sled files.

Step 1: No stubs needed

Step 2: Test

#[test]
fn test_pid_file_prevents_concurrent_mount() {
    let dir = TempDir::new().unwrap();
    let lock_path = dir.path().join("musicfs.lock");

    // First lock succeeds
    let lock1 = try_acquire_lock(&lock_path);
    assert!(lock1.is_ok());

    // Second lock fails
    let lock2 = try_acquire_lock(&lock_path);
    assert!(lock2.is_err());

    // Release first, second succeeds
    drop(lock1);
    let lock3 = try_acquire_lock(&lock_path);
    assert!(lock3.is_ok());
}

Step 3: Implementation

File: musicfs-cli/src/main.rs

use std::fs::File;
use std::os::unix::io::AsRawFd;

struct LockFile {
    _file: File,
}

fn try_acquire_lock(path: &Path) -> Result<LockFile> {
    let file = File::create(path).context("Failed to create lock file")?;
    let fd = file.as_raw_fd();

    let ret = unsafe { libc::flock(fd, libc::LOCK_EX | libc::LOCK_NB) };
    if ret != 0 {
        let err = std::io::Error::last_os_error();
        if err.kind() == std::io::ErrorKind::WouldBlock {
            anyhow::bail!("MusicFS is already running (lock file: {:?})", path);
        }
        return Err(err).context("Failed to acquire lock");
    }

    // Write PID for debugging
    use std::io::Write;
    let mut f = &file;
    writeln!(f, "{}", std::process::id())?;

    Ok(LockFile { _file: file })
}

Call in run_mount() before mounting:

let lock_path = cache_dir.join("musicfs.lock");
let _lock = try_acquire_lock(&lock_path)
    .context("Failed to acquire lock — is another instance running?")?;

Lock is released automatically when _lock is dropped (process exit or scope end).

4.7 Issue 5.3: fd Exhaustion Handling

Problem: When fd limit is hit, operations fail with EMFILE. Currently this propagates as panics or unhelpful errors.

Step 1: Replace the `todo!()` test

Step 2: Test

#[test]
#[cfg(target_os = "linux")]
fn test_fd_exhaustion_handling() {
    use rlimit::{Resource, setrlimit, getrlimit};

    let (orig_soft, orig_hard) = getrlimit(Resource::NOFILE).unwrap();

    // Set very low limit
    setrlimit(Resource::NOFILE, 64, 64).unwrap();

    let dir = TempDir::new().unwrap();
    let rt = tokio::runtime::Runtime::new().unwrap();

    let result = rt.block_on(async {
        CasStore::open(CasConfig {
            chunks_dir: dir.path().join("chunks"),
            max_size: 1_000_000,
            shard_levels: 2,
        }).await
    });

    // Should either succeed (sled uses fewer than 64 fds) or fail gracefully
    // Must NOT panic
    match result {
        Ok(_store) => { /* lucky — enough fds */ }
        Err(e) => {
            // Error message should be meaningful
            let msg = format!("{}", e);
            assert!(!msg.contains("panic"), "Should not panic on fd exhaustion");
        }
    }

    setrlimit(Resource::NOFILE, orig_soft, orig_hard).unwrap();
}

Step 3: Implementation

This is primarily a test — verifying that existing code handles fd exhaustion without panicking. The fix is ensuring all I/O paths return Result rather than .unwrap() on file operations. Phase A's RwLock migration already removed the biggest panic source. The remaining .unwrap() calls are in test code only.

No production code change required if existing error paths handle I/O errors correctly. The test validates this.

5. Cross-Cutting Concerns

5.1 Observability

Health check timeout logged at WARN with origin_id and timeout duration
FUSE read timeout logged at WARN with inode, offset, size
CAS chunk re-fetch logged at WARN with chunk hash
PID file path logged at INFO on lock acquisition

5.2 Performance

Health checks now parallel: O(1) wall-clock time instead of O(N) per check cycle
FUSE read timeout: 30s cap prevents indefinite hangs but doesn't improve happy-path latency
calculate_size() recursive scan: runs once at startup, negligible cost

5.3 Testing

Test	Status Before	Status After	Issue
`test_local_origin_health_check_has_timeout`	❌ FAILED	✅ GREEN	D1
`test_health_checks_run_in_parallel`	❌ FAILED	✅ GREEN	D2
`test_fd_exhaustion_handling`	❌ todo!()	✅ GREEN	5.3
`test_corrupt_chunk_auto_refetched`	❌ FAILED	✅ GREEN	6.4
`test_missing_chunk_triggers_origin_fetch`	❌ FAILED	✅ GREEN	6.4
`test_passthrough_mode_when_cache_disk_dead`	❌ todo!()	✅ GREEN	6.6
`test_cas_size_tracking_is_correct`	NEW	✅ GREEN	C6
`test_pid_file_prevents_concurrent_mount`	NEW	✅ GREEN	C9

Note on passthrough mode (6.6): The test expects reads to succeed when the cache dir is read-only. With chunk auto-re-fetch (4.5), this partially works — if the origin is alive and the chunk isn't in cache, the fetcher reads from origin. But the fetcher tries to write the chunk to CAS, which will fail on a read-only cache dir. The implementation needs a fallback path: if CAS write fails after origin fetch, return the data anyway without caching. This makes test_passthrough_mode_when_cache_disk_dead pass.

6. Alternatives Considered

6.1 Per-Origin Configurable Timeout vs Universal 5s

Could allow health_check_timeout_ms per origin config. Rejected for Phase C — universal 5s is correct for all current origin types. Can be made configurable later.

6.2 Chunk-Level Re-Fetch vs File-Level Re-Fetch

When one chunk is corrupt, we could re-fetch just that chunk's byte range from origin. Requires the origin to support byte-range reads and the system to know which byte range maps to which chunk. Complex. File-level re-fetch reuses existing fetch_file() and is correct, just slightly wasteful. Good enough for Phase C.

6.3 `advisory-lock` Crate vs Raw `flock`

The advisory-lock crate wraps flock nicely but adds a dependency for 10 lines of code. Raw libc::flock is simple enough and avoids the dependency.

7. Implementation Plan

7.1 Task Sequence

Day	Task	Issue	Effort	Tests
1 (morning)	Health check timeout in `check_one()`	D1	1h	`test_local_origin_health_check_has_timeout` → GREEN
1 (morning)	Parallel `check_all()` with `join_all`	D2	1h	`test_health_checks_run_in_parallel` → GREEN
1 (afternoon)	Fix `calculate_size()` recursion	C6	1h	`test_cas_size_tracking_is_correct` → GREEN
1 (afternoon)	FUSE read timeout wrapper	C7	1h	New timeout test
2 (morning)	CAS chunk auto-re-fetch on corruption/missing	6.4	3h	`test_corrupt_chunk_auto_refetched` + `test_missing_chunk_triggers_origin_fetch` → GREEN
2 (afternoon)	Passthrough fallback (CAS write fails → return data anyway)	6.6	1h	`test_passthrough_mode_when_cache_disk_dead` → GREEN
3 (morning)	PID file / flock	C9	1h	`test_pid_file_prevents_concurrent_mount` → GREEN
3 (morning)	fd exhaustion test	5.3	1h	`test_fd_exhaustion_handling` → GREEN
3 (afternoon)	Integration + regression testing	—	2h	Full `cargo test`
4	Buffer	—	4h	—

7.2 Verification Checklist

After all tasks:

cargo check — zero errors, zero warnings
cargo test --workspace --exclude musicfs-grpc — all pass
cargo test -p musicfs-test-utils --test resilience — 25 passed, 0 failed (all RED tests GREEN)
cargo clippy — no new warnings

8. Files Changed

File	Change	Issue
`musicfs-origins/src/health.rs`	Timeout in `check_one()`, `join_all` in `check_all()`	D1, D2
`musicfs-origins/Cargo.toml`	Add `futures` dependency (for `join_all`)	D2
`musicfs-cas/src/store.rs`	Recursive `calculate_size()`, skip `index.sled` dir	C6
`musicfs-fuse/src/filesystem.rs`	`tokio::time::timeout(30s)` around reader.read()	C7
`musicfs-cas/src/reader.rs`	Auto-re-fetch on `IntegrityError` / `NotFound`	6.4
`musicfs-cas/src/fetcher.rs`	Possible: make `fetch_file` return data even if CAS write fails	6.6
`musicfs-cli/src/main.rs`	PID file with flock, fd exhaustion handling	C9, 5.3
`musicfs-test-utils/tests/resilience.rs`	Replace remaining todo!()s, add new tests, update chunk tests with fetcher pipeline	all

9. Glossary / References

Term	Definition
join_all	`futures::future::join_all` — runs multiple futures concurrently, waits for all
flock	Advisory file locking syscall — `LOCK_EX
EMFILE	"Too many open files" errno — returned when process fd limit is reached
Passthrough mode	When CAS is unavailable, read directly from origin without caching

Document	Path
Phase A plan	phase-a-stop-dying.md
Phase B plan	phase-b-crash-recovery.md
Resilience audit	resilience-fault-tolerance.md

22 KiB Raw Permalink Blame History

Phase C: Production Hardening — Implementation Plan

1. Abstract

2. Background

3. Goals & Non-Goals

3.1 Goals

3.2 Non-Goals

4. Proposed Design

4.1 Implementation Order

4.2 Issues D1+D2: Health Check Timeout + Parallel Checks

Step 1: No stubs needed

Step 2: RED tests already exist

Step 3: Implementation

Step 4: Verify

4.3 Issue C6: Fix CAS calculate_size()

Step 1: No stubs needed

Step 2: Test

Step 3: Implementation

4.4 Issue C7: FUSE Read Timeout

Step 1: No stubs needed

Step 2: Test

Step 3: Implementation

4.5 Issues 6.4: CAS Chunk Auto-Re-Fetch on Corruption/Missing

Step 1: No stubs needed — modify FileReader::read()

Step 2: RED tests already exist

Step 3: Implementation

Step 4: Verify

4.6 Issue C9: PID File / flock

Step 1: No stubs needed

Step 2: Test

Step 3: Implementation

4.7 Issue 5.3: fd Exhaustion Handling

Step 1: Replace the todo!() test

Step 2: Test

Step 3: Implementation

5. Cross-Cutting Concerns

5.1 Observability

5.2 Performance

5.3 Testing

6. Alternatives Considered

6.1 Per-Origin Configurable Timeout vs Universal 5s

6.2 Chunk-Level Re-Fetch vs File-Level Re-Fetch

6.3 advisory-lock Crate vs Raw flock

7. Implementation Plan

7.1 Task Sequence

7.2 Verification Checklist

8. Files Changed

9. Glossary / References

22 KiB

Raw Permalink Blame History

Step 1: No stubs needed — modify `FileReader::read()`

Step 1: Replace the `todo!()` test

6.3 `advisory-lock` Crate vs Raw `flock`