BlueDoc covering 6 critical resilience fixes with TDD flow: - 2.9: RwLock → parking_lot (poison-free locks) - 2.2: Panic hook with tracing integration - 3.7+2.7: systemd ExecStopPost + stale mount cleanup - 2.10: sd_notify READY/STOPPING - 2.1: Signal handling via spawn_mount2 + tokio signals Each with: stubs → RED tests → implementation → GREEN verify. ~5 days estimated, exact files and code patterns specified.
19 KiB
Phase A: Stop Dying — Implementation Plan
Authors: AI-assisted
Status: Draft
Last Updated: 2026-05-13
Reviewers: TBD
Approvers: TBD
Prerequisites: resilience-fault-tolerance.md, resilience-testing.md
Estimated Effort: ~5 days
[TOC]
1. Abstract
Implement the 6 most critical resilience fixes (issues 2.1, 2.2, 2.7, 2.9, 2.10, 3.7 from the resilience audit) that prevent MusicFS from dying on common operational events: signals, panics, lock poisoning, and systemd lifecycle.
Issues 2.3 (shutdown orchestration), 2.4 (cache integrity), 2.5 (sync recovery), 2.6 (task supervisor), 2.8 (disk space) are deferred to Phase B — they depend on Phase A infrastructure or on the persistent state work.
Development flow (TDD, per-issue):
- Create stubs so the codebase compiles
- Write RED tests that express the expected behavior
- Implement the fix
- Verify tests turn GREEN
- Run full test suite — no regressions
2. Background
MusicFS currently dies on:
- Any signal (SIGTERM, SIGINT) — instant death, no cleanup
- Any panic in a writer thread — RwLock poisons, all FUSE ops crash
- systemd lifecycle —
Type=notifybut nosd_notify, ExecStop is a stub - Crash leaves stale FUSE mount — users must manually
fusermount -u
The resilience test crate and RED tests are already in place. This plan implements the fixes to turn them GREEN.
3. Goals & Non-Goals
3.1 Goals
- Signal handler catches SIGTERM/SIGINT and initiates clean exit
- Panics are logged with full context before process terminates
- RwLock poisoning cannot cascade to kill FUSE operations
- systemd integration works (
sd_notify READY=1,ExecStopPost) - Stale FUSE mounts are detected and cleaned on startup
- All existing 162 tests continue to pass
- All Phase A RED tests turn GREEN
3.2 Non-Goals
- Graceful shutdown orchestration with ordered teardown (Phase B — needs CancellationToken plumbing through all components)
- Task supervisor for background task restart (Phase B)
- Cache integrity checks on startup (Phase B — needs persistent state)
- Disk space monitoring (Phase B)
- Interrupted sync recovery (Phase B — needs persistent state)
4. Proposed Design
4.1 Implementation Order
Dependencies determine the order. Each issue is independent except where noted.
4.2 RwLock poison fix (no deps, instant win, unblocks safety)
↓
4.3 Panic hook (no deps, complements RwLock fix)
↓
4.4 systemd ExecStopPost (no deps, config-only change)
↓
4.5 sd_notify integration (no deps, new crate dependency)
↓
4.6 Signal handling (depends on: FUSE mount change to spawn_mount2)
↓
4.7 Stale mount detection (depends on: signal handling for clean test)
4.2 Issue 2.9: RwLock Poison Fix
Approach: Replace std::sync::RwLock with parking_lot::RwLock in all production paths. parking_lot never poisons — a panic in a writer releases the lock and subsequent readers see the pre-panic state.
Why parking_lot over poison recovery: The codebase already uses parking_lot in prefetch.rs and index.rs. Using it everywhere is consistent. The alternative (.unwrap_or_else(|p| p.into_inner())) is verbose and error-prone — one missed call re-introduces the bug.
Step 1: Stubs (compile)
None needed — parking_lot::RwLock is a drop-in replacement (same API, no PoisonError).
Step 2: RED tests
Already exist in tests/resilience.rs:
test_poisoned_tree_lock_returns_eio_not_panic— currently passes (demonstrates the problem)test_parking_lot_rwlock_survives_panic— currently passes (proves the fix works)
Additional test to add: verify FUSE filesystem survives a writer panic on the tree lock.
Step 3: Implementation
Files to change:
| File | Change |
|---|---|
musicfs-fuse/src/filesystem.rs |
use std::sync::RwLock → use parking_lot::RwLock; remove all .unwrap() on lock calls (parking_lot returns guard directly, not Result) |
musicfs-cas/src/reader.rs |
Same change for manifests: RwLock<HashMap<...>> |
musicfs-cas/src/fetcher.rs |
Same change for origins and file_meta locks |
musicfs-origins/src/registry.rs |
Same change for origins and watch_handles locks |
musicfs-cache/src/eviction.rs |
Same change for access_times and hash_to_time locks |
musicfs-core/src/metrics.rs |
Same change for histogram locks |
musicfs-cache/src/tree.rs |
Same change for last_refresh lock |
Pattern: In each file:
// BEFORE
use std::sync::RwLock;
let guard = self.tree.read().unwrap();
// AFTER
use parking_lot::RwLock;
let guard = self.tree.read(); // No unwrap needed
For the MusicFs struct in filesystem.rs, the tree field is Arc<RwLock<VirtualTree>> — this is passed in from main.rs. Change main.rs to use parking_lot::RwLock there too.
Step 4: Verify
cargo test # All 162+ tests pass
cargo test -p musicfs-test-utils # Resilience tests pass
cargo check # No warnings
4.3 Issue 2.2: Panic Hook
Approach: Install a custom panic hook at daemon startup that logs the panic with tracing::error! before the default behavior (abort/unwind). This ensures panics are captured in log files and journald.
Step 1: Stubs
Add to musicfs-core/src/lib.rs:
pub fn install_panic_hook() {
// stub — will be implemented
}
Step 2: RED tests
Write in tests/resilience.rs:
#[test]
fn test_panic_hook_logs_to_tracing() {
// Install hook with a test tracing subscriber
// Trigger panic via catch_unwind
// Verify error! log contains panic message + thread name
}
Step 3: Implementation
File: musicfs-core/src/lib.rs (or new musicfs-core/src/panic.rs)
pub fn install_panic_hook() {
let default_hook = std::panic::take_hook();
std::panic::set_hook(Box::new(move |info| {
let thread = std::thread::current();
let thread_name = thread.name().unwrap_or("<unnamed>");
let message = if let Some(s) = info.payload().downcast_ref::<&str>() {
s.to_string()
} else if let Some(s) = info.payload().downcast_ref::<String>() {
s.clone()
} else {
"unknown panic".to_string()
};
let location = info.location().map(|l| format!("{}:{}:{}", l.file(), l.line(), l.column()))
.unwrap_or_else(|| "unknown location".to_string());
tracing::error!(
thread = thread_name,
location = %location,
"PANIC: {}",
message
);
default_hook(info);
}));
}
Call site: musicfs-cli/src/main.rs, at the very top of main():
fn main() -> Result<()> {
musicfs_core::install_panic_hook();
let cli = Cli::parse();
// ...
}
Step 4: Verify
cargo test -p musicfs-core # Panic hook unit tests
cargo test -p musicfs-test-utils # Resilience tests
4.4 Issue 3.7 + 2.7: systemd Service Fix + FUSE Cleanup
Approach: Fix the systemd service file and add stale mount detection on startup.
Step 1: No stubs needed (config change)
Step 2: RED tests
Already exists: test_systemd_service_has_execstoppost — currently fails because service file lacks ExecStopPost.
Add test for stale mount detection:
#[test]
fn test_stale_mount_check_function_exists() {
// Verify the function signature exists
// (actual mount test needs privileged environment)
}
Step 3: Implementation
File: dist/musicfs.service
ExecStop=/usr/bin/musicfs shutdown
+ExecStopPost=/usr/bin/fusermount -uz %h/music || true
Restart=on-failure
Note: fusermount -uz is "lazy unmount" — always succeeds even if mount is busy. The || true prevents systemd from treating cleanup failure as a service failure.
File: musicfs-cli/src/main.rs — add stale mount check before mounting:
fn check_stale_mount(mountpoint: &Path) -> Result<()> {
// Check /proc/mounts for existing mount at this path
if let Ok(mounts) = std::fs::read_to_string("/proc/mounts") {
for line in mounts.lines() {
if line.contains(&mountpoint.to_string_lossy().as_ref()) && line.contains("fuse") {
warn!("Stale FUSE mount detected at {:?}, attempting cleanup", mountpoint);
let status = std::process::Command::new("fusermount")
.args(["-uz", &mountpoint.to_string_lossy()])
.status();
match status {
Ok(s) if s.success() => info!("Stale mount cleaned up"),
Ok(s) => warn!("fusermount exited with: {}", s),
Err(e) => warn!("Failed to run fusermount: {}", e),
}
}
}
}
Ok(())
}
Also fix the test_systemd_service_has_execstoppost test path — currently points to wrong location.
Step 4: Verify
cargo test -p musicfs-test-utils -- test_systemd # Service file test
4.5 Issue 2.10: sd_notify Integration
Approach: Add sd-notify crate, call READY=1 after mount, STOPPING on shutdown.
Step 1: Stubs
Add dependency to musicfs-cli/Cargo.toml:
sd-notify = "0.4"
Step 2: RED tests
Write test that mocks the notify socket:
#[test]
fn test_sd_notify_ready_sent() {
// Create Unix datagram socket at $NOTIFY_SOCKET
// Call sd_notify::notify(READY=1)
// Verify message received on socket
}
Step 3: Implementation
File: musicfs-cli/src/main.rs
After fs.mount() succeeds (or more precisely, after spawn_mount2 — see 4.6):
// Notify systemd we're ready
if let Err(e) = sd_notify::notify(false, &[sd_notify::NotifyState::Ready]) {
debug!("sd_notify not available (not running under systemd): {}", e);
}
On shutdown path:
let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Stopping]);
Step 4: Verify
cargo test -p musicfs-test-utils -- test_sd_notify
cargo build -p musicfs-cli # Verify it compiles with new dep
4.6 Issue 2.1: Signal Handling
Approach: Switch from fuser::mount2 (blocking) to fuser::spawn_mount2 (background), then listen for signals on the main thread.
This is the most complex change in Phase A. It restructures the daemon's main loop.
Step 1: Stubs
Change MusicFs::mount() signature to return a session handle:
// BEFORE
pub fn mount(self, mountpoint: &Path) -> Result<()> {
fuser::mount2(self, mountpoint, &options)?;
Ok(())
}
// AFTER (stub — returns BackgroundSession)
pub fn spawn_mount(self, mountpoint: &Path) -> Result<fuser::BackgroundSession> {
let session = fuser::spawn_mount2(self, mountpoint, &options)?;
Ok(session)
}
Keep old mount() temporarily for compatibility.
Step 2: RED tests
Write in tests/resilience.rs:
#[tokio::test]
async fn test_sigterm_triggers_shutdown() {
// Spawn daemon as child process
// Wait for mount
// Send SIGTERM
// Verify clean exit within 10s
// Verify mountpoint is unmounted
}
This test requires the signal handler to exist. It will be RED until implementation.
Step 3: Implementation
File: musicfs-cli/src/main.rs — rewrite run_mount():
fn run_mount(mountpoint: PathBuf, origin_path: Option<PathBuf>, cache_dir: Option<PathBuf>) -> Result<()> {
let origin_path = origin_path.context("--origin is required")?;
let runtime = tokio::runtime::Runtime::new()?;
let handle = runtime.handle().clone();
let (tree, reader) = runtime.block_on(async {
// ... existing setup code (unchanged) ...
Ok::<_, anyhow::Error>((tree, reader))
})?;
// Check for stale mount before mounting
check_stale_mount(&mountpoint)?;
let fs = MusicFs::with_reader(tree, reader, handle.clone());
info!("Mounting filesystem at {:?}", mountpoint);
// spawn_mount2 returns immediately — FUSE runs in background
let session = fs.spawn_mount(&mountpoint)
.context("Failed to mount filesystem")?;
// Notify systemd
let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Ready]);
info!("MusicFS ready, PID {}", std::process::id());
// Block on signal
runtime.block_on(async {
let mut sigterm = tokio::signal::unix::signal(
tokio::signal::unix::SignalKind::terminate()
)?;
let mut sigint = tokio::signal::unix::signal(
tokio::signal::unix::SignalKind::interrupt()
)?;
tokio::select! {
_ = sigterm.recv() => {
info!("Received SIGTERM, shutting down");
}
_ = sigint.recv() => {
info!("Received SIGINT, shutting down");
}
}
Ok::<_, anyhow::Error>(())
})?;
// Shutdown sequence
let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Stopping]);
info!("Unmounting filesystem");
drop(session); // BackgroundSession::drop() calls unmount
info!("Shutdown complete");
Ok(())
}
File: musicfs-fuse/src/filesystem.rs — add spawn_mount():
pub fn spawn_mount(self, mountpoint: &Path) -> Result<fuser::BackgroundSession> {
info!("Mounting MusicFS at {:?}", mountpoint);
let options = vec![
fuser::MountOption::RO,
fuser::MountOption::FSName("musicfs".to_string()),
fuser::MountOption::AutoUnmount,
fuser::MountOption::AllowOther,
];
let session = fuser::spawn_mount2(self, mountpoint, &options)
.map_err(musicfs_core::Error::Io)?;
Ok(session)
}
Step 4: Verify
cargo build -p musicfs-cli
cargo test -p musicfs-test-utils -- test_sigterm # Process-level test
cargo test # No regressions
5. Cross-Cutting Concerns
5.1 Security & Privacy
- No new attack surface — changes are internal lifecycle management
- Panic hook does NOT log sensitive data (only panic message, thread name, location)
sd_notifyuses existing systemd socket — no new IPC
5.2 Observability
- Panic hook ensures all panics are captured in logs/journald
- Signal handling logs which signal triggered shutdown
- sd_notify gives systemd accurate service state
- Stale mount detection logs cleanup attempts
5.3 Testing
All changes follow the TDD flow:
- Stubs compile
- RED tests document expected behavior
- Implementation turns tests GREEN
- Full suite passes (no regressions)
6. Alternatives Considered
6.1 Poison Recovery Instead of parking_lot
Alternative: Keep std::sync::RwLock, add .unwrap_or_else(|p| p.into_inner()) to every lock call.
Rejected: 30+ call sites to change, easy to miss one, and the pattern is verbose. parking_lot is already a dependency and is strictly better for this use case (faster, no poison, correct API).
6.2 Keep mount2 (blocking) with Signal Thread
Alternative: Keep fuser::mount2, spawn a separate thread for signal handling, use a channel to communicate shutdown.
Rejected: mount2 consumes self and blocks — there's no clean way to interrupt it from another thread. spawn_mount2 is the canonical solution from the fuser crate.
6.3 Defer sd_notify Until Full Shutdown Orchestration
Alternative: Implement sd_notify only after CancellationToken + graceful shutdown are in place.
Rejected: sd_notify READY=1 is critical now — without it, Type=notify in the service file means systemd will timeout and kill the daemon on every start. The shutdown STOPPING notification is a bonus but not required for Phase A.
7. Implementation Plan
7.1 Task Sequence
| Day | Task | Issue | Effort | Test Approach |
|---|---|---|---|---|
| 1 (morning) | RwLock → parking_lot migration | 2.9 | 2h | Existing GREEN test validates; verify no .unwrap() on locks |
| 1 (afternoon) | Panic hook | 2.2 | 2h | New test: panic → verify tracing output |
| 2 (morning) | systemd ExecStopPost + stale mount check | 3.7 + 2.7 | 2h | Existing RED test → GREEN; new stale mount test |
| 2 (afternoon) | sd_notify integration | 2.10 | 2h | New test: mock socket → verify READY=1 |
| 3 | Signal handling (spawn_mount2 + signal loop) | 2.1 | 4h | Fork daemon → send SIGTERM → verify exit |
| 4 | Integration + regression testing | — | 4h | Full cargo test, manual FUSE mount test |
| 5 | Buffer for issues found during integration | — | 4h | — |
7.2 Verification Checklist
After all tasks complete:
cargo check— zero errors, zero warningscargo test— all 162+ existing tests passcargo test -p musicfs-test-utils— all resilience tests passcargo clippy— no new warningsgrep -r '\.read()\.unwrap()\|\.write()\.unwrap()' crates/— zero hits in production code (test code is OK)dist/musicfs.servicecontainsExecStopPost- Manual test:
musicfs mount, thenkill -TERM <pid>, verify clean exit + mount gone - Manual test:
kill -9 <pid>, thenmusicfs mountagain — no "already mounted" error
8. Files Changed
| File | Change | Issue |
|---|---|---|
musicfs-fuse/src/filesystem.rs |
std::sync::RwLock → parking_lot::RwLock; add spawn_mount() |
2.9, 2.1 |
musicfs-cas/src/reader.rs |
std::sync::RwLock → parking_lot::RwLock |
2.9 |
musicfs-cas/src/fetcher.rs |
std::sync::RwLock → parking_lot::RwLock |
2.9 |
musicfs-origins/src/registry.rs |
std::sync::RwLock → parking_lot::RwLock |
2.9 |
musicfs-cache/src/eviction.rs |
std::sync::RwLock → parking_lot::RwLock |
2.9 |
musicfs-cache/src/tree.rs |
std::sync::RwLock → parking_lot::RwLock |
2.9 |
musicfs-core/src/metrics.rs |
std::sync::RwLock → parking_lot::RwLock |
2.9 |
musicfs-core/src/lib.rs |
Add install_panic_hook() |
2.2 |
musicfs-cli/src/main.rs |
Panic hook, signal handler, spawn_mount2, sd_notify, stale mount check | 2.1, 2.2, 2.7, 2.10 |
musicfs-cli/Cargo.toml |
Add sd-notify, tokio-util deps |
2.10, 2.1 |
dist/musicfs.service |
Add ExecStopPost, fix ExecStop |
3.7 |
tests/resilience.rs |
Update/add tests for signal, panic hook, sd_notify | all |
9. Glossary / References
| Term | Definition |
|---|---|
| parking_lot | Fast, poison-free lock implementation. Already a project dependency. |
| spawn_mount2 | fuser API that mounts FUSE in a background thread, returning a BackgroundSession handle |
| sd_notify | systemd notification protocol. READY=1 signals service started, STOPPING signals shutdown. |
| BackgroundSession | Handle returned by spawn_mount2. Dropping it unmounts the filesystem. |
| Document | Path |
|---|---|
| Resilience audit | resilience-fault-tolerance.md |
| Resilience testing | resilience-testing.md |
| Architecture | architecture.md |