From 24086cc744f90b6ac3f9e3ab5ae77d2d3a646539 Mon Sep 17 00:00:00 2001 From: Alexander Date: Wed, 13 May 2026 14:00:46 +0200 Subject: [PATCH] Add Phase A implementation plan (Stop Dying) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit BlueDoc covering 6 critical resilience fixes with TDD flow: - 2.9: RwLock → parking_lot (poison-free locks) - 2.2: Panic hook with tracing integration - 3.7+2.7: systemd ExecStopPost + stale mount cleanup - 2.10: sd_notify READY/STOPPING - 2.1: Signal handling via spawn_mount2 + tokio signals Each with: stubs → RED tests → implementation → GREEN verify. ~5 days estimated, exact files and code patterns specified. --- docs/v2/plans/phase-a-stop-dying.md | 569 ++++++++++++++++++++++++++++ 1 file changed, 569 insertions(+) create mode 100644 docs/v2/plans/phase-a-stop-dying.md diff --git a/docs/v2/plans/phase-a-stop-dying.md b/docs/v2/plans/phase-a-stop-dying.md new file mode 100644 index 0000000..eab0e5f --- /dev/null +++ b/docs/v2/plans/phase-a-stop-dying.md @@ -0,0 +1,569 @@ +# Phase A: Stop Dying — Implementation Plan + +**Authors:** AI-assisted +**Status:** Draft +**Last Updated:** 2026-05-13 +**Reviewers:** TBD +**Approvers:** TBD +**Prerequisites:** [resilience-fault-tolerance.md](resilience-fault-tolerance.md), [resilience-testing.md](resilience-testing.md) +**Estimated Effort:** ~5 days + +--- + +[TOC] + +--- + +## 1. Abstract + +Implement the 6 most critical resilience fixes (issues 2.1, 2.2, 2.7, 2.9, 2.10, 3.7 from the [resilience audit](resilience-fault-tolerance.md)) that prevent MusicFS from dying on common operational events: signals, panics, lock poisoning, and systemd lifecycle. + +Issues 2.3 (shutdown orchestration), 2.4 (cache integrity), 2.5 (sync recovery), 2.6 (task supervisor), 2.8 (disk space) are deferred to Phase B — they depend on Phase A infrastructure or on the [persistent state](persistent-state.md) work. + +**Development flow** (TDD, per-issue): +1. Create stubs so the codebase compiles +2. Write RED tests that express the expected behavior +3. Implement the fix +4. Verify tests turn GREEN +5. Run full test suite — no regressions + +--- + +## 2. Background + +MusicFS currently dies on: +- Any signal (SIGTERM, SIGINT) — instant death, no cleanup +- Any panic in a writer thread — RwLock poisons, all FUSE ops crash +- systemd lifecycle — `Type=notify` but no `sd_notify`, ExecStop is a stub +- Crash leaves stale FUSE mount — users must manually `fusermount -u` + +The [resilience test crate](../../musicfs/crates/musicfs-test-utils/) and RED tests are already in place. This plan implements the fixes to turn them GREEN. + +--- + +## 3. Goals & Non-Goals + +### 3.1 Goals + +- Signal handler catches SIGTERM/SIGINT and initiates clean exit +- Panics are logged with full context before process terminates +- RwLock poisoning cannot cascade to kill FUSE operations +- systemd integration works (`sd_notify READY=1`, `ExecStopPost`) +- Stale FUSE mounts are detected and cleaned on startup +- All existing 162 tests continue to pass +- All Phase A RED tests turn GREEN + +### 3.2 Non-Goals + +- Graceful shutdown orchestration with ordered teardown (Phase B — needs CancellationToken plumbing through all components) +- Task supervisor for background task restart (Phase B) +- Cache integrity checks on startup (Phase B — needs persistent state) +- Disk space monitoring (Phase B) +- Interrupted sync recovery (Phase B — needs persistent state) + +--- + +## 4. Proposed Design + +### 4.1 Implementation Order + +Dependencies determine the order. Each issue is independent except where noted. + +``` +4.2 RwLock poison fix (no deps, instant win, unblocks safety) + ↓ +4.3 Panic hook (no deps, complements RwLock fix) + ↓ +4.4 systemd ExecStopPost (no deps, config-only change) + ↓ +4.5 sd_notify integration (no deps, new crate dependency) + ↓ +4.6 Signal handling (depends on: FUSE mount change to spawn_mount2) + ↓ +4.7 Stale mount detection (depends on: signal handling for clean test) +``` + +### 4.2 Issue 2.9: RwLock Poison Fix + +**Approach**: Replace `std::sync::RwLock` with `parking_lot::RwLock` in all production paths. `parking_lot` never poisons — a panic in a writer releases the lock and subsequent readers see the pre-panic state. + +**Why parking_lot over poison recovery**: The codebase already uses `parking_lot` in `prefetch.rs` and `index.rs`. Using it everywhere is consistent. The alternative (`.unwrap_or_else(|p| p.into_inner())`) is verbose and error-prone — one missed call re-introduces the bug. + +#### Step 1: Stubs (compile) + +None needed — `parking_lot::RwLock` is a drop-in replacement (same API, no `PoisonError`). + +#### Step 2: RED tests + +Already exist in `tests/resilience.rs`: +- `test_poisoned_tree_lock_returns_eio_not_panic` — currently passes (demonstrates the problem) +- `test_parking_lot_rwlock_survives_panic` — currently passes (proves the fix works) + +Additional test to add: verify FUSE filesystem survives a writer panic on the tree lock. + +#### Step 3: Implementation + +**Files to change:** + +| File | Change | +|------|--------| +| `musicfs-fuse/src/filesystem.rs` | `use std::sync::RwLock` → `use parking_lot::RwLock`; remove all `.unwrap()` on lock calls (parking_lot returns guard directly, not `Result`) | +| `musicfs-cas/src/reader.rs` | Same change for `manifests: RwLock>` | +| `musicfs-cas/src/fetcher.rs` | Same change for `origins` and `file_meta` locks | +| `musicfs-origins/src/registry.rs` | Same change for `origins` and `watch_handles` locks | +| `musicfs-cache/src/eviction.rs` | Same change for `access_times` and `hash_to_time` locks | +| `musicfs-core/src/metrics.rs` | Same change for histogram locks | +| `musicfs-cache/src/tree.rs` | Same change for `last_refresh` lock | + +**Pattern**: In each file: +```rust +// BEFORE +use std::sync::RwLock; +let guard = self.tree.read().unwrap(); + +// AFTER +use parking_lot::RwLock; +let guard = self.tree.read(); // No unwrap needed +``` + +For the `MusicFs` struct in `filesystem.rs`, the `tree` field is `Arc>` — this is passed in from `main.rs`. Change `main.rs` to use `parking_lot::RwLock` there too. + +#### Step 4: Verify + +```bash +cargo test # All 162+ tests pass +cargo test -p musicfs-test-utils # Resilience tests pass +cargo check # No warnings +``` + +--- + +### 4.3 Issue 2.2: Panic Hook + +**Approach**: Install a custom panic hook at daemon startup that logs the panic with `tracing::error!` before the default behavior (abort/unwind). This ensures panics are captured in log files and journald. + +#### Step 1: Stubs + +Add to `musicfs-core/src/lib.rs`: +```rust +pub fn install_panic_hook() { + // stub — will be implemented +} +``` + +#### Step 2: RED tests + +Write in `tests/resilience.rs`: +```rust +#[test] +fn test_panic_hook_logs_to_tracing() { + // Install hook with a test tracing subscriber + // Trigger panic via catch_unwind + // Verify error! log contains panic message + thread name +} +``` + +#### Step 3: Implementation + +**File**: `musicfs-core/src/lib.rs` (or new `musicfs-core/src/panic.rs`) + +```rust +pub fn install_panic_hook() { + let default_hook = std::panic::take_hook(); + std::panic::set_hook(Box::new(move |info| { + let thread = std::thread::current(); + let thread_name = thread.name().unwrap_or(""); + + let message = if let Some(s) = info.payload().downcast_ref::<&str>() { + s.to_string() + } else if let Some(s) = info.payload().downcast_ref::() { + s.clone() + } else { + "unknown panic".to_string() + }; + + let location = info.location().map(|l| format!("{}:{}:{}", l.file(), l.line(), l.column())) + .unwrap_or_else(|| "unknown location".to_string()); + + tracing::error!( + thread = thread_name, + location = %location, + "PANIC: {}", + message + ); + + default_hook(info); + })); +} +``` + +**Call site**: `musicfs-cli/src/main.rs`, at the very top of `main()`: +```rust +fn main() -> Result<()> { + musicfs_core::install_panic_hook(); + let cli = Cli::parse(); + // ... +} +``` + +#### Step 4: Verify + +```bash +cargo test -p musicfs-core # Panic hook unit tests +cargo test -p musicfs-test-utils # Resilience tests +``` + +--- + +### 4.4 Issue 3.7 + 2.7: systemd Service Fix + FUSE Cleanup + +**Approach**: Fix the systemd service file and add stale mount detection on startup. + +#### Step 1: No stubs needed (config change) + +#### Step 2: RED tests + +Already exists: `test_systemd_service_has_execstoppost` — currently fails because service file lacks `ExecStopPost`. + +Add test for stale mount detection: +```rust +#[test] +fn test_stale_mount_check_function_exists() { + // Verify the function signature exists + // (actual mount test needs privileged environment) +} +``` + +#### Step 3: Implementation + +**File**: `dist/musicfs.service` + +```diff + ExecStop=/usr/bin/musicfs shutdown ++ExecStopPost=/usr/bin/fusermount -uz %h/music || true + Restart=on-failure +``` + +Note: `fusermount -uz` is "lazy unmount" — always succeeds even if mount is busy. The `|| true` prevents systemd from treating cleanup failure as a service failure. + +**File**: `musicfs-cli/src/main.rs` — add stale mount check before mounting: + +```rust +fn check_stale_mount(mountpoint: &Path) -> Result<()> { + // Check /proc/mounts for existing mount at this path + if let Ok(mounts) = std::fs::read_to_string("/proc/mounts") { + for line in mounts.lines() { + if line.contains(&mountpoint.to_string_lossy().as_ref()) && line.contains("fuse") { + warn!("Stale FUSE mount detected at {:?}, attempting cleanup", mountpoint); + let status = std::process::Command::new("fusermount") + .args(["-uz", &mountpoint.to_string_lossy()]) + .status(); + match status { + Ok(s) if s.success() => info!("Stale mount cleaned up"), + Ok(s) => warn!("fusermount exited with: {}", s), + Err(e) => warn!("Failed to run fusermount: {}", e), + } + } + } + } + Ok(()) +} +``` + +Also fix the `test_systemd_service_has_execstoppost` test path — currently points to wrong location. + +#### Step 4: Verify + +```bash +cargo test -p musicfs-test-utils -- test_systemd # Service file test +``` + +--- + +### 4.5 Issue 2.10: sd_notify Integration + +**Approach**: Add `sd-notify` crate, call `READY=1` after mount, `STOPPING` on shutdown. + +#### Step 1: Stubs + +Add dependency to `musicfs-cli/Cargo.toml`: +```toml +sd-notify = "0.4" +``` + +#### Step 2: RED tests + +Write test that mocks the notify socket: +```rust +#[test] +fn test_sd_notify_ready_sent() { + // Create Unix datagram socket at $NOTIFY_SOCKET + // Call sd_notify::notify(READY=1) + // Verify message received on socket +} +``` + +#### Step 3: Implementation + +**File**: `musicfs-cli/src/main.rs` + +After `fs.mount()` succeeds (or more precisely, after `spawn_mount2` — see 4.6): +```rust +// Notify systemd we're ready +if let Err(e) = sd_notify::notify(false, &[sd_notify::NotifyState::Ready]) { + debug!("sd_notify not available (not running under systemd): {}", e); +} +``` + +On shutdown path: +```rust +let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Stopping]); +``` + +#### Step 4: Verify + +```bash +cargo test -p musicfs-test-utils -- test_sd_notify +cargo build -p musicfs-cli # Verify it compiles with new dep +``` + +--- + +### 4.6 Issue 2.1: Signal Handling + +**Approach**: Switch from `fuser::mount2` (blocking) to `fuser::spawn_mount2` (background), then listen for signals on the main thread. + +This is the most complex change in Phase A. It restructures the daemon's main loop. + +#### Step 1: Stubs + +Change `MusicFs::mount()` signature to return a session handle: + +```rust +// BEFORE +pub fn mount(self, mountpoint: &Path) -> Result<()> { + fuser::mount2(self, mountpoint, &options)?; + Ok(()) +} + +// AFTER (stub — returns BackgroundSession) +pub fn spawn_mount(self, mountpoint: &Path) -> Result { + let session = fuser::spawn_mount2(self, mountpoint, &options)?; + Ok(session) +} +``` + +Keep old `mount()` temporarily for compatibility. + +#### Step 2: RED tests + +Write in `tests/resilience.rs`: +```rust +#[tokio::test] +async fn test_sigterm_triggers_shutdown() { + // Spawn daemon as child process + // Wait for mount + // Send SIGTERM + // Verify clean exit within 10s + // Verify mountpoint is unmounted +} +``` + +This test requires the signal handler to exist. It will be RED until implementation. + +#### Step 3: Implementation + +**File**: `musicfs-cli/src/main.rs` — rewrite `run_mount()`: + +```rust +fn run_mount(mountpoint: PathBuf, origin_path: Option, cache_dir: Option) -> Result<()> { + let origin_path = origin_path.context("--origin is required")?; + let runtime = tokio::runtime::Runtime::new()?; + let handle = runtime.handle().clone(); + + let (tree, reader) = runtime.block_on(async { + // ... existing setup code (unchanged) ... + Ok::<_, anyhow::Error>((tree, reader)) + })?; + + // Check for stale mount before mounting + check_stale_mount(&mountpoint)?; + + let fs = MusicFs::with_reader(tree, reader, handle.clone()); + info!("Mounting filesystem at {:?}", mountpoint); + + // spawn_mount2 returns immediately — FUSE runs in background + let session = fs.spawn_mount(&mountpoint) + .context("Failed to mount filesystem")?; + + // Notify systemd + let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Ready]); + info!("MusicFS ready, PID {}", std::process::id()); + + // Block on signal + runtime.block_on(async { + let mut sigterm = tokio::signal::unix::signal( + tokio::signal::unix::SignalKind::terminate() + )?; + let mut sigint = tokio::signal::unix::signal( + tokio::signal::unix::SignalKind::interrupt() + )?; + + tokio::select! { + _ = sigterm.recv() => { + info!("Received SIGTERM, shutting down"); + } + _ = sigint.recv() => { + info!("Received SIGINT, shutting down"); + } + } + + Ok::<_, anyhow::Error>(()) + })?; + + // Shutdown sequence + let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Stopping]); + info!("Unmounting filesystem"); + drop(session); // BackgroundSession::drop() calls unmount + info!("Shutdown complete"); + + Ok(()) +} +``` + +**File**: `musicfs-fuse/src/filesystem.rs` — add `spawn_mount()`: + +```rust +pub fn spawn_mount(self, mountpoint: &Path) -> Result { + info!("Mounting MusicFS at {:?}", mountpoint); + let options = vec![ + fuser::MountOption::RO, + fuser::MountOption::FSName("musicfs".to_string()), + fuser::MountOption::AutoUnmount, + fuser::MountOption::AllowOther, + ]; + let session = fuser::spawn_mount2(self, mountpoint, &options) + .map_err(musicfs_core::Error::Io)?; + Ok(session) +} +``` + +#### Step 4: Verify + +```bash +cargo build -p musicfs-cli +cargo test -p musicfs-test-utils -- test_sigterm # Process-level test +cargo test # No regressions +``` + +--- + +## 5. Cross-Cutting Concerns + +### 5.1 Security & Privacy + +- No new attack surface — changes are internal lifecycle management +- Panic hook does NOT log sensitive data (only panic message, thread name, location) +- `sd_notify` uses existing systemd socket — no new IPC + +### 5.2 Observability + +- Panic hook ensures all panics are captured in logs/journald +- Signal handling logs which signal triggered shutdown +- sd_notify gives systemd accurate service state +- Stale mount detection logs cleanup attempts + +### 5.3 Testing + +All changes follow the TDD flow: +1. Stubs compile +2. RED tests document expected behavior +3. Implementation turns tests GREEN +4. Full suite passes (no regressions) + +--- + +## 6. Alternatives Considered + +### 6.1 Poison Recovery Instead of parking_lot + +**Alternative**: Keep `std::sync::RwLock`, add `.unwrap_or_else(|p| p.into_inner())` to every lock call. + +**Rejected**: 30+ call sites to change, easy to miss one, and the pattern is verbose. `parking_lot` is already a dependency and is strictly better for this use case (faster, no poison, correct API). + +### 6.2 Keep mount2 (blocking) with Signal Thread + +**Alternative**: Keep `fuser::mount2`, spawn a separate thread for signal handling, use a channel to communicate shutdown. + +**Rejected**: `mount2` consumes `self` and blocks — there's no clean way to interrupt it from another thread. `spawn_mount2` is the canonical solution from the `fuser` crate. + +### 6.3 Defer sd_notify Until Full Shutdown Orchestration + +**Alternative**: Implement sd_notify only after CancellationToken + graceful shutdown are in place. + +**Rejected**: sd_notify `READY=1` is critical now — without it, `Type=notify` in the service file means systemd will timeout and kill the daemon on every start. The shutdown `STOPPING` notification is a bonus but not required for Phase A. + +--- + +## 7. Implementation Plan + +### 7.1 Task Sequence + +| Day | Task | Issue | Effort | Test Approach | +|-----|------|-------|--------|---------------| +| 1 (morning) | RwLock → parking_lot migration | 2.9 | 2h | Existing GREEN test validates; verify no `.unwrap()` on locks | +| 1 (afternoon) | Panic hook | 2.2 | 2h | New test: panic → verify tracing output | +| 2 (morning) | systemd ExecStopPost + stale mount check | 3.7 + 2.7 | 2h | Existing RED test → GREEN; new stale mount test | +| 2 (afternoon) | sd_notify integration | 2.10 | 2h | New test: mock socket → verify READY=1 | +| 3 | Signal handling (spawn_mount2 + signal loop) | 2.1 | 4h | Fork daemon → send SIGTERM → verify exit | +| 4 | Integration + regression testing | — | 4h | Full `cargo test`, manual FUSE mount test | +| 5 | Buffer for issues found during integration | — | 4h | — | + +### 7.2 Verification Checklist + +After all tasks complete: + +- [ ] `cargo check` — zero errors, zero warnings +- [ ] `cargo test` — all 162+ existing tests pass +- [ ] `cargo test -p musicfs-test-utils` — all resilience tests pass +- [ ] `cargo clippy` — no new warnings +- [ ] `grep -r '\.read()\.unwrap()\|\.write()\.unwrap()' crates/` — zero hits in production code (test code is OK) +- [ ] `dist/musicfs.service` contains `ExecStopPost` +- [ ] Manual test: `musicfs mount`, then `kill -TERM `, verify clean exit + mount gone +- [ ] Manual test: `kill -9 `, then `musicfs mount` again — no "already mounted" error + +--- + +## 8. Files Changed + +| File | Change | Issue | +|------|--------|-------| +| `musicfs-fuse/src/filesystem.rs` | `std::sync::RwLock` → `parking_lot::RwLock`; add `spawn_mount()` | 2.9, 2.1 | +| `musicfs-cas/src/reader.rs` | `std::sync::RwLock` → `parking_lot::RwLock` | 2.9 | +| `musicfs-cas/src/fetcher.rs` | `std::sync::RwLock` → `parking_lot::RwLock` | 2.9 | +| `musicfs-origins/src/registry.rs` | `std::sync::RwLock` → `parking_lot::RwLock` | 2.9 | +| `musicfs-cache/src/eviction.rs` | `std::sync::RwLock` → `parking_lot::RwLock` | 2.9 | +| `musicfs-cache/src/tree.rs` | `std::sync::RwLock` → `parking_lot::RwLock` | 2.9 | +| `musicfs-core/src/metrics.rs` | `std::sync::RwLock` → `parking_lot::RwLock` | 2.9 | +| `musicfs-core/src/lib.rs` | Add `install_panic_hook()` | 2.2 | +| `musicfs-cli/src/main.rs` | Panic hook, signal handler, spawn_mount2, sd_notify, stale mount check | 2.1, 2.2, 2.7, 2.10 | +| `musicfs-cli/Cargo.toml` | Add `sd-notify`, `tokio-util` deps | 2.10, 2.1 | +| `dist/musicfs.service` | Add `ExecStopPost`, fix `ExecStop` | 3.7 | +| `tests/resilience.rs` | Update/add tests for signal, panic hook, sd_notify | all | + +--- + +## 9. Glossary / References + +| Term | Definition | +|------|------------| +| **parking_lot** | Fast, poison-free lock implementation. Already a project dependency. | +| **spawn_mount2** | `fuser` API that mounts FUSE in a background thread, returning a `BackgroundSession` handle | +| **sd_notify** | systemd notification protocol. `READY=1` signals service started, `STOPPING` signals shutdown. | +| **BackgroundSession** | Handle returned by `spawn_mount2`. Dropping it unmounts the filesystem. | + +| Document | Path | +|----------|------| +| Resilience audit | [resilience-fault-tolerance.md](resilience-fault-tolerance.md) | +| Resilience testing | [resilience-testing.md](resilience-testing.md) | +| Architecture | [architecture.md](../architecture.md) |