Add Phase A implementation plan (Stop Dying)

BlueDoc covering 6 critical resilience fixes with TDD flow:
- 2.9: RwLock → parking_lot (poison-free locks)
- 2.2: Panic hook with tracing integration
- 3.7+2.7: systemd ExecStopPost + stale mount cleanup
- 2.10: sd_notify READY/STOPPING
- 2.1: Signal handling via spawn_mount2 + tokio signals
Each with: stubs → RED tests → implementation → GREEN verify.
~5 days estimated, exact files and code patterns specified.
This commit is contained in:
Alexander
2026-05-13 14:00:46 +02:00
parent e3eeba4650
commit 24086cc744
+569
View File
@@ -0,0 +1,569 @@
# Phase A: Stop Dying — Implementation Plan
**Authors:** AI-assisted
**Status:** Draft
**Last Updated:** 2026-05-13
**Reviewers:** TBD
**Approvers:** TBD
**Prerequisites:** [resilience-fault-tolerance.md](resilience-fault-tolerance.md), [resilience-testing.md](resilience-testing.md)
**Estimated Effort:** ~5 days
---
[TOC]
---
## 1. Abstract
Implement the 6 most critical resilience fixes (issues 2.1, 2.2, 2.7, 2.9, 2.10, 3.7 from the [resilience audit](resilience-fault-tolerance.md)) that prevent MusicFS from dying on common operational events: signals, panics, lock poisoning, and systemd lifecycle.
Issues 2.3 (shutdown orchestration), 2.4 (cache integrity), 2.5 (sync recovery), 2.6 (task supervisor), 2.8 (disk space) are deferred to Phase B — they depend on Phase A infrastructure or on the [persistent state](persistent-state.md) work.
**Development flow** (TDD, per-issue):
1. Create stubs so the codebase compiles
2. Write RED tests that express the expected behavior
3. Implement the fix
4. Verify tests turn GREEN
5. Run full test suite — no regressions
---
## 2. Background
MusicFS currently dies on:
- Any signal (SIGTERM, SIGINT) — instant death, no cleanup
- Any panic in a writer thread — RwLock poisons, all FUSE ops crash
- systemd lifecycle — `Type=notify` but no `sd_notify`, ExecStop is a stub
- Crash leaves stale FUSE mount — users must manually `fusermount -u`
The [resilience test crate](../../musicfs/crates/musicfs-test-utils/) and RED tests are already in place. This plan implements the fixes to turn them GREEN.
---
## 3. Goals & Non-Goals
### 3.1 Goals
- Signal handler catches SIGTERM/SIGINT and initiates clean exit
- Panics are logged with full context before process terminates
- RwLock poisoning cannot cascade to kill FUSE operations
- systemd integration works (`sd_notify READY=1`, `ExecStopPost`)
- Stale FUSE mounts are detected and cleaned on startup
- All existing 162 tests continue to pass
- All Phase A RED tests turn GREEN
### 3.2 Non-Goals
- Graceful shutdown orchestration with ordered teardown (Phase B — needs CancellationToken plumbing through all components)
- Task supervisor for background task restart (Phase B)
- Cache integrity checks on startup (Phase B — needs persistent state)
- Disk space monitoring (Phase B)
- Interrupted sync recovery (Phase B — needs persistent state)
---
## 4. Proposed Design
### 4.1 Implementation Order
Dependencies determine the order. Each issue is independent except where noted.
```
4.2 RwLock poison fix (no deps, instant win, unblocks safety)
4.3 Panic hook (no deps, complements RwLock fix)
4.4 systemd ExecStopPost (no deps, config-only change)
4.5 sd_notify integration (no deps, new crate dependency)
4.6 Signal handling (depends on: FUSE mount change to spawn_mount2)
4.7 Stale mount detection (depends on: signal handling for clean test)
```
### 4.2 Issue 2.9: RwLock Poison Fix
**Approach**: Replace `std::sync::RwLock` with `parking_lot::RwLock` in all production paths. `parking_lot` never poisons — a panic in a writer releases the lock and subsequent readers see the pre-panic state.
**Why parking_lot over poison recovery**: The codebase already uses `parking_lot` in `prefetch.rs` and `index.rs`. Using it everywhere is consistent. The alternative (`.unwrap_or_else(|p| p.into_inner())`) is verbose and error-prone — one missed call re-introduces the bug.
#### Step 1: Stubs (compile)
None needed — `parking_lot::RwLock` is a drop-in replacement (same API, no `PoisonError`).
#### Step 2: RED tests
Already exist in `tests/resilience.rs`:
- `test_poisoned_tree_lock_returns_eio_not_panic` — currently passes (demonstrates the problem)
- `test_parking_lot_rwlock_survives_panic` — currently passes (proves the fix works)
Additional test to add: verify FUSE filesystem survives a writer panic on the tree lock.
#### Step 3: Implementation
**Files to change:**
| File | Change |
|------|--------|
| `musicfs-fuse/src/filesystem.rs` | `use std::sync::RwLock``use parking_lot::RwLock`; remove all `.unwrap()` on lock calls (parking_lot returns guard directly, not `Result`) |
| `musicfs-cas/src/reader.rs` | Same change for `manifests: RwLock<HashMap<...>>` |
| `musicfs-cas/src/fetcher.rs` | Same change for `origins` and `file_meta` locks |
| `musicfs-origins/src/registry.rs` | Same change for `origins` and `watch_handles` locks |
| `musicfs-cache/src/eviction.rs` | Same change for `access_times` and `hash_to_time` locks |
| `musicfs-core/src/metrics.rs` | Same change for histogram locks |
| `musicfs-cache/src/tree.rs` | Same change for `last_refresh` lock |
**Pattern**: In each file:
```rust
// BEFORE
use std::sync::RwLock;
let guard = self.tree.read().unwrap();
// AFTER
use parking_lot::RwLock;
let guard = self.tree.read(); // No unwrap needed
```
For the `MusicFs` struct in `filesystem.rs`, the `tree` field is `Arc<RwLock<VirtualTree>>` — this is passed in from `main.rs`. Change `main.rs` to use `parking_lot::RwLock` there too.
#### Step 4: Verify
```bash
cargo test # All 162+ tests pass
cargo test -p musicfs-test-utils # Resilience tests pass
cargo check # No warnings
```
---
### 4.3 Issue 2.2: Panic Hook
**Approach**: Install a custom panic hook at daemon startup that logs the panic with `tracing::error!` before the default behavior (abort/unwind). This ensures panics are captured in log files and journald.
#### Step 1: Stubs
Add to `musicfs-core/src/lib.rs`:
```rust
pub fn install_panic_hook() {
// stub — will be implemented
}
```
#### Step 2: RED tests
Write in `tests/resilience.rs`:
```rust
#[test]
fn test_panic_hook_logs_to_tracing() {
// Install hook with a test tracing subscriber
// Trigger panic via catch_unwind
// Verify error! log contains panic message + thread name
}
```
#### Step 3: Implementation
**File**: `musicfs-core/src/lib.rs` (or new `musicfs-core/src/panic.rs`)
```rust
pub fn install_panic_hook() {
let default_hook = std::panic::take_hook();
std::panic::set_hook(Box::new(move |info| {
let thread = std::thread::current();
let thread_name = thread.name().unwrap_or("<unnamed>");
let message = if let Some(s) = info.payload().downcast_ref::<&str>() {
s.to_string()
} else if let Some(s) = info.payload().downcast_ref::<String>() {
s.clone()
} else {
"unknown panic".to_string()
};
let location = info.location().map(|l| format!("{}:{}:{}", l.file(), l.line(), l.column()))
.unwrap_or_else(|| "unknown location".to_string());
tracing::error!(
thread = thread_name,
location = %location,
"PANIC: {}",
message
);
default_hook(info);
}));
}
```
**Call site**: `musicfs-cli/src/main.rs`, at the very top of `main()`:
```rust
fn main() -> Result<()> {
musicfs_core::install_panic_hook();
let cli = Cli::parse();
// ...
}
```
#### Step 4: Verify
```bash
cargo test -p musicfs-core # Panic hook unit tests
cargo test -p musicfs-test-utils # Resilience tests
```
---
### 4.4 Issue 3.7 + 2.7: systemd Service Fix + FUSE Cleanup
**Approach**: Fix the systemd service file and add stale mount detection on startup.
#### Step 1: No stubs needed (config change)
#### Step 2: RED tests
Already exists: `test_systemd_service_has_execstoppost` — currently fails because service file lacks `ExecStopPost`.
Add test for stale mount detection:
```rust
#[test]
fn test_stale_mount_check_function_exists() {
// Verify the function signature exists
// (actual mount test needs privileged environment)
}
```
#### Step 3: Implementation
**File**: `dist/musicfs.service`
```diff
ExecStop=/usr/bin/musicfs shutdown
+ExecStopPost=/usr/bin/fusermount -uz %h/music || true
Restart=on-failure
```
Note: `fusermount -uz` is "lazy unmount" — always succeeds even if mount is busy. The `|| true` prevents systemd from treating cleanup failure as a service failure.
**File**: `musicfs-cli/src/main.rs` — add stale mount check before mounting:
```rust
fn check_stale_mount(mountpoint: &Path) -> Result<()> {
// Check /proc/mounts for existing mount at this path
if let Ok(mounts) = std::fs::read_to_string("/proc/mounts") {
for line in mounts.lines() {
if line.contains(&mountpoint.to_string_lossy().as_ref()) && line.contains("fuse") {
warn!("Stale FUSE mount detected at {:?}, attempting cleanup", mountpoint);
let status = std::process::Command::new("fusermount")
.args(["-uz", &mountpoint.to_string_lossy()])
.status();
match status {
Ok(s) if s.success() => info!("Stale mount cleaned up"),
Ok(s) => warn!("fusermount exited with: {}", s),
Err(e) => warn!("Failed to run fusermount: {}", e),
}
}
}
}
Ok(())
}
```
Also fix the `test_systemd_service_has_execstoppost` test path — currently points to wrong location.
#### Step 4: Verify
```bash
cargo test -p musicfs-test-utils -- test_systemd # Service file test
```
---
### 4.5 Issue 2.10: sd_notify Integration
**Approach**: Add `sd-notify` crate, call `READY=1` after mount, `STOPPING` on shutdown.
#### Step 1: Stubs
Add dependency to `musicfs-cli/Cargo.toml`:
```toml
sd-notify = "0.4"
```
#### Step 2: RED tests
Write test that mocks the notify socket:
```rust
#[test]
fn test_sd_notify_ready_sent() {
// Create Unix datagram socket at $NOTIFY_SOCKET
// Call sd_notify::notify(READY=1)
// Verify message received on socket
}
```
#### Step 3: Implementation
**File**: `musicfs-cli/src/main.rs`
After `fs.mount()` succeeds (or more precisely, after `spawn_mount2` — see 4.6):
```rust
// Notify systemd we're ready
if let Err(e) = sd_notify::notify(false, &[sd_notify::NotifyState::Ready]) {
debug!("sd_notify not available (not running under systemd): {}", e);
}
```
On shutdown path:
```rust
let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Stopping]);
```
#### Step 4: Verify
```bash
cargo test -p musicfs-test-utils -- test_sd_notify
cargo build -p musicfs-cli # Verify it compiles with new dep
```
---
### 4.6 Issue 2.1: Signal Handling
**Approach**: Switch from `fuser::mount2` (blocking) to `fuser::spawn_mount2` (background), then listen for signals on the main thread.
This is the most complex change in Phase A. It restructures the daemon's main loop.
#### Step 1: Stubs
Change `MusicFs::mount()` signature to return a session handle:
```rust
// BEFORE
pub fn mount(self, mountpoint: &Path) -> Result<()> {
fuser::mount2(self, mountpoint, &options)?;
Ok(())
}
// AFTER (stub — returns BackgroundSession)
pub fn spawn_mount(self, mountpoint: &Path) -> Result<fuser::BackgroundSession> {
let session = fuser::spawn_mount2(self, mountpoint, &options)?;
Ok(session)
}
```
Keep old `mount()` temporarily for compatibility.
#### Step 2: RED tests
Write in `tests/resilience.rs`:
```rust
#[tokio::test]
async fn test_sigterm_triggers_shutdown() {
// Spawn daemon as child process
// Wait for mount
// Send SIGTERM
// Verify clean exit within 10s
// Verify mountpoint is unmounted
}
```
This test requires the signal handler to exist. It will be RED until implementation.
#### Step 3: Implementation
**File**: `musicfs-cli/src/main.rs` — rewrite `run_mount()`:
```rust
fn run_mount(mountpoint: PathBuf, origin_path: Option<PathBuf>, cache_dir: Option<PathBuf>) -> Result<()> {
let origin_path = origin_path.context("--origin is required")?;
let runtime = tokio::runtime::Runtime::new()?;
let handle = runtime.handle().clone();
let (tree, reader) = runtime.block_on(async {
// ... existing setup code (unchanged) ...
Ok::<_, anyhow::Error>((tree, reader))
})?;
// Check for stale mount before mounting
check_stale_mount(&mountpoint)?;
let fs = MusicFs::with_reader(tree, reader, handle.clone());
info!("Mounting filesystem at {:?}", mountpoint);
// spawn_mount2 returns immediately — FUSE runs in background
let session = fs.spawn_mount(&mountpoint)
.context("Failed to mount filesystem")?;
// Notify systemd
let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Ready]);
info!("MusicFS ready, PID {}", std::process::id());
// Block on signal
runtime.block_on(async {
let mut sigterm = tokio::signal::unix::signal(
tokio::signal::unix::SignalKind::terminate()
)?;
let mut sigint = tokio::signal::unix::signal(
tokio::signal::unix::SignalKind::interrupt()
)?;
tokio::select! {
_ = sigterm.recv() => {
info!("Received SIGTERM, shutting down");
}
_ = sigint.recv() => {
info!("Received SIGINT, shutting down");
}
}
Ok::<_, anyhow::Error>(())
})?;
// Shutdown sequence
let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Stopping]);
info!("Unmounting filesystem");
drop(session); // BackgroundSession::drop() calls unmount
info!("Shutdown complete");
Ok(())
}
```
**File**: `musicfs-fuse/src/filesystem.rs` — add `spawn_mount()`:
```rust
pub fn spawn_mount(self, mountpoint: &Path) -> Result<fuser::BackgroundSession> {
info!("Mounting MusicFS at {:?}", mountpoint);
let options = vec![
fuser::MountOption::RO,
fuser::MountOption::FSName("musicfs".to_string()),
fuser::MountOption::AutoUnmount,
fuser::MountOption::AllowOther,
];
let session = fuser::spawn_mount2(self, mountpoint, &options)
.map_err(musicfs_core::Error::Io)?;
Ok(session)
}
```
#### Step 4: Verify
```bash
cargo build -p musicfs-cli
cargo test -p musicfs-test-utils -- test_sigterm # Process-level test
cargo test # No regressions
```
---
## 5. Cross-Cutting Concerns
### 5.1 Security & Privacy
- No new attack surface — changes are internal lifecycle management
- Panic hook does NOT log sensitive data (only panic message, thread name, location)
- `sd_notify` uses existing systemd socket — no new IPC
### 5.2 Observability
- Panic hook ensures all panics are captured in logs/journald
- Signal handling logs which signal triggered shutdown
- sd_notify gives systemd accurate service state
- Stale mount detection logs cleanup attempts
### 5.3 Testing
All changes follow the TDD flow:
1. Stubs compile
2. RED tests document expected behavior
3. Implementation turns tests GREEN
4. Full suite passes (no regressions)
---
## 6. Alternatives Considered
### 6.1 Poison Recovery Instead of parking_lot
**Alternative**: Keep `std::sync::RwLock`, add `.unwrap_or_else(|p| p.into_inner())` to every lock call.
**Rejected**: 30+ call sites to change, easy to miss one, and the pattern is verbose. `parking_lot` is already a dependency and is strictly better for this use case (faster, no poison, correct API).
### 6.2 Keep mount2 (blocking) with Signal Thread
**Alternative**: Keep `fuser::mount2`, spawn a separate thread for signal handling, use a channel to communicate shutdown.
**Rejected**: `mount2` consumes `self` and blocks — there's no clean way to interrupt it from another thread. `spawn_mount2` is the canonical solution from the `fuser` crate.
### 6.3 Defer sd_notify Until Full Shutdown Orchestration
**Alternative**: Implement sd_notify only after CancellationToken + graceful shutdown are in place.
**Rejected**: sd_notify `READY=1` is critical now — without it, `Type=notify` in the service file means systemd will timeout and kill the daemon on every start. The shutdown `STOPPING` notification is a bonus but not required for Phase A.
---
## 7. Implementation Plan
### 7.1 Task Sequence
| Day | Task | Issue | Effort | Test Approach |
|-----|------|-------|--------|---------------|
| 1 (morning) | RwLock → parking_lot migration | 2.9 | 2h | Existing GREEN test validates; verify no `.unwrap()` on locks |
| 1 (afternoon) | Panic hook | 2.2 | 2h | New test: panic → verify tracing output |
| 2 (morning) | systemd ExecStopPost + stale mount check | 3.7 + 2.7 | 2h | Existing RED test → GREEN; new stale mount test |
| 2 (afternoon) | sd_notify integration | 2.10 | 2h | New test: mock socket → verify READY=1 |
| 3 | Signal handling (spawn_mount2 + signal loop) | 2.1 | 4h | Fork daemon → send SIGTERM → verify exit |
| 4 | Integration + regression testing | — | 4h | Full `cargo test`, manual FUSE mount test |
| 5 | Buffer for issues found during integration | — | 4h | — |
### 7.2 Verification Checklist
After all tasks complete:
- [ ] `cargo check` — zero errors, zero warnings
- [ ] `cargo test` — all 162+ existing tests pass
- [ ] `cargo test -p musicfs-test-utils` — all resilience tests pass
- [ ] `cargo clippy` — no new warnings
- [ ] `grep -r '\.read()\.unwrap()\|\.write()\.unwrap()' crates/` — zero hits in production code (test code is OK)
- [ ] `dist/musicfs.service` contains `ExecStopPost`
- [ ] Manual test: `musicfs mount`, then `kill -TERM <pid>`, verify clean exit + mount gone
- [ ] Manual test: `kill -9 <pid>`, then `musicfs mount` again — no "already mounted" error
---
## 8. Files Changed
| File | Change | Issue |
|------|--------|-------|
| `musicfs-fuse/src/filesystem.rs` | `std::sync::RwLock``parking_lot::RwLock`; add `spawn_mount()` | 2.9, 2.1 |
| `musicfs-cas/src/reader.rs` | `std::sync::RwLock``parking_lot::RwLock` | 2.9 |
| `musicfs-cas/src/fetcher.rs` | `std::sync::RwLock``parking_lot::RwLock` | 2.9 |
| `musicfs-origins/src/registry.rs` | `std::sync::RwLock``parking_lot::RwLock` | 2.9 |
| `musicfs-cache/src/eviction.rs` | `std::sync::RwLock``parking_lot::RwLock` | 2.9 |
| `musicfs-cache/src/tree.rs` | `std::sync::RwLock``parking_lot::RwLock` | 2.9 |
| `musicfs-core/src/metrics.rs` | `std::sync::RwLock``parking_lot::RwLock` | 2.9 |
| `musicfs-core/src/lib.rs` | Add `install_panic_hook()` | 2.2 |
| `musicfs-cli/src/main.rs` | Panic hook, signal handler, spawn_mount2, sd_notify, stale mount check | 2.1, 2.2, 2.7, 2.10 |
| `musicfs-cli/Cargo.toml` | Add `sd-notify`, `tokio-util` deps | 2.10, 2.1 |
| `dist/musicfs.service` | Add `ExecStopPost`, fix `ExecStop` | 3.7 |
| `tests/resilience.rs` | Update/add tests for signal, panic hook, sd_notify | all |
---
## 9. Glossary / References
| Term | Definition |
|------|------------|
| **parking_lot** | Fast, poison-free lock implementation. Already a project dependency. |
| **spawn_mount2** | `fuser` API that mounts FUSE in a background thread, returning a `BackgroundSession` handle |
| **sd_notify** | systemd notification protocol. `READY=1` signals service started, `STOPPING` signals shutdown. |
| **BackgroundSession** | Handle returned by `spawn_mount2`. Dropping it unmounts the filesystem. |
| Document | Path |
|----------|------|
| Resilience audit | [resilience-fault-tolerance.md](resilience-fault-tolerance.md) |
| Resilience testing | [resilience-testing.md](resilience-testing.md) |
| Architecture | [architecture.md](../architecture.md) |