Files

T

Alexander 24086cc744 Add Phase A implementation plan (Stop Dying)

BlueDoc covering 6 critical resilience fixes with TDD flow:
- 2.9: RwLock → parking_lot (poison-free locks)
- 2.2: Panic hook with tracing integration
- 3.7+2.7: systemd ExecStopPost + stale mount cleanup
- 2.10: sd_notify READY/STOPPING
- 2.1: Signal handling via spawn_mount2 + tokio signals
Each with: stubs → RED tests → implementation → GREEN verify.
~5 days estimated, exact files and code patterns specified.

2026-05-13 14:00:46 +02:00

19 KiB

Raw Permalink Blame History

Phase A: Stop Dying — Implementation Plan

Authors: AI-assisted
Status: Draft
Last Updated: 2026-05-13
Reviewers: TBD
Approvers: TBD
Prerequisites: resilience-fault-tolerance.md, resilience-testing.md
Estimated Effort: ~5 days

[TOC]

1. Abstract

Implement the 6 most critical resilience fixes (issues 2.1, 2.2, 2.7, 2.9, 2.10, 3.7 from the resilience audit) that prevent MusicFS from dying on common operational events: signals, panics, lock poisoning, and systemd lifecycle.

Issues 2.3 (shutdown orchestration), 2.4 (cache integrity), 2.5 (sync recovery), 2.6 (task supervisor), 2.8 (disk space) are deferred to Phase B — they depend on Phase A infrastructure or on the persistent state work.

Development flow (TDD, per-issue):

Create stubs so the codebase compiles
Write RED tests that express the expected behavior
Implement the fix
Verify tests turn GREEN
Run full test suite — no regressions

2. Background

MusicFS currently dies on:

Any signal (SIGTERM, SIGINT) — instant death, no cleanup
Any panic in a writer thread — RwLock poisons, all FUSE ops crash
systemd lifecycle — Type=notify but no sd_notify, ExecStop is a stub
Crash leaves stale FUSE mount — users must manually fusermount -u

The resilience test crate and RED tests are already in place. This plan implements the fixes to turn them GREEN.

3. Goals & Non-Goals

3.1 Goals

Signal handler catches SIGTERM/SIGINT and initiates clean exit
Panics are logged with full context before process terminates
RwLock poisoning cannot cascade to kill FUSE operations
systemd integration works (sd_notify READY=1, ExecStopPost)
Stale FUSE mounts are detected and cleaned on startup
All existing 162 tests continue to pass
All Phase A RED tests turn GREEN

3.2 Non-Goals

Graceful shutdown orchestration with ordered teardown (Phase B — needs CancellationToken plumbing through all components)
Task supervisor for background task restart (Phase B)
Cache integrity checks on startup (Phase B — needs persistent state)
Disk space monitoring (Phase B)
Interrupted sync recovery (Phase B — needs persistent state)

4. Proposed Design

4.1 Implementation Order

Dependencies determine the order. Each issue is independent except where noted.

4.2  RwLock poison fix        (no deps, instant win, unblocks safety)
 ↓
4.3  Panic hook               (no deps, complements RwLock fix)
 ↓
4.4  systemd ExecStopPost     (no deps, config-only change)
 ↓
4.5  sd_notify integration    (no deps, new crate dependency)
 ↓
4.6  Signal handling           (depends on: FUSE mount change to spawn_mount2)
 ↓
4.7  Stale mount detection    (depends on: signal handling for clean test)

4.2 Issue 2.9: RwLock Poison Fix

Approach: Replace std::sync::RwLock with parking_lot::RwLock in all production paths. parking_lot never poisons — a panic in a writer releases the lock and subsequent readers see the pre-panic state.

Why parking_lot over poison recovery: The codebase already uses parking_lot in prefetch.rs and index.rs. Using it everywhere is consistent. The alternative (.unwrap_or_else(|p| p.into_inner())) is verbose and error-prone — one missed call re-introduces the bug.

Step 1: Stubs (compile)

None needed — parking_lot::RwLock is a drop-in replacement (same API, no PoisonError).

Step 2: RED tests

Already exist in tests/resilience.rs:

test_poisoned_tree_lock_returns_eio_not_panic — currently passes (demonstrates the problem)
test_parking_lot_rwlock_survives_panic — currently passes (proves the fix works)

Additional test to add: verify FUSE filesystem survives a writer panic on the tree lock.

Step 3: Implementation

Files to change:

File	Change
`musicfs-fuse/src/filesystem.rs`	`use std::sync::RwLock` → `use parking_lot::RwLock`; remove all `.unwrap()` on lock calls (parking_lot returns guard directly, not `Result`)
`musicfs-cas/src/reader.rs`	Same change for `manifests: RwLock<HashMap<...>>`
`musicfs-cas/src/fetcher.rs`	Same change for `origins` and `file_meta` locks
`musicfs-origins/src/registry.rs`	Same change for `origins` and `watch_handles` locks
`musicfs-cache/src/eviction.rs`	Same change for `access_times` and `hash_to_time` locks
`musicfs-core/src/metrics.rs`	Same change for histogram locks
`musicfs-cache/src/tree.rs`	Same change for `last_refresh` lock

Pattern: In each file:

// BEFORE
use std::sync::RwLock;
let guard = self.tree.read().unwrap();

// AFTER
use parking_lot::RwLock;
let guard = self.tree.read();  // No unwrap needed

For the MusicFs struct in filesystem.rs, the tree field is Arc<RwLock<VirtualTree>> — this is passed in from main.rs. Change main.rs to use parking_lot::RwLock there too.

Step 4: Verify

cargo test                          # All 162+ tests pass
cargo test -p musicfs-test-utils    # Resilience tests pass
cargo check                         # No warnings

4.3 Issue 2.2: Panic Hook

Approach: Install a custom panic hook at daemon startup that logs the panic with tracing::error! before the default behavior (abort/unwind). This ensures panics are captured in log files and journald.

Step 1: Stubs

Add to musicfs-core/src/lib.rs:

pub fn install_panic_hook() {
    // stub — will be implemented
}

Step 2: RED tests

Write in tests/resilience.rs:

#[test]
fn test_panic_hook_logs_to_tracing() {
    // Install hook with a test tracing subscriber
    // Trigger panic via catch_unwind
    // Verify error! log contains panic message + thread name
}

Step 3: Implementation

File: musicfs-core/src/lib.rs (or new musicfs-core/src/panic.rs)

pub fn install_panic_hook() {
    let default_hook = std::panic::take_hook();
    std::panic::set_hook(Box::new(move |info| {
        let thread = std::thread::current();
        let thread_name = thread.name().unwrap_or("<unnamed>");

        let message = if let Some(s) = info.payload().downcast_ref::<&str>() {
            s.to_string()
        } else if let Some(s) = info.payload().downcast_ref::<String>() {
            s.clone()
        } else {
            "unknown panic".to_string()
        };

        let location = info.location().map(|l| format!("{}:{}:{}", l.file(), l.line(), l.column()))
            .unwrap_or_else(|| "unknown location".to_string());

        tracing::error!(
            thread = thread_name,
            location = %location,
            "PANIC: {}",
            message
        );

        default_hook(info);
    }));
}

Call site: musicfs-cli/src/main.rs, at the very top of main():

fn main() -> Result<()> {
    musicfs_core::install_panic_hook();
    let cli = Cli::parse();
    // ...
}

Step 4: Verify

cargo test -p musicfs-core          # Panic hook unit tests
cargo test -p musicfs-test-utils    # Resilience tests

4.4 Issue 3.7 + 2.7: systemd Service Fix + FUSE Cleanup

Approach: Fix the systemd service file and add stale mount detection on startup.

Step 1: No stubs needed (config change)

Step 2: RED tests

Already exists: test_systemd_service_has_execstoppost — currently fails because service file lacks ExecStopPost.

Add test for stale mount detection:

#[test]
fn test_stale_mount_check_function_exists() {
    // Verify the function signature exists
    // (actual mount test needs privileged environment)
}

Step 3: Implementation

File: dist/musicfs.service

 ExecStop=/usr/bin/musicfs shutdown
+ExecStopPost=/usr/bin/fusermount -uz %h/music || true
 Restart=on-failure

Note: fusermount -uz is "lazy unmount" — always succeeds even if mount is busy. The || true prevents systemd from treating cleanup failure as a service failure.

File: musicfs-cli/src/main.rs — add stale mount check before mounting:

fn check_stale_mount(mountpoint: &Path) -> Result<()> {
    // Check /proc/mounts for existing mount at this path
    if let Ok(mounts) = std::fs::read_to_string("/proc/mounts") {
        for line in mounts.lines() {
            if line.contains(&mountpoint.to_string_lossy().as_ref()) && line.contains("fuse") {
                warn!("Stale FUSE mount detected at {:?}, attempting cleanup", mountpoint);
                let status = std::process::Command::new("fusermount")
                    .args(["-uz", &mountpoint.to_string_lossy()])
                    .status();
                match status {
                    Ok(s) if s.success() => info!("Stale mount cleaned up"),
                    Ok(s) => warn!("fusermount exited with: {}", s),
                    Err(e) => warn!("Failed to run fusermount: {}", e),
                }
            }
        }
    }
    Ok(())
}

Also fix the test_systemd_service_has_execstoppost test path — currently points to wrong location.

Step 4: Verify

cargo test -p musicfs-test-utils -- test_systemd  # Service file test

4.5 Issue 2.10: sd_notify Integration

Approach: Add sd-notify crate, call READY=1 after mount, STOPPING on shutdown.

Step 1: Stubs

Add dependency to musicfs-cli/Cargo.toml:

sd-notify = "0.4"

Step 2: RED tests

Write test that mocks the notify socket:

#[test]
fn test_sd_notify_ready_sent() {
    // Create Unix datagram socket at $NOTIFY_SOCKET
    // Call sd_notify::notify(READY=1)
    // Verify message received on socket
}

Step 3: Implementation

File: musicfs-cli/src/main.rs

After fs.mount() succeeds (or more precisely, after spawn_mount2 — see 4.6):

// Notify systemd we're ready
if let Err(e) = sd_notify::notify(false, &[sd_notify::NotifyState::Ready]) {
    debug!("sd_notify not available (not running under systemd): {}", e);
}

On shutdown path:

let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Stopping]);

Step 4: Verify

cargo test -p musicfs-test-utils -- test_sd_notify
cargo build -p musicfs-cli  # Verify it compiles with new dep

4.6 Issue 2.1: Signal Handling

Approach: Switch from fuser::mount2 (blocking) to fuser::spawn_mount2 (background), then listen for signals on the main thread.

This is the most complex change in Phase A. It restructures the daemon's main loop.

Step 1: Stubs

Change MusicFs::mount() signature to return a session handle:

// BEFORE
pub fn mount(self, mountpoint: &Path) -> Result<()> {
    fuser::mount2(self, mountpoint, &options)?;
    Ok(())
}

// AFTER (stub — returns BackgroundSession)
pub fn spawn_mount(self, mountpoint: &Path) -> Result<fuser::BackgroundSession> {
    let session = fuser::spawn_mount2(self, mountpoint, &options)?;
    Ok(session)
}

Keep old mount() temporarily for compatibility.

Step 2: RED tests

Write in tests/resilience.rs:

#[tokio::test]
async fn test_sigterm_triggers_shutdown() {
    // Spawn daemon as child process
    // Wait for mount
    // Send SIGTERM
    // Verify clean exit within 10s
    // Verify mountpoint is unmounted
}

This test requires the signal handler to exist. It will be RED until implementation.

Step 3: Implementation

File: musicfs-cli/src/main.rs — rewrite run_mount():

fn run_mount(mountpoint: PathBuf, origin_path: Option<PathBuf>, cache_dir: Option<PathBuf>) -> Result<()> {
    let origin_path = origin_path.context("--origin is required")?;
    let runtime = tokio::runtime::Runtime::new()?;
    let handle = runtime.handle().clone();

    let (tree, reader) = runtime.block_on(async {
        // ... existing setup code (unchanged) ...
        Ok::<_, anyhow::Error>((tree, reader))
    })?;

    // Check for stale mount before mounting
    check_stale_mount(&mountpoint)?;

    let fs = MusicFs::with_reader(tree, reader, handle.clone());
    info!("Mounting filesystem at {:?}", mountpoint);

    // spawn_mount2 returns immediately — FUSE runs in background
    let session = fs.spawn_mount(&mountpoint)
        .context("Failed to mount filesystem")?;

    // Notify systemd
    let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Ready]);
    info!("MusicFS ready, PID {}", std::process::id());

    // Block on signal
    runtime.block_on(async {
        let mut sigterm = tokio::signal::unix::signal(
            tokio::signal::unix::SignalKind::terminate()
        )?;
        let mut sigint = tokio::signal::unix::signal(
            tokio::signal::unix::SignalKind::interrupt()
        )?;

        tokio::select! {
            _ = sigterm.recv() => {
                info!("Received SIGTERM, shutting down");
            }
            _ = sigint.recv() => {
                info!("Received SIGINT, shutting down");
            }
        }

        Ok::<_, anyhow::Error>(())
    })?;

    // Shutdown sequence
    let _ = sd_notify::notify(false, &[sd_notify::NotifyState::Stopping]);
    info!("Unmounting filesystem");
    drop(session); // BackgroundSession::drop() calls unmount
    info!("Shutdown complete");

    Ok(())
}

File: musicfs-fuse/src/filesystem.rs — add spawn_mount():

pub fn spawn_mount(self, mountpoint: &Path) -> Result<fuser::BackgroundSession> {
    info!("Mounting MusicFS at {:?}", mountpoint);
    let options = vec![
        fuser::MountOption::RO,
        fuser::MountOption::FSName("musicfs".to_string()),
        fuser::MountOption::AutoUnmount,
        fuser::MountOption::AllowOther,
    ];
    let session = fuser::spawn_mount2(self, mountpoint, &options)
        .map_err(musicfs_core::Error::Io)?;
    Ok(session)
}

Step 4: Verify

cargo build -p musicfs-cli
cargo test -p musicfs-test-utils -- test_sigterm  # Process-level test
cargo test                                         # No regressions

5. Cross-Cutting Concerns

5.1 Security & Privacy

No new attack surface — changes are internal lifecycle management
Panic hook does NOT log sensitive data (only panic message, thread name, location)
sd_notify uses existing systemd socket — no new IPC

5.2 Observability

Panic hook ensures all panics are captured in logs/journald
Signal handling logs which signal triggered shutdown
sd_notify gives systemd accurate service state
Stale mount detection logs cleanup attempts

5.3 Testing

All changes follow the TDD flow:

Stubs compile
RED tests document expected behavior
Implementation turns tests GREEN
Full suite passes (no regressions)

6. Alternatives Considered

6.1 Poison Recovery Instead of parking_lot

Alternative: Keep std::sync::RwLock, add .unwrap_or_else(|p| p.into_inner()) to every lock call.

Rejected: 30+ call sites to change, easy to miss one, and the pattern is verbose. parking_lot is already a dependency and is strictly better for this use case (faster, no poison, correct API).

6.2 Keep mount2 (blocking) with Signal Thread

Alternative: Keep fuser::mount2, spawn a separate thread for signal handling, use a channel to communicate shutdown.

Rejected: mount2 consumes self and blocks — there's no clean way to interrupt it from another thread. spawn_mount2 is the canonical solution from the fuser crate.

6.3 Defer sd_notify Until Full Shutdown Orchestration

Alternative: Implement sd_notify only after CancellationToken + graceful shutdown are in place.

Rejected: sd_notify READY=1 is critical now — without it, Type=notify in the service file means systemd will timeout and kill the daemon on every start. The shutdown STOPPING notification is a bonus but not required for Phase A.

7. Implementation Plan

7.1 Task Sequence

Day	Task	Issue	Effort	Test Approach
1 (morning)	RwLock → parking_lot migration	2.9	2h	Existing GREEN test validates; verify no `.unwrap()` on locks
1 (afternoon)	Panic hook	2.2	2h	New test: panic → verify tracing output
2 (morning)	systemd ExecStopPost + stale mount check	3.7 + 2.7	2h	Existing RED test → GREEN; new stale mount test
2 (afternoon)	sd_notify integration	2.10	2h	New test: mock socket → verify READY=1
3	Signal handling (spawn_mount2 + signal loop)	2.1	4h	Fork daemon → send SIGTERM → verify exit
4	Integration + regression testing	—	4h	Full `cargo test`, manual FUSE mount test
5	Buffer for issues found during integration	—	4h	—

7.2 Verification Checklist

After all tasks complete:

cargo check — zero errors, zero warnings
cargo test — all 162+ existing tests pass
cargo test -p musicfs-test-utils — all resilience tests pass
cargo clippy — no new warnings
grep -r '\.read()\.unwrap()\|\.write()\.unwrap()' crates/ — zero hits in production code (test code is OK)
dist/musicfs.service contains ExecStopPost
Manual test: musicfs mount, then kill -TERM <pid>, verify clean exit + mount gone
Manual test: kill -9 <pid>, then musicfs mount again — no "already mounted" error

8. Files Changed

File	Change	Issue
`musicfs-fuse/src/filesystem.rs`	`std::sync::RwLock` → `parking_lot::RwLock`; add `spawn_mount()`	2.9, 2.1
`musicfs-cas/src/reader.rs`	`std::sync::RwLock` → `parking_lot::RwLock`	2.9
`musicfs-cas/src/fetcher.rs`	`std::sync::RwLock` → `parking_lot::RwLock`	2.9
`musicfs-origins/src/registry.rs`	`std::sync::RwLock` → `parking_lot::RwLock`	2.9
`musicfs-cache/src/eviction.rs`	`std::sync::RwLock` → `parking_lot::RwLock`	2.9
`musicfs-cache/src/tree.rs`	`std::sync::RwLock` → `parking_lot::RwLock`	2.9
`musicfs-core/src/metrics.rs`	`std::sync::RwLock` → `parking_lot::RwLock`	2.9
`musicfs-core/src/lib.rs`	Add `install_panic_hook()`	2.2
`musicfs-cli/src/main.rs`	Panic hook, signal handler, spawn_mount2, sd_notify, stale mount check	2.1, 2.2, 2.7, 2.10
`musicfs-cli/Cargo.toml`	Add `sd-notify`, `tokio-util` deps	2.10, 2.1
`dist/musicfs.service`	Add `ExecStopPost`, fix `ExecStop`	3.7
`tests/resilience.rs`	Update/add tests for signal, panic hook, sd_notify	all

9. Glossary / References

Term	Definition
parking_lot	Fast, poison-free lock implementation. Already a project dependency.
spawn_mount2	`fuser` API that mounts FUSE in a background thread, returning a `BackgroundSession` handle
sd_notify	systemd notification protocol. `READY=1` signals service started, `STOPPING` signals shutdown.
BackgroundSession	Handle returned by `spawn_mount2`. Dropping it unmounts the filesystem.

Document	Path
Resilience audit	resilience-fault-tolerance.md
Resilience testing	resilience-testing.md
Architecture	architecture.md

19 KiB Raw Permalink Blame History

Phase A: Stop Dying — Implementation Plan

1. Abstract

2. Background

3. Goals & Non-Goals

3.1 Goals

3.2 Non-Goals

4. Proposed Design

4.1 Implementation Order

4.2 Issue 2.9: RwLock Poison Fix

Step 1: Stubs (compile)

Step 2: RED tests

Step 3: Implementation

Step 4: Verify

4.3 Issue 2.2: Panic Hook

Step 1: Stubs

Step 2: RED tests

Step 3: Implementation

Step 4: Verify

4.4 Issue 3.7 + 2.7: systemd Service Fix + FUSE Cleanup

Step 1: No stubs needed (config change)

Step 2: RED tests

Step 3: Implementation

Step 4: Verify

4.5 Issue 2.10: sd_notify Integration

Step 1: Stubs

Step 2: RED tests

Step 3: Implementation

Step 4: Verify

4.6 Issue 2.1: Signal Handling

Step 1: Stubs

Step 2: RED tests

Step 3: Implementation

Step 4: Verify

5. Cross-Cutting Concerns

5.1 Security & Privacy

5.2 Observability

5.3 Testing

6. Alternatives Considered

6.1 Poison Recovery Instead of parking_lot

6.2 Keep mount2 (blocking) with Signal Thread

6.3 Defer sd_notify Until Full Shutdown Orchestration

7. Implementation Plan

7.1 Task Sequence

7.2 Verification Checklist

8. Files Changed

9. Glossary / References

19 KiB

Raw Permalink Blame History