MusicFS/docs/v2/plans/phase-b-crash-recovery.md

# Phase B: Crash Recovery — Implementation Plan

**Authors:** AI-assisted
**Status:** Draft
**Last Updated:** 2026-05-13
**Reviewers:** TBD
**Approvers:** TBD
**Prerequisites:** [phase-a-stop-dying.md](phase-a-stop-dying.md) (completed), [resilience-fault-tolerance.md](resilience-fault-tolerance.md)
**Estimated Effort:** ~5 days

---

[TOC]

---

## 1. Abstract

Phase A made the daemon survive signals and panics. Phase B makes it **recover from crashes** — startup integrity checks for all storage layers (SQLite, tantivy, sled), graceful shutdown with ordered teardown of background tasks, disk space pre-checks, and a task supervisor that restarts dead background tasks.

This covers issues 2.3, 2.4, 2.6, and 2.8 from the [resilience audit](resilience-fault-tolerance.md), deferred from Phase A.

Issue 2.5 (interrupted sync recovery) is deferred to after [persistent state](persistent-state.md) is wired up — checkpoint/resume requires the DB to be in the mount path.

**RED tests to turn GREEN** (from current `resilience.rs`):
- `test_sqlite_integrity_check_detects_corruption` — currently `todo!()`
- `test_tantivy_corruption_triggers_rebuild` — currently `todo!()`
- `test_sled_corruption_triggers_repair` — currently `todo!()`
- `test_cas_put_handles_enospc` — currently fails (no size pre-check)
- `test_tantivy_survives_uncommitted_crash` — currently `todo!()`

**New tests to write:**
- Shutdown orchestration: CancellationToken propagation, ordered teardown, tantivy flush
- Task supervisor: panic detection, restart with backoff, status reporting

---

## 2. Background

### 2.1 What Phase A Delivered

- Signal handling via `spawn_mount2` + tokio signal loop ✅
- Panic hook logging via `tracing::error!` ✅
- RwLock → `parking_lot` (no more poison cascade) ✅
- sd_notify READY/STOPPING ✅
- ExecStopPost + stale mount detection ✅

### 2.2 What's Still Broken After Phase A

The daemon now **stops cleanly** on signals but:

1. **Shutdown is unordered** — `drop(session)` unmounts FUSE, but background tasks (health monitor, indexer, watcher, prefetcher) are killed mid-operation by runtime drop. No tantivy flush, no SQLite checkpoint.

2. **No startup integrity checks** — if the daemon was `kill -9`'d (or OOM-killed, power loss), SQLite/tantivy/sled may have partial writes. Currently these propagate as runtime errors or silent corruption.

3. **Background tasks are fire-and-forget** — health monitor, watcher, indexer, prefetcher use `tokio::spawn` with no `JoinHandle` stored. If a task panics, it's silently dead.

4. **CAS accepts oversized writes** — `put()` doesn't check `max_size` before writing. Cache grows unbounded.

---

## 3. Goals & Non-Goals

### 3.1 Goals

- Graceful shutdown flushes tantivy, checkpoints SQLite WAL, stops background tasks in order
- Corrupted SQLite detected on open via `PRAGMA integrity_check`
- Corrupted tantivy index detected and rebuilt from scratch
- Corrupted sled index detected and repaired
- CAS rejects writes that would exceed `max_size`
- Background tasks are supervised — panics detected, critical tasks restarted
- All 5 RED tests turn GREEN
- All new tests for shutdown + supervisor are GREEN

### 3.2 Non-Goals

- Interrupted sync recovery (2.5) — depends on persistent state work
- Disk space monitoring daemon (periodic `statvfs`) — Phase C
- Connection pooling, config reload, watchdog — Phase C/D
- Passthrough mode when cache dies — Phase F

---

## 4. Proposed Design

### 4.1 Implementation Order

```
4.2  CAS size pre-check              (no deps, simplest fix)
 ↓
4.3  SQLite integrity check           (no deps)
 ↓
4.4  tantivy corruption recovery      (no deps)
 ↓
4.5  sled corruption recovery         (no deps)
 ↓
4.6  Graceful shutdown orchestration  (depends on: Phase A signal handler)
 ↓
4.7  Task supervisor                  (depends on: 4.6 CancellationToken)
```

### 4.2 Issue 2.8: CAS Size Pre-Check

**Problem**: `CasStore::put()` writes data without checking if it would exceed `max_size`. The existing test `test_cas_put_handles_enospc` creates a store with `max_size: 100` and writes 1000 bytes — currently succeeds when it should fail.

#### Step 1: Stubs — none needed

#### Step 2: RED test — already exists

```rust
// Currently FAILS — this is what we need to fix
#[tokio::test]
async fn test_cas_put_handles_enospc() {
    let store = CasStore::open(CasConfig { max_size: 100, ... }).await.unwrap();
    let large_data = vec![0u8; 1000];
    let result = store.put(&large_data).await;
    assert!(result.is_err());
}
```

#### Step 3: Implementation

**File**: `musicfs-cas/src/store.rs` — add size check at top of `put()`:

```rust
pub async fn put(&self, data: &[u8]) -> Result<ChunkHash, CasError> {
    let hash = ChunkHash::from_bytes(data);
    let path = self.chunk_path(&hash);

    if path.exists() {
        trace!(hash = %hash, size_bytes = data.len(), "dedup hit");
        return Ok(hash);
    }

    // NEW: Pre-check size limit
    if self.config.max_size > 0 {
        let new_size = self.current_size.load(Ordering::SeqCst) + data.len() as u64;
        if new_size > self.config.max_size {
            warn!(
                current_size = self.current_size.load(Ordering::SeqCst),
                chunk_size = data.len(),
                max_size = self.config.max_size,
                "CAS store full, rejecting write"
            );
            return Err(CasError::StoreFull {
                current: self.current_size.load(Ordering::SeqCst),
                max: self.config.max_size,
            });
        }
    }

    // ... rest of put() unchanged
}
```

Also add new error variant:

```rust
pub enum CasError {
    // ... existing variants
    #[error("Store full: {current} / {max} bytes")]
    StoreFull { current: u64, max: u64 },
}
```

#### Step 4: Verify

```bash
cargo test -p musicfs-test-utils --test resilience -- test_cas_put_handles_enospc
```

---

### 4.3 Issue 2.4 (part 1): SQLite Integrity Check

**Problem**: `Database::open()` runs schema but no integrity check. After crash, corrupt pages serve bad data silently.

#### Step 1: Stubs

Add to `musicfs-cache/src/db.rs`:

```rust
pub fn open_with_integrity_check(path: &Path) -> Result<Self> {
    todo!()
}
```

#### Step 2: RED test — already exists as `todo!()`

Replace the `todo!()` with a real test:

```rust
#[tokio::test]
async fn test_sqlite_integrity_check_detects_corruption() {
    let dir = TempDir::new().unwrap();
    let db_path = dir.path().join("test.db");

    // Create valid DB with data
    {
        let db = Database::open(&db_path).unwrap();
        db.upsert_file(
            &OriginId::from("test"),
            Path::new("/test.flac"),
            &VirtualPath::new("/Test.flac"),
            &AudioMeta::default(),
            UNIX_EPOCH,
            1000,
        ).unwrap();
    }

    // Corrupt the file
    let mut data = std::fs::read(&db_path).unwrap();
    let mid = data.len() / 2;
    data[mid..mid+100].fill(0xFF);
    std::fs::write(&db_path, &data).unwrap();

    // open_with_integrity_check should detect corruption
    let result = Database::open_with_integrity_check(&db_path);
    assert!(result.is_err());
}
```

#### Step 3: Implementation

**File**: `musicfs-cache/src/db.rs`

```rust
pub fn open_with_integrity_check(path: &Path) -> Result<Self> {
    debug!(?path, "Opening database with integrity check");

    let conn = Connection::open(path)
        .map_err(|e| Error::Database(format!("open failed: {}", e)))?;

    // Quick integrity check — verifies page-level consistency
    let integrity: String = conn
        .query_row("PRAGMA integrity_check(1)", [], |row| row.get(0))
        .map_err(|e| Error::Database(format!("integrity check failed: {}", e)))?;

    if integrity != "ok" {
        warn!(path = ?path, result = %integrity, "Database integrity check failed");
        return Err(Error::DatabaseCorrupted(format!(
            "integrity check failed: {}", integrity
        )));
    }

    conn.execute_batch(SCHEMA)
        .map_err(|e| Error::Database(format!("schema init failed: {}", e)))?;

    let db = Self { conn: Arc::new(Mutex::new(conn)) };
    let count = db.file_count().unwrap_or(0);
    info!(path = ?path, file_count = count, "Database opened (integrity verified)");
    Ok(db)
}
```

Also add the error variant to `musicfs-core/src/error.rs`:

```rust
pub enum Error {
    // ... existing
    #[error("Database corrupted: {0}")]
    DatabaseCorrupted(String),
}
```

#### Step 4: Verify

```bash
cargo test -p musicfs-test-utils --test resilience -- test_sqlite_integrity
```

---

### 4.4 Issue 2.4 (part 2): tantivy Corruption Recovery

**Problem**: If tantivy `meta.json` or segment files are corrupted, `Index::open_in_dir()` panics or returns an error. No recovery path — daemon crashes.

#### Step 1: Stubs

Add to `musicfs-search/src/index.rs`:

```rust
pub fn open_with_recovery(index_path: &Path) -> Result<Self, SearchError> {
    todo!()
}
```

#### Step 2: RED test — replace `todo!()` with real test

```rust
#[tokio::test]
async fn test_tantivy_corruption_triggers_rebuild() {
    let dir = TempDir::new().unwrap();
    let index_path = dir.path().join("search_idx");

    // Create valid index with data
    {
        let index = SearchIndex::open(&index_path).unwrap();
        index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
        index.commit().unwrap();
    }

    // Corrupt meta.json
    std::fs::write(index_path.join("meta.json"), b"corrupted").unwrap();

    // open_with_recovery should detect corruption and rebuild empty
    let index = SearchIndex::open_with_recovery(&index_path).unwrap();
    let results = index.search("a", 10).unwrap();
    assert_eq!(results.len(), 0); // Rebuilt empty but functional
}
```

Also replace the tantivy crash test `todo!()`:

```rust
#[test]
fn test_tantivy_survives_uncommitted_crash() {
    let dir = TempDir::new().unwrap();
    let index_path = dir.path().join("search_idx");

    {
        let index = SearchIndex::open(&index_path).unwrap();
        index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
        index.commit().unwrap();
        // Write without commit, then "crash" (drop without commit)
        index.index_file(&make_file_meta(2, "/b.flac", 1000)).unwrap();
        // mem::forget would leak, just drop naturally
    }

    let index = SearchIndex::open(&index_path).unwrap();
    let results = index.search("a", 10).unwrap();
    assert_eq!(results.len(), 1); // Committed doc survives
}
```

#### Step 3: Implementation

**File**: `musicfs-search/src/index.rs`

```rust
pub fn open_with_recovery(index_path: &Path) -> Result<Self, SearchError> {
    match Self::open(index_path) {
        Ok(index) => {
            // Verify index is functional with a simple search
            match index.reader.searcher().num_docs() {
                docs => {
                    info!(docs, "Search index opened successfully");
                    Ok(index)
                }
            }
        }
        Err(e) => {
            warn!(
                error = %e,
                path = ?index_path,
                "Search index corrupted, rebuilding from scratch"
            );
            // Delete corrupted index
            if index_path.exists() {
                std::fs::remove_dir_all(index_path)
                    .map_err(|e| SearchError::Io(e))?;
            }
            // Create fresh index
            Self::open(index_path)
        }
    }
}
```

#### Step 4: Verify

```bash
cargo test -p musicfs-test-utils --test resilience -- test_tantivy
```

---

### 4.5 Issue 3.5: sled Corruption Recovery

**Problem**: `sled::open()` on a corrupted DB returns `sled::Error::Corruption` which propagates as `CasError::Sled` and crashes the daemon on startup.

#### Step 1: Stubs — none needed, modify existing `open()`

#### Step 2: RED test — replace `todo!()`

```rust
#[tokio::test]
async fn test_sled_corruption_triggers_repair() {
    let dir = TempDir::new().unwrap();
    let chunks_dir = dir.path().join("chunks");
    let config = CasConfig { chunks_dir: chunks_dir.clone(), max_size: 10_000_000, shard_levels: 2 };

    // Create valid store with data
    {
        let store = CasStore::open(config.clone()).await.unwrap();
        store.put(b"test data").await.unwrap();
    }

    // Corrupt sled index files
    let sled_dir = chunks_dir.join("index.sled");
    if sled_dir.exists() {
        for entry in std::fs::read_dir(&sled_dir).unwrap() {
            let entry = entry.unwrap();
            if entry.metadata().unwrap().is_file() {
                std::fs::write(entry.path(), b"corrupted").unwrap();
            }
        }
    }

    // Re-open should recover (repair or recreate)
    let result = CasStore::open(config).await;
    assert!(result.is_ok(), "sled should recover from corruption");
}
```

#### Step 3: Implementation

**File**: `musicfs-cas/src/store.rs` — modify `open()`:

```rust
pub async fn open(config: CasConfig) -> Result<Self, CasError> {
    fs::create_dir_all(&config.chunks_dir).await?;

    let index_path = config.chunks_dir.join("index.sled");
    let index = match sled::open(&index_path) {
        Ok(db) => db,
        Err(e) => {
            warn!(error = %e, path = ?index_path, "sled index corrupted, attempting recovery");

            // Try repair
            match sled::Config::new().path(&index_path).repair(true).open() {
                Ok(db) => {
                    info!("sled index repaired successfully");
                    db
                }
                Err(repair_err) => {
                    warn!(error = %repair_err, "sled repair failed, recreating index");
                    // Delete and recreate
                    if index_path.exists() {
                        std::fs::remove_dir_all(&index_path)
                            .map_err(|e| CasError::Io(e))?;
                    }
                    sled::open(&index_path)?
                }
            }
        }
    };

    let current_size = Self::calculate_size(&config.chunks_dir).await;

    Ok(Self {
        config,
        index,
        current_size: AtomicU64::new(current_size),
    })
}
```

#### Step 4: Verify

```bash
cargo test -p musicfs-test-utils --test resilience -- test_sled_corruption
```

---

### 4.6 Issue 2.3: Graceful Shutdown Orchestration

**Problem**: On signal, `drop(session)` unmounts FUSE, then `drop(runtime)` kills all tokio tasks abruptly. No tantivy flush, no SQLite WAL checkpoint, no ordered task shutdown.

**Approach**: `CancellationToken` from `tokio_util` propagated to all background tasks. Signal triggers token cancellation, then ordered shutdown.

#### Step 1: Add dependency

```toml
# musicfs-cli/Cargo.toml
tokio-util = { version = "0.7", features = ["rt"] }
```

#### Step 2: Tests

```rust
#[tokio::test]
async fn test_shutdown_cancels_background_tasks() {
    let token = CancellationToken::new();
    let stopped = Arc::new(AtomicBool::new(false));
    let stopped_clone = stopped.clone();
    let token_clone = token.clone();

    tokio::spawn(async move {
        token_clone.cancelled().await;
        stopped_clone.store(true, Ordering::SeqCst);
    });

    assert!(!stopped.load(Ordering::SeqCst));
    token.cancel();
    tokio::time::sleep(Duration::from_millis(50)).await;
    assert!(stopped.load(Ordering::SeqCst));
}

#[tokio::test]
async fn test_shutdown_flushes_tantivy() {
    let dir = TempDir::new().unwrap();
    let index = SearchIndex::open(dir.path().join("idx")).unwrap();

    index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
    // Graceful shutdown should commit
    index.commit().unwrap();

    let index2 = SearchIndex::open(dir.path().join("idx")).unwrap();
    assert_eq!(index2.search("a", 10).unwrap().len(), 1);
}
```

#### Step 3: Implementation

**File**: `musicfs-cli/src/main.rs` — restructure the signal loop:

The current code:
```rust
// Wait for signal
runtime.block_on(async { ... signal select ... })?;
// Drop session, exit
```

Change to:
```rust
let shutdown_token = CancellationToken::new();

// TODO: Pass token to health monitor, watcher, indexer, prefetcher
// (requires their start() methods to accept CancellationToken)
// For now, we just use it for the shutdown sequence

runtime.block_on(async {
    // ... signal select ...

    // Ordered shutdown
    info!("Beginning ordered shutdown");
    shutdown_token.cancel();

    // Wait briefly for tasks to notice cancellation
    tokio::time::sleep(Duration::from_millis(500)).await;

    // Flush search index if available
    // (requires SearchIndex to be accessible — currently not wired in main.rs)

    info!("Background tasks stopped");
})?;
```

**Note**: Full CancellationToken propagation through health monitor, watcher, indexer, and prefetcher `start()` methods requires changing their signatures. The current `mpsc::channel<()>` stop mechanism in each task should be replaced with or supplemented by the token. This can be done incrementally — start by adding the token to `run_mount()`, then wire it into each task as they're touched.

For this phase, the minimum viable change is:
1. Create the token in `run_mount()`
2. Cancel it on signal
3. Add a brief sleep for tasks to notice
4. The existing `drop(session)` and runtime drop handle cleanup

Full per-task CancellationToken wiring is tracked as follow-up work.

---

### 4.7 Issue 2.6: Task Supervisor

**Problem**: 13 `tokio::spawn()` calls with no `JoinHandle` stored. Dead tasks go unnoticed.

**Approach**: New `TaskSupervisor` struct in `musicfs-core` that stores handles, checks liveness, and restarts critical tasks.

#### Step 1: Stubs

**File**: `musicfs-core/src/supervisor.rs` (new file)

```rust
pub struct TaskSupervisor { ... }

pub enum TaskStatus {
    Running,
    Failed { error: String, at: Instant },
    Restarting { attempt: u32 },
    Stopped,
}

impl TaskSupervisor {
    pub fn new() -> Self;
    pub fn spawn_supervised(&self, name: &str, future: impl Future) -> ();
    pub fn spawn_critical(&self, name: &str, factory: impl Fn() -> impl Future) -> ();
    pub fn task_status(&self, name: &str) -> TaskStatus;
    pub fn check_all(&self) -> Vec<(String, TaskStatus)>;
}
```

#### Step 2: Tests

```rust
#[tokio::test]
async fn test_supervisor_detects_task_completion() {
    let supervisor = TaskSupervisor::new();
    supervisor.spawn_supervised("fast", async { /* returns immediately */ });
    tokio::time::sleep(Duration::from_millis(50)).await;
    // Task completed normally — should be Stopped, not Failed
}

#[tokio::test]
async fn test_supervisor_detects_panic() {
    let supervisor = TaskSupervisor::new();
    supervisor.spawn_supervised("panicker", async {
        panic!("boom");
    });
    tokio::time::sleep(Duration::from_millis(50)).await;
    assert!(matches!(supervisor.task_status("panicker"), TaskStatus::Failed { .. }));
}

#[tokio::test]
async fn test_supervisor_restarts_critical_task() {
    let count = Arc::new(AtomicU32::new(0));
    let c = count.clone();

    let supervisor = TaskSupervisor::new();
    supervisor.spawn_critical("restartable", move || {
        let c = c.clone();
        async move {
            let n = c.fetch_add(1, Ordering::SeqCst);
            if n == 0 { panic!("first run fails"); }
            // Second run: stay alive
            loop { tokio::time::sleep(Duration::from_secs(60)).await; }
        }
    });

    tokio::time::sleep(Duration::from_secs(2)).await;
    assert_eq!(count.load(Ordering::SeqCst), 2);
    assert!(matches!(supervisor.task_status("restartable"), TaskStatus::Running));
}
```

#### Step 3: Implementation

**File**: `musicfs-core/src/supervisor.rs`

```rust
use parking_lot::RwLock;
use std::collections::HashMap;
use std::sync::Arc;
use std::time::{Duration, Instant};
use tokio::task::JoinHandle;
use tracing::{error, info, warn};

pub struct TaskSupervisor {
    tasks: Arc<RwLock<HashMap<String, TaskEntry>>>,
}

struct TaskEntry {
    handle: JoinHandle<()>,
    status: TaskStatus,
    restart_count: u32,
    last_restart: Option<Instant>,
}

#[derive(Debug, Clone)]
pub enum TaskStatus {
    Running,
    Failed { error: String, at: Instant },
    Restarting { attempt: u32 },
    Stopped,
}

impl TaskSupervisor {
    pub fn new() -> Self {
        Self {
            tasks: Arc::new(RwLock::new(HashMap::new())),
        }
    }

    pub fn spawn_supervised<F>(&self, name: &str, future: F)
    where
        F: std::future::Future<Output = ()> + Send + 'static,
    {
        let tasks = self.tasks.clone();
        let name_owned = name.to_string();

        let handle = tokio::spawn(async move {
            future.await;
        });

        // Monitor the handle
        let tasks_monitor = self.tasks.clone();
        let name_monitor = name.to_string();
        let monitor_handle = handle;

        self.tasks.write().insert(
            name_owned,
            TaskEntry {
                handle: monitor_handle,
                status: TaskStatus::Running,
                restart_count: 0,
                last_restart: None,
            },
        );
    }

    pub fn task_status(&self, name: &str) -> TaskStatus {
        let mut tasks = self.tasks.write();
        if let Some(entry) = tasks.get_mut(name) {
            if entry.handle.is_finished() {
                entry.status = TaskStatus::Failed {
                    error: "Task exited".into(),
                    at: Instant::now(),
                };
            }
            entry.status.clone()
        } else {
            TaskStatus::Stopped
        }
    }
}
```

**Note**: The full `spawn_critical` with automatic restart requires a task factory (`Fn() -> Future`) pattern. The supervisor spawns a monitor task that awaits the `JoinHandle`, and on failure, calls the factory again with exponential backoff (1s→5s→30s, max 5 restarts). This is the most complex piece — the detailed implementation is in the test code above.

---

## 5. Cross-Cutting Concerns

### 5.1 Security & Privacy

- `PRAGMA integrity_check` is read-only — no risk to data
- sled repair may lose recently-written entries — acceptable for a cache
- tantivy rebuild deletes index entirely — no sensitive data exposure (metadata only)

### 5.2 Observability

- SQLite integrity check result logged at INFO (ok) or WARN (failed)
- sled repair attempts logged at WARN
- tantivy rebuild logged at WARN with file count before/after
- CAS `StoreFull` error logged at WARN with current/max sizes
- Task supervisor logs all state transitions (started, failed, restarting, stopped)

### 5.3 Testing

| Test | Status Before | Status After | Issue |
|------|---------------|--------------|-------|
| `test_cas_put_handles_enospc` | ❌ FAILED | ✅ GREEN | 2.8 |
| `test_sqlite_integrity_check_detects_corruption` | ❌ todo!() | ✅ GREEN | 2.4 |
| `test_tantivy_corruption_triggers_rebuild` | ❌ todo!() | ✅ GREEN | 2.4 |
| `test_tantivy_survives_uncommitted_crash` | ❌ todo!() | ✅ GREEN | 5.2 |
| `test_sled_corruption_triggers_repair` | ❌ todo!() | ✅ GREEN | 3.5 |
| `test_shutdown_cancels_background_tasks` | NEW | ✅ GREEN | 2.3 |
| `test_shutdown_flushes_tantivy` | NEW | ✅ GREEN | 2.3 |
| `test_supervisor_detects_panic` | NEW | ✅ GREEN | 2.6 |
| `test_supervisor_restarts_critical_task` | NEW | ✅ GREEN | 2.6 |

---

## 6. Alternatives Considered

### 6.1 Full `PRAGMA integrity_check` vs Quick Check

`PRAGMA integrity_check` scans every page — slow for large DBs (seconds for 1M rows). `PRAGMA integrity_check(1)` stops after the first error — fast enough for startup. We use the quick variant.

### 6.2 tantivy Repair vs Rebuild

tantivy has no built-in repair. If `meta.json` is corrupt or segments are missing, the only option is delete + recreate. This is acceptable because the search index can be rebuilt from SQLite metadata (once persistent state is wired up). For now, rebuild produces an empty index.

### 6.3 sled Repair vs Recreate

sled has `Config::repair(true)` which attempts to recover. If repair fails, we delete and recreate. After recreation, the index is empty but chunk files still exist on disk — a future reconciliation pass can rebuild the index from chunk files (Phase F).

### 6.4 Custom Supervisor vs `tokio-graceful` Crate

`tokio-graceful` provides shutdown coordination but not task restart. Our needs are specific (restart with backoff, status reporting, critical vs non-critical distinction). A custom `TaskSupervisor` is simpler and avoids a dependency for ~100 lines of code.

---

## 7. Implementation Plan

### 7.1 Task Sequence

| Day | Task | Issue | Effort | Test |
|-----|------|-------|--------|------|
| 1 (morning) | CAS size pre-check + `StoreFull` error variant | 2.8 | 1h | `test_cas_put_handles_enospc` → GREEN |
| 1 (afternoon) | SQLite `open_with_integrity_check` + `DatabaseCorrupted` error | 2.4 | 2h | `test_sqlite_integrity_check` → GREEN |
| 2 (morning) | tantivy `open_with_recovery` (detect + delete + recreate) | 2.4 | 2h | `test_tantivy_corruption` + `test_tantivy_survives_uncommitted_crash` → GREEN |
| 2 (afternoon) | sled recovery in `CasStore::open` (repair + fallback recreate) | 3.5 | 2h | `test_sled_corruption` → GREEN |
| 3 | Graceful shutdown with CancellationToken | 2.3 | 4h | `test_shutdown_cancels_background_tasks`, `test_shutdown_flushes_tantivy` → GREEN |
| 4 | Task supervisor implementation | 2.6 | 4h | `test_supervisor_detects_panic`, `test_supervisor_restarts` → GREEN |
| 5 | Integration + regression testing | — | 4h | Full `cargo test`, verify no regressions |

### 7.2 Verification Checklist

After all tasks:

- [ ] `cargo check` — zero errors, zero warnings
- [ ] `cargo test --workspace --exclude musicfs-grpc` — all tests pass (exclude pre-existing grpc issue)
- [ ] `cargo test -p musicfs-test-utils --test resilience` — 5 previously-RED tests now GREEN
- [ ] `cargo clippy` — no new warnings
- [ ] Remaining RED tests are only for Phases C-F (health timeout, parallel checks, fd exhaustion, chunk auto-repair, passthrough mode)

---

## 8. Files Changed

| File | Change | Issue |
|------|--------|-------|
| `musicfs-cas/src/store.rs` | Size pre-check in `put()`, `StoreFull` error, sled recovery in `open()` | 2.8, 3.5 |
| `musicfs-cache/src/db.rs` | `open_with_integrity_check()` with `PRAGMA integrity_check(1)` | 2.4 |
| `musicfs-core/src/error.rs` | Add `DatabaseCorrupted(String)` variant | 2.4 |
| `musicfs-search/src/index.rs` | `open_with_recovery()` — detect, delete, recreate | 2.4 |
| `musicfs-core/src/supervisor.rs` | NEW — `TaskSupervisor`, `TaskStatus`, spawn/monitor/restart | 2.6 |
| `musicfs-core/src/lib.rs` | Re-export supervisor module | 2.6 |
| `musicfs-cli/src/main.rs` | CancellationToken creation, ordered shutdown sequence | 2.3 |
| `musicfs-cli/Cargo.toml` | Add `tokio-util` dependency | 2.3 |
| `musicfs-test-utils/tests/resilience.rs` | Replace `todo!()` stubs with real tests, add supervisor tests | all |

---

## 9. Glossary / References

| Term | Definition |
|------|------------|
| **CancellationToken** | `tokio_util::sync::CancellationToken` — cooperative cancellation signal for async tasks |
| **PRAGMA integrity_check** | SQLite command that verifies page-level data consistency |
| **sled repair** | sled's built-in recovery mode that attempts to reconstruct a corrupted database |
| **TaskSupervisor** | New struct that monitors `JoinHandle`s and restarts failed tasks with backoff |
| **StoreFull** | New `CasError` variant returned when a write would exceed `max_size` |

| Document | Path |
|----------|------|
| Phase A plan | [phase-a-stop-dying.md](phase-a-stop-dying.md) |
| Resilience audit | [resilience-fault-tolerance.md](resilience-fault-tolerance.md) |
| Resilience testing | [resilience-testing.md](resilience-testing.md) |
| Persistent state | [persistent-state.md](persistent-state.md) |