Files

T

Alexander 4e394c60ec Add Phase B implementation plan (Crash Recovery)

BlueDoc covering 6 issues with TDD flow:
- 2.8: CAS size pre-check (StoreFull error variant)
- 2.4: SQLite PRAGMA integrity_check on open
- 2.4: tantivy open_with_recovery (detect + rebuild)
- 3.5: sled corruption repair + fallback recreate
- 2.3: Graceful shutdown with CancellationToken
- 2.6: TaskSupervisor (monitor, detect panic, restart)
Turns 5 RED tests GREEN, adds 4 new tests. ~5 days.

2026-05-13 14:56:43 +02:00

27 KiB

Raw Permalink Blame History

Phase B: Crash Recovery — Implementation Plan

Authors: AI-assisted
Status: Draft
Last Updated: 2026-05-13
Reviewers: TBD
Approvers: TBD
Prerequisites: phase-a-stop-dying.md (completed), resilience-fault-tolerance.md
Estimated Effort: ~5 days

[TOC]

1. Abstract

Phase A made the daemon survive signals and panics. Phase B makes it recover from crashes — startup integrity checks for all storage layers (SQLite, tantivy, sled), graceful shutdown with ordered teardown of background tasks, disk space pre-checks, and a task supervisor that restarts dead background tasks.

This covers issues 2.3, 2.4, 2.6, and 2.8 from the resilience audit, deferred from Phase A.

Issue 2.5 (interrupted sync recovery) is deferred to after persistent state is wired up — checkpoint/resume requires the DB to be in the mount path.

RED tests to turn GREEN (from current resilience.rs):

test_sqlite_integrity_check_detects_corruption — currently todo!()
test_tantivy_corruption_triggers_rebuild — currently todo!()
test_sled_corruption_triggers_repair — currently todo!()
test_cas_put_handles_enospc — currently fails (no size pre-check)
test_tantivy_survives_uncommitted_crash — currently todo!()

New tests to write:

Shutdown orchestration: CancellationToken propagation, ordered teardown, tantivy flush
Task supervisor: panic detection, restart with backoff, status reporting

2. Background

2.1 What Phase A Delivered

Signal handling via spawn_mount2 + tokio signal loop ✅
Panic hook logging via tracing::error! ✅
RwLock → parking_lot (no more poison cascade) ✅
sd_notify READY/STOPPING ✅
ExecStopPost + stale mount detection ✅

2.2 What's Still Broken After Phase A

The daemon now stops cleanly on signals but:

Shutdown is unordered — drop(session) unmounts FUSE, but background tasks (health monitor, indexer, watcher, prefetcher) are killed mid-operation by runtime drop. No tantivy flush, no SQLite checkpoint.
No startup integrity checks — if the daemon was kill -9'd (or OOM-killed, power loss), SQLite/tantivy/sled may have partial writes. Currently these propagate as runtime errors or silent corruption.
Background tasks are fire-and-forget — health monitor, watcher, indexer, prefetcher use tokio::spawn with no JoinHandle stored. If a task panics, it's silently dead.
CAS accepts oversized writes — put() doesn't check max_size before writing. Cache grows unbounded.

3. Goals & Non-Goals

3.1 Goals

Graceful shutdown flushes tantivy, checkpoints SQLite WAL, stops background tasks in order
Corrupted SQLite detected on open via PRAGMA integrity_check
Corrupted tantivy index detected and rebuilt from scratch
Corrupted sled index detected and repaired
CAS rejects writes that would exceed max_size
Background tasks are supervised — panics detected, critical tasks restarted
All 5 RED tests turn GREEN
All new tests for shutdown + supervisor are GREEN

3.2 Non-Goals

Interrupted sync recovery (2.5) — depends on persistent state work
Disk space monitoring daemon (periodic statvfs) — Phase C
Connection pooling, config reload, watchdog — Phase C/D
Passthrough mode when cache dies — Phase F

4. Proposed Design

4.1 Implementation Order

4.2  CAS size pre-check              (no deps, simplest fix)
 ↓
4.3  SQLite integrity check           (no deps)
 ↓
4.4  tantivy corruption recovery      (no deps)
 ↓
4.5  sled corruption recovery         (no deps)
 ↓
4.6  Graceful shutdown orchestration  (depends on: Phase A signal handler)
 ↓
4.7  Task supervisor                  (depends on: 4.6 CancellationToken)

4.2 Issue 2.8: CAS Size Pre-Check

Problem: CasStore::put() writes data without checking if it would exceed max_size. The existing test test_cas_put_handles_enospc creates a store with max_size: 100 and writes 1000 bytes — currently succeeds when it should fail.

Step 1: Stubs — none needed

Step 2: RED test — already exists

// Currently FAILS — this is what we need to fix
#[tokio::test]
async fn test_cas_put_handles_enospc() {
    let store = CasStore::open(CasConfig { max_size: 100, ... }).await.unwrap();
    let large_data = vec![0u8; 1000];
    let result = store.put(&large_data).await;
    assert!(result.is_err());
}

Step 3: Implementation

File: musicfs-cas/src/store.rs — add size check at top of put():

pub async fn put(&self, data: &[u8]) -> Result<ChunkHash, CasError> {
    let hash = ChunkHash::from_bytes(data);
    let path = self.chunk_path(&hash);

    if path.exists() {
        trace!(hash = %hash, size_bytes = data.len(), "dedup hit");
        return Ok(hash);
    }

    // NEW: Pre-check size limit
    if self.config.max_size > 0 {
        let new_size = self.current_size.load(Ordering::SeqCst) + data.len() as u64;
        if new_size > self.config.max_size {
            warn!(
                current_size = self.current_size.load(Ordering::SeqCst),
                chunk_size = data.len(),
                max_size = self.config.max_size,
                "CAS store full, rejecting write"
            );
            return Err(CasError::StoreFull {
                current: self.current_size.load(Ordering::SeqCst),
                max: self.config.max_size,
            });
        }
    }

    // ... rest of put() unchanged
}

Also add new error variant:

pub enum CasError {
    // ... existing variants
    #[error("Store full: {current} / {max} bytes")]
    StoreFull { current: u64, max: u64 },
}

Step 4: Verify

cargo test -p musicfs-test-utils --test resilience -- test_cas_put_handles_enospc

4.3 Issue 2.4 (part 1): SQLite Integrity Check

Problem: Database::open() runs schema but no integrity check. After crash, corrupt pages serve bad data silently.

Step 1: Stubs

Add to musicfs-cache/src/db.rs:

pub fn open_with_integrity_check(path: &Path) -> Result<Self> {
    todo!()
}

Step 2: RED test — already exists as `todo!()`

Replace the todo!() with a real test:

#[tokio::test]
async fn test_sqlite_integrity_check_detects_corruption() {
    let dir = TempDir::new().unwrap();
    let db_path = dir.path().join("test.db");

    // Create valid DB with data
    {
        let db = Database::open(&db_path).unwrap();
        db.upsert_file(
            &OriginId::from("test"),
            Path::new("/test.flac"),
            &VirtualPath::new("/Test.flac"),
            &AudioMeta::default(),
            UNIX_EPOCH,
            1000,
        ).unwrap();
    }

    // Corrupt the file
    let mut data = std::fs::read(&db_path).unwrap();
    let mid = data.len() / 2;
    data[mid..mid+100].fill(0xFF);
    std::fs::write(&db_path, &data).unwrap();

    // open_with_integrity_check should detect corruption
    let result = Database::open_with_integrity_check(&db_path);
    assert!(result.is_err());
}

Step 3: Implementation

File: musicfs-cache/src/db.rs

pub fn open_with_integrity_check(path: &Path) -> Result<Self> {
    debug!(?path, "Opening database with integrity check");

    let conn = Connection::open(path)
        .map_err(|e| Error::Database(format!("open failed: {}", e)))?;

    // Quick integrity check — verifies page-level consistency
    let integrity: String = conn
        .query_row("PRAGMA integrity_check(1)", [], |row| row.get(0))
        .map_err(|e| Error::Database(format!("integrity check failed: {}", e)))?;

    if integrity != "ok" {
        warn!(path = ?path, result = %integrity, "Database integrity check failed");
        return Err(Error::DatabaseCorrupted(format!(
            "integrity check failed: {}", integrity
        )));
    }

    conn.execute_batch(SCHEMA)
        .map_err(|e| Error::Database(format!("schema init failed: {}", e)))?;

    let db = Self { conn: Arc::new(Mutex::new(conn)) };
    let count = db.file_count().unwrap_or(0);
    info!(path = ?path, file_count = count, "Database opened (integrity verified)");
    Ok(db)
}

Also add the error variant to musicfs-core/src/error.rs:

pub enum Error {
    // ... existing
    #[error("Database corrupted: {0}")]
    DatabaseCorrupted(String),
}

Step 4: Verify

cargo test -p musicfs-test-utils --test resilience -- test_sqlite_integrity

4.4 Issue 2.4 (part 2): tantivy Corruption Recovery

Problem: If tantivy meta.json or segment files are corrupted, Index::open_in_dir() panics or returns an error. No recovery path — daemon crashes.

Step 1: Stubs

Add to musicfs-search/src/index.rs:

pub fn open_with_recovery(index_path: &Path) -> Result<Self, SearchError> {
    todo!()
}

Step 2: RED test — replace `todo!()` with real test

#[tokio::test]
async fn test_tantivy_corruption_triggers_rebuild() {
    let dir = TempDir::new().unwrap();
    let index_path = dir.path().join("search_idx");

    // Create valid index with data
    {
        let index = SearchIndex::open(&index_path).unwrap();
        index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
        index.commit().unwrap();
    }

    // Corrupt meta.json
    std::fs::write(index_path.join("meta.json"), b"corrupted").unwrap();

    // open_with_recovery should detect corruption and rebuild empty
    let index = SearchIndex::open_with_recovery(&index_path).unwrap();
    let results = index.search("a", 10).unwrap();
    assert_eq!(results.len(), 0); // Rebuilt empty but functional
}

Also replace the tantivy crash test todo!():

#[test]
fn test_tantivy_survives_uncommitted_crash() {
    let dir = TempDir::new().unwrap();
    let index_path = dir.path().join("search_idx");

    {
        let index = SearchIndex::open(&index_path).unwrap();
        index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
        index.commit().unwrap();
        // Write without commit, then "crash" (drop without commit)
        index.index_file(&make_file_meta(2, "/b.flac", 1000)).unwrap();
        // mem::forget would leak, just drop naturally
    }

    let index = SearchIndex::open(&index_path).unwrap();
    let results = index.search("a", 10).unwrap();
    assert_eq!(results.len(), 1); // Committed doc survives
}

Step 3: Implementation

File: musicfs-search/src/index.rs

pub fn open_with_recovery(index_path: &Path) -> Result<Self, SearchError> {
    match Self::open(index_path) {
        Ok(index) => {
            // Verify index is functional with a simple search
            match index.reader.searcher().num_docs() {
                docs => {
                    info!(docs, "Search index opened successfully");
                    Ok(index)
                }
            }
        }
        Err(e) => {
            warn!(
                error = %e,
                path = ?index_path,
                "Search index corrupted, rebuilding from scratch"
            );
            // Delete corrupted index
            if index_path.exists() {
                std::fs::remove_dir_all(index_path)
                    .map_err(|e| SearchError::Io(e))?;
            }
            // Create fresh index
            Self::open(index_path)
        }
    }
}

Step 4: Verify

cargo test -p musicfs-test-utils --test resilience -- test_tantivy

4.5 Issue 3.5: sled Corruption Recovery

Problem: sled::open() on a corrupted DB returns sled::Error::Corruption which propagates as CasError::Sled and crashes the daemon on startup.

Step 1: Stubs — none needed, modify existing `open()`

Step 2: RED test — replace `todo!()`

#[tokio::test]
async fn test_sled_corruption_triggers_repair() {
    let dir = TempDir::new().unwrap();
    let chunks_dir = dir.path().join("chunks");
    let config = CasConfig { chunks_dir: chunks_dir.clone(), max_size: 10_000_000, shard_levels: 2 };

    // Create valid store with data
    {
        let store = CasStore::open(config.clone()).await.unwrap();
        store.put(b"test data").await.unwrap();
    }

    // Corrupt sled index files
    let sled_dir = chunks_dir.join("index.sled");
    if sled_dir.exists() {
        for entry in std::fs::read_dir(&sled_dir).unwrap() {
            let entry = entry.unwrap();
            if entry.metadata().unwrap().is_file() {
                std::fs::write(entry.path(), b"corrupted").unwrap();
            }
        }
    }

    // Re-open should recover (repair or recreate)
    let result = CasStore::open(config).await;
    assert!(result.is_ok(), "sled should recover from corruption");
}

Step 3: Implementation

File: musicfs-cas/src/store.rs — modify open():

pub async fn open(config: CasConfig) -> Result<Self, CasError> {
    fs::create_dir_all(&config.chunks_dir).await?;

    let index_path = config.chunks_dir.join("index.sled");
    let index = match sled::open(&index_path) {
        Ok(db) => db,
        Err(e) => {
            warn!(error = %e, path = ?index_path, "sled index corrupted, attempting recovery");

            // Try repair
            match sled::Config::new().path(&index_path).repair(true).open() {
                Ok(db) => {
                    info!("sled index repaired successfully");
                    db
                }
                Err(repair_err) => {
                    warn!(error = %repair_err, "sled repair failed, recreating index");
                    // Delete and recreate
                    if index_path.exists() {
                        std::fs::remove_dir_all(&index_path)
                            .map_err(|e| CasError::Io(e))?;
                    }
                    sled::open(&index_path)?
                }
            }
        }
    };

    let current_size = Self::calculate_size(&config.chunks_dir).await;

    Ok(Self {
        config,
        index,
        current_size: AtomicU64::new(current_size),
    })
}

Step 4: Verify

cargo test -p musicfs-test-utils --test resilience -- test_sled_corruption

4.6 Issue 2.3: Graceful Shutdown Orchestration

Problem: On signal, drop(session) unmounts FUSE, then drop(runtime) kills all tokio tasks abruptly. No tantivy flush, no SQLite WAL checkpoint, no ordered task shutdown.

Approach: CancellationToken from tokio_util propagated to all background tasks. Signal triggers token cancellation, then ordered shutdown.

Step 1: Add dependency

# musicfs-cli/Cargo.toml
tokio-util = { version = "0.7", features = ["rt"] }

Step 2: Tests

#[tokio::test]
async fn test_shutdown_cancels_background_tasks() {
    let token = CancellationToken::new();
    let stopped = Arc::new(AtomicBool::new(false));
    let stopped_clone = stopped.clone();
    let token_clone = token.clone();

    tokio::spawn(async move {
        token_clone.cancelled().await;
        stopped_clone.store(true, Ordering::SeqCst);
    });

    assert!(!stopped.load(Ordering::SeqCst));
    token.cancel();
    tokio::time::sleep(Duration::from_millis(50)).await;
    assert!(stopped.load(Ordering::SeqCst));
}

#[tokio::test]
async fn test_shutdown_flushes_tantivy() {
    let dir = TempDir::new().unwrap();
    let index = SearchIndex::open(dir.path().join("idx")).unwrap();

    index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
    // Graceful shutdown should commit
    index.commit().unwrap();

    let index2 = SearchIndex::open(dir.path().join("idx")).unwrap();
    assert_eq!(index2.search("a", 10).unwrap().len(), 1);
}

Step 3: Implementation

File: musicfs-cli/src/main.rs — restructure the signal loop:

The current code:

// Wait for signal
runtime.block_on(async { ... signal select ... })?;
// Drop session, exit

Change to:

let shutdown_token = CancellationToken::new();

// TODO: Pass token to health monitor, watcher, indexer, prefetcher
// (requires their start() methods to accept CancellationToken)
// For now, we just use it for the shutdown sequence

runtime.block_on(async {
    // ... signal select ...

    // Ordered shutdown
    info!("Beginning ordered shutdown");
    shutdown_token.cancel();

    // Wait briefly for tasks to notice cancellation
    tokio::time::sleep(Duration::from_millis(500)).await;

    // Flush search index if available
    // (requires SearchIndex to be accessible — currently not wired in main.rs)

    info!("Background tasks stopped");
})?;

Note: Full CancellationToken propagation through health monitor, watcher, indexer, and prefetcher start() methods requires changing their signatures. The current mpsc::channel<()> stop mechanism in each task should be replaced with or supplemented by the token. This can be done incrementally — start by adding the token to run_mount(), then wire it into each task as they're touched.

For this phase, the minimum viable change is:

Create the token in run_mount()
Cancel it on signal
Add a brief sleep for tasks to notice
The existing drop(session) and runtime drop handle cleanup

Full per-task CancellationToken wiring is tracked as follow-up work.

4.7 Issue 2.6: Task Supervisor

Problem: 13 tokio::spawn() calls with no JoinHandle stored. Dead tasks go unnoticed.

Approach: New TaskSupervisor struct in musicfs-core that stores handles, checks liveness, and restarts critical tasks.

Step 1: Stubs

File: musicfs-core/src/supervisor.rs (new file)

pub struct TaskSupervisor { ... }

pub enum TaskStatus {
    Running,
    Failed { error: String, at: Instant },
    Restarting { attempt: u32 },
    Stopped,
}

impl TaskSupervisor {
    pub fn new() -> Self;
    pub fn spawn_supervised(&self, name: &str, future: impl Future) -> ();
    pub fn spawn_critical(&self, name: &str, factory: impl Fn() -> impl Future) -> ();
    pub fn task_status(&self, name: &str) -> TaskStatus;
    pub fn check_all(&self) -> Vec<(String, TaskStatus)>;
}

Step 2: Tests

#[tokio::test]
async fn test_supervisor_detects_task_completion() {
    let supervisor = TaskSupervisor::new();
    supervisor.spawn_supervised("fast", async { /* returns immediately */ });
    tokio::time::sleep(Duration::from_millis(50)).await;
    // Task completed normally — should be Stopped, not Failed
}

#[tokio::test]
async fn test_supervisor_detects_panic() {
    let supervisor = TaskSupervisor::new();
    supervisor.spawn_supervised("panicker", async {
        panic!("boom");
    });
    tokio::time::sleep(Duration::from_millis(50)).await;
    assert!(matches!(supervisor.task_status("panicker"), TaskStatus::Failed { .. }));
}

#[tokio::test]
async fn test_supervisor_restarts_critical_task() {
    let count = Arc::new(AtomicU32::new(0));
    let c = count.clone();

    let supervisor = TaskSupervisor::new();
    supervisor.spawn_critical("restartable", move || {
        let c = c.clone();
        async move {
            let n = c.fetch_add(1, Ordering::SeqCst);
            if n == 0 { panic!("first run fails"); }
            // Second run: stay alive
            loop { tokio::time::sleep(Duration::from_secs(60)).await; }
        }
    });

    tokio::time::sleep(Duration::from_secs(2)).await;
    assert_eq!(count.load(Ordering::SeqCst), 2);
    assert!(matches!(supervisor.task_status("restartable"), TaskStatus::Running));
}

Step 3: Implementation

File: musicfs-core/src/supervisor.rs

use parking_lot::RwLock;
use std::collections::HashMap;
use std::sync::Arc;
use std::time::{Duration, Instant};
use tokio::task::JoinHandle;
use tracing::{error, info, warn};

pub struct TaskSupervisor {
    tasks: Arc<RwLock<HashMap<String, TaskEntry>>>,
}

struct TaskEntry {
    handle: JoinHandle<()>,
    status: TaskStatus,
    restart_count: u32,
    last_restart: Option<Instant>,
}

#[derive(Debug, Clone)]
pub enum TaskStatus {
    Running,
    Failed { error: String, at: Instant },
    Restarting { attempt: u32 },
    Stopped,
}

impl TaskSupervisor {
    pub fn new() -> Self {
        Self {
            tasks: Arc::new(RwLock::new(HashMap::new())),
        }
    }

    pub fn spawn_supervised<F>(&self, name: &str, future: F)
    where
        F: std::future::Future<Output = ()> + Send + 'static,
    {
        let tasks = self.tasks.clone();
        let name_owned = name.to_string();

        let handle = tokio::spawn(async move {
            future.await;
        });

        // Monitor the handle
        let tasks_monitor = self.tasks.clone();
        let name_monitor = name.to_string();
        let monitor_handle = handle;

        self.tasks.write().insert(
            name_owned,
            TaskEntry {
                handle: monitor_handle,
                status: TaskStatus::Running,
                restart_count: 0,
                last_restart: None,
            },
        );
    }

    pub fn task_status(&self, name: &str) -> TaskStatus {
        let mut tasks = self.tasks.write();
        if let Some(entry) = tasks.get_mut(name) {
            if entry.handle.is_finished() {
                entry.status = TaskStatus::Failed {
                    error: "Task exited".into(),
                    at: Instant::now(),
                };
            }
            entry.status.clone()
        } else {
            TaskStatus::Stopped
        }
    }
}

Note: The full spawn_critical with automatic restart requires a task factory (Fn() -> Future) pattern. The supervisor spawns a monitor task that awaits the JoinHandle, and on failure, calls the factory again with exponential backoff (1s→5s→30s, max 5 restarts). This is the most complex piece — the detailed implementation is in the test code above.

5. Cross-Cutting Concerns

5.1 Security & Privacy

PRAGMA integrity_check is read-only — no risk to data
sled repair may lose recently-written entries — acceptable for a cache
tantivy rebuild deletes index entirely — no sensitive data exposure (metadata only)

5.2 Observability

SQLite integrity check result logged at INFO (ok) or WARN (failed)
sled repair attempts logged at WARN
tantivy rebuild logged at WARN with file count before/after
CAS StoreFull error logged at WARN with current/max sizes
Task supervisor logs all state transitions (started, failed, restarting, stopped)

5.3 Testing

Test	Status Before	Status After	Issue
`test_cas_put_handles_enospc`	❌ FAILED	✅ GREEN	2.8
`test_sqlite_integrity_check_detects_corruption`	❌ todo!()	✅ GREEN	2.4
`test_tantivy_corruption_triggers_rebuild`	❌ todo!()	✅ GREEN	2.4
`test_tantivy_survives_uncommitted_crash`	❌ todo!()	✅ GREEN	5.2
`test_sled_corruption_triggers_repair`	❌ todo!()	✅ GREEN	3.5
`test_shutdown_cancels_background_tasks`	NEW	✅ GREEN	2.3
`test_shutdown_flushes_tantivy`	NEW	✅ GREEN	2.3
`test_supervisor_detects_panic`	NEW	✅ GREEN	2.6
`test_supervisor_restarts_critical_task`	NEW	✅ GREEN	2.6

6. Alternatives Considered

6.1 Full `PRAGMA integrity_check` vs Quick Check

PRAGMA integrity_check scans every page — slow for large DBs (seconds for 1M rows). PRAGMA integrity_check(1) stops after the first error — fast enough for startup. We use the quick variant.

6.2 tantivy Repair vs Rebuild

tantivy has no built-in repair. If meta.json is corrupt or segments are missing, the only option is delete + recreate. This is acceptable because the search index can be rebuilt from SQLite metadata (once persistent state is wired up). For now, rebuild produces an empty index.

6.3 sled Repair vs Recreate

sled has Config::repair(true) which attempts to recover. If repair fails, we delete and recreate. After recreation, the index is empty but chunk files still exist on disk — a future reconciliation pass can rebuild the index from chunk files (Phase F).

6.4 Custom Supervisor vs `tokio-graceful` Crate

tokio-graceful provides shutdown coordination but not task restart. Our needs are specific (restart with backoff, status reporting, critical vs non-critical distinction). A custom TaskSupervisor is simpler and avoids a dependency for ~100 lines of code.

7. Implementation Plan

7.1 Task Sequence

Day	Task	Issue	Effort	Test
1 (morning)	CAS size pre-check + `StoreFull` error variant	2.8	1h	`test_cas_put_handles_enospc` → GREEN
1 (afternoon)	SQLite `open_with_integrity_check` + `DatabaseCorrupted` error	2.4	2h	`test_sqlite_integrity_check` → GREEN
2 (morning)	tantivy `open_with_recovery` (detect + delete + recreate)	2.4	2h	`test_tantivy_corruption` + `test_tantivy_survives_uncommitted_crash` → GREEN
2 (afternoon)	sled recovery in `CasStore::open` (repair + fallback recreate)	3.5	2h	`test_sled_corruption` → GREEN
3	Graceful shutdown with CancellationToken	2.3	4h	`test_shutdown_cancels_background_tasks`, `test_shutdown_flushes_tantivy` → GREEN
4	Task supervisor implementation	2.6	4h	`test_supervisor_detects_panic`, `test_supervisor_restarts` → GREEN
5	Integration + regression testing	—	4h	Full `cargo test`, verify no regressions

7.2 Verification Checklist

After all tasks:

cargo check — zero errors, zero warnings
cargo test --workspace --exclude musicfs-grpc — all tests pass (exclude pre-existing grpc issue)
cargo test -p musicfs-test-utils --test resilience — 5 previously-RED tests now GREEN
cargo clippy — no new warnings
Remaining RED tests are only for Phases C-F (health timeout, parallel checks, fd exhaustion, chunk auto-repair, passthrough mode)

8. Files Changed

File	Change	Issue
`musicfs-cas/src/store.rs`	Size pre-check in `put()`, `StoreFull` error, sled recovery in `open()`	2.8, 3.5
`musicfs-cache/src/db.rs`	`open_with_integrity_check()` with `PRAGMA integrity_check(1)`	2.4
`musicfs-core/src/error.rs`	Add `DatabaseCorrupted(String)` variant	2.4
`musicfs-search/src/index.rs`	`open_with_recovery()` — detect, delete, recreate	2.4
`musicfs-core/src/supervisor.rs`	NEW — `TaskSupervisor`, `TaskStatus`, spawn/monitor/restart	2.6
`musicfs-core/src/lib.rs`	Re-export supervisor module	2.6
`musicfs-cli/src/main.rs`	CancellationToken creation, ordered shutdown sequence	2.3
`musicfs-cli/Cargo.toml`	Add `tokio-util` dependency	2.3
`musicfs-test-utils/tests/resilience.rs`	Replace `todo!()` stubs with real tests, add supervisor tests	all

9. Glossary / References

Term	Definition
CancellationToken	`tokio_util::sync::CancellationToken` — cooperative cancellation signal for async tasks
PRAGMA integrity_check	SQLite command that verifies page-level data consistency
sled repair	sled's built-in recovery mode that attempts to reconstruct a corrupted database
TaskSupervisor	New struct that monitors `JoinHandle`s and restarts failed tasks with backoff
StoreFull	New `CasError` variant returned when a write would exceed `max_size`

Document	Path
Phase A plan	phase-a-stop-dying.md
Resilience audit	resilience-fault-tolerance.md
Resilience testing	resilience-testing.md
Persistent state	persistent-state.md

27 KiB Raw Permalink Blame History

Phase B: Crash Recovery — Implementation Plan

1. Abstract

2. Background

2.1 What Phase A Delivered

2.2 What's Still Broken After Phase A

3. Goals & Non-Goals

3.1 Goals

3.2 Non-Goals

4. Proposed Design

4.1 Implementation Order

4.2 Issue 2.8: CAS Size Pre-Check

Step 1: Stubs — none needed

Step 2: RED test — already exists

Step 3: Implementation

Step 4: Verify

4.3 Issue 2.4 (part 1): SQLite Integrity Check

Step 1: Stubs

Step 2: RED test — already exists as todo!()

Step 3: Implementation

Step 4: Verify

4.4 Issue 2.4 (part 2): tantivy Corruption Recovery

Step 1: Stubs

Step 2: RED test — replace todo!() with real test

Step 3: Implementation

Step 4: Verify

4.5 Issue 3.5: sled Corruption Recovery

Step 1: Stubs — none needed, modify existing open()

Step 2: RED test — replace todo!()

Step 3: Implementation

Step 4: Verify

4.6 Issue 2.3: Graceful Shutdown Orchestration

Step 1: Add dependency

Step 2: Tests

Step 3: Implementation

4.7 Issue 2.6: Task Supervisor

Step 1: Stubs

Step 2: Tests

Step 3: Implementation

5. Cross-Cutting Concerns

5.1 Security & Privacy

5.2 Observability

5.3 Testing

6. Alternatives Considered

6.1 Full PRAGMA integrity_check vs Quick Check

6.2 tantivy Repair vs Rebuild

6.3 sled Repair vs Recreate

6.4 Custom Supervisor vs tokio-graceful Crate

7. Implementation Plan

7.1 Task Sequence

7.2 Verification Checklist

8. Files Changed

9. Glossary / References

27 KiB

Raw Permalink Blame History

Step 2: RED test — already exists as `todo!()`

Step 2: RED test — replace `todo!()` with real test

Step 1: Stubs — none needed, modify existing `open()`

Step 2: RED test — replace `todo!()`

6.1 Full `PRAGMA integrity_check` vs Quick Check

6.4 Custom Supervisor vs `tokio-graceful` Crate