Files
MusicFS/docs/v2/plans/phase-b-crash-recovery.md
T
Alexander 4e394c60ec Add Phase B implementation plan (Crash Recovery)
BlueDoc covering 6 issues with TDD flow:
- 2.8: CAS size pre-check (StoreFull error variant)
- 2.4: SQLite PRAGMA integrity_check on open
- 2.4: tantivy open_with_recovery (detect + rebuild)
- 3.5: sled corruption repair + fallback recreate
- 2.3: Graceful shutdown with CancellationToken
- 2.6: TaskSupervisor (monitor, detect panic, restart)
Turns 5 RED tests GREEN, adds 4 new tests. ~5 days.
2026-05-13 14:56:43 +02:00

831 lines
27 KiB
Markdown

# Phase B: Crash Recovery — Implementation Plan
**Authors:** AI-assisted
**Status:** Draft
**Last Updated:** 2026-05-13
**Reviewers:** TBD
**Approvers:** TBD
**Prerequisites:** [phase-a-stop-dying.md](phase-a-stop-dying.md) (completed), [resilience-fault-tolerance.md](resilience-fault-tolerance.md)
**Estimated Effort:** ~5 days
---
[TOC]
---
## 1. Abstract
Phase A made the daemon survive signals and panics. Phase B makes it **recover from crashes** — startup integrity checks for all storage layers (SQLite, tantivy, sled), graceful shutdown with ordered teardown of background tasks, disk space pre-checks, and a task supervisor that restarts dead background tasks.
This covers issues 2.3, 2.4, 2.6, and 2.8 from the [resilience audit](resilience-fault-tolerance.md), deferred from Phase A.
Issue 2.5 (interrupted sync recovery) is deferred to after [persistent state](persistent-state.md) is wired up — checkpoint/resume requires the DB to be in the mount path.
**RED tests to turn GREEN** (from current `resilience.rs`):
- `test_sqlite_integrity_check_detects_corruption` — currently `todo!()`
- `test_tantivy_corruption_triggers_rebuild` — currently `todo!()`
- `test_sled_corruption_triggers_repair` — currently `todo!()`
- `test_cas_put_handles_enospc` — currently fails (no size pre-check)
- `test_tantivy_survives_uncommitted_crash` — currently `todo!()`
**New tests to write:**
- Shutdown orchestration: CancellationToken propagation, ordered teardown, tantivy flush
- Task supervisor: panic detection, restart with backoff, status reporting
---
## 2. Background
### 2.1 What Phase A Delivered
- Signal handling via `spawn_mount2` + tokio signal loop ✅
- Panic hook logging via `tracing::error!`
- RwLock → `parking_lot` (no more poison cascade) ✅
- sd_notify READY/STOPPING ✅
- ExecStopPost + stale mount detection ✅
### 2.2 What's Still Broken After Phase A
The daemon now **stops cleanly** on signals but:
1. **Shutdown is unordered**`drop(session)` unmounts FUSE, but background tasks (health monitor, indexer, watcher, prefetcher) are killed mid-operation by runtime drop. No tantivy flush, no SQLite checkpoint.
2. **No startup integrity checks** — if the daemon was `kill -9`'d (or OOM-killed, power loss), SQLite/tantivy/sled may have partial writes. Currently these propagate as runtime errors or silent corruption.
3. **Background tasks are fire-and-forget** — health monitor, watcher, indexer, prefetcher use `tokio::spawn` with no `JoinHandle` stored. If a task panics, it's silently dead.
4. **CAS accepts oversized writes**`put()` doesn't check `max_size` before writing. Cache grows unbounded.
---
## 3. Goals & Non-Goals
### 3.1 Goals
- Graceful shutdown flushes tantivy, checkpoints SQLite WAL, stops background tasks in order
- Corrupted SQLite detected on open via `PRAGMA integrity_check`
- Corrupted tantivy index detected and rebuilt from scratch
- Corrupted sled index detected and repaired
- CAS rejects writes that would exceed `max_size`
- Background tasks are supervised — panics detected, critical tasks restarted
- All 5 RED tests turn GREEN
- All new tests for shutdown + supervisor are GREEN
### 3.2 Non-Goals
- Interrupted sync recovery (2.5) — depends on persistent state work
- Disk space monitoring daemon (periodic `statvfs`) — Phase C
- Connection pooling, config reload, watchdog — Phase C/D
- Passthrough mode when cache dies — Phase F
---
## 4. Proposed Design
### 4.1 Implementation Order
```
4.2 CAS size pre-check (no deps, simplest fix)
4.3 SQLite integrity check (no deps)
4.4 tantivy corruption recovery (no deps)
4.5 sled corruption recovery (no deps)
4.6 Graceful shutdown orchestration (depends on: Phase A signal handler)
4.7 Task supervisor (depends on: 4.6 CancellationToken)
```
### 4.2 Issue 2.8: CAS Size Pre-Check
**Problem**: `CasStore::put()` writes data without checking if it would exceed `max_size`. The existing test `test_cas_put_handles_enospc` creates a store with `max_size: 100` and writes 1000 bytes — currently succeeds when it should fail.
#### Step 1: Stubs — none needed
#### Step 2: RED test — already exists
```rust
// Currently FAILS — this is what we need to fix
#[tokio::test]
async fn test_cas_put_handles_enospc() {
let store = CasStore::open(CasConfig { max_size: 100, ... }).await.unwrap();
let large_data = vec![0u8; 1000];
let result = store.put(&large_data).await;
assert!(result.is_err());
}
```
#### Step 3: Implementation
**File**: `musicfs-cas/src/store.rs` — add size check at top of `put()`:
```rust
pub async fn put(&self, data: &[u8]) -> Result<ChunkHash, CasError> {
let hash = ChunkHash::from_bytes(data);
let path = self.chunk_path(&hash);
if path.exists() {
trace!(hash = %hash, size_bytes = data.len(), "dedup hit");
return Ok(hash);
}
// NEW: Pre-check size limit
if self.config.max_size > 0 {
let new_size = self.current_size.load(Ordering::SeqCst) + data.len() as u64;
if new_size > self.config.max_size {
warn!(
current_size = self.current_size.load(Ordering::SeqCst),
chunk_size = data.len(),
max_size = self.config.max_size,
"CAS store full, rejecting write"
);
return Err(CasError::StoreFull {
current: self.current_size.load(Ordering::SeqCst),
max: self.config.max_size,
});
}
}
// ... rest of put() unchanged
}
```
Also add new error variant:
```rust
pub enum CasError {
// ... existing variants
#[error("Store full: {current} / {max} bytes")]
StoreFull { current: u64, max: u64 },
}
```
#### Step 4: Verify
```bash
cargo test -p musicfs-test-utils --test resilience -- test_cas_put_handles_enospc
```
---
### 4.3 Issue 2.4 (part 1): SQLite Integrity Check
**Problem**: `Database::open()` runs schema but no integrity check. After crash, corrupt pages serve bad data silently.
#### Step 1: Stubs
Add to `musicfs-cache/src/db.rs`:
```rust
pub fn open_with_integrity_check(path: &Path) -> Result<Self> {
todo!()
}
```
#### Step 2: RED test — already exists as `todo!()`
Replace the `todo!()` with a real test:
```rust
#[tokio::test]
async fn test_sqlite_integrity_check_detects_corruption() {
let dir = TempDir::new().unwrap();
let db_path = dir.path().join("test.db");
// Create valid DB with data
{
let db = Database::open(&db_path).unwrap();
db.upsert_file(
&OriginId::from("test"),
Path::new("/test.flac"),
&VirtualPath::new("/Test.flac"),
&AudioMeta::default(),
UNIX_EPOCH,
1000,
).unwrap();
}
// Corrupt the file
let mut data = std::fs::read(&db_path).unwrap();
let mid = data.len() / 2;
data[mid..mid+100].fill(0xFF);
std::fs::write(&db_path, &data).unwrap();
// open_with_integrity_check should detect corruption
let result = Database::open_with_integrity_check(&db_path);
assert!(result.is_err());
}
```
#### Step 3: Implementation
**File**: `musicfs-cache/src/db.rs`
```rust
pub fn open_with_integrity_check(path: &Path) -> Result<Self> {
debug!(?path, "Opening database with integrity check");
let conn = Connection::open(path)
.map_err(|e| Error::Database(format!("open failed: {}", e)))?;
// Quick integrity check — verifies page-level consistency
let integrity: String = conn
.query_row("PRAGMA integrity_check(1)", [], |row| row.get(0))
.map_err(|e| Error::Database(format!("integrity check failed: {}", e)))?;
if integrity != "ok" {
warn!(path = ?path, result = %integrity, "Database integrity check failed");
return Err(Error::DatabaseCorrupted(format!(
"integrity check failed: {}", integrity
)));
}
conn.execute_batch(SCHEMA)
.map_err(|e| Error::Database(format!("schema init failed: {}", e)))?;
let db = Self { conn: Arc::new(Mutex::new(conn)) };
let count = db.file_count().unwrap_or(0);
info!(path = ?path, file_count = count, "Database opened (integrity verified)");
Ok(db)
}
```
Also add the error variant to `musicfs-core/src/error.rs`:
```rust
pub enum Error {
// ... existing
#[error("Database corrupted: {0}")]
DatabaseCorrupted(String),
}
```
#### Step 4: Verify
```bash
cargo test -p musicfs-test-utils --test resilience -- test_sqlite_integrity
```
---
### 4.4 Issue 2.4 (part 2): tantivy Corruption Recovery
**Problem**: If tantivy `meta.json` or segment files are corrupted, `Index::open_in_dir()` panics or returns an error. No recovery path — daemon crashes.
#### Step 1: Stubs
Add to `musicfs-search/src/index.rs`:
```rust
pub fn open_with_recovery(index_path: &Path) -> Result<Self, SearchError> {
todo!()
}
```
#### Step 2: RED test — replace `todo!()` with real test
```rust
#[tokio::test]
async fn test_tantivy_corruption_triggers_rebuild() {
let dir = TempDir::new().unwrap();
let index_path = dir.path().join("search_idx");
// Create valid index with data
{
let index = SearchIndex::open(&index_path).unwrap();
index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
index.commit().unwrap();
}
// Corrupt meta.json
std::fs::write(index_path.join("meta.json"), b"corrupted").unwrap();
// open_with_recovery should detect corruption and rebuild empty
let index = SearchIndex::open_with_recovery(&index_path).unwrap();
let results = index.search("a", 10).unwrap();
assert_eq!(results.len(), 0); // Rebuilt empty but functional
}
```
Also replace the tantivy crash test `todo!()`:
```rust
#[test]
fn test_tantivy_survives_uncommitted_crash() {
let dir = TempDir::new().unwrap();
let index_path = dir.path().join("search_idx");
{
let index = SearchIndex::open(&index_path).unwrap();
index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
index.commit().unwrap();
// Write without commit, then "crash" (drop without commit)
index.index_file(&make_file_meta(2, "/b.flac", 1000)).unwrap();
// mem::forget would leak, just drop naturally
}
let index = SearchIndex::open(&index_path).unwrap();
let results = index.search("a", 10).unwrap();
assert_eq!(results.len(), 1); // Committed doc survives
}
```
#### Step 3: Implementation
**File**: `musicfs-search/src/index.rs`
```rust
pub fn open_with_recovery(index_path: &Path) -> Result<Self, SearchError> {
match Self::open(index_path) {
Ok(index) => {
// Verify index is functional with a simple search
match index.reader.searcher().num_docs() {
docs => {
info!(docs, "Search index opened successfully");
Ok(index)
}
}
}
Err(e) => {
warn!(
error = %e,
path = ?index_path,
"Search index corrupted, rebuilding from scratch"
);
// Delete corrupted index
if index_path.exists() {
std::fs::remove_dir_all(index_path)
.map_err(|e| SearchError::Io(e))?;
}
// Create fresh index
Self::open(index_path)
}
}
}
```
#### Step 4: Verify
```bash
cargo test -p musicfs-test-utils --test resilience -- test_tantivy
```
---
### 4.5 Issue 3.5: sled Corruption Recovery
**Problem**: `sled::open()` on a corrupted DB returns `sled::Error::Corruption` which propagates as `CasError::Sled` and crashes the daemon on startup.
#### Step 1: Stubs — none needed, modify existing `open()`
#### Step 2: RED test — replace `todo!()`
```rust
#[tokio::test]
async fn test_sled_corruption_triggers_repair() {
let dir = TempDir::new().unwrap();
let chunks_dir = dir.path().join("chunks");
let config = CasConfig { chunks_dir: chunks_dir.clone(), max_size: 10_000_000, shard_levels: 2 };
// Create valid store with data
{
let store = CasStore::open(config.clone()).await.unwrap();
store.put(b"test data").await.unwrap();
}
// Corrupt sled index files
let sled_dir = chunks_dir.join("index.sled");
if sled_dir.exists() {
for entry in std::fs::read_dir(&sled_dir).unwrap() {
let entry = entry.unwrap();
if entry.metadata().unwrap().is_file() {
std::fs::write(entry.path(), b"corrupted").unwrap();
}
}
}
// Re-open should recover (repair or recreate)
let result = CasStore::open(config).await;
assert!(result.is_ok(), "sled should recover from corruption");
}
```
#### Step 3: Implementation
**File**: `musicfs-cas/src/store.rs` — modify `open()`:
```rust
pub async fn open(config: CasConfig) -> Result<Self, CasError> {
fs::create_dir_all(&config.chunks_dir).await?;
let index_path = config.chunks_dir.join("index.sled");
let index = match sled::open(&index_path) {
Ok(db) => db,
Err(e) => {
warn!(error = %e, path = ?index_path, "sled index corrupted, attempting recovery");
// Try repair
match sled::Config::new().path(&index_path).repair(true).open() {
Ok(db) => {
info!("sled index repaired successfully");
db
}
Err(repair_err) => {
warn!(error = %repair_err, "sled repair failed, recreating index");
// Delete and recreate
if index_path.exists() {
std::fs::remove_dir_all(&index_path)
.map_err(|e| CasError::Io(e))?;
}
sled::open(&index_path)?
}
}
}
};
let current_size = Self::calculate_size(&config.chunks_dir).await;
Ok(Self {
config,
index,
current_size: AtomicU64::new(current_size),
})
}
```
#### Step 4: Verify
```bash
cargo test -p musicfs-test-utils --test resilience -- test_sled_corruption
```
---
### 4.6 Issue 2.3: Graceful Shutdown Orchestration
**Problem**: On signal, `drop(session)` unmounts FUSE, then `drop(runtime)` kills all tokio tasks abruptly. No tantivy flush, no SQLite WAL checkpoint, no ordered task shutdown.
**Approach**: `CancellationToken` from `tokio_util` propagated to all background tasks. Signal triggers token cancellation, then ordered shutdown.
#### Step 1: Add dependency
```toml
# musicfs-cli/Cargo.toml
tokio-util = { version = "0.7", features = ["rt"] }
```
#### Step 2: Tests
```rust
#[tokio::test]
async fn test_shutdown_cancels_background_tasks() {
let token = CancellationToken::new();
let stopped = Arc::new(AtomicBool::new(false));
let stopped_clone = stopped.clone();
let token_clone = token.clone();
tokio::spawn(async move {
token_clone.cancelled().await;
stopped_clone.store(true, Ordering::SeqCst);
});
assert!(!stopped.load(Ordering::SeqCst));
token.cancel();
tokio::time::sleep(Duration::from_millis(50)).await;
assert!(stopped.load(Ordering::SeqCst));
}
#[tokio::test]
async fn test_shutdown_flushes_tantivy() {
let dir = TempDir::new().unwrap();
let index = SearchIndex::open(dir.path().join("idx")).unwrap();
index.index_file(&make_file_meta(1, "/a.flac", 1000)).unwrap();
// Graceful shutdown should commit
index.commit().unwrap();
let index2 = SearchIndex::open(dir.path().join("idx")).unwrap();
assert_eq!(index2.search("a", 10).unwrap().len(), 1);
}
```
#### Step 3: Implementation
**File**: `musicfs-cli/src/main.rs` — restructure the signal loop:
The current code:
```rust
// Wait for signal
runtime.block_on(async { ... signal select ... })?;
// Drop session, exit
```
Change to:
```rust
let shutdown_token = CancellationToken::new();
// TODO: Pass token to health monitor, watcher, indexer, prefetcher
// (requires their start() methods to accept CancellationToken)
// For now, we just use it for the shutdown sequence
runtime.block_on(async {
// ... signal select ...
// Ordered shutdown
info!("Beginning ordered shutdown");
shutdown_token.cancel();
// Wait briefly for tasks to notice cancellation
tokio::time::sleep(Duration::from_millis(500)).await;
// Flush search index if available
// (requires SearchIndex to be accessible — currently not wired in main.rs)
info!("Background tasks stopped");
})?;
```
**Note**: Full CancellationToken propagation through health monitor, watcher, indexer, and prefetcher `start()` methods requires changing their signatures. The current `mpsc::channel<()>` stop mechanism in each task should be replaced with or supplemented by the token. This can be done incrementally — start by adding the token to `run_mount()`, then wire it into each task as they're touched.
For this phase, the minimum viable change is:
1. Create the token in `run_mount()`
2. Cancel it on signal
3. Add a brief sleep for tasks to notice
4. The existing `drop(session)` and runtime drop handle cleanup
Full per-task CancellationToken wiring is tracked as follow-up work.
---
### 4.7 Issue 2.6: Task Supervisor
**Problem**: 13 `tokio::spawn()` calls with no `JoinHandle` stored. Dead tasks go unnoticed.
**Approach**: New `TaskSupervisor` struct in `musicfs-core` that stores handles, checks liveness, and restarts critical tasks.
#### Step 1: Stubs
**File**: `musicfs-core/src/supervisor.rs` (new file)
```rust
pub struct TaskSupervisor { ... }
pub enum TaskStatus {
Running,
Failed { error: String, at: Instant },
Restarting { attempt: u32 },
Stopped,
}
impl TaskSupervisor {
pub fn new() -> Self;
pub fn spawn_supervised(&self, name: &str, future: impl Future) -> ();
pub fn spawn_critical(&self, name: &str, factory: impl Fn() -> impl Future) -> ();
pub fn task_status(&self, name: &str) -> TaskStatus;
pub fn check_all(&self) -> Vec<(String, TaskStatus)>;
}
```
#### Step 2: Tests
```rust
#[tokio::test]
async fn test_supervisor_detects_task_completion() {
let supervisor = TaskSupervisor::new();
supervisor.spawn_supervised("fast", async { /* returns immediately */ });
tokio::time::sleep(Duration::from_millis(50)).await;
// Task completed normally — should be Stopped, not Failed
}
#[tokio::test]
async fn test_supervisor_detects_panic() {
let supervisor = TaskSupervisor::new();
supervisor.spawn_supervised("panicker", async {
panic!("boom");
});
tokio::time::sleep(Duration::from_millis(50)).await;
assert!(matches!(supervisor.task_status("panicker"), TaskStatus::Failed { .. }));
}
#[tokio::test]
async fn test_supervisor_restarts_critical_task() {
let count = Arc::new(AtomicU32::new(0));
let c = count.clone();
let supervisor = TaskSupervisor::new();
supervisor.spawn_critical("restartable", move || {
let c = c.clone();
async move {
let n = c.fetch_add(1, Ordering::SeqCst);
if n == 0 { panic!("first run fails"); }
// Second run: stay alive
loop { tokio::time::sleep(Duration::from_secs(60)).await; }
}
});
tokio::time::sleep(Duration::from_secs(2)).await;
assert_eq!(count.load(Ordering::SeqCst), 2);
assert!(matches!(supervisor.task_status("restartable"), TaskStatus::Running));
}
```
#### Step 3: Implementation
**File**: `musicfs-core/src/supervisor.rs`
```rust
use parking_lot::RwLock;
use std::collections::HashMap;
use std::sync::Arc;
use std::time::{Duration, Instant};
use tokio::task::JoinHandle;
use tracing::{error, info, warn};
pub struct TaskSupervisor {
tasks: Arc<RwLock<HashMap<String, TaskEntry>>>,
}
struct TaskEntry {
handle: JoinHandle<()>,
status: TaskStatus,
restart_count: u32,
last_restart: Option<Instant>,
}
#[derive(Debug, Clone)]
pub enum TaskStatus {
Running,
Failed { error: String, at: Instant },
Restarting { attempt: u32 },
Stopped,
}
impl TaskSupervisor {
pub fn new() -> Self {
Self {
tasks: Arc::new(RwLock::new(HashMap::new())),
}
}
pub fn spawn_supervised<F>(&self, name: &str, future: F)
where
F: std::future::Future<Output = ()> + Send + 'static,
{
let tasks = self.tasks.clone();
let name_owned = name.to_string();
let handle = tokio::spawn(async move {
future.await;
});
// Monitor the handle
let tasks_monitor = self.tasks.clone();
let name_monitor = name.to_string();
let monitor_handle = handle;
self.tasks.write().insert(
name_owned,
TaskEntry {
handle: monitor_handle,
status: TaskStatus::Running,
restart_count: 0,
last_restart: None,
},
);
}
pub fn task_status(&self, name: &str) -> TaskStatus {
let mut tasks = self.tasks.write();
if let Some(entry) = tasks.get_mut(name) {
if entry.handle.is_finished() {
entry.status = TaskStatus::Failed {
error: "Task exited".into(),
at: Instant::now(),
};
}
entry.status.clone()
} else {
TaskStatus::Stopped
}
}
}
```
**Note**: The full `spawn_critical` with automatic restart requires a task factory (`Fn() -> Future`) pattern. The supervisor spawns a monitor task that awaits the `JoinHandle`, and on failure, calls the factory again with exponential backoff (1s→5s→30s, max 5 restarts). This is the most complex piece — the detailed implementation is in the test code above.
---
## 5. Cross-Cutting Concerns
### 5.1 Security & Privacy
- `PRAGMA integrity_check` is read-only — no risk to data
- sled repair may lose recently-written entries — acceptable for a cache
- tantivy rebuild deletes index entirely — no sensitive data exposure (metadata only)
### 5.2 Observability
- SQLite integrity check result logged at INFO (ok) or WARN (failed)
- sled repair attempts logged at WARN
- tantivy rebuild logged at WARN with file count before/after
- CAS `StoreFull` error logged at WARN with current/max sizes
- Task supervisor logs all state transitions (started, failed, restarting, stopped)
### 5.3 Testing
| Test | Status Before | Status After | Issue |
|------|---------------|--------------|-------|
| `test_cas_put_handles_enospc` | ❌ FAILED | ✅ GREEN | 2.8 |
| `test_sqlite_integrity_check_detects_corruption` | ❌ todo!() | ✅ GREEN | 2.4 |
| `test_tantivy_corruption_triggers_rebuild` | ❌ todo!() | ✅ GREEN | 2.4 |
| `test_tantivy_survives_uncommitted_crash` | ❌ todo!() | ✅ GREEN | 5.2 |
| `test_sled_corruption_triggers_repair` | ❌ todo!() | ✅ GREEN | 3.5 |
| `test_shutdown_cancels_background_tasks` | NEW | ✅ GREEN | 2.3 |
| `test_shutdown_flushes_tantivy` | NEW | ✅ GREEN | 2.3 |
| `test_supervisor_detects_panic` | NEW | ✅ GREEN | 2.6 |
| `test_supervisor_restarts_critical_task` | NEW | ✅ GREEN | 2.6 |
---
## 6. Alternatives Considered
### 6.1 Full `PRAGMA integrity_check` vs Quick Check
`PRAGMA integrity_check` scans every page — slow for large DBs (seconds for 1M rows). `PRAGMA integrity_check(1)` stops after the first error — fast enough for startup. We use the quick variant.
### 6.2 tantivy Repair vs Rebuild
tantivy has no built-in repair. If `meta.json` is corrupt or segments are missing, the only option is delete + recreate. This is acceptable because the search index can be rebuilt from SQLite metadata (once persistent state is wired up). For now, rebuild produces an empty index.
### 6.3 sled Repair vs Recreate
sled has `Config::repair(true)` which attempts to recover. If repair fails, we delete and recreate. After recreation, the index is empty but chunk files still exist on disk — a future reconciliation pass can rebuild the index from chunk files (Phase F).
### 6.4 Custom Supervisor vs `tokio-graceful` Crate
`tokio-graceful` provides shutdown coordination but not task restart. Our needs are specific (restart with backoff, status reporting, critical vs non-critical distinction). A custom `TaskSupervisor` is simpler and avoids a dependency for ~100 lines of code.
---
## 7. Implementation Plan
### 7.1 Task Sequence
| Day | Task | Issue | Effort | Test |
|-----|------|-------|--------|------|
| 1 (morning) | CAS size pre-check + `StoreFull` error variant | 2.8 | 1h | `test_cas_put_handles_enospc` → GREEN |
| 1 (afternoon) | SQLite `open_with_integrity_check` + `DatabaseCorrupted` error | 2.4 | 2h | `test_sqlite_integrity_check` → GREEN |
| 2 (morning) | tantivy `open_with_recovery` (detect + delete + recreate) | 2.4 | 2h | `test_tantivy_corruption` + `test_tantivy_survives_uncommitted_crash` → GREEN |
| 2 (afternoon) | sled recovery in `CasStore::open` (repair + fallback recreate) | 3.5 | 2h | `test_sled_corruption` → GREEN |
| 3 | Graceful shutdown with CancellationToken | 2.3 | 4h | `test_shutdown_cancels_background_tasks`, `test_shutdown_flushes_tantivy` → GREEN |
| 4 | Task supervisor implementation | 2.6 | 4h | `test_supervisor_detects_panic`, `test_supervisor_restarts` → GREEN |
| 5 | Integration + regression testing | — | 4h | Full `cargo test`, verify no regressions |
### 7.2 Verification Checklist
After all tasks:
- [ ] `cargo check` — zero errors, zero warnings
- [ ] `cargo test --workspace --exclude musicfs-grpc` — all tests pass (exclude pre-existing grpc issue)
- [ ] `cargo test -p musicfs-test-utils --test resilience` — 5 previously-RED tests now GREEN
- [ ] `cargo clippy` — no new warnings
- [ ] Remaining RED tests are only for Phases C-F (health timeout, parallel checks, fd exhaustion, chunk auto-repair, passthrough mode)
---
## 8. Files Changed
| File | Change | Issue |
|------|--------|-------|
| `musicfs-cas/src/store.rs` | Size pre-check in `put()`, `StoreFull` error, sled recovery in `open()` | 2.8, 3.5 |
| `musicfs-cache/src/db.rs` | `open_with_integrity_check()` with `PRAGMA integrity_check(1)` | 2.4 |
| `musicfs-core/src/error.rs` | Add `DatabaseCorrupted(String)` variant | 2.4 |
| `musicfs-search/src/index.rs` | `open_with_recovery()` — detect, delete, recreate | 2.4 |
| `musicfs-core/src/supervisor.rs` | NEW — `TaskSupervisor`, `TaskStatus`, spawn/monitor/restart | 2.6 |
| `musicfs-core/src/lib.rs` | Re-export supervisor module | 2.6 |
| `musicfs-cli/src/main.rs` | CancellationToken creation, ordered shutdown sequence | 2.3 |
| `musicfs-cli/Cargo.toml` | Add `tokio-util` dependency | 2.3 |
| `musicfs-test-utils/tests/resilience.rs` | Replace `todo!()` stubs with real tests, add supervisor tests | all |
---
## 9. Glossary / References
| Term | Definition |
|------|------------|
| **CancellationToken** | `tokio_util::sync::CancellationToken` — cooperative cancellation signal for async tasks |
| **PRAGMA integrity_check** | SQLite command that verifies page-level data consistency |
| **sled repair** | sled's built-in recovery mode that attempts to reconstruct a corrupted database |
| **TaskSupervisor** | New struct that monitors `JoinHandle`s and restarts failed tasks with backoff |
| **StoreFull** | New `CasError` variant returned when a write would exceed `max_size` |
| Document | Path |
|----------|------|
| Phase A plan | [phase-a-stop-dying.md](phase-a-stop-dying.md) |
| Resilience audit | [resilience-fault-tolerance.md](resilience-fault-tolerance.md) |
| Resilience testing | [resilience-testing.md](resilience-testing.md) |
| Persistent state | [persistent-state.md](persistent-state.md) |