Add resilience audit and persistent state plans

Comprehensive fault tolerance analysis covering 34 issues across 6 phases: signal handling, crash recovery, cache corruption, network failures, resource exhaustion, and the critical finding that no persistent state is used on mount (every restart is a full origin rescan). Persistent state plan covers storage engine options, mount flow redesign, background delta sync, and the in-memory state inventory.
2026-05-13 12:09:41 +02:00
parent 5ac33987c0
commit 87574ce008
2 changed files with 1770 additions and 0 deletions
@@ -0,0 +1,353 @@
 # MusicFS Persistent State Plan
 **Date**: 2026-05-13
 **Status**: Research Complete — Design Decision Needed
 **Prerequisites**: [architecture.md](../architecture.md), [resilience-fault-tolerance.md](resilience-fault-tolerance.md)
 **Related Requirements**: G1 (O(1) mount time), NFR-1.7 (<500ms mount), FR-7.1 (cache persists across restarts)
 ---
 ## 1. Problem Statement
 Every mount is a full cold start. The `run_mount()` function in `main.rs` does not use any persistent storage — it walks the entire origin filesystem, parses metadata from every audio file, and builds all runtime state from scratch.
 The architecture designed persistence infrastructure (SQLite schema, `chunk_manifest` column, `ChunkManifest::from_db()`, `chunks_to_bytes()`) but **none of it is wired into the mount path**. The mount flow doesn't even open the database.
 ### Mount Time by Library Size (Current)
 | Library Size | Estimated Mount Time | Target (NFR-1.7) |
 |---|---|---|
 | 1K files | ~1-2s | <500ms |
 | 10K files | ~10-20s | <500ms |
 | 100K files | ~2-5 minutes | <500ms |
 | 1M files | ~20-60 minutes | <500ms |
 | 10M files (stretch) | hours | <500ms |
 ---
 ## 2. In-Memory State Inventory
 ### 2.1 State That Must Survive Restart
 These are the large, expensive-to-rebuild data structures. Losing them forces a full origin rescan.
 #### VirtualTree (~300-400MB at 1M files)
 **Location**: `musicfs-cache/src/tree.rs`
 **Contents**:
 - `nodes: HashMap<Inode, VirtualNode>` — every directory and file node
 - `path_to_inode: HashMap<VirtualPath, Inode>` — reverse path lookup
 - `next_inode: AtomicU64` — inode counter
 **Currently rebuilt from**: Full recursive origin scan + metadata parse of every audio file. This is the single most expensive operation on mount — it touches every file on origin, runs symphonia metadata extraction, and builds the entire tree structure.
 **What's needed**: Load from persistent storage on mount. Rebuild only on first-ever mount or if storage is corrupt.
 ---
 #### ContentFetcher.file_meta (~200MB at 1M files)
 **Location**: `musicfs-cas/src/fetcher.rs`
 **Contents**:
 - `file_meta: RwLock<HashMap<FileId, FileMeta>>` — full metadata for every file
 - Each `FileMeta` contains: id, virtual_path, real_path (origin_id + path), size, mtime, content_hash, audio metadata
 **Currently rebuilt from**: Same origin scan that builds the tree. Every file is registered via `fetcher.register_file(meta)`.
 **What's needed**: This is essentially a duplicate of the tree data in a different shape. If the tree is loaded from storage, this map should be populated from the same source.
 ---
 #### FileReader.manifests (~100MB at 1M files)
 **Location**: `musicfs-cas/src/reader.rs`
 **Contents**:
 - `manifests: RwLock<HashMap<FileId, ChunkManifest>>` — maps FileId to list of chunk hashes + offsets
 - Each `ChunkManifest` contains: file_id, total_size, mtime, chunks (Vec<ChunkRef> with hash + offset + size)
 **Currently rebuilt from**: Re-fetched from origin on first `read()` after restart. The fetcher downloads the entire file, chunks it via CDC, stores chunks in CAS (dedup catches existing ones), and builds the manifest. This means every file is re-downloaded once after restart even though the chunks are already on disk.
 **What's needed**: Persist manifests to storage after fetch. Load on mount. This is the difference between "restart = re-download everything" and "restart = instant reads from cache."
 **Existing dead code**: SQLite `files` table has `chunk_manifest BLOB` column. `ChunkManifest::chunks_to_bytes()` and `ChunkManifest::from_db()` exist but are never called.
 ---
 #### LruEviction access times (~50MB at 100K chunks)
 **Location**: `musicfs-cache/src/eviction.rs`
 **Contents**:
 - `access_times: RwLock<BTreeMap<Instant, ChunkHash>>` — ordered by access time
 - `hash_to_time: RwLock<HashMap<ChunkHash, Instant>>` — reverse lookup
 **Currently rebuilt from**: Nothing. After restart, all chunks have equal eviction priority. The album you're currently listening to is just as likely to be evicted as something you played 6 months ago.
 **What's needed**: Persist last-access timestamps. On mount, load and reconstruct the LRU order so hot data stays cached.
 ---
 ### 2.2 State That Survives But Is Ignored on Mount
 These persist on disk but `run_mount()` never opens them.
 | Component | Persisted To | Loaded on Mount? | Effect |
 |---|---|---|---|
 | SQLite metadata (files table) | `metadata.db` | ❌ | All metadata re-scanned from origin |
 | tantivy search index | `search.idx/` | ❌ | Index rebuilt from scratch (or not at all) |
 | PatternStore (access patterns) | SQLite (separate DB) | ❌ | Predictions reset to zero |
 | CollectionStore (smart collections) | SQLite (same as patterns) | ❌ | Collections unavailable until opened |
 ### 2.3 State That Correctly Does Not Need Persistence
 | Component | Why Transient Is Fine |
 |---|---|
 | OriginRegistry (origin connections) | Reconstructed from config on startup |
 | Router (priorities, latency stats) | Priorities from config; latency stats warm up within seconds |
 | HealthMonitor (health state) | All origins start as Unknown, converge within one check cycle (~30s) |
 | EventBus (in-flight events) | Transient by nature |
 | PrefetchEngine.in_flight | Transient work queue |
 | PluginManager | Re-loaded from config + plugin directories |
 | MusicFs.query_inodes | Transient search session state |
 | CasStore.current_size | Recalculated on open (though currently broken — see resilience doc 3.10) |
 ---
 ## 3. Storage Decision
 ### 3.1 Requirements for Persistent State
 1. **Bulk sequential read on mount** — load ~1M records into in-memory structures as fast as possible
 2. **Incremental updates at runtime** — delta sync adds/removes/modifies individual files
 3. **Crash safety** — no corruption on unclean shutdown (SIGKILL, power loss)
 4. **Manifest storage** — binary blobs (msgpack-encoded chunk lists), variable size (100 bytes to 10KB per file)
 5. **LRU timestamps** — simple key-value (ChunkHash → last_access_timestamp)
 6. **Already in project** — minimize new dependencies
 ### 3.2 Options
 #### Option A: SQLite (Current Architecture Choice)
 **Already in project**: `rusqlite` dependency, `schema.sql` with `files` table, `Database` struct with full CRUD, `chunk_manifest BLOB` column ready.
 | Metric | Performance |
 |---|---|
 | Bulk load 1M rows | ~2-4 seconds (WAL mode, indexed) |
 | Single row upsert | ~50μs |
 | Crash safety | WAL mode — excellent |
 | Manifest blobs | Native BLOB support, no size limit |
 **Pros**: Already built (schema, code, tests exist). Well-understood crash semantics. Single file backup. SQL queries for debugging. The `chunk_manifest` column and `from_db()`/`to_bytes()` methods are already written.
 **Cons**: Not the fastest for pure key-value workloads. WAL checkpoint can cause brief write pauses. Single-writer limitation (Mutex around connection).
 **Effort to wire up**: ~5-7 days (mostly connecting existing code, not writing new code)
 ---
 #### Option B: sled (Already in Project for CAS Index)
 **Already in project**: Used for CAS chunk hash → location mapping.
 | Metric | Performance |
 |---|---|
 | Bulk load 1M entries | ~1-2 seconds (LSM, sequential reads) |
 | Single entry upsert | ~10-20μs |
 | Crash safety | Built-in WAL — good |
 | Manifest blobs | Native byte value support |
 **Pros**: Faster than SQLite for pure key-value. Already a dependency. Good for LRU timestamps (simple k/v).
 **Cons**: No SQL — querying for debugging is harder. No schema migration story. Limited tooling. Has known issues with large datasets (memory usage during compaction). Two persistence engines = two things to maintain.
 **Effort**: ~7-9 days (new serialization layer, no existing code to reuse)
 ---
 #### Option C: Flat File (bincode/msgpack dump)
 | Metric | Performance |
 |---|---|
 | Bulk load 1M entries | <1 second (mmap, zero-parse with bincode) |
 | Single entry upsert | N/A — full rewrite required |
 | Crash safety | Must write atomically (tmp + rename) |
 | Manifest blobs | Part of serialized struct |
 **Pros**: Fastest possible bulk load. Simplest implementation.
 **Cons**: No incremental updates — every change requires serializing and rewriting the entire file. At 1M files (~500MB serialized), a single file modification triggers a 500MB write. No concurrent access. No recovery from partial corruption.
 **Effort**: ~3-4 days but creates ongoing maintenance burden for delta updates
 ---
 #### Option D: Hybrid (SQLite for metadata + sled for hot-path data)
 Use SQLite for structured metadata (files, collections, patterns — already built) and sled for hot-path key-value data (manifests, LRU timestamps — performance-critical).
 **Pros**: Each store optimized for its access pattern. SQLite for queryable metadata, sled for fast blob lookup.
 **Cons**: Two persistence engines to coordinate. Consistency between them on crash. More complex startup/shutdown.
 ---
 ### 3.3 Recommendation
 **Pending your decision.** The tradeoffs are:
 - **Simplest path**: Option A (SQLite) — most code already exists, just needs wiring
 - **Fastest hot-path**: Option D (Hybrid) — but more complexity
 - **Fastest bulk load**: Option C (Flat file) — but no incremental updates
 The choice depends on what you value most. SQLite at 1M files loads in ~2-4 seconds — is that acceptable vs the <500ms target? If not, a flat file or sled for the tree data with SQLite for everything else might be needed.
 ---
 ## 4. What Needs to Change
 Regardless of storage choice, these are the code changes needed:
 ### 4.1 Mount Path (musicfs-cli/src/main.rs)
 Current `run_mount()` flow:
 ```
 1. Open CAS store                           → O(1)
 2. Create origin connection                 → O(1)
 3. scan_music_files() — FULL ORIGIN WALK    → O(N × origin_latency)  ← BOTTLENECK
 4. Build tree from scan results             → O(N)
 5. Register files in fetcher                → O(N)
 6. Mount FUSE                               → O(1)
 ```
 Required flow:
 ```
 1. Open CAS store                           → O(1)
 2. Open persistent state store              → O(1)
 3. IF store has data:
     Load tree from store                   → O(N × local_read)  ← ~1000x faster
     Load manifests from store              → O(N × local_read)
     Load LRU access times from store       → O(chunks)
   ELSE (first mount):
     Full origin scan (current behavior)    → O(N × origin_latency)
     Persist results to store               → O(N × local_write)
 4. Open tantivy search index                → O(1)
 5. Open PatternStore                        → O(1)
 6. Create origin connections                → O(1)
 7. Mount FUSE                               → O(1)
 8. Background: delta sync (origin vs store) → incremental, non-blocking
 ```
 ### 4.2 Runtime Persistence (Write Path)
 These operations must persist state changes as they happen, not just on shutdown:
 | Event | What to Persist | When |
 |---|---|---|
 | File discovered during sync | FileMeta → store | Immediately (in batch if scanning) |
 | File removed during sync | Delete from store | Immediately |
 | File metadata changed | Update FileMeta in store | Immediately |
 | File content fetched (cache miss) | ChunkManifest → store | After fetch completes |
 | Chunk accessed | Update LRU timestamp | Batched (every 10s or 100 accesses) |
 | Search index updated | tantivy handles its own persistence | On commit (every 5s) |
 | Access pattern recorded | PatternStore handles its own persistence | Already persisted per-access |
 ### 4.3 Files That Need Changes
 | File | Change |
 |---|---|
 | `musicfs-cli/src/main.rs` | Rewrite `run_mount()` to load from store; add background delta sync |
 | `musicfs-cache/src/db.rs` | Add `list_all_files()` bulk load; add manifest read/write methods (if SQLite) |
 | `musicfs-cache/src/tree.rs` | Add `TreeBuilder::from_file_metas(iter)` — build tree from stored records |
 | `musicfs-cas/src/reader.rs` | Load manifests from store on startup; persist after fetch |
 | `musicfs-cas/src/fetcher.rs` | After `fetch_file()`, persist manifest to store |
 | `musicfs-cache/src/eviction.rs` | Persist access times; load on startup |
 | `musicfs-search/src/indexer.rs` | On mount, check what's already indexed vs what's in store — skip known files |
 | `musicfs-sync/src/delta.rs` | Background delta sync: compare store state vs origin, sync differences |
 ### 4.4 Shutdown Persistence
 On graceful shutdown (after signal handling from resilience plan Phase A is implemented):
 | Step | What |
 |---|---|
 | 1 | Flush any batched LRU timestamp updates |
 | 2 | Commit tantivy index writer |
 | 3 | WAL checkpoint SQLite (if SQLite): `PRAGMA wal_checkpoint(TRUNCATE)` |
 | 4 | Flush sled (if sled): `sled::Db::flush()` |
 | 5 | Close all database connections |
 On crash (no graceful shutdown):
 - SQLite WAL mode: automatic recovery on next open (no data loss for committed transactions)
 - sled: automatic recovery via internal WAL
 - tantivy: up to 5 seconds of uncommitted documents lost, but recoverable from store
 - LRU timestamps: batched updates may lose last batch (10s window) — acceptable
 ---
 ## 5. Background Delta Sync
 After mounting from persistent state, the data may be stale (origin changed while daemon was stopped). A background sync reconciles:
 ```
 1. Walk origin (or use watcher for inotify-capable origins)
 2. For each file on origin:
   a. Compare mtime + size against stored record
   b. If unchanged → skip
   c. If modified → re-parse metadata, update store, update tree, invalidate manifest
   d. If new → parse metadata, add to store + tree
 3. For each file in store not found on origin:
   a. Remove from store + tree
 4. Update search index for changed files
 5. Log summary: "Delta sync complete: N added, M modified, K removed, T unchanged"
 ```
 This runs in the background AFTER mount completes. Users see the filesystem immediately (from stored state), and it converges to current reality within minutes.
 ### 5.1 Stale Data Window
 Between mount and delta sync completion, users may see:
 - Files that were deleted on origin (will get ENOENT or EIO on read — origin returns not found)
 - Files with old metadata (wrong track name, etc.)
 - Missing files that were added to origin (won't appear until sync discovers them)
 This is acceptable — it's the same behavior as any cached filesystem (NFS, CIFS). The key insight: **stale data for 30 seconds is infinitely better than no data for 5 minutes.**
 ---
 ## 6. First Mount vs Subsequent Mount
 | | First Mount (empty store) | Subsequent Mount (store has data) |
 |---|---|---|
 | **Tree source** | Origin scan + metadata parse | Load from store |
 | **Manifests** | None (populated on first read) | Loaded from store |
 | **Search index** | Built during/after scan | Opened from disk |
 | **LRU data** | Empty (cold cache) | Loaded from store |
 | **Mount time** | O(N × origin_latency) — same as today | O(N × local_read) — target <5s for 1M files |
 | **Accuracy** | 100% current | Stale until delta sync completes |
 | **Detection** | Store file doesn't exist or is empty | Store file exists with data |
 ---
 ## 7. Estimated Effort
 | Task | Effort | Depends On |
 |---|---|---|
 | Rewrite `run_mount()` with store loading + fallback | 2 days | Storage decision |
 | Persist chunk manifests after fetch | 1 day | Storage decision |
 | Load manifests on mount + register in FileReader | 0.5 day | Above |
 | Open tantivy on mount, skip known files | 1 day | — |
 | Open PatternStore + CollectionStore on mount | 0.5 day | — |
 | Background delta sync | 1.5 days | — |
 | Persist LRU access times + load on mount | 1 day | Storage decision |
 | First-mount detection + fallback to full scan | 0.5 day | — |
 | **Total** | **~8 days** | |
 ---
 ## 8. Open Decision
 **Which storage engine for the persistent state?**
 The answer drives the implementation of every task above. See Section 3 for tradeoffs.