Comprehensive fault tolerance analysis covering 34 issues across 6 phases: signal handling, crash recovery, cache corruption, network failures, resource exhaustion, and the critical finding that no persistent state is used on mount (every restart is a full origin rescan). Persistent state plan covers storage engine options, mount flow redesign, background delta sync, and the in-memory state inventory.
16 KiB
MusicFS Persistent State Plan
Date: 2026-05-13 Status: Research Complete — Design Decision Needed Prerequisites: architecture.md, resilience-fault-tolerance.md Related Requirements: G1 (O(1) mount time), NFR-1.7 (<500ms mount), FR-7.1 (cache persists across restarts)
1. Problem Statement
Every mount is a full cold start. The run_mount() function in main.rs does not use any persistent storage — it walks the entire origin filesystem, parses metadata from every audio file, and builds all runtime state from scratch.
The architecture designed persistence infrastructure (SQLite schema, chunk_manifest column, ChunkManifest::from_db(), chunks_to_bytes()) but none of it is wired into the mount path. The mount flow doesn't even open the database.
Mount Time by Library Size (Current)
| Library Size | Estimated Mount Time | Target (NFR-1.7) |
|---|---|---|
| 1K files | ~1-2s | <500ms |
| 10K files | ~10-20s | <500ms |
| 100K files | ~2-5 minutes | <500ms |
| 1M files | ~20-60 minutes | <500ms |
| 10M files (stretch) | hours | <500ms |
2. In-Memory State Inventory
2.1 State That Must Survive Restart
These are the large, expensive-to-rebuild data structures. Losing them forces a full origin rescan.
VirtualTree (~300-400MB at 1M files)
Location: musicfs-cache/src/tree.rs
Contents:
nodes: HashMap<Inode, VirtualNode>— every directory and file nodepath_to_inode: HashMap<VirtualPath, Inode>— reverse path lookupnext_inode: AtomicU64— inode counter
Currently rebuilt from: Full recursive origin scan + metadata parse of every audio file. This is the single most expensive operation on mount — it touches every file on origin, runs symphonia metadata extraction, and builds the entire tree structure.
What's needed: Load from persistent storage on mount. Rebuild only on first-ever mount or if storage is corrupt.
ContentFetcher.file_meta (~200MB at 1M files)
Location: musicfs-cas/src/fetcher.rs
Contents:
file_meta: RwLock<HashMap<FileId, FileMeta>>— full metadata for every file- Each
FileMetacontains: id, virtual_path, real_path (origin_id + path), size, mtime, content_hash, audio metadata
Currently rebuilt from: Same origin scan that builds the tree. Every file is registered via fetcher.register_file(meta).
What's needed: This is essentially a duplicate of the tree data in a different shape. If the tree is loaded from storage, this map should be populated from the same source.
FileReader.manifests (~100MB at 1M files)
Location: musicfs-cas/src/reader.rs
Contents:
manifests: RwLock<HashMap<FileId, ChunkManifest>>— maps FileId to list of chunk hashes + offsets- Each
ChunkManifestcontains: file_id, total_size, mtime, chunks (Vec with hash + offset + size)
Currently rebuilt from: Re-fetched from origin on first read() after restart. The fetcher downloads the entire file, chunks it via CDC, stores chunks in CAS (dedup catches existing ones), and builds the manifest. This means every file is re-downloaded once after restart even though the chunks are already on disk.
What's needed: Persist manifests to storage after fetch. Load on mount. This is the difference between "restart = re-download everything" and "restart = instant reads from cache."
Existing dead code: SQLite files table has chunk_manifest BLOB column. ChunkManifest::chunks_to_bytes() and ChunkManifest::from_db() exist but are never called.
LruEviction access times (~50MB at 100K chunks)
Location: musicfs-cache/src/eviction.rs
Contents:
access_times: RwLock<BTreeMap<Instant, ChunkHash>>— ordered by access timehash_to_time: RwLock<HashMap<ChunkHash, Instant>>— reverse lookup
Currently rebuilt from: Nothing. After restart, all chunks have equal eviction priority. The album you're currently listening to is just as likely to be evicted as something you played 6 months ago.
What's needed: Persist last-access timestamps. On mount, load and reconstruct the LRU order so hot data stays cached.
2.2 State That Survives But Is Ignored on Mount
These persist on disk but run_mount() never opens them.
| Component | Persisted To | Loaded on Mount? | Effect |
|---|---|---|---|
| SQLite metadata (files table) | metadata.db |
❌ | All metadata re-scanned from origin |
| tantivy search index | search.idx/ |
❌ | Index rebuilt from scratch (or not at all) |
| PatternStore (access patterns) | SQLite (separate DB) | ❌ | Predictions reset to zero |
| CollectionStore (smart collections) | SQLite (same as patterns) | ❌ | Collections unavailable until opened |
2.3 State That Correctly Does Not Need Persistence
| Component | Why Transient Is Fine |
|---|---|
| OriginRegistry (origin connections) | Reconstructed from config on startup |
| Router (priorities, latency stats) | Priorities from config; latency stats warm up within seconds |
| HealthMonitor (health state) | All origins start as Unknown, converge within one check cycle (~30s) |
| EventBus (in-flight events) | Transient by nature |
| PrefetchEngine.in_flight | Transient work queue |
| PluginManager | Re-loaded from config + plugin directories |
| MusicFs.query_inodes | Transient search session state |
| CasStore.current_size | Recalculated on open (though currently broken — see resilience doc 3.10) |
3. Storage Decision
3.1 Requirements for Persistent State
- Bulk sequential read on mount — load ~1M records into in-memory structures as fast as possible
- Incremental updates at runtime — delta sync adds/removes/modifies individual files
- Crash safety — no corruption on unclean shutdown (SIGKILL, power loss)
- Manifest storage — binary blobs (msgpack-encoded chunk lists), variable size (100 bytes to 10KB per file)
- LRU timestamps — simple key-value (ChunkHash → last_access_timestamp)
- Already in project — minimize new dependencies
3.2 Options
Option A: SQLite (Current Architecture Choice)
Already in project: rusqlite dependency, schema.sql with files table, Database struct with full CRUD, chunk_manifest BLOB column ready.
| Metric | Performance |
|---|---|
| Bulk load 1M rows | ~2-4 seconds (WAL mode, indexed) |
| Single row upsert | ~50μs |
| Crash safety | WAL mode — excellent |
| Manifest blobs | Native BLOB support, no size limit |
Pros: Already built (schema, code, tests exist). Well-understood crash semantics. Single file backup. SQL queries for debugging. The chunk_manifest column and from_db()/to_bytes() methods are already written.
Cons: Not the fastest for pure key-value workloads. WAL checkpoint can cause brief write pauses. Single-writer limitation (Mutex around connection).
Effort to wire up: ~5-7 days (mostly connecting existing code, not writing new code)
Option B: sled (Already in Project for CAS Index)
Already in project: Used for CAS chunk hash → location mapping.
| Metric | Performance |
|---|---|
| Bulk load 1M entries | ~1-2 seconds (LSM, sequential reads) |
| Single entry upsert | ~10-20μs |
| Crash safety | Built-in WAL — good |
| Manifest blobs | Native byte value support |
Pros: Faster than SQLite for pure key-value. Already a dependency. Good for LRU timestamps (simple k/v).
Cons: No SQL — querying for debugging is harder. No schema migration story. Limited tooling. Has known issues with large datasets (memory usage during compaction). Two persistence engines = two things to maintain.
Effort: ~7-9 days (new serialization layer, no existing code to reuse)
Option C: Flat File (bincode/msgpack dump)
| Metric | Performance |
|---|---|
| Bulk load 1M entries | <1 second (mmap, zero-parse with bincode) |
| Single entry upsert | N/A — full rewrite required |
| Crash safety | Must write atomically (tmp + rename) |
| Manifest blobs | Part of serialized struct |
Pros: Fastest possible bulk load. Simplest implementation.
Cons: No incremental updates — every change requires serializing and rewriting the entire file. At 1M files (~500MB serialized), a single file modification triggers a 500MB write. No concurrent access. No recovery from partial corruption.
Effort: ~3-4 days but creates ongoing maintenance burden for delta updates
Option D: Hybrid (SQLite for metadata + sled for hot-path data)
Use SQLite for structured metadata (files, collections, patterns — already built) and sled for hot-path key-value data (manifests, LRU timestamps — performance-critical).
Pros: Each store optimized for its access pattern. SQLite for queryable metadata, sled for fast blob lookup.
Cons: Two persistence engines to coordinate. Consistency between them on crash. More complex startup/shutdown.
3.3 Recommendation
Pending your decision. The tradeoffs are:
- Simplest path: Option A (SQLite) — most code already exists, just needs wiring
- Fastest hot-path: Option D (Hybrid) — but more complexity
- Fastest bulk load: Option C (Flat file) — but no incremental updates
The choice depends on what you value most. SQLite at 1M files loads in ~2-4 seconds — is that acceptable vs the <500ms target? If not, a flat file or sled for the tree data with SQLite for everything else might be needed.
4. What Needs to Change
Regardless of storage choice, these are the code changes needed:
4.1 Mount Path (musicfs-cli/src/main.rs)
Current run_mount() flow:
1. Open CAS store → O(1)
2. Create origin connection → O(1)
3. scan_music_files() — FULL ORIGIN WALK → O(N × origin_latency) ← BOTTLENECK
4. Build tree from scan results → O(N)
5. Register files in fetcher → O(N)
6. Mount FUSE → O(1)
Required flow:
1. Open CAS store → O(1)
2. Open persistent state store → O(1)
3. IF store has data:
Load tree from store → O(N × local_read) ← ~1000x faster
Load manifests from store → O(N × local_read)
Load LRU access times from store → O(chunks)
ELSE (first mount):
Full origin scan (current behavior) → O(N × origin_latency)
Persist results to store → O(N × local_write)
4. Open tantivy search index → O(1)
5. Open PatternStore → O(1)
6. Create origin connections → O(1)
7. Mount FUSE → O(1)
8. Background: delta sync (origin vs store) → incremental, non-blocking
4.2 Runtime Persistence (Write Path)
These operations must persist state changes as they happen, not just on shutdown:
| Event | What to Persist | When |
|---|---|---|
| File discovered during sync | FileMeta → store | Immediately (in batch if scanning) |
| File removed during sync | Delete from store | Immediately |
| File metadata changed | Update FileMeta in store | Immediately |
| File content fetched (cache miss) | ChunkManifest → store | After fetch completes |
| Chunk accessed | Update LRU timestamp | Batched (every 10s or 100 accesses) |
| Search index updated | tantivy handles its own persistence | On commit (every 5s) |
| Access pattern recorded | PatternStore handles its own persistence | Already persisted per-access |
4.3 Files That Need Changes
| File | Change |
|---|---|
musicfs-cli/src/main.rs |
Rewrite run_mount() to load from store; add background delta sync |
musicfs-cache/src/db.rs |
Add list_all_files() bulk load; add manifest read/write methods (if SQLite) |
musicfs-cache/src/tree.rs |
Add TreeBuilder::from_file_metas(iter) — build tree from stored records |
musicfs-cas/src/reader.rs |
Load manifests from store on startup; persist after fetch |
musicfs-cas/src/fetcher.rs |
After fetch_file(), persist manifest to store |
musicfs-cache/src/eviction.rs |
Persist access times; load on startup |
musicfs-search/src/indexer.rs |
On mount, check what's already indexed vs what's in store — skip known files |
musicfs-sync/src/delta.rs |
Background delta sync: compare store state vs origin, sync differences |
4.4 Shutdown Persistence
On graceful shutdown (after signal handling from resilience plan Phase A is implemented):
| Step | What |
|---|---|
| 1 | Flush any batched LRU timestamp updates |
| 2 | Commit tantivy index writer |
| 3 | WAL checkpoint SQLite (if SQLite): PRAGMA wal_checkpoint(TRUNCATE) |
| 4 | Flush sled (if sled): sled::Db::flush() |
| 5 | Close all database connections |
On crash (no graceful shutdown):
- SQLite WAL mode: automatic recovery on next open (no data loss for committed transactions)
- sled: automatic recovery via internal WAL
- tantivy: up to 5 seconds of uncommitted documents lost, but recoverable from store
- LRU timestamps: batched updates may lose last batch (10s window) — acceptable
5. Background Delta Sync
After mounting from persistent state, the data may be stale (origin changed while daemon was stopped). A background sync reconciles:
1. Walk origin (or use watcher for inotify-capable origins)
2. For each file on origin:
a. Compare mtime + size against stored record
b. If unchanged → skip
c. If modified → re-parse metadata, update store, update tree, invalidate manifest
d. If new → parse metadata, add to store + tree
3. For each file in store not found on origin:
a. Remove from store + tree
4. Update search index for changed files
5. Log summary: "Delta sync complete: N added, M modified, K removed, T unchanged"
This runs in the background AFTER mount completes. Users see the filesystem immediately (from stored state), and it converges to current reality within minutes.
5.1 Stale Data Window
Between mount and delta sync completion, users may see:
- Files that were deleted on origin (will get ENOENT or EIO on read — origin returns not found)
- Files with old metadata (wrong track name, etc.)
- Missing files that were added to origin (won't appear until sync discovers them)
This is acceptable — it's the same behavior as any cached filesystem (NFS, CIFS). The key insight: stale data for 30 seconds is infinitely better than no data for 5 minutes.
6. First Mount vs Subsequent Mount
| First Mount (empty store) | Subsequent Mount (store has data) | |
|---|---|---|
| Tree source | Origin scan + metadata parse | Load from store |
| Manifests | None (populated on first read) | Loaded from store |
| Search index | Built during/after scan | Opened from disk |
| LRU data | Empty (cold cache) | Loaded from store |
| Mount time | O(N × origin_latency) — same as today | O(N × local_read) — target <5s for 1M files |
| Accuracy | 100% current | Stale until delta sync completes |
| Detection | Store file doesn't exist or is empty | Store file exists with data |
7. Estimated Effort
| Task | Effort | Depends On |
|---|---|---|
Rewrite run_mount() with store loading + fallback |
2 days | Storage decision |
| Persist chunk manifests after fetch | 1 day | Storage decision |
| Load manifests on mount + register in FileReader | 0.5 day | Above |
| Open tantivy on mount, skip known files | 1 day | — |
| Open PatternStore + CollectionStore on mount | 0.5 day | — |
| Background delta sync | 1.5 days | — |
| Persist LRU access times + load on mount | 1 day | Storage decision |
| First-mount detection + fallback to full scan | 0.5 day | — |
| Total | ~8 days |
8. Open Decision
Which storage engine for the persistent state?
The answer drives the implementation of every task above. See Section 3 for tradeoffs.