Files
MusicFS/docs/v2/plans/persistent-state.md
Alexander 87574ce008 Add resilience audit and persistent state plans
Comprehensive fault tolerance analysis covering 34 issues across 6 phases:
signal handling, crash recovery, cache corruption, network failures,
resource exhaustion, and the critical finding that no persistent state
is used on mount (every restart is a full origin rescan).

Persistent state plan covers storage engine options, mount flow redesign,
background delta sync, and the in-memory state inventory.
2026-05-13 12:09:41 +02:00

16 KiB
Raw Permalink Blame History

MusicFS Persistent State Plan

Date: 2026-05-13 Status: Research Complete — Design Decision Needed Prerequisites: architecture.md, resilience-fault-tolerance.md Related Requirements: G1 (O(1) mount time), NFR-1.7 (<500ms mount), FR-7.1 (cache persists across restarts)


1. Problem Statement

Every mount is a full cold start. The run_mount() function in main.rs does not use any persistent storage — it walks the entire origin filesystem, parses metadata from every audio file, and builds all runtime state from scratch.

The architecture designed persistence infrastructure (SQLite schema, chunk_manifest column, ChunkManifest::from_db(), chunks_to_bytes()) but none of it is wired into the mount path. The mount flow doesn't even open the database.

Mount Time by Library Size (Current)

Library Size Estimated Mount Time Target (NFR-1.7)
1K files ~1-2s <500ms
10K files ~10-20s <500ms
100K files ~2-5 minutes <500ms
1M files ~20-60 minutes <500ms
10M files (stretch) hours <500ms

2. In-Memory State Inventory

2.1 State That Must Survive Restart

These are the large, expensive-to-rebuild data structures. Losing them forces a full origin rescan.

VirtualTree (~300-400MB at 1M files)

Location: musicfs-cache/src/tree.rs

Contents:

  • nodes: HashMap<Inode, VirtualNode> — every directory and file node
  • path_to_inode: HashMap<VirtualPath, Inode> — reverse path lookup
  • next_inode: AtomicU64 — inode counter

Currently rebuilt from: Full recursive origin scan + metadata parse of every audio file. This is the single most expensive operation on mount — it touches every file on origin, runs symphonia metadata extraction, and builds the entire tree structure.

What's needed: Load from persistent storage on mount. Rebuild only on first-ever mount or if storage is corrupt.


ContentFetcher.file_meta (~200MB at 1M files)

Location: musicfs-cas/src/fetcher.rs

Contents:

  • file_meta: RwLock<HashMap<FileId, FileMeta>> — full metadata for every file
  • Each FileMeta contains: id, virtual_path, real_path (origin_id + path), size, mtime, content_hash, audio metadata

Currently rebuilt from: Same origin scan that builds the tree. Every file is registered via fetcher.register_file(meta).

What's needed: This is essentially a duplicate of the tree data in a different shape. If the tree is loaded from storage, this map should be populated from the same source.


FileReader.manifests (~100MB at 1M files)

Location: musicfs-cas/src/reader.rs

Contents:

  • manifests: RwLock<HashMap<FileId, ChunkManifest>> — maps FileId to list of chunk hashes + offsets
  • Each ChunkManifest contains: file_id, total_size, mtime, chunks (Vec with hash + offset + size)

Currently rebuilt from: Re-fetched from origin on first read() after restart. The fetcher downloads the entire file, chunks it via CDC, stores chunks in CAS (dedup catches existing ones), and builds the manifest. This means every file is re-downloaded once after restart even though the chunks are already on disk.

What's needed: Persist manifests to storage after fetch. Load on mount. This is the difference between "restart = re-download everything" and "restart = instant reads from cache."

Existing dead code: SQLite files table has chunk_manifest BLOB column. ChunkManifest::chunks_to_bytes() and ChunkManifest::from_db() exist but are never called.


LruEviction access times (~50MB at 100K chunks)

Location: musicfs-cache/src/eviction.rs

Contents:

  • access_times: RwLock<BTreeMap<Instant, ChunkHash>> — ordered by access time
  • hash_to_time: RwLock<HashMap<ChunkHash, Instant>> — reverse lookup

Currently rebuilt from: Nothing. After restart, all chunks have equal eviction priority. The album you're currently listening to is just as likely to be evicted as something you played 6 months ago.

What's needed: Persist last-access timestamps. On mount, load and reconstruct the LRU order so hot data stays cached.


2.2 State That Survives But Is Ignored on Mount

These persist on disk but run_mount() never opens them.

Component Persisted To Loaded on Mount? Effect
SQLite metadata (files table) metadata.db All metadata re-scanned from origin
tantivy search index search.idx/ Index rebuilt from scratch (or not at all)
PatternStore (access patterns) SQLite (separate DB) Predictions reset to zero
CollectionStore (smart collections) SQLite (same as patterns) Collections unavailable until opened

2.3 State That Correctly Does Not Need Persistence

Component Why Transient Is Fine
OriginRegistry (origin connections) Reconstructed from config on startup
Router (priorities, latency stats) Priorities from config; latency stats warm up within seconds
HealthMonitor (health state) All origins start as Unknown, converge within one check cycle (~30s)
EventBus (in-flight events) Transient by nature
PrefetchEngine.in_flight Transient work queue
PluginManager Re-loaded from config + plugin directories
MusicFs.query_inodes Transient search session state
CasStore.current_size Recalculated on open (though currently broken — see resilience doc 3.10)

3. Storage Decision

3.1 Requirements for Persistent State

  1. Bulk sequential read on mount — load ~1M records into in-memory structures as fast as possible
  2. Incremental updates at runtime — delta sync adds/removes/modifies individual files
  3. Crash safety — no corruption on unclean shutdown (SIGKILL, power loss)
  4. Manifest storage — binary blobs (msgpack-encoded chunk lists), variable size (100 bytes to 10KB per file)
  5. LRU timestamps — simple key-value (ChunkHash → last_access_timestamp)
  6. Already in project — minimize new dependencies

3.2 Options

Option A: SQLite (Current Architecture Choice)

Already in project: rusqlite dependency, schema.sql with files table, Database struct with full CRUD, chunk_manifest BLOB column ready.

Metric Performance
Bulk load 1M rows ~2-4 seconds (WAL mode, indexed)
Single row upsert ~50μs
Crash safety WAL mode — excellent
Manifest blobs Native BLOB support, no size limit

Pros: Already built (schema, code, tests exist). Well-understood crash semantics. Single file backup. SQL queries for debugging. The chunk_manifest column and from_db()/to_bytes() methods are already written.

Cons: Not the fastest for pure key-value workloads. WAL checkpoint can cause brief write pauses. Single-writer limitation (Mutex around connection).

Effort to wire up: ~5-7 days (mostly connecting existing code, not writing new code)


Option B: sled (Already in Project for CAS Index)

Already in project: Used for CAS chunk hash → location mapping.

Metric Performance
Bulk load 1M entries ~1-2 seconds (LSM, sequential reads)
Single entry upsert ~10-20μs
Crash safety Built-in WAL — good
Manifest blobs Native byte value support

Pros: Faster than SQLite for pure key-value. Already a dependency. Good for LRU timestamps (simple k/v).

Cons: No SQL — querying for debugging is harder. No schema migration story. Limited tooling. Has known issues with large datasets (memory usage during compaction). Two persistence engines = two things to maintain.

Effort: ~7-9 days (new serialization layer, no existing code to reuse)


Option C: Flat File (bincode/msgpack dump)

Metric Performance
Bulk load 1M entries <1 second (mmap, zero-parse with bincode)
Single entry upsert N/A — full rewrite required
Crash safety Must write atomically (tmp + rename)
Manifest blobs Part of serialized struct

Pros: Fastest possible bulk load. Simplest implementation.

Cons: No incremental updates — every change requires serializing and rewriting the entire file. At 1M files (~500MB serialized), a single file modification triggers a 500MB write. No concurrent access. No recovery from partial corruption.

Effort: ~3-4 days but creates ongoing maintenance burden for delta updates


Option D: Hybrid (SQLite for metadata + sled for hot-path data)

Use SQLite for structured metadata (files, collections, patterns — already built) and sled for hot-path key-value data (manifests, LRU timestamps — performance-critical).

Pros: Each store optimized for its access pattern. SQLite for queryable metadata, sled for fast blob lookup.

Cons: Two persistence engines to coordinate. Consistency between them on crash. More complex startup/shutdown.


3.3 Recommendation

Pending your decision. The tradeoffs are:

  • Simplest path: Option A (SQLite) — most code already exists, just needs wiring
  • Fastest hot-path: Option D (Hybrid) — but more complexity
  • Fastest bulk load: Option C (Flat file) — but no incremental updates

The choice depends on what you value most. SQLite at 1M files loads in ~2-4 seconds — is that acceptable vs the <500ms target? If not, a flat file or sled for the tree data with SQLite for everything else might be needed.


4. What Needs to Change

Regardless of storage choice, these are the code changes needed:

4.1 Mount Path (musicfs-cli/src/main.rs)

Current run_mount() flow:

1. Open CAS store                           → O(1)
2. Create origin connection                 → O(1)
3. scan_music_files() — FULL ORIGIN WALK    → O(N × origin_latency)  ← BOTTLENECK
4. Build tree from scan results             → O(N)
5. Register files in fetcher                → O(N)
6. Mount FUSE                               → O(1)

Required flow:

1. Open CAS store                           → O(1)
2. Open persistent state store              → O(1)
3. IF store has data:
     Load tree from store                   → O(N × local_read)  ← ~1000x faster
     Load manifests from store              → O(N × local_read)
     Load LRU access times from store       → O(chunks)
   ELSE (first mount):
     Full origin scan (current behavior)    → O(N × origin_latency)
     Persist results to store               → O(N × local_write)
4. Open tantivy search index                → O(1)
5. Open PatternStore                        → O(1)
6. Create origin connections                → O(1)
7. Mount FUSE                               → O(1)
8. Background: delta sync (origin vs store) → incremental, non-blocking

4.2 Runtime Persistence (Write Path)

These operations must persist state changes as they happen, not just on shutdown:

Event What to Persist When
File discovered during sync FileMeta → store Immediately (in batch if scanning)
File removed during sync Delete from store Immediately
File metadata changed Update FileMeta in store Immediately
File content fetched (cache miss) ChunkManifest → store After fetch completes
Chunk accessed Update LRU timestamp Batched (every 10s or 100 accesses)
Search index updated tantivy handles its own persistence On commit (every 5s)
Access pattern recorded PatternStore handles its own persistence Already persisted per-access

4.3 Files That Need Changes

File Change
musicfs-cli/src/main.rs Rewrite run_mount() to load from store; add background delta sync
musicfs-cache/src/db.rs Add list_all_files() bulk load; add manifest read/write methods (if SQLite)
musicfs-cache/src/tree.rs Add TreeBuilder::from_file_metas(iter) — build tree from stored records
musicfs-cas/src/reader.rs Load manifests from store on startup; persist after fetch
musicfs-cas/src/fetcher.rs After fetch_file(), persist manifest to store
musicfs-cache/src/eviction.rs Persist access times; load on startup
musicfs-search/src/indexer.rs On mount, check what's already indexed vs what's in store — skip known files
musicfs-sync/src/delta.rs Background delta sync: compare store state vs origin, sync differences

4.4 Shutdown Persistence

On graceful shutdown (after signal handling from resilience plan Phase A is implemented):

Step What
1 Flush any batched LRU timestamp updates
2 Commit tantivy index writer
3 WAL checkpoint SQLite (if SQLite): PRAGMA wal_checkpoint(TRUNCATE)
4 Flush sled (if sled): sled::Db::flush()
5 Close all database connections

On crash (no graceful shutdown):

  • SQLite WAL mode: automatic recovery on next open (no data loss for committed transactions)
  • sled: automatic recovery via internal WAL
  • tantivy: up to 5 seconds of uncommitted documents lost, but recoverable from store
  • LRU timestamps: batched updates may lose last batch (10s window) — acceptable

5. Background Delta Sync

After mounting from persistent state, the data may be stale (origin changed while daemon was stopped). A background sync reconciles:

1. Walk origin (or use watcher for inotify-capable origins)
2. For each file on origin:
   a. Compare mtime + size against stored record
   b. If unchanged → skip
   c. If modified → re-parse metadata, update store, update tree, invalidate manifest
   d. If new → parse metadata, add to store + tree
3. For each file in store not found on origin:
   a. Remove from store + tree
4. Update search index for changed files
5. Log summary: "Delta sync complete: N added, M modified, K removed, T unchanged"

This runs in the background AFTER mount completes. Users see the filesystem immediately (from stored state), and it converges to current reality within minutes.

5.1 Stale Data Window

Between mount and delta sync completion, users may see:

  • Files that were deleted on origin (will get ENOENT or EIO on read — origin returns not found)
  • Files with old metadata (wrong track name, etc.)
  • Missing files that were added to origin (won't appear until sync discovers them)

This is acceptable — it's the same behavior as any cached filesystem (NFS, CIFS). The key insight: stale data for 30 seconds is infinitely better than no data for 5 minutes.


6. First Mount vs Subsequent Mount

First Mount (empty store) Subsequent Mount (store has data)
Tree source Origin scan + metadata parse Load from store
Manifests None (populated on first read) Loaded from store
Search index Built during/after scan Opened from disk
LRU data Empty (cold cache) Loaded from store
Mount time O(N × origin_latency) — same as today O(N × local_read) — target <5s for 1M files
Accuracy 100% current Stale until delta sync completes
Detection Store file doesn't exist or is empty Store file exists with data

7. Estimated Effort

Task Effort Depends On
Rewrite run_mount() with store loading + fallback 2 days Storage decision
Persist chunk manifests after fetch 1 day Storage decision
Load manifests on mount + register in FileReader 0.5 day Above
Open tantivy on mount, skip known files 1 day
Open PatternStore + CollectionStore on mount 0.5 day
Background delta sync 1.5 days
Persist LRU access times + load on mount 1 day Storage decision
First-mount detection + fallback to full scan 0.5 day
Total ~8 days

8. Open Decision

Which storage engine for the persistent state?

The answer drives the implementation of every task above. See Section 3 for tradeoffs.