Commit Graph

7 Commits

Author SHA1 Message Date
Alexander 0ff2a17ab7 Implement Phase C: Production Hardening
Implements phase-c-hardening.md to fix 6 RED resilience tests:

- D1/D2: Health check timeout (1.5s) + parallel execution via join_all
- C6: Recursive CAS calculate_size() to scan shard subdirectories
- C7: FUSE read timeout (30s) returns EIO instead of hanging
- 6.4: Auto-re-fetch corrupt/missing chunks from origin
- 6.6: Passthrough mode - continue even when CAS write fails
- C9: PID file with flock prevents concurrent mounts
- 5.3: fd exhaustion handling test

All 27 resilience tests now pass. Full test suite green.

Files changed:
- musicfs-origins/src/health.rs: timeout + join_all
- musicfs-origins/Cargo.toml: add futures dependency
- musicfs-cas/src/store.rs: recursive calculate_size
- musicfs-cas/src/reader.rs: auto-re-fetch on IntegrityError/NotFound
- musicfs-cas/src/fetcher.rs: passthrough fallback
- musicfs-fuse/src/filesystem.rs: 30s read timeout
- musicfs-cli/src/main.rs: PID file with flock
- musicfs-test-utils/tests/resilience.rs: updated tests
2026-05-13 15:55:22 +02:00
Alexander 5da96ffab2 Implement Phase B: Crash Recovery
Add startup integrity checks, corruption recovery, CAS size limits,
graceful shutdown orchestration, and a task supervisor — turning 5
previously-RED resilience tests GREEN and adding 5 new tests.

- CAS: pre-check size limit in put(), add StoreFull error variant
- CAS: sled corruption recovery in open() (retry then recreate)
- SQLite: open_with_integrity_check() via PRAGMA integrity_check(1)
- tantivy: open_with_recovery() deletes and rebuilds corrupt index
- CLI: CancellationToken-based ordered shutdown sequence
- Core: TaskSupervisor with spawn_supervised/spawn_critical + backoff
- Tests: replace 4 todo!() stubs, add 5 new shutdown/supervisor tests
2026-05-13 15:33:23 +02:00
Alexander 6285eeb6c0 Implement Phase A: Stop Dying resilience fixes
Implements all 6 critical resilience fixes from phase-a-stop-dying.md:

- Issue 2.9: Migrate std::sync::RwLock → parking_lot::RwLock (7 files)
  Prevents lock poisoning cascade on writer panic

- Issue 2.2: Add install_panic_hook() to log panics via tracing
  Ensures panics are captured in logs/journald before process death

- Issue 3.7: Add ExecStopPost to systemd service
  Cleans up stale FUSE mounts on service stop

- Issue 2.7: Add check_stale_mount() detection on startup
  Auto-cleans leftover mounts from previous crashes

- Issue 2.10: Integrate sd_notify for systemd lifecycle
  Sends READY=1 after mount, STOPPING on shutdown

- Issue 2.1: Add signal handling with spawn_mount
  Catches SIGTERM/SIGINT for clean shutdown instead of instant death

All 7 Phase A tests pass:
- test_poisoned_tree_lock_returns_eio_not_panic
- test_parking_lot_rwlock_survives_panic
- test_panic_hook_logs_to_tracing
- test_systemd_service_has_execstoppost
- test_stale_mount_check_function_exists
- test_sd_notify_ready_sent
- test_sigterm_triggers_shutdown
2026-05-13 14:48:32 +02:00
Alexander 5ac33987c0 Add comprehensive logging with tracing, file rotation, and systemd integration
- Add tracing-appender and tracing-journald for production logging
- Add LoggingConfig with trace_sample_rate, json_output, journald options
- Expand init_logging() with file rotation, journald, and stderr layers
- Add sanitize_path() helper for PII protection in logs
- Instrument FUSE operations with #[instrument] and trace decision points
- Instrument gRPC handlers (10 methods) with span correlation
- Add spawn instrumentation for health monitor, indexer, watcher tasks
- Add broadcast lag handling (RecvError::Lagged) in event subscribers
- Fix webhook.rs expect() calls with proper error handling
- Add logging to patterns.rs, collections.rs, artwork.rs database ops
- Add Drop impl logging for PluginManager and WatchHandle
- Update systemd service with rate limiting and journal output
- Add logrotate config and example config.toml with logging section
2026-05-13 11:21:51 +02:00
Alexander bc9fa36646 Add Week 10 Plugin System and Week 11 Control API
Week 10 - Plugin System (FR-19):
- Plugin traits: Plugin, OriginPlugin, MetadataPlugin, FormatPlugin
- NativePluginHost with libloading for dynamic loading
- WasmPluginHost (feature-gated) with wasmtime runtime
- PluginManager coordinating both hosts with version checks
- OriginInstance::watch() with WatchHandle, WatchEvent for live updates
- FormatPlugin::synthesize_header() for metadata overlay

Week 11 - Control API & Production (FR-17, FR-18, NFR-6, NFR-10):
- gRPC server with full MusicFS service (status, cache, origins, events)
- Proto extended: MountState enum, TierStats, full StatusResponse/CacheStats
- WebhookHandler with HMAC-SHA256 signing and exponential retry
- Metrics with latency histograms (p50/p95/p99) and origin health gauges
- CLI with mount, status, cache, search, origin, events, shutdown commands
- E2E player compatibility tests (mpv, VLC, file manager)
- systemd service, PKGBUILD, RPM spec for packaging

Plans added for Weeks 10-14 covering P1 features.
All 154 tests passing.
2026-05-13 10:34:01 +02:00
Alexander 7ad554f8d5 Add CLI implementation and MVP performance review
- Implement functional CLI with clap argument parsing
- Add directory scanning and metadata extraction at startup
- Fix filesystem.rs to store tokio Handle for async/sync bridge
- Fix flake.nix with LD_LIBRARY_PATH for libfuse3
- Add MVP performance review with real-world benchmark results

Benchmarks show:
- Mount time: 8ms (target <500ms)
- Throughput: 2-3 GB/s (target >500 MB/s)
- Identifies critical gap: incomplete file caching (only ~2MB per file)
- Identifies missing CDC chunking per architecture spec
2026-05-12 19:28:13 +02:00
Alexander 76856b893a Implement Week 1 foundation: workspace, core types, FUSE skeleton, LocalOrigin
- musicfs-core: OriginId, FileId, VirtualPath, ContentHash, AudioMeta,
  FileMeta, EventBus with FileAccessed event (5 tests)
- musicfs-fuse: FUSE skeleton with EROFS handlers for write ops
- musicfs-origins: Origin trait with watch(), LocalOrigin impl (6 tests)
- flake.nix: Nix dev shell with rust toolchain, clang, lld, fuse3

All 11 tests pass. Build produces no warnings.
2026-05-12 18:01:47 +02:00