Fix WAL sync: durable conversation linkage, startup recovery, accurate indicator#6330
Fix WAL sync: durable conversation linkage, startup recovery, accurate indicator#6330
Conversation
Adds conversationId, retryCount, and lastRetryAt fields to the Wal class so WAL-to-conversation linkage survives app kill via wals.json persistence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ocalWalSync (#6318) - finalizeCurrentSession(): force-drains tail buffer frames and flushes to disk - stampConversationId(): stamps all session WALs with conversation ID and persists - getOrphanedWals(): returns miss+disk WALs with conversationId for startup recovery - getInFlightSeconds(): returns seconds of in-flight frames for accurate indicator - persistRetryMetadata(): saves retry count/timestamp for failed sync attempts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… retry, startup recovery (#6318) - On ConversationProcessingStartedEvent: finalize tail buffer + stamp WALs with conversationId - 30s fallback timer fires if ConversationEvent never arrives (WS disconnect) - _syncSingleWal: retry with exponential backoff (3 attempts, 5/10/20s delays) - recoverOrphanedWals(): called on startup to sync WALs that survived app kill - inFlightAudioSeconds getter for accurate indicator Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Indicator now counts both flushed disk WALs and in-flight memory frames, eliminating the 1-2 minute delay where users saw "0 audio saved" after disconnect. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…finalize (#6318) Tests cover conversationId persistence, stampConversationId filtering, getOrphanedWals with retry cap, getInFlightSeconds accuracy, and finalizeCurrentSession tail buffer drain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR makes WAL audio sync durable across app kills and WebSocket disconnects by persisting
Confidence Score: 4/5Two P1 timing bugs make both startup recovery and the fast-path sync unreliable, but the WAL data itself is never lost — the issues manifest as delayed syncing, not data loss. The model layer (wal.dart), service layer (local_wal_sync.dart), and UI (page.dart) are solid. However, capture_provider.dart has two concrete correctness defects: startup recovery via Future.microtask always queries an empty _wals list (I/O not yet complete), and the ConversationEvent fast-path cancels the fallback timer before _finalizeAndStampSession has flushed WALs to disk, leaving session WALs unsynced until next launch — which is also broken by the first bug. app/lib/providers/capture_provider.dart — the recoverOrphanedWals microtask timing and the _finalizeAndStampSession / ConversationEvent race both need fixing before the WAL durability guarantees hold. Important Files Changed
Sequence DiagramsequenceDiagram
participant WS as WebSocket
participant CP as CaptureProvider
participant LWS as LocalWalSyncImpl
participant Disk as Disk (wals.json)
Note over CP: ConversationProcessingStartedEvent
WS->>CP: ConversationProcessingStartedEvent
CP->>LWS: finalizeCurrentSession() [fire-and-forget]
LWS-->>Disk: flush WALs (async I/O)
CP->>CP: _resetStateVariables()
CP->>CP: start 30s fallback timer
WS->>CP: ConversationEvent (may arrive before flush completes)
CP->>CP: cancel fallback timer
CP->>LWS: getSessionUnsyncedWals() [needs WalStorage.disk]
Note over LWS: WALs may still be WalStorage.mem
LWS-->>CP: [] empty — race condition
Note over CP: On next app startup
CP->>LWS: recoverOrphanedWals() via Future.microtask
Note over LWS: _wals still const [] — I/O not complete
LWS-->>CP: [] empty — timing bug
Note over LWS: _initializeWals() completes later
Disk-->>LWS: WALs loaded into _wals
Note over CP: Recovery window missed
Reviews (1): Last reviewed commit: "Add 12 unit tests for WAL recovery: mode..." | Re-trigger Greptile |
| // Recover orphaned WALs from previous sessions on startup | ||
| Future.microtask(() => recoverOrphanedWals()); |
There was a problem hiding this comment.
Startup recovery runs before WAL list is populated
_initializeWals() is called without await from LocalWalSyncImpl.start() and performs async disk I/O (WalFileManager.init() + loadWals()). Microtasks are drained before the event loop services any I/O futures, so recoverOrphanedWals() always calls getOrphanedWals() while _wals is still const [] — recovering zero WALs every time. The startup recovery as written is effectively a no-op. A reliable fix would be to expose a Future<void> initialized completer from the WAL service and await it before calling getOrphanedWals().
| if (event is ConversationEvent) { | ||
| event.memory.isNew = true; | ||
| conversationProvider!.removeProcessingConversation(event.memory.id); | ||
| _processConversationCreated(event.memory, event.messages.cast<ServerMessage>()); | ||
| _autoSyncFallbackTimer?.cancel(); | ||
| if (_pendingAutoSyncSessionStart > 0) { | ||
| final sessionStart = _pendingAutoSyncSessionStart; | ||
| _pendingAutoSyncSessionStart = 0; | ||
| _pendingAutoSyncConversationId = null; | ||
| _autoSyncSessionWals(sessionStart, event.memory.id); | ||
| } | ||
| return; |
There was a problem hiding this comment.
_finalizeAndStampSession race with ConversationEvent
_finalizeAndStampSession is fired-and-forgotten at line 1127, then ConversationEvent (which often follows in seconds) cancels the fallback timer and calls _autoSyncSessionWals. If _flush() inside finalizeCurrentSession hasn't completed yet when ConversationEvent arrives, the freshly drained WALs are still WalStorage.mem and getSessionUnsyncedWals (which filters for WalStorage.disk) silently returns empty — with the fallback timer already cancelled, the WALs won't auto-sync until the next app startup. _finalizeAndStampSession should be awaited before starting the fallback timer, or _autoSyncSessionWals should be chained to its completion rather than invoked independently in the ConversationEvent branch.
|
|
||
| /// Sync a single WAL to a conversation with retry and backoff. | ||
| /// Retries up to 3 times with exponential delays (5s, 10s, 20s). | ||
| Future<void> _syncSingleWal(Wal wal, String conversationId, dynamic phoneSync) async { |
There was a problem hiding this comment.
dynamic parameter loses compile-time type safety
phoneSync is typed as dynamic, which disables all static checks. Both call-sites pass a LocalWalSyncImpl (which implements LocalWalSync). Use LocalWalSync instead so that any refactoring of the interface is caught at compile time.
| Future<void> _syncSingleWal(Wal wal, String conversationId, dynamic phoneSync) async { | |
| Future<void> _syncSingleWal(Wal wal, String conversationId, LocalWalSync phoneSync) async { |
| } catch (e) { | ||
| Logger.debug('Auto-sync WAL ${wal.id} failed: $e'); | ||
| wal.retryCount = attempt + 1; | ||
| wal.lastRetryAt = DateTime.now().millisecondsSinceEpoch ~/ 1000; | ||
| if (attempt < maxRetries - 1) { | ||
| final delay = baseDelay * (1 << attempt); // 5s, 10s, 20s | ||
| Logger.debug('Auto-sync WAL ${wal.id} attempt ${attempt + 1} failed, retrying in ${delay}s: $e'); | ||
| await Future.delayed(Duration(seconds: delay)); | ||
| } else { | ||
| Logger.debug('Auto-sync WAL ${wal.id} failed after $maxRetries attempts: $e'); | ||
| // Persist retry metadata so startup recovery knows the attempt count | ||
| await phoneSync.persistRetryMetadata(wal); |
There was a problem hiding this comment.
Intermediate retry count not persisted to disk
wal.retryCount and wal.lastRetryAt are updated in memory on every failure, but persistRetryMetadata is only called on the last attempt. If the app is killed during one of the earlier retry delays, the retry count is lost and the orphan recovery restarts from 0 on the next launch. Consider calling persistRetryMetadata after every failure (not just the final one) to make the backoff state durable across kills.
- Add shouldStored check in finalizeCurrentSession() to skip storage when all frames were already synced via WebSocket (P1) - Fix _wals.add() on unmodifiable const [] list — use List.from pattern (P2) - Change dynamic phoneSync to LocalWalSyncImpl for type safety (P4) - Move orphan recovery to updateProviderInstances() with 5s delay (P4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use LocalWalSyncImpl instead of dynamic for type safety - Move recoverOrphanedWals() to updateProviderInstances() with 5s delay so WALs are loaded from disk before recovery runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase frame count to 1100 (above 10*fps threshold) for stored test - Add test verifying finalizeCurrentSession skips storage when all synced Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Required fixes before merge (round 1 — all addressed):
Tests updated: frame count increased to 1100 (above shouldStored threshold); new test for synced-frames skip case. All 13 tests pass. Ready for re-review. by AI for @beastoin |
Exposes a walReady future that completes when _initializeWals() finishes loading WALs from disk. recoverOrphanedWals() awaits this instead of relying on a fixed 5-second delay. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- P1: Store _pendingFinalizeAndStamp future; _autoSyncSessionWals awaits it before querying disk WALs, preventing race when backend responds fast - P2: Skip orphan recovery when offline (don't burn retries); catch SocketException separately without incrementing retryCount - P3: Await phoneSync.walReady in recoverOrphanedWals instead of fixed 5s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
) - P1: Trigger recoverOrphanedWals() in onConnectionStateChanged when coming back online after an offline start (previously was stranded) - P2: Persist retryCount/lastRetryAt after every failed attempt, not just the final one, so app-kill during backoff preserves the budget Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check SyncLocalFilesResponse.hasPartialFailure before marking WAL as synced. Partial failures (207) are now treated as retryable errors instead of silently losing the failed audio segments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
) - Start retry loop from wal.retryCount (persisted value) instead of 0, so cross-restart budget is honored and WALs aren't retried indefinitely - Mark WALs with missing filePath, unresolvable path, or deleted file as corrupted (matches main sync path behavior) instead of silently returning Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After recoverOrphanedWals() runs, check if any orphans remain (e.g., SocketException while still "online"). If so, clear _orphanRecoveryDone so the next connectivity transition re-triggers recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Exact 10*fps threshold triggers storage (>= boundary) - Just below threshold (999 frames) does NOT trigger storage - All-synced 1100 frames skips storage (shouldStored=false) - stampConversationId only stamps WALs >= sessionStartSeconds - Exact timerStart == sessionStartSeconds is stamped - walReady Completer resolves after start() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Test results:
CP9A — L1 live test (changed component standalone)
L1 synthesis: P1-P6 proven via 19 unit tests covering all LocalWalSyncImpl paths including boundary cases. P7-P8 confirmed via successful app build+boot with WAL service logs. P9-P10 untested at runtime due to missing HTTP mock and widget test infrastructure — these are provider-level integration paths that would require by AI for @beastoin |
CP9B — L2 live test (service + app integrated)This PR changes only Flutter app code (0 backend files). The app was built and run with
Since no backend code was changed, the integration boundary is unchanged — the
L2 synthesis: P1-P8, P10 proven. App builds, boots, and runs with WAL service active against prod API. P9 (_syncSingleWal recovery path) requires an actual orphaned WAL scenario (device capture → app kill → restart) which needs a physical Omi device. No backend changes in this PR, so API integration boundary is unchanged from main. by AI for @beastoin |
Closes #6318
Makes WAL audio sync survive app kill, WebSocket disconnects, and crashes by persisting conversation linkage directly on WAL objects and adding startup recovery.
Changes
WAL model (
wal.dart):conversationId,retryCount,lastRetryAtfields with JSON serializationLocal WAL sync (
local_wal_sync.dart):finalizeCurrentSession()— drains all in-flight frames to disk; respectsshouldStoredgatestampConversationId()— stamps miss/in-progress WALs in session windowgetOrphanedWals()— returns recoverable WALs (miss + disk + conversationId + retryCount < 3)getInFlightSeconds()— in-memory frame count as seconds for UIpersistRetryMetadata()— persists retry state to diskwalReadyCompleter — signals when WALs loaded from diskCapture provider (
capture_provider.dart):ConversationProcessingStartedEvent: finalize session + stamp WALs (fire-and-forget)_pendingFinalizeAndStampfuture — prevents race with auto-syncConversationEventnever arrivesrecoverOrphanedWals()— awaitswalReady; skips offline; re-arms on remaining orphans_syncSingleWal()— durable retry budget from persistedretryCount; network-safe; partial failure aware; marks corrupted for missing filesonConnectionStateChanged()— triggers recovery on reconnectUI (
conversation_capturing/page.dart):Review & Testing
wal_recovery_test.dart(model, stamp, orphan, in-flight, finalize boundaries, walReady)capture_provider_test.dart— all passby AI for @beastoin