Improve visual context OCR quality#482
Draft
FuJacob wants to merge 3 commits into
Draft
Conversation
SwiftLint (--strict) flagged generateContext at cyclomatic complexity 11 (limit 10). Extract the summarizer-vs-OCR fallback branch into resolvedContextText(...), a behavior-preserving helper that drops the function to ~8 and reads more clearly. XcodeGen drift: the project.pbxproj had hand-written synthetic object IDs (AA1000...) for the three new files instead of being regenerated. Ran 'xcodegen generate' so the committed project matches project.yml (hash-based IDs); diff vs main is now only the three new file references. Sparkle stays pinned exactVersion 2.9.1.
sanitizeOCR's noise filter only treated a token as real when it had an ASCII vowel or matched the English word lists, so CJK, Cyrillic, Greek, Arabic, Hebrew, Thai, and accented-Latin tokens were stripped to nothing. Non-English users would get empty visual context and fully generic suggestions. assessOCRToken now keeps any token carrying a non-ASCII letter as strong signal, after numbers and repeated-glyph junk are rejected so this is not a backdoor for ASCII garbage. The Latin-only tail moved into assessLatinToken to keep cyclomatic complexity under the limit; ASCII behavior is unchanged (a count<=2 token can never be repeated-glyph junk, so the reordering is a no-op for it). Adds regression tests for CJK/Cyrillic/accented-Latin preservation and for ASCII noise still being dropped on a mixed line.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improve the screenshot/OCR visual-context pipeline so Cotabby sends richer, sanitized prompt context instead of dropping context when summarization fails. The change keeps Screen Recording required, expands OCR capture quality and budgets, adds deterministic OCR-noise filtering, and replaces the generic summary prompt with an autocomplete-focused context extraction prompt.
Validation
plutil -lint Cotabby.xcodeproj/project.pbxprojCotabby.xcodeproj/project.pbxproj: OKxcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination platform=macOS build-for-testing -derivedDataPath build/DerivedData** TEST BUILD SUCCEEDED **xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination platform=macOS test -only-testing:CotabbyTests/PromptContextSanitizerTests -only-testing:CotabbyTests/ScreenshotContextGeneratorTests -only-testing:CotabbyTests/VisualContextSummaryPromptRendererTests -derivedDataPath build/DerivedDataCotabbyTests.xctestbundle because the test bundle and host app have different Team IDs. This matches the known local signing failure for app-hosted tests.rm -rf build/DerivedDataLinked issues
None.
Risk / rollout notes
Greptile Summary
This PR upgrades the screenshot/OCR visual-context pipeline: it expands capture/OCR/summary budgets, switches Vision to accurate mode, replaces the generic summary prompt with an autocomplete-focused one extracted into
VisualContextSummaryPromptRenderer, and adds deterministic OCR token scoring to filter hallucinated fragments. The fallback path now always delivers sanitized OCR text instead of dropping context when summarization fails or times out.maxImageDimension900→1600, OCR char cap 2000→5000, summary cap 900→1500, timeout 3→6 s, max tokens 80→160; OCR switches from.fastto.accuratewith a lower minimum text height (0.012→0.008).PromptContextSanitizer.sanitizeOCR: multi-stage heuristics classify tokens as strong-signal, kept, or noise, dropping a line when it lacks any strong-signal token; the newisLikelyShortMixedCaseNoiseheuristic works well for random garbage but has false positives for lowercase-initial camelCase brand names (\"iPhone\",\"macOS\",\"iCloud\") that are absent fromknownWordSignals.WindowScreenshotCapturingandScreenTextExtractingprotocols are introduced soScreenshotContextGeneratorcan be unit-tested without live ScreenCaptureKit or Vision; newScreenshotContextGeneratorTestsandVisualContextSummaryPromptRendererTestscover the main paths.Confidence Score: 4/5
Safe to merge; the core fallback behavior is well-tested and correct. The one heuristic edge case silently degrades context quality rather than causing errors.
The pipeline changes are well-structured: the OCR fallback path is thoroughly tested with stubs, error propagation from the summarizer is correctly caught upstream, and the new prompt renderer is straightforward. The only substantive concern is in
isLikelyShortMixedCaseNoise: the condition!firstCharacterIsUppercasecauses common lowercase-initial camelCase tokens like"iPhone","macOS", and"iCloud"to be silently dropped as OCR noise, since none of them appear inknownWordSignalsorpreservedTechnicalTokens. On a macOS screenshot tool these names appear frequently in UI text, so affected users may see slightly less relevant autocomplete context without any visible error. All other logic — the summarizer timeout rework, theContextSourcefallback tracking, and the testability protocols — looks correct.Cotabby/Support/PromptContextSanitizer.swift — specifically the
isLikelyShortMixedCaseNoisefunction and the absence of lowercase-initial camelCase entries inknownWordSignalsorpreservedTechnicalTokens.Important Files Changed
isLikelyShortMixedCaseNoiseheuristic produces false positives for lowercase-initial camelCase brand names (iPhone, macOS, iCloud) that are not inknownWordSignals.sanitizeOCRon text that is already sanitized by callers.ContextSourceenum..fastto.accurate, lowers minimum text height from 0.012 to 0.008, and addsScreenTextExtractingprotocol for test injection.WindowScreenshotCapturingprotocol seam for testing.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[FocusedInputSnapshot] --> B[captureScreenshot\nWindowScreenshotCapturing] B --> C[extractText\nScreenTextExtracting] C -->|ScreenTextExtractionError.noRecognizedText| D{windowTitle\navailable?} D -->|No| E[throw unavailable] D -->|Yes| F[normalizeRecognizedText windowTitle] F --> G{hasMeaningfulSignal?} G -->|No| E G -->|Yes| H[boundedSummaryText — ocrFallback] H --> Z[VisualContextExcerpt] C -->|other error| E C -->|success| I[normalizeRecognizedText — sanitizeOCR maxChars=5000] I --> J{hasMeaningfulSignal?} J -->|No| E J -->|Yes| K{summarizer available?} K -->|No| L[finalContextText = boundedSummaryText OCR] K -->|Yes| M[summarizer.summarize — timeout=6s tokens=160] M -->|success| N{hasMeaningfulSignal boundedSummary?} N -->|Yes| O[source=summary — finalContextText=boundedSummary] N -->|No| L M -->|throws| P[log reason — fallback to OCR] P --> L L --> Q{hasMeaningfulSignal finalContextText?} O --> Q Q -->|No| E Q -->|Yes| Z subgraph LlamaVisualContextSummarizer M1[deduplicateConsecutiveLines] --> M2[VisualContextSummaryPromptRenderer.prompt — sanitizeOCR] M2 --> M3[summarizeWithTimeout — generationTask + timeoutTask] M3 --> M4[truncateAtRepeatedBlock] M4 -->|empty| M5[throw emptyResult] M4 -->|non-empty| M6[return summary] endComments Outside Diff (2)
Cotabby/Support/PromptContextSanitizer.swift, line 614-633 (link)isLikelyShortMixedCaseNoisefalse-positives for lowercase-initial camelCase brand namesThe function is designed to flag OCR garbage like
"gLVWrt"but its final return condition —uppercaseCount >= 2 || !firstCharacterIsUppercase— also catches legitimate tokens where the first letter is lowercase but the token contains any uppercase. Concretely:"iPhone"(letters.first=i, uppercaseCount=1 →!firstCharacterIsUppercase=true→ classified as noise),"macOS"(letters.first=m, uppercaseCount=2 → both clauses true → noise), and"iCloud","iPad","eBay"all follow the same path. None of these appear inknownWordSignalsorpreservedTechnicalTokens, so they reachisLikelyShortMixedCaseNoiseand are silently dropped. On a macOS screenshot tool, references to Apple product names are common OCR output and their removal degrades autocomplete context.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Cotabby/Support/VisualContextSummaryPromptRenderer.swift, line 660-661 (link)sanitizeOCRcall on already-filtered textVisualContextSummaryPromptRenderer.promptis called fromLlamaVisualContextSummarizer.summarize, which receives text that was already processed bynormalizeRecognizedText(i.e.PromptContextSanitizer.sanitizeOCRwith a 5 000-character cap). The secondsanitizeOCRcall here without a character limit runs the full line-scoring logic on text that is already clean, making it a no-op in the production path. The extra pass is harmless and idempotent, but it creates a cost asymmetry: in production the sanitizer runs twice while tests that callpromptdirectly sanitize exactly once. Consider removing the internal call and documenting that callers are expected to pre-sanitize, or keep it but note the redundancy so future callers know the contract.Reviews (1): Last reviewed commit: "Improve visual context OCR quality" | Re-trigger Greptile