Improve visual context OCR quality by FuJacob · Pull Request #482 · FuJacob/cotabby

FuJacob · 2026-05-31T18:19:06Z

Summary

Improve the screenshot/OCR visual-context pipeline so Cotabby sends richer, sanitized prompt context instead of dropping context when summarization fails. The change keeps Screen Recording required, expands OCR capture quality and budgets, adds deterministic OCR-noise filtering, and replaces the generic summary prompt with an autocomplete-focused context extraction prompt.

Validation

plutil -lint Cotabby.xcodeproj/project.pbxproj
- Cotabby.xcodeproj/project.pbxproj: OK
xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination platform=macOS build-for-testing -derivedDataPath build/DerivedData
- ** TEST BUILD SUCCEEDED **
xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination platform=macOS test -only-testing:CotabbyTests/PromptContextSanitizerTests -only-testing:CotabbyTests/ScreenshotContextGeneratorTests -only-testing:CotabbyTests/VisualContextSummaryPromptRendererTests -derivedDataPath build/DerivedData
- Compiled, then failed to load the app-hosted CotabbyTests.xctest bundle because the test bundle and host app have different Team IDs. This matches the known local signing failure for app-hosted tests.
rm -rf build/DerivedData

Linked issues

None.

Risk / rollout notes

Visual context intentionally trades a bit more capture/OCR/summarization latency for higher-quality context: larger screenshot/OCR budgets, Vision accurate mode, and a longer summary timeout.
Fast Mode still skips screenshot/OCR capture after permissions are granted, but Screen Recording remains required for autocomplete eligibility.
No hosted APIs, raw screenshots, or unbounded OCR text are introduced.

Greptile Summary

This PR upgrades the screenshot/OCR visual-context pipeline: it expands capture/OCR/summary budgets, switches Vision to accurate mode, replaces the generic summary prompt with an autocomplete-focused one extracted into VisualContextSummaryPromptRenderer, and adds deterministic OCR token scoring to filter hallucinated fragments. The fallback path now always delivers sanitized OCR text instead of dropping context when summarization fails or times out.

Budget and quality increases: snapshot dimension 500→700, maxImageDimension 900→1600, OCR char cap 2000→5000, summary cap 900→1500, timeout 3→6 s, max tokens 80→160; OCR switches from .fast to .accurate with a lower minimum text height (0.012→0.008).
New token-scoring OCR filter in PromptContextSanitizer.sanitizeOCR: multi-stage heuristics classify tokens as strong-signal, kept, or noise, dropping a line when it lacks any strong-signal token; the new isLikelyShortMixedCaseNoise heuristic works well for random garbage but has false positives for lowercase-initial camelCase brand names (\"iPhone\", \"macOS\", \"iCloud\") that are absent from knownWordSignals.
Testability: WindowScreenshotCapturing and ScreenTextExtracting protocols are introduced so ScreenshotContextGenerator can be unit-tested without live ScreenCaptureKit or Vision; new ScreenshotContextGeneratorTests and VisualContextSummaryPromptRendererTests cover the main paths.

Confidence Score: 4/5

Safe to merge; the core fallback behavior is well-tested and correct. The one heuristic edge case silently degrades context quality rather than causing errors.

The pipeline changes are well-structured: the OCR fallback path is thoroughly tested with stubs, error propagation from the summarizer is correctly caught upstream, and the new prompt renderer is straightforward. The only substantive concern is in isLikelyShortMixedCaseNoise: the condition !firstCharacterIsUppercase causes common lowercase-initial camelCase tokens like "iPhone", "macOS", and "iCloud" to be silently dropped as OCR noise, since none of them appear in knownWordSignals or preservedTechnicalTokens. On a macOS screenshot tool these names appear frequently in UI text, so affected users may see slightly less relevant autocomplete context without any visible error. All other logic — the summarizer timeout rework, the ContextSource fallback tracking, and the testability protocols — looks correct.

Cotabby/Support/PromptContextSanitizer.swift — specifically the isLikelyShortMixedCaseNoise function and the absence of lowercase-initial camelCase entries in knownWordSignals or preservedTechnicalTokens.

Important Files Changed

Filename	Overview
Cotabby/Support/PromptContextSanitizer.swift	Adds a multi-stage deterministic OCR token scorer; the `isLikelyShortMixedCaseNoise` heuristic produces false positives for lowercase-initial camelCase brand names (iPhone, macOS, iCloud) that are not in `knownWordSignals`.
Cotabby/Support/VisualContextSummaryPromptRenderer.swift	New file: builds the autocomplete-focused summarization prompt with injection-resistance guardrails; internally calls `sanitizeOCR` on text that is already sanitized by callers.
Cotabby/Services/Visual/ScreenshotContextGenerator.swift	Refactored to use testable protocols; adds OCR fallback path so summarizer errors no longer discard visual context; adds structured debug logging via a typed `ContextSource` enum.
Cotabby/Services/Visual/LlamaVisualContextSummarizer.swift	Timeout extended to 6 s, max tokens doubled to 160, prompt delegated to renderer; errors now propagate so callers can fall back to OCR rather than silently swallowing failures.
Cotabby/Services/Visual/ScreenTextExtractor.swift	Switches OCR recognition level from `.fast` to `.accurate`, lowers minimum text height from 0.012 to 0.008, and adds `ScreenTextExtracting` protocol for test injection.
Cotabby/Services/Visual/WindowScreenshotService.swift	Increases capture padding (100→160 pt horizontal, 600→800 pt vertical) and adds `WindowScreenshotCapturing` protocol seam for testing.
CotabbyTests/ScreenshotContextGeneratorTests.swift	New test file: covers summary path, empty-summary OCR fallback, thrown-error fallback, capped OCR, and all-noise OCR rejection using protocol stubs.
CotabbyTests/PromptContextSanitizerTests.swift	Adds three new OCR sanitizer tests covering noise filtering and useful-token preservation; no tests for lowercase-initial camelCase names (iPhone, macOS) that the new heuristic incorrectly drops.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[FocusedInputSnapshot] --> B[captureScreenshot\nWindowScreenshotCapturing]
    B --> C[extractText\nScreenTextExtracting]

    C -->|ScreenTextExtractionError.noRecognizedText| D{windowTitle\navailable?}
    D -->|No| E[throw unavailable]
    D -->|Yes| F[normalizeRecognizedText windowTitle]
    F --> G{hasMeaningfulSignal?}
    G -->|No| E
    G -->|Yes| H[boundedSummaryText — ocrFallback]
    H --> Z[VisualContextExcerpt]

    C -->|other error| E
    C -->|success| I[normalizeRecognizedText — sanitizeOCR maxChars=5000]
    I --> J{hasMeaningfulSignal?}
    J -->|No| E
    J -->|Yes| K{summarizer available?}

    K -->|No| L[finalContextText = boundedSummaryText OCR]
    K -->|Yes| M[summarizer.summarize — timeout=6s tokens=160]

    M -->|success| N{hasMeaningfulSignal boundedSummary?}
    N -->|Yes| O[source=summary — finalContextText=boundedSummary]
    N -->|No| L

    M -->|throws| P[log reason — fallback to OCR]
    P --> L

    L --> Q{hasMeaningfulSignal finalContextText?}
    O --> Q
    Q -->|No| E
    Q -->|Yes| Z

    subgraph LlamaVisualContextSummarizer
        M1[deduplicateConsecutiveLines] --> M2[VisualContextSummaryPromptRenderer.prompt — sanitizeOCR]
        M2 --> M3[summarizeWithTimeout — generationTask + timeoutTask]
        M3 --> M4[truncateAtRepeatedBlock]
        M4 -->|empty| M5[throw emptyResult]
        M4 -->|non-empty| M6[return summary]
    end

Comments Outside Diff (2)

Cotabby/Support/PromptContextSanitizer.swift, line 614-633 (link)

isLikelyShortMixedCaseNoise false-positives for lowercase-initial camelCase brand names

The function is designed to flag OCR garbage like "gLVWrt" but its final return condition — uppercaseCount >= 2 || !firstCharacterIsUppercase — also catches legitimate tokens where the first letter is lowercase but the token contains any uppercase. Concretely: "iPhone" (letters.first=i, uppercaseCount=1 → !firstCharacterIsUppercase=true → classified as noise), "macOS" (letters.first=m, uppercaseCount=2 → both clauses true → noise), and "iCloud", "iPad", "eBay" all follow the same path. None of these appear in knownWordSignals or preservedTechnicalTokens, so they reach isLikelyShortMixedCaseNoise and are silently dropped. On a macOS screenshot tool, references to Apple product names are common OCR output and their removal degrades autocomplete context.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Cotabby/Support/VisualContextSummaryPromptRenderer.swift, line 660-661 (link)

Redundant sanitizeOCR call on already-filtered text

VisualContextSummaryPromptRenderer.prompt is called from LlamaVisualContextSummarizer.summarize, which receives text that was already processed by normalizeRecognizedText (i.e. PromptContextSanitizer.sanitizeOCR with a 5 000-character cap). The second sanitizeOCR call here without a character limit runs the full line-scoring logic on text that is already clean, making it a no-op in the production path. The extra pass is harmless and idempotent, but it creates a cost asymmetry: in production the sanitizer runs twice while tests that call prompt directly sanitize exactly once. Consider removing the internal call and documenting that callers are expected to pre-sanitize, or keep it but note the redundancy so future callers know the contract.

_{Reviews (1): Last reviewed commit: "Improve visual context OCR quality" | Re-trigger Greptile}

SwiftLint (--strict) flagged generateContext at cyclomatic complexity 11 (limit 10). Extract the summarizer-vs-OCR fallback branch into resolvedContextText(...), a behavior-preserving helper that drops the function to ~8 and reads more clearly. XcodeGen drift: the project.pbxproj had hand-written synthetic object IDs (AA1000...) for the three new files instead of being regenerated. Ran 'xcodegen generate' so the committed project matches project.yml (hash-based IDs); diff vs main is now only the three new file references. Sparkle stays pinned exactVersion 2.9.1.

sanitizeOCR's noise filter only treated a token as real when it had an ASCII vowel or matched the English word lists, so CJK, Cyrillic, Greek, Arabic, Hebrew, Thai, and accented-Latin tokens were stripped to nothing. Non-English users would get empty visual context and fully generic suggestions. assessOCRToken now keeps any token carrying a non-ASCII letter as strong signal, after numbers and repeated-glyph junk are rejected so this is not a backdoor for ASCII garbage. The Latin-only tail moved into assessLatinToken to keep cyclomatic complexity under the limit; ASCII behavior is unchanged (a count<=2 token can never be repeated-glyph junk, so the reordering is a no-op for it). Adds regression tests for CJK/Cyrillic/accented-Latin preservation and for ASCII noise still being dropped on a mixed line.

Improve visual context OCR quality

606833e

FuJacob marked this pull request as draft May 31, 2026 18:21

FuJacob added 2 commits May 31, 2026 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve visual context OCR quality#482

Improve visual context OCR quality#482
FuJacob wants to merge 3 commits into
mainfrom
fix/visual-context-quality

FuJacob commented May 31, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented May 31, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented May 31, 2026 •

edited by greptile-apps Bot

Loading