Skip to content

Improve visual context OCR quality#482

Draft
FuJacob wants to merge 3 commits into
mainfrom
fix/visual-context-quality
Draft

Improve visual context OCR quality#482
FuJacob wants to merge 3 commits into
mainfrom
fix/visual-context-quality

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented May 31, 2026

Summary

Improve the screenshot/OCR visual-context pipeline so Cotabby sends richer, sanitized prompt context instead of dropping context when summarization fails. The change keeps Screen Recording required, expands OCR capture quality and budgets, adds deterministic OCR-noise filtering, and replaces the generic summary prompt with an autocomplete-focused context extraction prompt.

Validation

  • plutil -lint Cotabby.xcodeproj/project.pbxproj
    • Cotabby.xcodeproj/project.pbxproj: OK
  • xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination platform=macOS build-for-testing -derivedDataPath build/DerivedData
    • ** TEST BUILD SUCCEEDED **
  • xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination platform=macOS test -only-testing:CotabbyTests/PromptContextSanitizerTests -only-testing:CotabbyTests/ScreenshotContextGeneratorTests -only-testing:CotabbyTests/VisualContextSummaryPromptRendererTests -derivedDataPath build/DerivedData
    • Compiled, then failed to load the app-hosted CotabbyTests.xctest bundle because the test bundle and host app have different Team IDs. This matches the known local signing failure for app-hosted tests.
  • rm -rf build/DerivedData

Linked issues

None.

Risk / rollout notes

  • Visual context intentionally trades a bit more capture/OCR/summarization latency for higher-quality context: larger screenshot/OCR budgets, Vision accurate mode, and a longer summary timeout.
  • Fast Mode still skips screenshot/OCR capture after permissions are granted, but Screen Recording remains required for autocomplete eligibility.
  • No hosted APIs, raw screenshots, or unbounded OCR text are introduced.

Greptile Summary

This PR upgrades the screenshot/OCR visual-context pipeline: it expands capture/OCR/summary budgets, switches Vision to accurate mode, replaces the generic summary prompt with an autocomplete-focused one extracted into VisualContextSummaryPromptRenderer, and adds deterministic OCR token scoring to filter hallucinated fragments. The fallback path now always delivers sanitized OCR text instead of dropping context when summarization fails or times out.

  • Budget and quality increases: snapshot dimension 500→700, maxImageDimension 900→1600, OCR char cap 2000→5000, summary cap 900→1500, timeout 3→6 s, max tokens 80→160; OCR switches from .fast to .accurate with a lower minimum text height (0.012→0.008).
  • New token-scoring OCR filter in PromptContextSanitizer.sanitizeOCR: multi-stage heuristics classify tokens as strong-signal, kept, or noise, dropping a line when it lacks any strong-signal token; the new isLikelyShortMixedCaseNoise heuristic works well for random garbage but has false positives for lowercase-initial camelCase brand names (\"iPhone\", \"macOS\", \"iCloud\") that are absent from knownWordSignals.
  • Testability: WindowScreenshotCapturing and ScreenTextExtracting protocols are introduced so ScreenshotContextGenerator can be unit-tested without live ScreenCaptureKit or Vision; new ScreenshotContextGeneratorTests and VisualContextSummaryPromptRendererTests cover the main paths.

Confidence Score: 4/5

Safe to merge; the core fallback behavior is well-tested and correct. The one heuristic edge case silently degrades context quality rather than causing errors.

The pipeline changes are well-structured: the OCR fallback path is thoroughly tested with stubs, error propagation from the summarizer is correctly caught upstream, and the new prompt renderer is straightforward. The only substantive concern is in isLikelyShortMixedCaseNoise: the condition !firstCharacterIsUppercase causes common lowercase-initial camelCase tokens like "iPhone", "macOS", and "iCloud" to be silently dropped as OCR noise, since none of them appear in knownWordSignals or preservedTechnicalTokens. On a macOS screenshot tool these names appear frequently in UI text, so affected users may see slightly less relevant autocomplete context without any visible error. All other logic — the summarizer timeout rework, the ContextSource fallback tracking, and the testability protocols — looks correct.

Cotabby/Support/PromptContextSanitizer.swift — specifically the isLikelyShortMixedCaseNoise function and the absence of lowercase-initial camelCase entries in knownWordSignals or preservedTechnicalTokens.

Important Files Changed

Filename Overview
Cotabby/Support/PromptContextSanitizer.swift Adds a multi-stage deterministic OCR token scorer; the isLikelyShortMixedCaseNoise heuristic produces false positives for lowercase-initial camelCase brand names (iPhone, macOS, iCloud) that are not in knownWordSignals.
Cotabby/Support/VisualContextSummaryPromptRenderer.swift New file: builds the autocomplete-focused summarization prompt with injection-resistance guardrails; internally calls sanitizeOCR on text that is already sanitized by callers.
Cotabby/Services/Visual/ScreenshotContextGenerator.swift Refactored to use testable protocols; adds OCR fallback path so summarizer errors no longer discard visual context; adds structured debug logging via a typed ContextSource enum.
Cotabby/Services/Visual/LlamaVisualContextSummarizer.swift Timeout extended to 6 s, max tokens doubled to 160, prompt delegated to renderer; errors now propagate so callers can fall back to OCR rather than silently swallowing failures.
Cotabby/Services/Visual/ScreenTextExtractor.swift Switches OCR recognition level from .fast to .accurate, lowers minimum text height from 0.012 to 0.008, and adds ScreenTextExtracting protocol for test injection.
Cotabby/Services/Visual/WindowScreenshotService.swift Increases capture padding (100→160 pt horizontal, 600→800 pt vertical) and adds WindowScreenshotCapturing protocol seam for testing.
CotabbyTests/ScreenshotContextGeneratorTests.swift New test file: covers summary path, empty-summary OCR fallback, thrown-error fallback, capped OCR, and all-noise OCR rejection using protocol stubs.
CotabbyTests/PromptContextSanitizerTests.swift Adds three new OCR sanitizer tests covering noise filtering and useful-token preservation; no tests for lowercase-initial camelCase names (iPhone, macOS) that the new heuristic incorrectly drops.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[FocusedInputSnapshot] --> B[captureScreenshot\nWindowScreenshotCapturing]
    B --> C[extractText\nScreenTextExtracting]

    C -->|ScreenTextExtractionError.noRecognizedText| D{windowTitle\navailable?}
    D -->|No| E[throw unavailable]
    D -->|Yes| F[normalizeRecognizedText windowTitle]
    F --> G{hasMeaningfulSignal?}
    G -->|No| E
    G -->|Yes| H[boundedSummaryText — ocrFallback]
    H --> Z[VisualContextExcerpt]

    C -->|other error| E
    C -->|success| I[normalizeRecognizedText — sanitizeOCR maxChars=5000]
    I --> J{hasMeaningfulSignal?}
    J -->|No| E
    J -->|Yes| K{summarizer available?}

    K -->|No| L[finalContextText = boundedSummaryText OCR]
    K -->|Yes| M[summarizer.summarize — timeout=6s tokens=160]

    M -->|success| N{hasMeaningfulSignal boundedSummary?}
    N -->|Yes| O[source=summary — finalContextText=boundedSummary]
    N -->|No| L

    M -->|throws| P[log reason — fallback to OCR]
    P --> L

    L --> Q{hasMeaningfulSignal finalContextText?}
    O --> Q
    Q -->|No| E
    Q -->|Yes| Z

    subgraph LlamaVisualContextSummarizer
        M1[deduplicateConsecutiveLines] --> M2[VisualContextSummaryPromptRenderer.prompt — sanitizeOCR]
        M2 --> M3[summarizeWithTimeout — generationTask + timeoutTask]
        M3 --> M4[truncateAtRepeatedBlock]
        M4 -->|empty| M5[throw emptyResult]
        M4 -->|non-empty| M6[return summary]
    end
Loading

Comments Outside Diff (2)

  1. Cotabby/Support/PromptContextSanitizer.swift, line 614-633 (link)

    P2 isLikelyShortMixedCaseNoise false-positives for lowercase-initial camelCase brand names

    The function is designed to flag OCR garbage like "gLVWrt" but its final return condition — uppercaseCount >= 2 || !firstCharacterIsUppercase — also catches legitimate tokens where the first letter is lowercase but the token contains any uppercase. Concretely: "iPhone" (letters.first=i, uppercaseCount=1 → !firstCharacterIsUppercase=true → classified as noise), "macOS" (letters.first=m, uppercaseCount=2 → both clauses true → noise), and "iCloud", "iPad", "eBay" all follow the same path. None of these appear in knownWordSignals or preservedTechnicalTokens, so they reach isLikelyShortMixedCaseNoise and are silently dropped. On a macOS screenshot tool, references to Apple product names are common OCR output and their removal degrades autocomplete context.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Codex Fix in Claude Code

  2. Cotabby/Support/VisualContextSummaryPromptRenderer.swift, line 660-661 (link)

    P2 Redundant sanitizeOCR call on already-filtered text

    VisualContextSummaryPromptRenderer.prompt is called from LlamaVisualContextSummarizer.summarize, which receives text that was already processed by normalizeRecognizedText (i.e. PromptContextSanitizer.sanitizeOCR with a 5 000-character cap). The second sanitizeOCR call here without a character limit runs the full line-scoring logic on text that is already clean, making it a no-op in the production path. The extra pass is harmless and idempotent, but it creates a cost asymmetry: in production the sanitizer runs twice while tests that call prompt directly sanitize exactly once. Consider removing the internal call and documenting that callers are expected to pre-sanitize, or keep it but note the redundancy so future callers know the contract.

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Improve visual context OCR quality" | Re-trigger Greptile

@FuJacob FuJacob marked this pull request as draft May 31, 2026 18:21
FuJacob added 2 commits May 31, 2026 11:35
SwiftLint (--strict) flagged generateContext at cyclomatic complexity 11 (limit 10). Extract the summarizer-vs-OCR fallback branch into resolvedContextText(...), a behavior-preserving helper that drops the function to ~8 and reads more clearly.

XcodeGen drift: the project.pbxproj had hand-written synthetic object IDs (AA1000...) for the three new files instead of being regenerated. Ran 'xcodegen generate' so the committed project matches project.yml (hash-based IDs); diff vs main is now only the three new file references. Sparkle stays pinned exactVersion 2.9.1.
sanitizeOCR's noise filter only treated a token as real when it had an ASCII vowel or matched the English word lists, so CJK, Cyrillic, Greek, Arabic, Hebrew, Thai, and accented-Latin tokens were stripped to nothing. Non-English users would get empty visual context and fully generic suggestions.

assessOCRToken now keeps any token carrying a non-ASCII letter as strong signal, after numbers and repeated-glyph junk are rejected so this is not a backdoor for ASCII garbage. The Latin-only tail moved into assessLatinToken to keep cyclomatic complexity under the limit; ASCII behavior is unchanged (a count<=2 token can never be repeated-glyph junk, so the reordering is a no-op for it). Adds regression tests for CJK/Cyrillic/accented-Latin preservation and for ASCII noise still being dropped on a mixed line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant