Skip to content

Replace the LLM OCR summarizer with direct OCR-hygiene filtering#499

Merged
FuJacob merged 1 commit into
mainfrom
experimental/m2-ocr-hygiene
Jun 1, 2026
Merged

Replace the LLM OCR summarizer with direct OCR-hygiene filtering#499
FuJacob merged 1 commit into
mainfrom
experimental/m2-ocr-hygiene

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented Jun 1, 2026

Summary

Drops the LLM visual-context summarizer in favor of direct OCR hygiene. ScreenshotContextGenerator now filters raw OCR through the new pure OCRTextHygiene (low-confidence, replacement-char, symbol-density, digit-substitution, and word-char-ratio drops, plus field-text stripping and bounding) and injects the cleaned text, instead of running a second on-device generation to summarize it.

Why: the summarizer cost an extra llama generation on every visual-context refresh (latency) and was a hallucination layer between the screen and the prompt (a likely source of off-context ghost text). A base model conditions fine on cleaned raw context, so the summary is unnecessary. Field-text stripping also stops the screen context from echoing the user's own typed text back into the prompt.

Removes LlamaVisualContextSummarizer, VisualContextSummaryPromptRenderer, and their test; drops the summarizer injection in CotabbyAppEnvironment. Adds OCRTextHygiene + thorough tests.

Validation

xcodebuild build-for-testing ... -derivedDataPath build/DerivedData   # ** BUILD SUCCEEDED **
swiftlint lint --config .swiftlint.yml --quiet                        # changed files, exit 0

OCRTextHygiene has unit coverage for each filter (notably the digit-substitution matrix: qu81ity/h3llo drop; utf8, v2, 3D, RTX5070, 20-core pass), symbol-density, field-text stripping, and clean bounding. App-hosted tests run in CI (local Team ID signing blocks the xctest bundle, as before).

Risk / rollout notes

  • Visual context is unchanged in when it runs (off the keystroke path, off by default, secure-field gated); only what it produces changes: cleaned raw OCR rather than a model summary.
  • Deferred follow-ups (intentionally small, to keep this PR out of the runtime actor):
    • LlamaRuntimeManager.summarize / LlamaRuntimeCore.summarize are now unused (only the deleted summarizer called them); left in place for a separate trivial cleanup.
    • The VisualContextStatus.summarizingText case is no longer emitted but kept so the debug-overlay's exhaustive switches don't churn.
    • ScreenTextExtractor still discards per-line OCR confidence, so the confidence filter is currently a no-op; surfacing confidence (and optionally .fast recognition) is a follow-up.

🤖 Generated with Claude Code

Greptile Summary

Replaces the LlamaVisualContextSummarizer + VisualContextSummaryPromptRenderer stack with a pure-Swift OCRTextHygiene pass that filters noisy OCR lines (low-confidence, replacement characters, symbol density, digit substitution, word-char ratio, field-text echoes) before injecting them into the autocomplete prompt. This eliminates the extra on-device LLM generation per visual-context refresh and removes a hallucination layer between the screen and the prompt.

  • OCRTextHygiene is a stateless, I/O-free enum with six independently testable filter stages and thorough unit coverage; all thresholds are tunable.
  • ScreenshotContextGenerator is simplified to a linear hygiene → normalize → bound pipeline; the VisualContextSummarizing protocol, resolvedContextText, and the summarizer injection in CotabbyAppEnvironment are all removed.

Confidence Score: 4/5

Safe to merge; the change is a clean removal of the on-device summarizer path and a well-tested replacement filter. No data loss or behavioral regression risks on the existing feature boundary.

The OCR hygiene filters are well-reasoned and the test suite covers each one individually and end-to-end. The digit-substitution heuristic (lowercase-before + letter-after) quietly drops lines containing numeronyms like k8s, e2e, i18n, and a11y — real developer terminology that is indistinguishable from OCR misreads by this rule. The PR acknowledges gpt-4o-mini as a known false positive but does not document or test the shorter numeronym family, which is more likely to appear in on-screen text for the target developer audience. This is a quality regression for those specific contexts rather than a hard breakage.

Cotabby/Support/OCRTextHygiene.swift — specifically the tokenHasDigitSubstitution predicate and its interaction with numeronyms.

Important Files Changed

Filename Overview
Cotabby/Support/OCRTextHygiene.swift New pure-Swift OCR hygiene module with 6 filters; well-structured and tested, but the digit-substitution filter (filter #4) silently drops lines containing numeronyms like k8s, e2e, i18n, a11y — cases not covered by tests or the PR description
Cotabby/Services/Visual/ScreenshotContextGenerator.swift Summarizer path cleanly removed; replaced with OCRTextHygiene.clean pipeline; hardcoded confidence of 1.0 is an acknowledged no-op; file-level comment still references "optional local summary" which no longer exists
Cotabby/App/Core/CotabbyAppEnvironment.swift Cleanly drops the summarizer injection; ScreenshotContextGenerator now takes no summarizer argument
CotabbyTests/OCRTextHygieneTests.swift Thorough per-filter unit tests with a preserve/drop matrix for digit-substitution; correctly documents the gpt-4o-mini limitation but does not cover numeronym false-positive cases
Cotabby.xcodeproj/project.pbxproj Correctly removes deleted files and adds new OCRTextHygiene files to both app and test targets

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[FocusedInputSnapshot] --> B[captureScreenshot]
    B --> C[ScreenTextExtractor.extractText]
    C -->|ScreenTextExtractionError.noRecognizedText| D[Use window title fallback]
    C -->|extractedText| E[OCRTextHygiene.clean]
    E --> E1[Filter 1: dropLowConfidence]
    E1 --> E2[Filter 2: dropReplacementCharacter]
    E2 --> E3[Filter 3: dropHighSymbolDensity]
    E3 --> E4[Filter 4: dropDigitSubstitution]
    E4 --> E5[Filter 5: dropLowWordCharacterRatio]
    E5 --> E6[Filter 6: strip field-text echoes]
    E6 --> F[normalizeRecognizedText / sanitizeOCR]
    F --> G[boundedSummaryText / sanitize]
    G --> H{hasMeaningfulSignal?}
    H -->|No| I[throw .unavailable]
    H -->|Yes| J[VisualContextExcerpt]
    D --> K[normalizeRecognizedText]
    K --> L{hasMeaningfulSignal?}
    L -->|No| I
    L -->|Yes| J
Loading

Comments Outside Diff (1)

  1. Cotabby/Services/Visual/ScreenshotContextGenerator.swift, line 9-10 (link)

    P2 The file-level pipeline comment still references "optional local summary" which was removed by this PR. The summarizer step no longer exists.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Replace the LLM OCR summarizer with dire..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

ScreenshotContextGenerator now filters raw OCR through the pure OCRTextHygiene (confidence, replacement-char, symbol-density, digit-substitution, word-char-ratio drops, plus field-text stripping and bounding) and injects the cleaned text, instead of running a second llama generation to summarize it. Removes the per-refresh generation (latency) and a hallucination layer; a base model conditions fine on cleaned raw context.

Deletes LlamaVisualContextSummarizer, VisualContextSummaryPromptRenderer, and their test; drops the summarizer injection in CotabbyAppEnvironment. Adds OCRTextHygiene + tests. Follow-ups left small: runtime summarize() is now unused (kept out of this PR), the .summarizingText status is no longer emitted, and ScreenTextExtractor still discards per-line confidence (confidence filter is currently a no-op).
@FuJacob FuJacob merged commit ecf6b0f into main Jun 1, 2026
3 of 4 checks passed
Comment on lines +117 to +123
/// The "lowercase before" + "letter after" pairing is deliberately narrow so genuine tokens
/// survive:
/// - trailing digits (`utf8`, `v2`): no letter after the digit.
/// - leading digits (`3D`, `5070`): no letter before the digit at all.
/// - hyphenated counts (`20-core`): the digits have no lowercase letter before them.
/// - ALL-CAPS identifiers (`RTX5070`, `N1X`): the letters before the digit are uppercase, so
/// the "lowercase before" condition is never met (uppercase model/product codes are real).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The digit-substitution rule (lowercase-before + letter-after) also matches common developer numeronyms: k8s (k → lowercase before 8 → s after), e2e, i18n, a11y, w3c, b2b. Entire lines containing these tokens will be silently dropped from autocomplete context. A developer working on Kubernetes infra or an i18n codebase would find that all relevant screen context is stripped. The PR description acknowledges this family of false-positives for gpt-4o-mini but these short numeronyms are a more common case not covered by the tests.

Suggested change
/// The "lowercase before" + "letter after" pairing is deliberately narrow so genuine tokens
/// survive:
/// - trailing digits (`utf8`, `v2`): no letter after the digit.
/// - leading digits (`3D`, `5070`): no letter before the digit at all.
/// - hyphenated counts (`20-core`): the digits have no lowercase letter before them.
/// - ALL-CAPS identifiers (`RTX5070`, `N1X`): the letters before the digit are uppercase, so
/// the "lowercase before" condition is never met (uppercase model/product codes are real).
/// The "lowercase before" + "letter after" pairing is deliberately narrow so genuine tokens
/// survive:
/// - trailing digits (`utf8`, `v2`): no letter after the digit.
/// - leading digits (`3D`, `5070`): no letter before the digit at all.
/// - hyphenated counts (`20-core`): the digits have no lowercase letter before them.
/// - ALL-CAPS identifiers (`RTX5070`, `N1X`): the letters before the digit are uppercase, so
/// the "lowercase before" condition is never met (uppercase model/product codes are real).
///
/// Known false positives (same pattern as OCR misreads): numeronyms such as `k8s`, `e2e`,
/// `i18n`, `a11y`, `w3c`, and model names like `gpt-4o-mini`. Lines containing these will be
/// dropped. This is an accepted trade-off; adding an allow-list is a potential follow-up.

Fix in Codex Fix in Claude Code

FuJacob added a commit that referenced this pull request Jun 1, 2026
Follow-up to #499 (which removed the OCR summarizer): deletes the now-unused runtime code it relied on - LlamaRuntimeManager.summarize, LlamaRuntimeCore.summarize, and the orphaned LlamaGenerationOptions.summary factory. No remaining callers; build-for-testing green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant