Skip to content

fix(agent): five model-robustness bugs (parser, quirks, cannot_read, DPI, safety bypass)#82

Open
AmrDab wants to merge 4 commits intomainfrom
claude/model-robustness-fix
Open

fix(agent): five model-robustness bugs (parser, quirks, cannot_read, DPI, safety bypass)#82
AmrDab wants to merge 4 commits intomainfrom
claude/model-robustness-fix

Conversation

@AmrDab
Copy link
Copy Markdown
Owner

@AmrDab AmrDab commented May 7, 2026

Summary

Five real bugs that surfaced from a single Kimi/Outlook task run, plus the cred-fix cherry-picked from claude/openclaw-cred-fix (so this branch installs cleanly without that one merged first).

All fixes are pattern-based, not model-name-matched. Adding new models / formats is one-line each.

Bugs fixed

1. Multi-format prose tool-call parser (src/llm-client.ts)

The OpenAI tool-calls path only fell back to a thin JSON parser when no native tool_calls block was present. Kimi (and several other providers) emit tool calls as PROSE in formats that JSON-only parsing missed:

  • functions.<NAME>:<id>$\n{...} (Kimi moonshot-v1-* text)
  • <NAME>(key: value, key2: "value2") (Kimi kimi-k2.5 vision — Python-call style)
  • <function=NAME>{...}</function> (Llama / some Mistral)
  • <|tool_call|>NAME\n{...} (some chat-formatted models)
  • JSON-only with explicit tool|action|name + args|input|parameters keys

Now all five families parse to {name, args} correctly. The legacy lenient {name:"X"} path now requires a peer args object so a parameter dictionary can never be misread as a tool call (this was the v0.8.8 "unknown tool: Outlook" bug).

2. Per-model param quirks (src/llm-client.ts)

Some models reject perfectly-valid OpenAI-shape params:

Model substring Quirk
kimi-k2 Requires temperature: 1 (rejects 0 with HTTP 400)
o1, o3 max_tokensmax_completion_tokens, temperature: 1
gpt-5 temperature: 1 only

New MODEL_QUIRKS table + applyModelQuirks() helper rewrites incompatible request bodies before send. Wired into all four request-building sites (Anthropic text/prefill/tool_use, OpenAI tool_calls).

3. cannot_read guard (src/pipeline/agent/agent.ts + prompt.ts)

Some models bail out of an action loop by calling cannot_read after they ALREADY successfully located the target ("I see Send button but I need to confirm before clicking"). This stalls the pipeline.

The agent loop now refuses cannot_read calls when a perception/locator tool succeeded in the previous 4 turns. Pattern-based; the resolver-tool list (wait_for_element, find_element, invoke_element, set_field_value, read_screen, a11y_snapshot, screenshot, list_windows, focus_window) is hard-coded in clawdcursor, not model-specific. Prompt also tightened so the rule is explicit to the LLM.

4. DPI coord correction in vision + OCR paths

src/computer-use.ts scale() and src/tools/smart.ts smart_click OCR path returned PHYSICAL pixel coords that nut-js mouse.setPosition then double-scaled on Windows ≥125% DPI. Now both divide by dpiRatio (no-op on standard DPI; correct on hi-DPI). Note: the compact mouse(action:click) path in src/pipeline/agent/compound.ts still passes coords through raw — works on displays where image-space === logical-space, breaks otherwise. Worth fixing in a follow-up.

5. Intent-matched safety bypass (src/pipeline/safety/layer.ts + agent.ts)

The safety layer's CONFIRM_LABEL_PATTERNS (/\bsend\b/i, /\bdelete\b/i, etc.) blocked EVERY destructive click — even when the user explicitly asked for the action. The agent loop has no in-loop confirm dialog, so legitimate user-requested actions deadlocked.

Fix: when the user's task text contains a word that matches the same CONFIRM pattern as the target label, the user has provided explicit consent for THIS specific destructive action. Examples:

User task Target label Decision
"hit send" "Send" bypass (both match \bsend\b)
"delete the row" "Delete" bypass (both match \bdelete\b)
"open my inbox" "Send" confirm (no intent match)
"purchase the laptop" "Buy now" confirm (different patterns)

Strictly safer than disabling the gate. A model that hallucinates a Send click in a context where the user didn't ask for it still gets blocked. Only when both texts name the same destructive operation does the bypass apply.

Plus: cred fix cherry-picked

claude/openclaw-cred-fix (already approved, not yet merged to main) is included so this branch tests cleanly with OpenClaw-stored API keys. Will collapse into one merge if it lands first.

Validation — end-to-end with real task

Submitted the user's exact original failing task (open outlook from desktop and hit send, should already be opened up) against:

Run 1: Kimi text + Kimi vision — Phase 1 observation task succeeded. Phase 2 click failed because Kimi vision hallucinated coords (clicked sidebar instead of Send). Not a clawdcursor bug — known weakness of Kimi vision for spatial perception.

Run 2: Anthropic Haiku text + Sonnet vision — Full success:

turn 1 vision (Sonnet): invoke_element(name:"Send", controlType:"Button")
                        safety.intent_match.bypass — pattern \bsend\b matches user task
                        ✓ Invoked "Send" via a11y. (864ms)
turn 2 vision (Sonnet): screenshot()  ✓
turn 3 vision (Sonnet): done(evidence:"Email sent successfully...")
pipeline.done success:true cost:$0.027 duration:114s

Email actually sent. Compose window closed. Browser navigated away.

Test plan

  • npm run typecheck clean
  • npm run lint 0 errors
  • npm run test:ci 30 files / 434 passing
  • Runtime sanity-check: 5 prose-tool-call formats parse correctly
  • Runtime sanity-check: applyModelQuirks("kimi-k2.5", {temperature:0}) rewrites to {temperature:1}
  • Live end-to-end: Outlook Send task succeeded with Anthropic vision
  • Live end-to-end: safety.intent_match.bypass fired on the right action
  • Live end-to-end: agent.cannot_read.suppressed fired during blind/hybrid struggles

🤖 Generated with Claude Code

AmrDab and others added 4 commits May 6, 2026 19:02
A real production failure log gave us four orthogonal bugs in the agent
pipeline. All four are fixed here as PATTERN-BASED, not model-specific —
adding new models is one-line for each.

## Bug 1 — Prose tool-call parser misread the inner JSON

Kimi's `moonshot-v1-32k` emits tool calls as prose:
    functions.open_app:0$\n{ "name": "Outlook" }

The old parser ignored the `functions.open_app:0$` prefix and parsed the
JSON body — extracting `obj.name` ("Outlook") as the supposed tool name.
Result: every Kimi turn warned `agent.unknown_tool tool="Outlook"` and
the agent burned 20 turns before strategy escalation.

Fix: `tryParseProseToolCall` in `src/llm-client.ts` now recognises three
families of prose tool-call emissions:
  - `functions.<NAME>:<id>$\n{...}`        (Kimi, some Qwen/DeepSeek)
  - `<function=NAME>{...}</function>`      (Llama, some Mistral)
  - `<|tool_call|>NAME\n{...}`             (some chat-formatted models)
  - JSON-only with explicit `tool|action|name` + `args|input|parameters`

The legacy lenient path (bare `{name:"X"}`) now requires a peer `args`
object so a parameter dictionary can never be misread as a tool call.
Verified at runtime against the exact failing input from the log.

## Bug 2 — Per-model param quirks

Kimi `kimi-k2.5` rejects any `temperature` other than 1 with HTTP 400
("invalid temperature: only 1 is allowed for this model"). The vision
fallback died on its first call.

Fix: new `MODEL_QUIRKS` table + `applyModelQuirks()` helper in
`src/llm-client.ts`. Pattern-matches model id substrings and rewrites
incompatible request params before send. Initial entries:
  - `kimi-k2`      → temperature → 1
  - `o1`, `o3`     → max_tokens → max_completion_tokens, temperature → 1
  - `gpt-5`        → temperature → 1

Wired into all four request-building sites (Anthropic text, Anthropic
prefill, Anthropic tool_use, OpenAI tool_calls). Adding a new model is
one row.

## Bug 3 — `cannot_read` after a successful element resolution

In hybrid mode the agent located the Send button cleanly:
    → wait_for_element(name="Send")    ✓ Found Send [Button] @199,243

Then instead of clicking it, called:
    → cannot_read("Send button is visible but I need to confirm…")

This stalls the loop (cannot_read escalates strategies, but vision then
errored out from Bug 2). Likely a safety-trained model bailing on an
irreversible action. Same problem will hit any model on any "destructive"
click target.

Fix: agent loop now refuses cannot_read calls when ANY perception or
locator tool succeeded in the previous 4 turns. The model gets a
structured rejection message telling it to act on what it already
located. Pattern-based (a hard-coded list of resolver tool names),
not model-specific. Prompt in `src/pipeline/agent/prompt.ts` also
tightened to make the rule explicit.

## Bug 4 — DPI coord double-scaling on Windows ≥125% scaling

Two paths returned PHYSICAL pixel coordinates and passed them straight
to `mouseClick`, which on Windows nut-js uses LOGICAL pixels. A click
intended for logical (900, 450) on a 2x DPI display landed at logical
(1800, 900) — far off-target.

Affected:
  - `src/tools/smart.ts` smart_click OCR path — OCR returns physical
    pixels from `screen.grab()`; now divides by `dpiRatio` before
    mouseClick.
  - `src/computer-use.ts` `scale()` — vision LLM returns image-space
    coords; `scaleFactor * coord` produced physical, but mouse expects
    logical. Now divides by `dpiRatio` after scaling.

On standard-DPI displays `dpiRatio === 1` so the fix is a no-op —
zero regression risk on any config that worked before.

## Validation

- typecheck clean
- typecheck:tests clean
- lint 0 errors, 64 warnings (unchanged from main baseline)
- 30 test files, 434 passing, 1 skipped
- Runtime sanity check on the dist build:
  - Kimi prose `functions.open_app:0$\n{"name":"Outlook"}` → parses to
    `{name:"open_app", args:{name:"Outlook"}}` ✅
  - applyModelQuirks("kimi-k2.5", {temperature:0}) → temperature:1 ✅

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h failure

Two coupled bugs surfaced when a user reinstalled v0.8.8 fresh: their Kimi
key worked in `clawdcursor doctor` but `clawdcursor start` immediately
reported "API key INVALID for Kimi", deleted the saved config, and then
crashed with a libuv UV_HANDLE_CLOSING assertion on Windows.

## Bug 1 — wrong provider's key sent to Kimi endpoint

`resolveApiConfig({ provider })` only consulted PROVIDER_ENV_VARS for the
requested provider. When the user's keys lived in OpenClaw's auth-profiles
(not as env vars), this returned an empty string. The downstream chain in
`loadPipelineConfig` then fell back to the generic `resolveApiConfig()`
result — which picks the "best" overall provider — and that turned out to
be Anthropic (whose key was actually invalid). Result: Kimi pipeline
config + Anthropic key sent to Kimi's endpoint = 401.

Doctor's `scanProviders()` reads the SAME auth-profiles and was correct;
the two paths just disagreed.

Fix: extract the auth-profile reader into `src/external-creds.ts` as a
shared, cached helper, and have `resolveApiConfig({ provider })` consult
it AFTER env vars (env still wins) and BEFORE the AI_API_KEY fallback.
Both code paths now agree on which key belongs to which provider.

The scanner's inline reader is left intact for now — it also picks up
base URLs from openclaw.json which the new helper doesn't need. Could be
unified in a follow-up.

## Bug 2 — libuv assertion on synchronous process.exit during teardown

The auth-failure handler called `agent.disconnect()` immediately followed
by `process.exit(1)`. With pending async handles (Express server, child
processes, fetch timers) mid-close, libuv asserts and crashes the process.
On Windows this surfaces as `Assertion failed: !(handle->flags &
UV_HANDLE_CLOSING), file src\win\async.c, line 76`.

Fix: new `gracefulExitOnInitFailure()` helper sets `process.exitCode`,
kicks off cleanup, and arms a 2-second hard-kill safety net via
`setTimeout(...).unref()`. The event loop drains naturally; the timer
itself doesn't keep the loop alive. Replaces three identical
`releasePidFile + agent.disconnect + process.exit(1)` blocks in the
start-action error paths.

Tests: 30 files, 434 passing. typecheck clean. lint 0 errors,
64 warnings (unchanged from main baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mat)

In-test discovery — Kimi's `kimi-k2.5` vision model emits tool calls in
Python-call syntax that wasn't covered by the initial four-family parser:

    done(evidence: "Screenshot shows Outlook draft email")
    give_up(reason: "missing credentials")
    mouse_click(x: 100, y: 200)
    wait(seconds=2.5)

Both `:` and `=` are accepted as kwarg separators. Single-quoted strings
are converted to double-quoted before JSON.parse (Kimi sometimes mixes).
Balanced-paren walking handles values like `"text (with parens)"` so the
arg body is extracted correctly even with nested punctuation.

Verified at runtime against five formats (the four prior + this new one);
all parse to the expected {name, args} shape. The agent now reaches a
clean done() terminal state on the same task that previously looped 9+
turns calling done(...) into the void.

Same-task re-test went from 32s+ failure to 15s success. No new tests
required — this is a pure additive parser branch.
Real-world scenario discovered during end-to-end testing:

User submits `clawdcursor task "open outlook from desktop and hit send"`.
Vision agent (Sonnet) correctly chose `invoke_element(name:"Send")` —
the right tool, the right target. Safety layer matched "Send" against
CONFIRM_LABEL_PATTERNS' /\bsend\b/i and returned `confirm` instead of
`allow`. Agent loop has no in-loop confirm dialog mechanism, so the
agent correctly called `give_up("needs confirm: Send button requires
user confirmation")`. Pipeline ended with success:false.

This was Audit Bug #4 (Suspect 1) confirmed live — the safety layer's
confirm-tier blocks every Send/Delete/Submit/Pay click without giving
the agent a way to proceed. Originally meant to prevent hallucinated
destructive clicks, but the same gate fires on EXPLICITLY user-requested
destructive actions.

## The fix — intent-matched bypass

When the user's task text contains a word that matches the same
CONFIRM_LABEL_PATTERN as the target label, the user has provided
explicit consent for THIS SPECIFIC destructive action. Examples:

  task="hit send" + target="Send"           → bypass (both match \bsend\b)
  task="delete the row" + target="Delete"   → bypass (both match \bdelete\b)
  task="open my inbox"  + target="Send"     → confirm (no intent match)
  task="purchase"       + target="Buy now"  → confirm (different patterns)

This is strictly safer than removing the confirm gate. A model that
hallucinates a Send click in a context where the user didn't ask for
it still gets blocked. Only when the user's intent text and the action
both name the same destructive operation does the bypass apply.

## Wiring

- `EvalContext` adds optional `userTaskText` field
- `evaluate()` checks pattern.test(userTaskText) before falling through
  to confirm; logs `safety.intent_match.bypass` for audit
- `agent.ts` passes `input.task` through to every safetyEvaluate call

## Validation

End-to-end test with Anthropic Claude Sonnet 4.5 vision:
  user: "open outlook from desktop and hit send, should already be opened up"
  agent: invoke_element(name:"Send", controlType:"Button")
  safety: intent_match.bypass tool="invoke_element" pattern="\bsend\b"
  safety: decision:"allow"
  result: ✓ Invoked "Send" via a11y. Email sent. Compose window closed.
  pipeline: success:true cost:$0.027 duration:114s

Same task previously: success:false, agent.tool.blocked, give_up.

Pattern-matched, model-agnostic, app-agnostic. Adds zero new attack
surface — only relaxes the gate for explicitly-consented actions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AmrDab added a commit that referenced this pull request May 7, 2026
Closes the long-standing pattern where the agent's open_app tool reports
"Launched X (no window surfaced yet)" and the next wait_for_element times
out — even though the app is installed and launchable from a terminal.
The full smoke (Calculator / Edge / File Explorer) goes from 200+s with
give_up to <40s with done() per app.

Three layered root causes, three small fixes:

1. SINGLE-SHOT POLL → DIFF-AND-POLL
   `findLaunchedWindow` previously did one fixed 800ms / 1200ms settle
   followed by one `listWindows` scan. UWP cold-starts (Calculator,
   Notepad-Win11, Settings, etc.) take 2–6s to surface a window in the
   a11y tree on a fresh boot — they never matched. macOS first-launch
   apps (Xcode, Photoshop) hit the same wall.

   New `src/v2/platform/launch-poll.ts` exports
   `waitForLaunchedWindow(before, listFn, predicate, opts)` which polls
   every 300ms for up to 8s, prefers brand-new windows (diffed against
   the before-snapshot), accepts a spawn-PID hint, and falls back to a
   "best existing match" at the deadline so macOS `open -a` activate
   semantics don't lose the result. Pure async; no platform specifics.
   Survives transient `listWindows` exceptions. 18 unit tests.

   All three adapters (Windows, macOS, Linux) now use the helper.

2. ALIAS RESOLUTION HAPPENS AT THE PLATFORM LAYER
   `open_app("Calculator")` was going straight through to
   `Start-Process -FilePath "Calculator"`, which silently fails for
   UWP apps. The router knew the right launch path (UWP AppsFolder via
   `explorer.exe shell:AppsFolder\<id>`) but the agent's tool didn't.

   `WindowsAdapter.openApp`, `MacAdapter.openApp`, and `LinuxAdapter
   .openApp` now resolve the user-supplied name through the existing
   `APP_ALIASES` table before calling `launchApp`. The alias table is
   pure data — adding apps doesn't touch the platform code, so this
   stays app-agnostic. Per-OS, the right field is forwarded:
     • Windows: `uwpAppId` (UWP route) + `executable` (Start-Process)
     • macOS:   `macOSAppName` (open -a)
     • Linux:   `executable` (with `.exe` stripped)

3. NATIVE-SEARCH FALLBACK INSIDE launchApp
   When the primary path doesn't surface a window within 4s, launchApp
   now falls back to the OS's native launcher — same pattern the
   router's zero-LLM fast path proved. Ports the keyboard sequence into
   the platform adapter so every caller (agent's open_app, MCP
   `mcp__clawdcursor__window` open_app, REST /execute) gets the
   reliability without duplicating router logic.
     • Windows: Win key → type → Return (Start Menu search). Resolves
       Edge / VS Code / any Start-Menu-indexed app that Start-Process
       can't find by name.
     • macOS: Cmd+Space → type → Return (Spotlight). Same UX fallback
       the router already uses.
     • Linux: existing direct-spawn / xdg-open chain (no universal
       launcher pattern across DEs).

   Keyboard primitives go through the adapter directly, NOT the safety
   layer. The safety layer's `cmd+space` / `win+r` blocks are for
   agent actions, not internal platform plumbing — `launchApp` is
   fulfilling its own contract.

ALSO IN THIS PATCH:
  • `buildAppPredicate` strips trailing `.exe` / `.com` / `.app` so
    `launchName="msedge.exe"` matches `processName="msedge"` and
    `launchName="Calculator.app"` matches `processName="Calculator"`.
    Reverse-contains gated by a 3-char minimum on `processName` so
    short proc names ("ai", "ps") don't false-positive.
  • `findExistingAppWindowIn(windowsBefore, ...)` extracted as the
    in-memory variant of the idempotency helper so launchApp can
    reuse the snapshot it captured for diff-and-poll instead of
    round-tripping the PS bridge twice.

NOT CHANGED:
  • Tool signatures: `openApp`, `launchApp`, the agent's `open_app`
    tool, and MCP `mcp__clawdcursor__window` action `open_app` keep
    their schemas. Return shape unchanged. The MCP schema snapshot
    test still passes.
  • The blocked-keys list. PR #82's intent-match bypass is unchanged.
    Cmd+Space / Win+R remain blocked for agent-emitted keys.

Validation:
  • typecheck clean, lint 0 errors (64 pre-existing warnings unchanged).
  • 452/453 tests passing (1 pre-existing skip), with +18 new tests
    covering buildAppPredicate variants and waitForLaunchedWindow
    behavior under fast / slow / spawn-pid / deadline-fallback /
    minimized-window / transient-exception / default-budget paths.
  • Live smoke on Windows 11 with Anthropic Claude Haiku 4.5 as the
    text agent (the same config that produced the original failures):
        Calculator: 28s, 2 turns, done() ← was 208s, 48 turns, give_up
        Edge:       35s, 2 turns, done() ← was 133s, never reached idle
        File Exp.:  69s, 3 turns, done() ← was 147s, full ladder maxed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AmrDab added a commit that referenced this pull request May 7, 2026
…ged evidence; v0.8.11

Closes the "agent operates completely blindly, calls done() with hedged
evidence, pipeline reports success" pattern observed live with Kimi:
the agent typed into a stale screen for 7 turns, every turn warned
"⚠ stagnation — last 3 screens unchanged", and on turn 9 called
`done(evidence: "The email should have been sent...")` — note "should",
a clear hallucination — and the pipeline returned success.

TWO FIXES, EACH AT ROOT.

1. STAGNATION → ESCALATION (`src/pipeline/agent/agent.ts`)

   Before: `STAGNATION_WINDOW = 3` triggered a one-line warning that
   was appended to the next turn's prompt. The agent kept looping
   through its full max_turns budget. The pipeline ladder
   (blind → hybrid → vision) had a `'stagnation'` exit type already
   wired through `failureReason`, but no code path actually returned
   it — so the ladder never climbed on stagnation.

   After: track `consecutiveStagnantTurns`; reset to 0 every turn the
   fingerprint moves; when it crosses `STAGNATION_HARD_LIMIT = 5`,
   exit the rung with `exit: 'stagnation'`. The pipeline ladder now
   actually receives the signal and climbs to the next strategy.

   Two-stage:
     - turns 3–4 stagnant → warn "try a different approach"
     - turn 5 stagnant   → abort the rung, escalate

   The 3-turn warning still fires (legitimate stagnant patches —
   slow window cold-start, transient a11y blip — usually clear within
   one or two turns). The hard limit only catches genuinely-stuck
   agents.

2. DONE() EVIDENCE GUARD (`src/pipeline/agent/tools.ts`)

   Before: `done(evidence: <string>)` accepted ANY non-empty evidence
   string, including obvious hallucinations like
   "should have been sent". The verifier ground-truth module
   (`src/v2/verifier/ground-truth.ts`) exists but is wired into the
   v2 orchestrator, not the unified pipeline that drives blind /
   hybrid / vision. Plugging the verifier into the unified pipeline
   is a much bigger change; this PR adds the cheaper, narrower fix
   that addresses 80% of the symptom in 20 lines.

   After: `done`'s `execute` now runs two cheap guards before
   accepting the terminal exit:

     a) Length: empty / "ok" / "done" / whitespace-only is rejected.
        Forces the agent to write SOMETHING the pipeline / human
        reviewer can use.

     b) Hedging-language detection: a narrow regex matches the
        unambiguous "I'm guessing" phrasings — "should have", "might
        be", "may have", "could have", "probably", "I think",
        "I believe", "I assume", "appears to", "seems to",
        "presumably", "if successful", "if it worked". When matched,
        the tool returns `success: false` with an instruction to take
        a screenshot or call read_screen first, then call done with
        the literal observation. The agent's NEXT turn sees the
        rejection and re-tries.

   The pattern is intentionally narrow — word-boundary anchored, so
   "shoulder" doesn't match "should", "mighty" doesn't match "might",
   "appearance" doesn't match "appears to". 18 unit tests pin this
   down in `src/__tests__/done-evidence-guard.test.ts`.

3. PROMPT UPDATE (`src/pipeline/agent/prompt.ts`)

   The `done()` line in the system prompt now spells out the rule —
   "Never use 'should have', 'might have', 'probably', 'I think',
   'appears to', 'if successful'. Those mean you are guessing." —
   and tells the agent that the tool will reject hedged evidence.
   Defense-in-depth alongside the runtime guard.

PROPERTIES

  • Model-agnostic. The hedging regex runs in clawdcursor's tool
    layer, after whatever the LLM emitted. Works identically for
    Claude / Kimi / GPT / Gemini / any tool-calling model.
  • OS-agnostic. Pure logic in the agent / tool layer; no platform
    code touched.
  • App-agnostic. No allowlist of specific apps or task types.
  • MCP-safe. `done` is not exposed via MCP — it's a unified-agent
    internal terminal action. Tool signatures unchanged. The MCP
    schema snapshot test still passes.
  • Non-breaking for legitimate uses. Any concrete-observation
    evidence ("Window title shows X", "Calculator displays 391",
    "Compose closed, Sent folder selected, latest message visible
    at top") still passes. Only the "I'm guessing" phrasings are
    rejected.

VERSION

Bumps `package.json` 0.8.10 → 0.8.11.

VALIDATION

  • typecheck clean
  • lint 0 errors (no new warnings)
  • 487/488 tests passing (1 pre-existing skip), with +18 new
    tests covering the done() evidence guard:
        accepts: window-title, on-screen-text, focused-element,
                 multi-signal commas
        rejects: should-have-been, should-be, might-have, may-have,
                 probably, I-think, I-believe, I-assume, appears-to,
                 seems-to, if-successful, empty, whitespace, "ok"
        no false positives on: "shoulder", "mighty", "appearance",
                              "showing", "displayed"
  • Live smoke (Anthropic Haiku 4.5 — same baseline as the user's
    earlier Kimi failure, just with native tool_use):
        Task: "Open Outlook and start composing... STOP at the To
              field — DO NOT SEND. Report what the To field shows."
        Trace:
            turn 3 → consecutiveStagnantTurns: 1
            turn 4 → consecutiveStagnantTurns: 2
            turn 5 → consecutiveStagnantTurns: 3   ← counter live
            agent escalated: blind → hybrid → vision
            no fabricated done() — agent gave up cleanly when stuck
            runaway-guard ALSO fired (PR #82's prior fix, intact)

  Net behavior: when the agent can't observe, it now ESCALATES
  through the ladder instead of running out the clock and lying.
  When it does call done(), the evidence has to be observable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AmrDab added a commit that referenced this pull request May 8, 2026
…-reliability

Merges `claude/model-robustness-fix` into `claude/launch-reliability-fix`
so the full set of fixes is on one shippable branch:

  PR #82 brings (model-robustness):
    • parser: 5 prose tool-call families incl. Kimi `functions.X:N`
      and Python-call style (closes the agent.no_tool_call storm seen
      with Kimi text + vision)
    • MODEL_QUIRKS table (kimi-k2 vision temperature must be 1)
    • cannot_read guard (no escalation if last 4 turns include a
      successful resolver tool call)
    • DPI logical-pixel translation
    • intent-matched safety bypass for destructive verbs the user
      actually requested
    • per-provider OpenClaw cred lookup + graceful start exit on
      auth failure

  This branch already had (launch-reliability):
    • diff-and-poll launch helper + alias-aware platform.openApp
    • Start-Menu / Spotlight fallback w/ alias.searchTerm threading
    • normalizeAppName ("the outlook app" → "outlook")
    • stagnation hard-limit → exit:'stagnation' → ladder climbs
    • done() rejects hedged evidence ("should have been sent")
    • GroundTruthVerifier wired into the unified pipeline at
      `runOneSubtask` — every successful agent rung post-checked
      against actual screen state

VALIDATION

  • Auto-merge clean (no conflicts; ort strategy)
  • typecheck clean
  • lint 0 errors
  • 492/493 tests passing (1 pre-existing skip)
  • Live smoke with Kimi (moonshot-v1-32k text + kimi-k2.5 vision —
    the same config that previously fabricated done() and dropped
    `agent.no_tool_call` warnings):
        Task: "Open the Calculator app and call done once a
              Calculator window is on screen."
        Result: 14s, confidence=1, zero parser warnings, zero
                temperature errors, verifier confirmed externally.

  All 6 fixes confirmed in the compiled bundle:
    - dist/llm-client.js                      → MODEL_QUIRKS, tryParseProseToolCall
    - dist/pipeline/index.js                  → pipeline.verifier
    - dist/pipeline/agent/agent.js            → STAGNATION_HARD_LIMIT
    - dist/pipeline/agent/tools.js            → HEDGING_PATTERN
    - dist/pipeline/router/normalize.js       → normalizeAppName
    - dist/v2/platform/launch-poll.js         → waitForLaunchedWindow

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant