Skip to content

AmrDab/clawdcursor

Repository files navigation

Clawd Cursor

Clawd Cursor

The skill that gives any AI agent eyes, hands, and a keyboard on a real desktop.
Install it. Your agent uses it. Windows, macOS, Linux — any tool-calling model.

GitHub stars Latest release MIT license Discord Website

Website · Discord · Install · Connect · Tools · Changelog


What This Is

Clawd Cursor is a skill, not an application. It gives an AI agent the ability to use the user's computer — mouse, keyboard, screen, windows, browser — the same way a human would.

You install it once. Any tool-calling agent on the machine — Claude Code, Cursor, Windsurf, OpenClaw, the Claude Agent SDK, or a bring-your-own-model setup — picks it up through MCP or the skill registry once configured. The agent then knows how to click, type, read the screen, open apps, and drive GUIs whenever the task requires it.

User: "Open Outlook and reply to the latest email from Sarah."

Agent  →  window({"action":"open_app","name":"Outlook"})
       →  accessibility({"action":"read_tree"})
       →  accessibility({"action":"invoke","name":"Sarah's email"})
       →  computer({"action":"key","combo":"mod+r"})
       →  computer({"action":"type","text":"..."})
       →  accessibility({"action":"invoke","name":"Send"})
       → done (verified by ground-truth verifier)

No app-specific integrations. No per-service API keys. No cloud round-trip — everything runs locally on 127.0.0.1. If it renders on screen, the agent can read it and act on it.

Design principles. Model-agnostic (Claude, GPT, Gemini, local models via Ollama). OS-agnostic (a single PlatformAdapter handles Windows, macOS, and Linux behind one interface). Skill-first (the AI is the primary consumer; the CLI exists for testing).


Latest Release

v0.8.8 — reliability + correctness release. The mod modifier in compact computer({"action":"key","combo":"mod+s"}) now resolves correctly across platforms (Cmd on macOS, Ctrl on Win/Linux) instead of silently dropping or throwing. Compact accessibility({"action":"set_value", ...}) now actually works — previously the delegate target wasn't registered. smart_click OCR now prefers matches inside the focused window so it can't silently click into a background app. invoke-element.ps1 adds a 2s timeout so React/Electron buttons that advertise InvokePattern but block on invoke can no longer hang the script. Plus a routine round of dependency hygiene (express v5, commander v14, dotenv v17, sharp 0.34) and lint cleanup.

The substantive work landed earlier in the v0.8.x line:

  • v0.8.7 — security hardening. Direct tool calls now route through a shared safety gate; accessibility/window/clipboard reads consolidate onto PlatformAdapter; the version string is single-sourced from package.json with a CI guard against drift. Tooling: TypeScript 6.0, ESLint 10, Playwright 1.59.
  • v0.8.6 — polish release. Fixes a stale McpServer version string that had been advertising v0.7.2 in MCP client metadata since the v0.7.x line; adds SECURITY.md and a private vulnerability reporting channel; trims the homepage; prunes stale repo artifacts.
  • v0.8.5computer({"action":"key","combo":"..."}) now actually works (compact-tool keyboard remap was missing); 16 documentation accuracy fixes; cost-tier ladder added to SKILL.md.
  • v0.8.4 — security maintenance: patches every fixable CVE in the dependency tree (vite, path-to-regexp, picomatch, hono, follow-redirects); README rewritten to frame clawdcursor as a skill rather than a standalone server.
  • v0.8.3 — idempotent open_app (no more N copies of Outlook stacking up under retry), agent runaway guard, clawdcursor stop sweeps every mode.
  • v0.8.2 — silent-401 auth bug fixed, force-focus on Windows through the foreground lock, Electron/WebView2 detection + CDP relaunch hint.
  • v0.8.1 (rolled into 0.8.2) — unified blind/hybrid/vision pipeline (one loop, three strategy modes), compact MCP surface (6 tools, ~12× smaller catalog), Linux AT-SPI bridge, Wayland input routing.

Full per-release detail in CHANGELOG.md.


Install the Skill

Windows

powershell -c "irm https://clawdcursor.com/install.ps1 | iex"

macOS

curl -fsSL https://clawdcursor.com/install.sh | bash
clawdcursor grant     # Accessibility + Screen Recording

Linux

curl -fsSL https://clawdcursor.com/install.sh | bash

The installer clones the skill into ~/clawdcursor, runs npm install, builds, and registers a global clawdcursor shim via npm link. Runtime state (auth token, pidfiles, logs) lives at ~/.clawdcursor/. To wire the skill into an agent host, follow Connect Your Agent below — the installer does not edit any host config files automatically.

Linux notes: install tesseract-ocr for OCR, python3-gi + gir1.2-atspi-2.0 for accessibility (the AT-SPI typelib python3-gi consumes), and ydotool (or wtype) for Wayland input.


Connect Your Agent

The skill is transport-agnostic. Every agent below exposes the same tool catalog.

Claude Code

Add the MCP entry to ~/.claude/settings.json (the installer leaves agent host config untouched, so this step is required):

// ~/.claude/settings.json
{
  "mcpServers": {
    "clawdcursor": {
      "command": "clawdcursor",
      "args": ["mcp", "--compact"]
    }
  }
}

OpenClaw

openclaw skill install clawdcursor

The skill metadata in SKILL.md tells OpenClaw how to install, bootstrap, and discover the tool catalog. No further configuration needed.

Cursor, Windsurf, Zed

Any MCP-aware editor. Add a stdio MCP entry pointing to clawdcursor mcp --compact. Refer to the host's MCP configuration docs.

Claude Agent SDK / bring-your-own-model

The skill also exposes a local REST surface for agents that do not speak MCP. Start the skill server once, then discover tools at GET http://127.0.0.1:3847/tools?mode=compact and call them at POST /execute/:name. Bearer-token auth; token written to ~/.clawdcursor/token. See API below.


Tool Surface

The skill exposes two catalogs side by side. Agents pick the one that fits.

Compact — 6 compound tools (recommended)

Anthropic computer_20250124-style: one tool per capability, with an action enum for the verb. Small prompt footprint (~1,500 tokens), easy for a model to learn zero-shot, the default for most agents.

Most-used actions per compound below. The full enum is at GET /tools?mode=compact or via MCP list_tools.

Tool Most-used actions
computer screenshot, click, double_click, right_click, triple_click, hover, scroll, scroll_horizontal, drag, drag_path, type, key, wait
accessibility read_tree, find, get_element, focused, invoke, focus, set_value, get_value, expand, collapse, toggle, select, state, list_children, wait_for
window list, active, focus, maximize, minimize, restore, close, resize, list_displays, screen_size, open_app, open_file, open_url, switch_tab, navigate
system clipboard_read, clipboard_write, system_time, ocr, undo, shortcuts_list, shortcuts_run, delegate, detect_webview, relaunch_with_cdp
browser connect, page_context, read_text, click, type, select_option, evaluate, wait_for, list_tabs, switch_tab, scroll
task (no action enum — takes {instruction: string} and routes through the full pipeline)

Granular — 75 individual tools

Full catalog for agents that prefer one tool per verb. Sample of categories below; the full list is at GET /tools or list_tools over MCP.

Category Examples
Perception read_screen, desktop_screenshot, desktop_screenshot_region, ocr_read_screen, smart_read
Mouse mouse_click, mouse_double_click, mouse_drag, mouse_drag_stepped, mouse_scroll
Keyboard key_press, type_text, smart_type, shortcuts_list, shortcuts_execute
Window / App focus_window, open_app, get_windows, get_active_window, detect_webview_apps
Browser (CDP) cdp_connect, cdp_click, cdp_type, cdp_read_text, cdp_evaluate
Accessibility find_element, invoke_element, wait_for_element, get_focused_element, a11y_expand, a11y_toggle
System read_clipboard, write_clipboard, get_system_time, undo_last, delegate_to_agent
Orchestration smart_click, navigate_browser, wait

Full catalog visible to the agent through MCP list_tools or at GET /tools.


How the Skill Thinks

Every tool call — whether it arrives over MCP, REST, or the built-in agent — passes through the same decision layer.

         ┌────────────────────────────────────────────┐
agent ─▶ │  Router   (regex shortcuts · zero LLM)    │ ──▶ tool
         └───────────────────┬────────────────────────┘
                             │  (no shortcut match)
                             ▼
         ┌────────────────────────────────────────────┐
         │  Blind     (accessibility tree only)       │ ──▶ tool
         └───────────────────┬────────────────────────┘
                             │  (a11y sparse, stagnation)
                             ▼
         ┌────────────────────────────────────────────┐
         │  Hybrid    (a11y + screenshot-on-demand)   │ ──▶ tool
         └───────────────────┬────────────────────────┘
                             │  (still stuck)
                             ▼
         ┌────────────────────────────────────────────┐
         │  Vision    (screenshot every turn)         │ ──▶ tool
         └────────────────────────────────────────────┘

Every tool call routes through a single safety.evaluate() chokepoint. The agent cannot bypass this path — it is the only way tools execute.

Ground-truth verification. When a task is claimed complete, six independent signals are checked against the post-task screen: pixel diff, window-state change, focus change, OCR delta, task-type assertions (send_email, navigate_url, open_app, type_text, …), and anti-pattern detection (error dialogs, auth failures, "cannot send", "draft saved"). Weighted voting with hard-fail rules. The agent cannot self-report its way past the verifier.

Runaway guard. If the agent calls the same tool with identical arguments three or more times in a six-turn window, the loop exits with a targeted diagnostic — typically pointing at detect_webview when the target app is Electron/WebView2 with a sparse accessibility tree.


Safety

Tools are classified into three tiers, enforced at the single safety.evaluate() chokepoint:

Tier Actions Behavior
Auto Reading, navigation, opening apps Executes immediately
Preview Typing, form fill, arbitrary input Logged before executing
Confirm Sending messages, deleting, purchases Pauses for user approval

Hardening: server binds to 127.0.0.1 only, bearer-token auth on every request, dangerous key combinations (Cmd+Q, Alt+F4, Ctrl+Alt+Del) blocked by default, first-run consent prompt required. Sensitive categories (email, banking, password managers) require explicit user approval per action.


API

For agents that do not speak MCP. Base URL: http://127.0.0.1:3847 (localhost-only, bearer-token auth, token at ~/.clawdcursor/token).

Endpoint Method Purpose
/tools GET Full catalog in OpenAI function-calling format. ?mode=compact for the 6-tool surface.
/execute/:name POST Execute a tool by name. Returns structured JSON.
/status GET Current skill state.
/screenshot GET Current screen as PNG.
/confirm POST Approve or reject a safety-gated action.
/abort POST Stop the in-flight task.
/health GET Version, uptime, and health check.

Platform Support

Platform-specific code lives in src/v2/platform/{windows,macos,linux}.ts behind a single PlatformAdapter interface. Business logic never reads process.platform.

Platform UI Automation OCR Browser
Windows x64 / ARM64 UI Automation via PowerShell bridge Windows.Media.Ocr Chrome / Edge (CDP)
macOS Intel / Apple Silicon JXA + System Events (TCC-safe) Apple Vision Chrome / Edge (CDP)
Linux X11 AT-SPI + nut-js Tesseract Chrome / Edge (CDP)
Linux Wayland AT-SPI + ydotool / wtype Tesseract Chrome / Edge (CDP)

Prerequisites

  • Node.js 20 or newer
  • macOS — Xcode CLI tools (xcode-select --install), then clawdcursor grant for Accessibility + Screen Recording
  • Linuxtesseract-ocr, python3-gi + gir1.2-atspi-2.0 (AT-SPI typelib), ydotool or wtype (Wayland)
  • AI provider key — configured on the agent side; the skill itself is model-agnostic

Testing and Troubleshooting

The CLI below is intended for humans diagnosing an install. Agents should not invoke it; they should use MCP or the REST surface.

clawdcursor doctor       Diagnose install, permissions, and platform bridges
clawdcursor grant        Grant macOS permissions (interactive)
clawdcursor consent      Manage desktop-control consent (--accept / --revoke / --status)
clawdcursor status       Check readiness (consent, permissions, AI config)
clawdcursor mcp          MCP stdio server (the primary skill transport)
clawdcursor serve        REST-only tool server (bring-your-own-agent)
clawdcursor stop         Stop every running mode (mcp, serve, start)

# The web dashboard is reachable at http://127.0.0.1:3847 while
# `clawdcursor serve` (or `start`) is running — no separate command.

# The two commands below exist for manual end-to-end testing only.
# Real agents should not use these — they should call the skill through MCP.
clawdcursor start        Run the built-in autonomous agent (testing)
clawdcursor task <t>     Send a task to that agent (testing)

Options:
  --port <port>          Default: 3847 (start, serve, stop, task)
  --compact              MCP only: expose 6 compound tools instead of 75 granular.
                         For REST/serve, use the `?mode=compact` query parameter
                         on `GET /tools` instead.
  --provider <name>      `start` only: anthropic | openai | gemini | ollama | ...
  --accept               `start` and `consent` only: skip the consent prompt.
                         For `serve`, use `--skip-consent` (dev environments).

Tech Stack

TypeScript · Node.js 20+ · nut-js · Playwright · sharp · Express · Model Context Protocol SDK · Zod · commander


License

MIT — see LICENSE.


clawdcursor.com

About

OS-agnostic, model-agnostic desktop automation server. Gives any AI agent eyes, hands, and ground-truth verification on Windows, macOS, and Linux.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors