Lobster optics. Inside OpenClaw.

See every OS. Click, type, scroll.

OpenClaw's screen-observing agent.

Mac · Linux · Windows · FreeBSD · Haiku · Android · iOS

OS-level vision · Coordinator or commanded · Full UI permission — anything a human can click, type, scroll, or drag, SAI does

$ sai connect --os macos
 Describe loop active — Sonoma 14.5, M1
 Lobster eye: describe → locate → act
 MCP tool layer ready
 Ready — OpenClaw task chain active

Why lobster optics?

Lobster eyes taught SAI not to ask “what API is this?” — only “where is the human looking, and what are they looking for?”

100 million years of natural engineering. Reflection beats refraction in low light. SAI inherits the geometry.

10,000 mirrors. No lens.

01 / 05

Reflection, not refraction. Geometry is the lens.

02 / 05

SAI's world has no map either. Only pixels.

03 / 05

It sees what you're looking for. That's what OpenClaw agents — and their builders — need.

04 / 05

Same principle. China's Einstein Probe scans the X-ray sky with it.

05 / 05

One eye inside OpenClaw. Every OS.

SAI is how OpenClaw agents see and act — same eye, same hands, whether SAI leads the task chain or follows another agent's command.

And it's how the team sees what those agents were looking at when something breaks. Four scenarios below — same OS-level vision loop, four jobs.

Coordinator

9:07 AM, weekday start

macOS
3 apps, 0 APIsavg 12s/task

You're still pouring coffee. Coordinator has opened Notion, summarised last night's 3 unread Slack threads into your Daily Note, and queued up the 10 AM Zoom waiting room. You sit down. Screen is ready. No API. No Zapier.

Customer Service

A 3-day-old ticket

Linux
screen-to-fix in <4 minLinux + Mac + Windows

User shares a backend screen: "settings keep failing." Agent describes the visible state, spots the wrong toggle, then drives the rep's Linux desktop step-by-step — clicks, not text. 3 min 12s. Case closed.

Trader

2:51 AM, a spike

Windows
reads 3 windows/loopno exchange API key

BTC/USDT prints an outlier candle on Binance desktop. Agent scans the K-line, RSI, and orderbook depth — three windows, vision only. Calls short-term exhaustion, fills a limit order, waits for your tap. Zero exchange API key.

Broadcaster

A clutch moment

Windows
real-time commentaryany game, no SDK

Player pulls a low-HP reversal. Agent reads HP bar, kill feed, and minimap in 0.3s, generates: "INSANE — 30% HP, no mana, double kill from base." Subtitles go live. VTuber lipsync starts.

90%+ of the world's OS surface, one eye.

Seven operating-system targets, each marked by validation status. SAI runs with the same UI permission as a logged-in human — system settings, file managers, browsers, IDEs, full-screen games, niche pro tools, timeline editors, anything in between. If you can click it, type in it, scroll it, or drag it, SAI can operate it. No allow-list. No integration manifest.

Production: real workflows, misclick recovery validated, version-pinned. Beta: architecture confirmed, stress testing incomplete. Preview: vision parsing works, action coverage partial.

macOSSonoma+LinuxUbuntu 22.04+Windows11+FreeBSD15.0HaikuR1/beta4iOS17+Android13+SAI

See. Decide. Act. Self-correct.

The Vision-to-Action pipeline. Describe input, OS output, retry on miss.

Step 1

Describe screen

Grounder observes the active OS surface and returns state.

describe: active app, controls, focus

Step 2

Locate target

Natural-language targets resolve to labels and coordinates.

locate: WiFi toggle x:924, y:612

x: 248, y: 412

Step 3

OS action

Click, type, scroll, or drag through native input.

action: scroll down, then click x:924

Step 4

Self-correct

Misclick? Re-describe, re-locate, retry.

miss at x:901 → re-locate → retry x:924

describe → locate → act. Zero image tokens in the normal agent loop.

The agent observes with describe and locate. The vision layer handles pixels; the orchestrating LLM receives prose, labels, coordinates, and state. Grounder deployment can be local, self-hosted, or configured by endpoint.

open source·Import directly into your agent stack or wire as an MCP tool layer — one vision layer, every OS.

The eye is the first inch.

Universal embodiment is the mile.

Now

SAI inside OpenClaw

The screen-observing agent. Shipping today.

Runs as an OpenClaw agent with OS-level perception and input. Full UI permission — click, type, scroll, drag, and hotkeys. Coordinator or commanded. Self-correcting on misclick.

Soon

All 7 OS to production

Bring beta + preview platforms to parity.

FreeBSD, Android, iOS, and Haiku graduated to production: stress-tested, version-pinned, misclick recovery validated. One bar across every supported OS.

Next

Vertical Agents

Customer service. Trading. Broadcasting. Editing.

Pre-built agent templates for each vertical. Bring your own model — SAI handles the OS layer.

Future

Agent OS

Operating systems built for agents, not humans.

UIs optimized for vision-first parsing. File systems exposed as agent-readable graphs. No legacy GUI overhead.

Built by

Roger

Roger

Creator & Founder

Vision architecture and OS-level agent runtime. Published open-source fine-tunes across 7B and MoE architectures — Chihiro, Monsoon, Rain, DraftReasoner — GGUF-quantized for local inference.

YMOW

YMOW

Cofounder

Product builder focused on what AI makes possible at the application layer. Created ACP — an open protocol so human-AI teams can track contribution and split revenue fairly, without a platform. Cofounder of SAI.

Everywhere.

Questions.

Be first to use lobster eyes.

We're shipping early access to teams building with OpenClaw — and developers who want to give their agents eyes on any screen. First wave: Mac, Linux, Windows.