Roger
Creator & Founder
Vision architecture and OS-level agent runtime. Published open-source fine-tunes across 7B and MoE architectures — Chihiro, Monsoon, Rain, DraftReasoner — GGUF-quantized for local inference.
See every OS. Click, type, scroll.
OpenClaw's screen-observing agent.
Mac · Linux · Windows · FreeBSD · Haiku · Android · iOS
OS-level vision · Coordinator or commanded · Full UI permission — anything a human can click, type, scroll, or drag, SAI does
$ sai connect --os macos
✓ Describe loop active — Sonoma 14.5, M1
✓ Lobster eye: describe → locate → act
✓ MCP tool layer ready
→ Ready — OpenClaw task chain activeLobster eyes taught SAI not to ask “what API is this?” — only “where is the human looking, and what are they looking for?”
100 million years of natural engineering. Reflection beats refraction in low light. SAI inherits the geometry.
10,000 mirrors. No lens.
01 / 05
Reflection, not refraction. Geometry is the lens.
02 / 05
SAI's world has no map either. Only pixels.
03 / 05
It sees what you're looking for. That's what OpenClaw agents — and their builders — need.
04 / 05
Same principle. China's Einstein Probe scans the X-ray sky with it.
05 / 05
SAI is how OpenClaw agents see and act — same eye, same hands, whether SAI leads the task chain or follows another agent's command.
And it's how the team sees what those agents were looking at when something breaks. Four scenarios below — same OS-level vision loop, four jobs.
9:07 AM, weekday start
You're still pouring coffee. Coordinator has opened Notion, summarised last night's 3 unread Slack threads into your Daily Note, and queued up the 10 AM Zoom waiting room. You sit down. Screen is ready. No API. No Zapier.
A 3-day-old ticket
User shares a backend screen: "settings keep failing." Agent describes the visible state, spots the wrong toggle, then drives the rep's Linux desktop step-by-step — clicks, not text. 3 min 12s. Case closed.
2:51 AM, a spike
BTC/USDT prints an outlier candle on Binance desktop. Agent scans the K-line, RSI, and orderbook depth — three windows, vision only. Calls short-term exhaustion, fills a limit order, waits for your tap. Zero exchange API key.
A clutch moment
Player pulls a low-HP reversal. Agent reads HP bar, kill feed, and minimap in 0.3s, generates: "INSANE — 30% HP, no mana, double kill from base." Subtitles go live. VTuber lipsync starts.
Seven operating-system targets, each marked by validation status. SAI runs with the same UI permission as a logged-in human — system settings, file managers, browsers, IDEs, full-screen games, niche pro tools, timeline editors, anything in between. If you can click it, type in it, scroll it, or drag it, SAI can operate it. No allow-list. No integration manifest.
Production: real workflows, misclick recovery validated, version-pinned. Beta: architecture confirmed, stress testing incomplete. Preview: vision parsing works, action coverage partial.
The Vision-to-Action pipeline. Describe input, OS output, retry on miss.
Step 1
Grounder observes the active OS surface and returns state.
describe: active app, controls, focus
Step 2
Natural-language targets resolve to labels and coordinates.
locate: WiFi toggle x:924, y:612
Step 3
Click, type, scroll, or drag through native input.
action: scroll down, then click x:924
Step 4
Misclick? Re-describe, re-locate, retry.
miss at x:901 → re-locate → retry x:924
describe → locate → act. Zero image tokens in the normal agent loop.
The agent observes with describe and locate. The vision layer handles pixels; the orchestrating LLM receives prose, labels, coordinates, and state. Grounder deployment can be local, self-hosted, or configured by endpoint.
open source·Import directly into your agent stack or wire as an MCP tool layer — one vision layer, every OS.
Universal embodiment is the mile.
Now
The screen-observing agent. Shipping today.
Runs as an OpenClaw agent with OS-level perception and input. Full UI permission — click, type, scroll, drag, and hotkeys. Coordinator or commanded. Self-correcting on misclick.
Soon
Bring beta + preview platforms to parity.
FreeBSD, Android, iOS, and Haiku graduated to production: stress-tested, version-pinned, misclick recovery validated. One bar across every supported OS.
Next
Customer service. Trading. Broadcasting. Editing.
Pre-built agent templates for each vertical. Bring your own model — SAI handles the OS layer.
Future
Operating systems built for agents, not humans.
UIs optimized for vision-first parsing. File systems exposed as agent-readable graphs. No legacy GUI overhead.
Creator & Founder
Vision architecture and OS-level agent runtime. Published open-source fine-tunes across 7B and MoE architectures — Chihiro, Monsoon, Rain, DraftReasoner — GGUF-quantized for local inference.
Cofounder
Product builder focused on what AI makes possible at the application layer. Created ACP — an open protocol so human-AI teams can track contribution and split revenue fairly, without a platform. Cofounder of SAI.
We're shipping early access to teams building with OpenClaw — and developers who want to give their agents eyes on any screen. First wave: Mac, Linux, Windows.