macOS today · Local-first · Teachable 中文

Watch once. Then take the stage.

Understudy is a teachable desktop agent. It operates your computer like a human colleague across GUI, browser, shell, files, and messaging. You demonstrate a task once, it learns the intent, remembers successful paths, and gradually upgrades to faster execution routes.

Unified desktop execution One task can mix GUI, browser, terminal, web search, and local files inside one agent session.
Not a macro recorder It learns task intent, parameter slots, and success criteria, not fixed coordinates.
Gets sharper with use Successful paths become reusable skills, reducing randomness and repeat exploration.
Upgrades to faster routes Start with GUI if needed, then graduate toward browser, CLI, or API execution when verified.

Showcase

See Understudy in action — operating desktop apps, browsers, and terminal in a single agent session.

The video above has been sped up. Full-length original: Google Drive

The published skill artifact generated from this showcase demo is available at examples/published-skills/taught-person-photo-cutout-bc88ec/SKILL.md.

Why This Is More Than Another GUI Clicker

Understudy is not interesting because it can press buttons. It is interesting because native desktop execution, teach-by-demonstration, memory crystallization, and route upgrades live inside one agent runtime.

Native Runtime

One planner across GUI, browser, shell, web, and memory

A single task can search the web, run a local command, switch to a native app, click through a workflow, and send the result through a messaging channel. This is not stitched-together automation. It is one routed agent loop.

GUI: screenshot grounding plus native input events
Browser: Playwright plus Chrome extension relay
Shell / Files: local execution and filesystem access
Teach By Demonstration

Show the task once, extract a reusable skill

Understudy records video and semantic events, then learns workflow intent rather than a coordinate trace. Scene detection, event clustering, keyframe extraction, evidence pack analysis, and teach drafts are already part of the product model.

Dual-track recording: screen capture plus Swift global event monitor
AI analysis: parameters, steps, success criteria, and route preferences
Validation loop: replay the learned task and analyze the trace
Path Upgrade

Day one fallback, long-term acceleration

GUI is the universal fallback, not the permanent destination. Understudy can begin with GUI execution and later promote the same task toward browser, CLI, or API routes after repeated verified success.

Day 1: GUI fallback means any visible software can be operated
Week 1: remembered paths reduce repeated exploration
Month 1+: faster routes become stable defaults

What Understudy Already Demonstrates

This section stays intentionally strict: it reflects capabilities explicitly described in the docs today, without presenting future layers as already shipped.

01 · Grounded GUI

A full native desktop execution loop

From screenshot capture and HiDPI normalization to target resolution, click-point stabilization, complex-mode overlay validation, and post-action verification, the GUI pipeline is already a disciplined execution system rather than a fragile click bot.

Dual-model split: one model decides what to do, another decides where to do it
Hard cases handled: crop refinement, dense-area retries, grounding feedback memory
Structured outcomes: observed / resolved / action_sent / condition_met
02 · Teach Loop

Teach-by-demonstration is a product flow, not a concept slide

`/teach start`, `/teach stop`, `/teach confirm`, `/teach validate`, and `/teach publish` already define a concrete interface for turning one desktop demonstration into a task card and publishable skill output.

Evidence pack: scene changes, event timelines, keyframes, and capability snapshots
Clarification dialog: refine title, parameters, and success criteria in natural language
Skill output: publish to a three-layer `SKILL.md` abstraction
03 · Unified Agent Surface

Already broader than a single CLI command

The repository already includes daemon, chat, agent, dashboard, webchat, channels, and scheduling surfaces. That matters because the project is aiming at a persistent local agent system, not a one-shot automation wrapper.

Messaging: Telegram, Slack, Discord, WhatsApp, Signal, LINE, iMessage, and Web
Automation: cron scheduling and one-shot timers
Subagents: child sessions for parallel task decomposition

Five Layers, No Cheap “Full Autonomy” Shortcut

The product narrative is unusually coherent: first make the agent capable, then teachable, then reliable, then faster, and only then proactive.

Layer 1 · Implemented
Operate Software Natively

Use software like a human

See, click, type, drag, scroll, and verify across real macOS applications.

Layer 2 · Implemented
Learn from Demonstrations

Learn tasks from demonstration

Capture evidence, extract workflow intent, refine task cards, and publish reusable skills.

Layer 3 · Partially Implemented
Crystallized Memory

Lock in successful paths

Repeated day-to-day usage is automatically crystallized into reusable workspace skills — no explicit teaching required.

Layer 4 · Partially Implemented
Route Optimization

Promote faster execution routes

Teach prompts and route preferences already push tasks toward faster routes, with a simpler but real upgrade path in place today.

Layer 5 · North Star
Proactive Autonomy

Work proactively in an isolated space

Understand routines, suggest next actions, and act without monopolizing the user’s screen.

The Execution Route Pyramid

This is one of the strongest product ideas in Understudy: GUI is the universal start, but it is not the ideal final route.

Fastest

API / Direct Tool

If a capability already has a direct tool or API surface, the agent should not route back through GUI.

Best for structured actions, channel sends, and explicit tool surfaces.
Fast

Shell / CLI

Local commands, scripts, and filesystem operations often avoid repeated interface work entirely.

Best for developer workflows, file transforms, and local system automation.
Fallback+

Browser

When the natural interface is a website, managed browser control can beat GUI while preserving real workflows.

Best for SaaS products, dashboards, forms, and authenticated web tasks.
Universal

GUI

Anything with a visible interface can be operated. This is what gives Understudy broad real-world coverage.

Not a “low-end” route, but the core universality layer for long-tail software.

One task can span the whole machine

The planner is not choosing a favorite tool up front. It is routing step-by-step toward the objective. That means the same task can inspect the web, read local files, operate a native app, and send the result out through a channel inside one flow.

Observe Capture screenshots, read pages, inspect local state, and recover memory context.
Resolve Decide whether the next step should use API, CLI, browser, or GUI.
Execute Invoke tools, scripts, browser actions, or native input events.
Verify Re-observe after action and emit structured signals for memory and route upgrades.

Quick Start

The current README recommends starting the daemon first, then using terminal or dashboard entry points. The project is still Alpha, with development and testing centered on macOS today.

understudy daemon --start
understudy chat
understudy dashboard
understudy webchat
understudy agent --message "..."