Understudy is a teachable desktop agent. It operates your computer like a human colleague across GUI, browser, shell, files, and messaging. You demonstrate a task once, it learns the intent, remembers successful paths, and gradually upgrades to faster execution routes.
Understudy is not interesting because it can press buttons. It is interesting because native desktop execution, teach-by-demonstration, memory crystallization, and route upgrades live inside one agent runtime.
Native Runtime
One planner across GUI, browser, shell, web, and memory
A single task can search the web, run a local command, switch to a native app, click through a workflow, and send the result through a messaging channel. This is not stitched-together automation. It is one routed agent loop.
GUI: screenshot grounding plus native input events
Browser: Playwright plus Chrome extension relay
Shell / Files: local execution and filesystem access
Teach By Demonstration
Show the task once, extract a reusable skill
Understudy records video and semantic events, then learns workflow intent rather than a coordinate trace. Scene detection, event clustering, keyframe extraction, evidence pack analysis, and teach drafts are already part of the product model.
Dual-track recording: screen capture plus Swift global event monitor
AI analysis: parameters, steps, success criteria, and route preferences
Validation loop: replay the learned task and analyze the trace
Path Upgrade
Day one fallback, long-term acceleration
GUI is the universal fallback, not the permanent destination. Understudy can begin with GUI execution and later promote the same task toward browser, CLI, or API routes after repeated verified success.
Day 1: GUI fallback means any visible software can be operated
This section stays intentionally strict: it reflects capabilities explicitly described in the docs today, without presenting future layers as already shipped.
01 · Grounded GUI
A full native desktop execution loop
From screenshot capture and HiDPI normalization to target resolution, click-point stabilization, complex-mode overlay validation, and post-action verification, the GUI pipeline is already a disciplined execution system rather than a fragile click bot.
Dual-model split: one model decides what to do, another decides where to do it
Hard cases handled: crop refinement, dense-area retries, grounding feedback memory
Teach-by-demonstration is a product flow, not a concept slide
`/teach start`, `/teach stop`, `/teach confirm`, `/teach validate`, and `/teach publish` already define a concrete interface for turning one desktop demonstration into a task card and publishable skill output.
Evidence pack: scene changes, event timelines, keyframes, and capability snapshots
Clarification dialog: refine title, parameters, and success criteria in natural language
Skill output: publish to a three-layer `SKILL.md` abstraction
03 · Unified Agent Surface
Already broader than a single CLI command
The repository already includes daemon, chat, agent, dashboard, webchat, channels, and scheduling surfaces. That matters because the project is aiming at a persistent local agent system, not a one-shot automation wrapper.
Messaging: Telegram, Slack, Discord, WhatsApp, Signal, LINE, iMessage, and Web
Automation: cron scheduling and one-shot timers
Subagents: child sessions for parallel task decomposition
Five Layers, No Cheap “Full Autonomy” Shortcut
The product narrative is unusually coherent: first make the agent capable, then teachable, then reliable, then faster, and only then proactive.
Layer 1 · Implemented
Operate Software Natively
Use software like a human
See, click, type, drag, scroll, and verify across real macOS applications.
Repeated day-to-day usage is automatically crystallized into reusable workspace skills — no explicit teaching required.
Layer 4 · Partially Implemented
Route Optimization
Promote faster execution routes
Teach prompts and route preferences already push tasks toward faster routes, with a simpler but real upgrade path in place today.
Layer 5 · North Star
Proactive Autonomy
Work proactively in an isolated space
Understand routines, suggest next actions, and act without monopolizing the user’s screen.
The Execution Route Pyramid
This is one of the strongest product ideas in Understudy: GUI is the universal start, but it is not the ideal final route.
Fastest
API / Direct Tool
If a capability already has a direct tool or API surface, the agent should not route back through GUI.
Best for structured actions, channel sends, and explicit tool surfaces.
Fast
Shell / CLI
Local commands, scripts, and filesystem operations often avoid repeated interface work entirely.
Best for developer workflows, file transforms, and local system automation.
Fallback+
Browser
When the natural interface is a website, managed browser control can beat GUI while preserving real workflows.
Best for SaaS products, dashboards, forms, and authenticated web tasks.
Universal
GUI
Anything with a visible interface can be operated. This is what gives Understudy broad real-world coverage.
Not a “low-end” route, but the core universality layer for long-tail software.
One task can span the whole machine
The planner is not choosing a favorite tool up front. It is routing step-by-step toward the objective. That means the same task can inspect the web, read local files, operate a native app, and send the result out through a channel inside one flow.
Observe
Capture screenshots, read pages, inspect local state, and recover memory context.
Resolve
Decide whether the next step should use API, CLI, browser, or GUI.
Execute
Invoke tools, scripts, browser actions, or native input events.
Verify
Re-observe after action and emit structured signals for memory and route upgrades.
Quick Start
The current README recommends starting the daemon first, then using terminal or dashboard entry points. The project is still Alpha, with development and testing centered on macOS today.