Understudy: Open-Source Desktop Agent That Learns from a Single Demonstration

Understudy, an open-source desktop agent published by the understudy-ai GitHub organization, takes a teach-by-demonstration approach to task automation: a user performs a task once, and the agent extracts intent, parameters, and success criteria — storing the result as a structured SKILL.md artifact rather than a brittle coordinate replay. The system runs entirely locally in Node.js (version 20.6 or higher) and is distributed via npm as @understudy-ai/understudy. Its unified runtime spans four execution surfaces — GUI (13 action tools with screenshot grounding and HiDPI normalization), browser (via Playwright plus a Chrome extension relay for session-persistent logins), shell (bash with full local access), and web search — inside a single planner loop. A showcase demo running on macOS with GPT-5.4 via Codex achieved a 30/30 score on a grounding benchmark that included ambiguous prompts and icon-only controls.

The architecture is organized into five progressive layers, framed by the project as mirroring a new hire's growth into a reliable colleague. Layers 1 and 2 — native OS operation and demonstration learning via a /teach command interface — are fully implemented. Layer 3 (crystallized memory, hardening successful paths from repeated use) and Layer 4 (route optimization, autopromoting tasks from GUI to faster browser, CLI, or API routes once verified) are partially implemented. Layer 5, proactive autonomy without user disruption, remains a long-term roadmap item. Decision-making is split across two models: one decides what to do, a separate grounding model handles where on screen to act — a design visible in the open system prompt code at packages/core/src/system-prompt-sections.ts.

Understudy treats the GUI as a temporary bootstrapping surface rather than a persistent dependency. The SKILL.md artifact encodes three layers — an intent procedure in natural language, route options listing preferred and fallback paths, and GUI replay hints explicitly marked as last resort only. The architectural intent of Layer 4 is to automatically promote validated tasks up a route hierarchy, from pixel-level screen grounding toward shell commands or direct API calls, collapsing within a single agent's operational history what historically took years of ecosystem development (from screen-scraping to unofficial APIs to official integrations). The closest existing analogue is Playwright's codegen tool, which records and replays browser interactions, but Understudy's ambition extends the same autopromote logic across native desktop, web, and shell surfaces simultaneously.

Hacker News reception was cautiously curious rather than enthusiastic. Commenters flagged the system-prompt-sections.ts file as worth reading for insight into the planner's structure, raised standard robustness concerns about GUI automation under real-world variability, and noted that the look-click-look-click execution loop appeared slow in the Telegram demo segment — a practical constraint given current model latency. The project is still in alpha, and the team acknowledges the gap between the route optimization vision and current implementation: promotion thresholds are described as "heuristic rather than fully policy-driven," and crystallization logic is "LLM-first" in segmentation. Those gaps are real — the teach-once promise depends on crystallization and route promotion that aren't yet fully automated, meaning users today are closer to a smart macro recorder than the self-improving colleague the architecture envisions.