Understudy – Teach a Desktop Agent by Demonstrating a Task Once

Understudy, a new open-source macOS agent from Understudy AI, skips the workflow configuration that makes most desktop automation tools a slog. You show it a task once; it handles it from then on. The project ships under the MIT license and is available on npm.

It runs across the GUI, browser, shell, file system, and seven messaging platforms — Telegram, Slack, Discord, WhatsApp, Signal, LINE, and iMessage — from a single agent loop, with no separate integrations required.

The design separates two problems that computer-use agents often blur together: deciding what to do, and finding exactly where on screen to do it. A dedicated grounding model handles the targeting half, using crop refinement and retry logic to land accurately on ambiguous UI elements and icon-only controls. The project reports 30/30 on its internal benchmark, though how that benchmark holds up against real-world UI diversity is a question the documentation doesn't fully answer.

Demonstrations are captured as two parallel tracks: a screen recording and a Swift-based semantic event stream. Rather than replaying these as a fixed script, the system analyzes them to extract intent, parameter slots, and success criteria, writing the result to a human-readable SKILL.md file. Because skills are re-grounded against live screenshots at runtime rather than locked to pixel coordinates, they should survive the UI updates that routinely break conventional macro automations.

The project is built around a five-layer roadmap. Layers 1 and 2 — running software natively and learning from demonstrations — are fully shipped. Layers 3 and 4, covering persistent memory and route optimization, are partially implemented. The route optimization piece is worth watching: as the agent accumulates experience with a given task, it's designed to graduate from GUI replay toward faster execution paths through CLI, direct API calls, or browser automation. Layer 5, proactive autonomy, is the stated long-term direction.

Understudy supports multiple model providers. Its documentation is bilingual and it has an active Discord community. The field it's entering is crowded — Anthropic's Computer Use, Microsoft's Copilot agent efforts, and a growing list of open-source tools all cover overlapping ground. But requiring users to do nothing more than demonstrate a task once is a meaningful point of difference, and a deliberate one.