The AI agents built to control computers complete roughly 15% of standard desktop benchmark tasks. Thomas Kidane decided to put that performance on a live stream, open to anyone.
His project, iPad Playground, lets anyone on the internet queue up and issue natural language commands to a real, physical iPad. Accessible at play.thomaskidane.com, it works through a simple system: pick a nickname, join the queue, watch the live stream while waiting, then type plain English — "Open Safari," "Open Goodnotes then close it" — and an AI agent translates those instructions into multi-step actions on actual hardware. Each turn ends when the task is done or the time limit hits; then the next person in line gets control.
At launch the system handles opening, closing, and switching apps; tapping UI elements; scrolling within apps; returning to the home screen; and chaining multi-step instructions. Text entry, home-screen page swiping, pinch-to-zoom, multi-finger gestures, and anything behind the lock screen or Control Center are all out of scope. Those limits aren't arbitrary — iOS sandboxing blocks the ADB-based shortcuts that Android-targeting frameworks like Tencent Research's AppAgent rely on, and programmatic text entry on iOS is substantially harder than on Android or desktop.
That leaves iPad Playground doing something none of the major players have shipped: a publicly accessible, live, hardware-in-the-loop demo on real iPadOS. Anthropic's Computer Use API, still in beta as of early 2026, targets desktop Linux. AppAgent runs on Android. Benchmarks like OSWorld and AITW both skew toward desktop and Android. Kidane built the first internet-facing iOS GUI agent demo, and he's running it continuously against arbitrary commands from anonymous strangers — less forgiving than any controlled evaluation suite.
The benchmark numbers sharpen what iPad Playground actually shows. Anthropic's published figures put Claude 3.5 Sonnet at 14.9% on OSWorld tasks against a human baseline of roughly 72%. iPad Playground's capability profile — reliable single-tap navigation and multi-step app switching, but not text entry or complex gestures — maps to a similar ceiling on mobile. The demo doesn't close that gap. It makes it visible, on real hardware, to anyone who wants a turn.