OpenAI released a research preview of GPT-5.3-Codex-Spark, a model built for real-time coding that delivers over 1,000 tokens per second through Cursor. It has a 128k context window, text-only input, and four effort modes — low through extra-high — letting developers trade response depth for speed. OpenAI distributed the preview through Cursor rather than its own interface, putting the model directly inside the IDE where it competes with GitHub Copilot and other native coding assistants.
The throughput comes from Cerebras' Wafer Scale Engine 3 (WSE-3). OpenAI and Cerebras have disclosed a hardware partnership, though they haven't published a detailed breakdown of its terms or scope. The WSE-3 spans 46,255 mm² of die area with 4 trillion transistors and 900,000 AI-optimized cores, delivering a claimed 125 petaflops of AI compute. Cerebras positions the chip as packing 19 times more transistors and 28 times more compute than NVIDIA's B200 — figures the company has not had independently validated at scale. OpenAI, like other frontier labs, is diversifying inference infrastructure away from GPU clusters for latency-sensitive workloads, a trend that also benefits inference-focused chip vendors such as Groq and SambaNova.
Hardware alone doesn't explain the numbers. OpenAI reworked its full request-response stack alongside the silicon: persistent WebSockets cut per-client roundtrip overhead by 80%, per-token overhead by 30%, and time-to-first-token by 50%. Hitting those latency figures required simultaneous tuning of model architecture, silicon, and networking — not just swapping in a faster chip.
Jack Pearce, who published the first detailed writeup at jackpearce.co.uk, draws an explicit parallel to xAI's grok-code-fast-1, crediting that model as the first to establish raw speed as a first-class product differentiator for developer tooling. Developers who have used Codex-Spark describe near-instant feedback as transformative for small UI changes and <a href="/news/2026-03-15-developer-month-long-test-openai-codex-5-3-no-code">rapid codebase Q&A</a> — the kind of iterative work where waiting two seconds per response compounds into friction. OpenAI and xAI now both have dedicated fast-coding entries in market. Cursor hasn't said when Codex-Spark moves from research preview to general availability, and whether the throughput numbers hold at production traffic loads is the next thing to watch.