IonRouter Runs Multiple LLMs Per GPU and Claims Twice the Speed

Cumulus Compute Labs, a Y Combinator W26 startup, has launched IonRouter, an inference platform whose core claim is easy to state and harder to independently verify: run multiple LLMs simultaneously on a single NVIDIA Grace Hopper chip, and you get throughput that incumbent providers can't match at the price point.

The underlying engine, IonAttention, multiplexes models on a single GH200 or B200 GPU with millisecond-scale swap times, adapting in real time as request traffic shifts rather than queuing for a dedicated model instance. On a single GH200, Cumulus benchmarks IonRouter at 7,167 tokens per second on Qwen2.5-7B — roughly twice what leading inference providers deliver, the company says. That figure is batched throughput, not per-request latency, and the benchmark methodology isn't detailed in the launch materials. For context, vLLM in optimized production configurations on a comparable H100 typically reaches 3,000–5,000 tokens per second on 7B-class models under heavy batching; the GH200's unified CPU-GPU memory does offer a genuine advantage on memory-bound workloads, but the size of the gap is worth watching as independent benchmarks emerge. Cumulus is also an NVIDIA Inception member.

The model roster leans heavily on Chinese frontier labs: ZhiPu AI's GLM-5, a 600B-parameter mixture-of-experts model with EAGLE speculative decoding across 8× B200 GPUs; MoonShot AI's Kimi-K2.5 for long-context reasoning; MiniMax's M2.5 with a 1M-token context window; and Alibaba's Qwen3.5-122B. GPT-OSS-120B is priced at $0.020/$0.095 per million tokens. On the multimodal side, Wan2.2 handles text-to-video generation in under 10 seconds via the FastGen runtime, while Black Forest Labs' Flux Schnell covers image generation at under 4 seconds.

For agent developers, the billing model may be the more practically relevant detail: per-second charges with no cold starts or idle costs, and an OpenAI-compatible API that requires only a base_url swap. A published case study puts five concurrent VLMs on a single GPU serving 2,700 video clips with sub-one-second initialization — a scenario that maps directly onto multi-agent pipelines where spinning up specialized vision models on demand is a recurring bottleneck. Dedicated GPU streams are available for custom LoRA and finetuned model deployments, tracked via Cumulus's own model comparison layer, Iondex.

IonRouter enters a market where Groq, Together AI, and Fireworks AI already have developer traction and where price competition on commodity models is aggressive enough to make margins thin. Cumulus's bet is that hardware-native engine design — built specifically for GH200's unified memory architecture — can sustain a throughput edge even as rivals migrate to the same silicon. The more immediate question is whether the model roster, skewed toward Chinese labs and refreshed via Iondex, actually reflects what agentic workflow developers want to run. If Cumulus adds the Western frontier models that dominate enterprise agent stacks before competitors close the hardware gap, the benchmarks start to matter. If not, fast inference on models few teams are using is a narrower advantage than the launch materials suggest.