IonRouter (YC W26) Launches High-Throughput LLM Inference Service Claiming 2× Rival Throughput

Cumulus Compute Labs, a Y Combinator W26 startup, has launched IonRouter, an LLM inference platform built around a proprietary inference engine called IonAttention. Running on NVIDIA Grace Hopper GH200 and B200 GPUs, the service claims 7,167 tokens per second on Qwen2.5-7B on a single GH200 — approximately double the throughput Cumulus attributes to leading competitors. The core architectural approach is model multiplexing: IonAttention allows multiple models to share a single GPU with millisecond-level swap times and real-time traffic adaptation, enabling 0ms cold starts for custom LoRA and fine-tuned model deployments billed per second with no idle costs. The platform exposes an OpenAI-compatible API requiring only a one-line base_url change, lowering integration friction for existing SDK users.

IonRouter's model catalog is notable for its concentration of Chinese-origin frontier open-weight models. The lineup includes GLM-5 from ZhiPu AI (a 600B+ mixture-of-experts model running on 8x B200s via EAGLE speculative decoding), Kimi-K2.5 from MoonShot AI, MiniMax-M2.5's 1M-context model, Qwen3.5-122B-A10B from Alibaba, and Wan2.2 text-to-video — also an Alibaba product. A sixth language model listed as "GPT-OSS-120B" carries no disclosed lab attribution and is priced at notably aggressive rates of $0.020 per million input tokens. Flux Schnell from Black Forest Labs rounds out the image generation offering. Targeted use cases include robotics VLM perception, multi-stream surveillance, game asset generation, and AI video pipelines, with a case study demonstrating five vision-language models running concurrently on a single GPU processing 2,700 video clips.

ZhiPu AI, the maker of GLM-5, was added to the US Department of Commerce BIS Entity List in January 2025 over concerns about advancing PRC military modernization — the first Chinese large-model company to receive that designation. Hosting ZhiPu's open-weight model weights does not require a BIS license under current rules, as the Biden-era AI Diffusion Rule explicitly exempted open-weight models and was subsequently rescinded in May 2025 with no replacement in effect. However, any formal commercial relationship with ZhiPu could carry distinct compliance exposure. The practice of Western inference providers hosting Chinese open-weight models is now broadly established — Together AI, Fireworks AI, and OpenRouter all do so — but NIST's CAISI program has documented that some Chinese models exhibit measurably higher alignment with CCP positions on sensitive topics, and independent security research has flagged elevated rates of sensitive enterprise data being submitted to Chinese AI tools. IonRouter's marketing copy, however, refers to Qwen3.5-122B as "Cumulus's most capable open-source model," attributing an Alibaba product to the company itself, a framing that obscures model provenance.

Hacker News reception at launch was broadly positive on the technical merits but surfaced several gaps relevant to potential enterprise and agentic customers. Commenters flagged that pricing and model listings are positioned too far down the landing page. Of greater concern to enterprise buyers: cached input pricing — critical for agentic workflows that repeatedly reuse system prompts — is not currently disclosed, nor are model quantization levels. One commenter identified or suspected a presence on OpenRouter under the name "Ionstream," suggesting potential multi-channel distribution, though Cumulus has not confirmed this. A privacy policy concern was also raised: IonRouter's terms state that input prompts are stored, with no disclosed retention period — a red flag for enterprise customers handling sensitive data. Cumulus has not publicly responded to these specific points as of publication.