Infinity Inc Claims to Surpass vLLM Performance with AI-Generated Inference Stack for Qwen3

Infinity Inc, an early-stage AI infrastructure company with a minimal public profile, published a case study this week claiming that its AI-generated LLM inference stack outperforms vLLM when serving Alibaba's Qwen3 model. The core methodology is an iterative, ML-inspired optimization loop: an AI system generates candidate changes to the inference stack, evaluates them against a performance objective, retains improvements, and discards regressions — a process the Hacker News community dubbed "AI-descent" applied to low-level systems engineering. Rather than hand-tuning GPU kernels and memory management the way teams behind vLLM and SGLang do, Infinity Inc generates the stack itself through automated optimization.

The case study lands two months after Inferact — the vLLM commercialization startup founded by vLLM maintainers — raised a $150 million seed round at an $800 million valuation, backed by a16z and Lightspeed. As other inference platforms likewise push performance boundaries, the technical community has met Infinity Inc's performance claims with measured skepticism. Commenters on Hacker News pointed to the conspicuous absence of token probability verification: a properly equivalent inference implementation should reproduce exact output token probabilities for any given prompt, and relying on benchmark metrics alone leaves open whether Infinity Inc's stack is fully semantically equivalent to vLLM. Correctness is non-negotiable in production.

Two additional technical gaps drew criticism. The case study does not appear to support speculative decoding, a widely adopted optimization that provides substantial latency benefits on decode-heavy workloads, weakening the scope of the performance comparison. The absence of paged attention support — the memory management technique vLLM uses to efficiently handle KV cache and prevent fragmentation under high concurrency — also raises questions about scalability in real-world serving scenarios. Without these capabilities, critics argue the benchmark comparison is not yet apples-to-apples against a fully configured vLLM deployment.

Infinity Inc itself remains largely opaque: the company has no detectable presence on Crunchbase, LinkedIn, or Y Combinator's directory, no named executives, and no public funding announcements. Its choice to lead with a technically detailed case study rather than conventional marketing is consistent with founders from systems or ML research backgrounds. What the community is waiting for next is a correctness proof — token probability parity against a reference implementation — before the performance numbers mean much in a production context.