Researchers from Peking University and Microsoft Research Asia built MegaTrain, a system that trains 100 billion parameter language models on a single GPU. That's not a typo. On an H200 paired with 1.5TB of host memory, it handles models up to 120 billion parameters. It also trained a 7 billion parameter model with a 512,000 token context window on a single GH200.
The core problem is memory.
GPUs don't have enough of it for massive models, and traditional offloading approaches choke on the CPU-GPU bandwidth bottleneck. MegaTrain fights back with a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams. It also uses stateless layer templates that bind weights dynamically as they stream in, dumping the memory overhead of persistent autograd graphs. The result: 1.84 times the training throughput of DeepSpeed ZeRO-3 with CPU offloading for 14 billion parameter models.
Throughput is still nowhere near what you'd get with a proper distributed setup. Training large models on one GPU remains slow. And as Hacker News commenters pointed out, manual offloading strategies like this have been used in the wild for a while. MegaTrain's contribution is making the approach systematic and efficient, not inventing it from scratch.
The real value is access. If you're a researcher or small team who can't get time on a multi-GPU cluster, MegaTrain lets you experiment with genuinely large models on hardware you might actually have. For those seeking to navigate the fragmented AI model landscape without a cluster budget, services like OpenRouter are increasingly vital. For the agent space, that could mean fine-tuning models for tool use or domain-specific reasoning without needing a cluster budget. Keeping this in mind, developers are currently exploring new ways to manage agent workflows, such as the recently raised version control for AI agents.