Training a 100-billion parameter model usually requires a cluster of GPUs. Researchers Zhengqing Yuan, Hanchi Sun, Lichao Sun, and Yanfang Ye built MegaTrain, a system that does it on a single GPU. The trick is simple in concept: stop treating the GPU as the memory authority. MegaTrain stores all parameters and optimizer states in CPU RAM, streaming them to the GPU layer by layer for computation. On an NVIDIA H200 paired with 1.5TB of host memory, the team reliably trained models up to 120 billion parameters.

The approach fights the obvious bottleneck (CPU-GPU bandwidth) with two techniques. A pipelined double-buffered execution engine overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams to keep the GPU fed. MegaTrain also ditches persistent autograd graphs in favor of stateless layer templates that bind weights as they arrive, cutting metadata overhead and giving the scheduler flexibility.

The benchmarks are concrete. When training a 14-billion parameter model, MegaTrain hit 1.84 times the throughput of DeepSpeed ZeRO-3 with CPU offloading. On a NVIDIA GH200, it trained a 7-billion parameter model with 512,000 token context lengths, the kind of sequence length that normally demands distributed infrastructure. But the team doesn't compare against a proper multi-GPU setup, so we don't know the real speed penalty. The 1.84x figure is versus another single-GPU offloading approach, not a cluster. And it's unclear whether that advantage holds at the 120B parameter scale.

Hacker News commenters noted similarities to PyTorch's existing CPUOffloadPolicy in its FullyShardedDataParallel module, suggesting pieces of this approach could land in stock PyTorch. If that happens, the barrier shifts from GPU cost to system RAM capacity. For teams fine-tuning large models on constrained hardware, that changes what's actually buildable.