Hugging Face researchers have published a comprehensive technical survey of the open-source reinforcement learning ecosystem, benchmarking 16 libraries built for LLM post-training. The goal was practical: extract architectural lessons to inform TRL's own upcoming async trainer. The problem they set out to solve isn't complicated to describe. Synchronous RL training leaves GPUs idle for most of wall-clock time because autoregressive generation is so slow. A single batch of 32K-token rollouts on a 32B-parameter model can take hours, during which the training GPUs do nothing. As reasoning models push chain-of-thought sequences longer, and as agentic workloads introduce unpredictable latencies — a tool call might return in seconds, a multi-step reasoning chain might not come back for hours — that idle time has become one of the more expensive inefficiencies in post-training infrastructure.

The answer most of these libraries have settled on is the same: split inference and training across separate GPU pools, connect them with a rollout buffer, and sync model weights asynchronously so neither side waits on the other. Of the 16 libraries evaluated — TRL, ART, AReaL, Atropos, NeMo-RL, OAT, PipelineRL, MiniMax Forge, and eight others, representing teams at Hugging Face, NVIDIA, Nous Research, Allen AI, Ant Group, and Tsinghua IIIS — Ray has become the default orchestration layer for half of them. NCCL broadcast is how most transfer weights. LoRA training support, notably, is still sparse across the board. The differentiator the researchers see emerging at the next tier of scale is distributed Mixture-of-Experts support.

A few libraries are worth calling out for where they've placed their bets. OpenPipe's ART targets multi-step agentic workloads via GRPO, offering a serverless RL product built on Weights & Biases that claims 40% cost savings through shared inference cluster multiplexing, with LangGraph and MCP-RL hooks. AReaL, a joint project from Tsinghua IIIS and Ant Group built on top of ReaLHF, is a fully async system that has already run at 235B MoE scale with the AReaL-SEA model, posting competitive numbers on math, coding, and search agent benchmarks — including a claimed edge over GPT-5 on τ²-bench. Nous Research's Atropos treats RL environments as independent microservices connected through a trajectory API, a design that maps naturally onto how production agentic systems already work.

The survey closes by noting that several problems in this space look different on the surface but share the same underlying structure. Process reward models add synchronization checkpoints mid-rollout. Multi-agent co-evolution multiplies the straggler problem. On-policy distillation — where a student generates sequences and a teacher scores them — runs the same pipeline as async RL, with the teacher forward pass substituted for the reward function. TRL's own async trainer design, featuring a bounded queue, per-token model versioning to sidestep double-buffering, and partial rollout support for long agentic trajectories, reflects these patterns throughout. For anyone currently making infrastructure choices around agent training, the survey is a useful map of what the field has actually built.