RL Fine-Tuning Helps LLM Agents Generalize Within Environments But Struggles Across Them

The standard playbook for evaluating RL-trained LLM agents has a blind spot: nearly every benchmark tests performance on tasks drawn from the same environments used during training. A 14-author team from Fudan University, Meituan, and the Shanghai Artificial Intelligence Laboratory set out to close that gap. Their paper, arXiv:2603.12011, published March 12, 2026, asks directly whether reinforcement fine-tuning (RFT) generalizes beyond training distribution — and finds the answer depends heavily on what kind of generalization you're measuring.

The study runs RFT across five benchmark environments — WebShop, SearchQA, TextCraft, AlfWorld, and BabyAI — chosen to vary in reward density, action interface structure, and world knowledge requirements. Lead author Zhiheng Xi and corresponding authors Tao Gui, Qi Zhang, and Xuanjing Huang from Fudan's NLP Lab organized the analysis along three axes: within-environment generalization (training on easy tasks, testing on harder ones in the same environment), cross-environment transfer (training in one environment, testing in a structurally different unseen one), and sequential multi-environment training.

Within a single environment, RFT holds up well — agents learn environment-level behavioral patterns and apply them to harder task variants. Cross-environment transfer is the weak point: performance degradation tracks with mismatches in both semantic priors and observation-action interfaces, so when background knowledge requirements and interface structures differ simultaneously, agents struggle. The sequential and mixture results are more encouraging. Training across environments in sequence produces meaningful downstream gains with little catastrophic forgetting; training across environments simultaneously offers the best overall balance between generalization and stability.

Meituan's presence in the author list, through co-author Xunliang Cai of the company's LongCat Team, isn't incidental. Meituan's LongCat-Flash-Thinking-2601 — a 560-billion-parameter open-source mixture-of-experts model trained with RL across more than 10,000 environments — is a direct test case for the generalization questions the paper investigates. Meituan's operations span food delivery, grocery, hotel booking, and drone logistics, exactly the kind of heterogeneous multi-environment problem the study examines. The finding that sequential and mixture training transfers positively with minimal forgetting makes the case for deploying generalizable agent models across multiple business domains, rather than separate specialized models for each service line.