In mid-2024, independent researcher David Noel Ng claimed the top spot on the HuggingFace Open LLM Leaderboard without training a single additional step, merging weights, or running gradient descent. His model, dnhkng/RYS-XLarge, beat well-funded labs and fine-tuning specialists across all six benchmark categories by doing something far stranger: he took Alibaba's open-weight Qwen2-72B, duplicated a specific block of seven middle transformer layers, stitched the result back together, and ran the whole thing on two consumer NVIDIA RTX 4090 GPUs using ExLlamaV2 quantized inference.
No new compute budget. No research team. Just architectural surgery on existing weights.
The method grew from what Ng calls 'LLM Neuroanatomy' — a hypothesis assembled from two odd observations. The first came from jailbreaking experiments: sufficiently capable 2023-era models could accept, reason over, and respond in Base64-encoded text, a format well outside their training distribution. That pointed toward early transformer layers acting as format-agnostic translators, converting inputs into a shared abstract internal representation before late layers reverse the process on the way out. The second came in November 2023, when a HuggingFace user named Alpindale released Goliath-120b — a 'Frankenmerge' that stitched two Llama-2-70B fine-tunes together by interleaving their layers in ways that fed later-layer outputs back into earlier-layer inputs. By conventional ML logic, this should have produced garbage. It didn't.
Together, those two data points pointed somewhere interesting: middle transformer layers might be functionally modular and robust to rearrangement in ways the field hadn't systematically explored.
To test it, Ng built what he calls a 'brain scanner' — a sweep of 3,241 candidate layer-loop configurations across the 80-layer Qwen2-72B architecture. He evaluated candidates using fast proxy tasks and a logit-weighted LLM-as-judge scoring system designed to surface promising configurations without running full benchmark evaluations on each one. The winning configuration, duplicating seven specific middle layers, improved performance across IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-PRO — all six benchmarks, not just the ones you might expect to move.
Ng's argument is that middle layers function as a kind of universal abstract reasoning cortex: modular enough to tolerate duplication, and apparently to benefit from it. Whether the broader research community accepts that framing or reaches for a more conservative mechanistic explanation, the result is hard to dismiss. A solo researcher with consumer hardware and no training budget topped a competitive leaderboard by rearranging architecture rather than scaling compute. He published the work as a blog post rather than a paper — a deliberate choice given the speed at which the community moves. It circulated fast, and widely. The decision looks correct.