LLM Neuroanatomy: How I Topped the HuggingFace Open LLM Leaderboard Without Changing a Single Weight

David Noel Ng beat every major AI lab on HuggingFace's Open LLM Leaderboard in mid-2024 without running a single training step. No fine-tuning, no gradient descent, no weight updates of any kind. He took Alibaba's Qwen2-72B, found seven consecutive layers in the middle of the network, copied them, and stitched the architecture back together. The model — dnhkng/RYS-XLarge — topped the leaderboard across six benchmarks including IFEval, MATH Level 5, GPQA, and MMLU-PRO, running on a pair of consumer RTX 4090 GPUs.

The idea came from two earlier observations that stuck with Ng. Capable LLMs from 2023 could handle inputs encoded entirely in Base64 — a tokenization pattern completely alien to normal training data — and still produce coherent answers. Something inside the model was converting the input into an abstract internal form, reasoning there, and translating back out. Around the same time, a HuggingFace user named Alpindale released Goliath-120b, a model built by interleaving layers from two separate Llama-2 70B fine-tunes in a configuration no model had ever been trained on. It worked anyway. Both cases pointed to transformer architectures that were more modular than most researchers had assumed.

To find out how modular, Ng ran a sweep across 3,241 layer-duplication configurations — every (i, j) index pair representing a possible duplication range — evaluated with ExLlamaV2 on his two-GPU setup. He called it a brain scanner. The winning configuration was a seven-layer block in the network's middle third. From this, Ng developed what he calls a theory of LLM neuroanatomy: early layers convert tokens into format-agnostic abstract representations; middle layers do the reasoning; late layers translate back into text. Duplicate the middle section and you get more reasoning capacity without touching a single weight.

For agent developers, this matters beyond benchmark trivia. If a model's performance ceiling can be raised through architectural surgery alone — no GPU cluster, no training run — it changes what's achievable on constrained hardware. The quality of a base model's middle layers may be a better predictor of ceiling than raw parameter count or training budget. Ng did the original work in 2024; the full writeup only arrived recently, handing mechanistic interpretability researchers a concrete, testable framework and giving anyone building agents on consumer hardware a reason to look more carefully at what's happening inside the stack.