Kazushige Goto's 2008 paper reads like a victory lap. Working alone, he'd built a math library called GotoBLAS that beat Intel's Math Kernel Library on Intel's own processors. Not by a little. Benchmarks showed GotoBLAS outperforming MKL by 10-20% on general matrix multiplication.
How? Obsessive optimization for memory hierarchies. Goto wrote "micro-kernels," small hand-tuned assembly routines that kept data flowing through CPU registers and caches without stalling. He used a technique called "packing" to rearrange matrix data into cache-friendly blocks before computation. Most library writers treated cache management as an afterthought. Goto made it the whole game.
The results were striking. On some operations, his code hit over 90% of theoretical peak performance on hardware he didn't design. GotoBLAS became the standard for AI infrastructure and eventually evolved into OpenBLAS, which still ships with major Linux distributions today.
Matrix multiplication is the computational bottleneck in deep learning. The packing and micro-kernel techniques Goto documented are baked into how modern frameworks handle linear algebra, particularly for LLM inference. His paper predates the deep learning boom by years, but the optimization playbook it describes is still the one everyone follows.