TurboQuant Model Compression Added to llama.cpp Fork

TheTom's llama-cpp-turboquant fork now supports TQ4_1S and TQ3_1S weight quantization, compressing models by 27-42% with minimal perplexity degradation. Merged via Pull Request #45, the implementation uses Walsh-Hadamard Transform (WHT) rotation combined with Lloyd-Max centroids for post-training quantization—no retraining, calibration data, or model modification required. Testing across Qwen, Phi-4, and Llama 3.1 models shows perplexity increases of just 1.0-1.9% for Qwen and Phi families at 27-37% size reduction.

The current implementation is Metal-only, targeting Apple Silicon, though developer signalnine has already contributed a CUDA port using a cuBLAS dequant-to-f16 path with further optimizations planned. HIP/ROCm support is also in development. Llama-family models present challenges with 6-8x higher per-layer error amplification on WHT-rotated FFN tensors, requiring hybrid configurations that combine TQ4 attention layers with Q4_K or Q5_K/Q6_K FFN layers.

This work exists in a fork and has not been merged into the official llama.cpp repository. An unrelated PR (#21089) by elusznik proposing CPU-only TurboQuant KV-cache support remains pending in the official repo. The fork's implementation builds on David Y. Tan's original TQ3_1S research, extending it with V2.1 fused Metal kernels featuring zero threadgroup memory and cooperative SIMD rotation. The PR notes indicate the code was generated using Claude Code, which community members suggest may complicate upstream acceptance given recent policy concerns around AI-generated submissions.