Flash-MoE Runs Qwen3.5-397B on a Laptop via 2-bit Quantization and Expert Reduction

A GitHub project called Flash-MoE, published by developer danveloper, has demonstrated that Alibaba's Qwen3.5-397B-A17B mixture-of-experts model can technically be run on a consumer laptop. The technique pairs two aggressive optimizations: 2-bit quantization, which reduces each model weight to just 2 bits of precision, and a reduction of active experts per token from the model's intended 10 down to 4. Running on a MacBook Pro, the setup achieves approximately 5–6 tokens per second — a headline-grabbing result that drew significant attention on Hacker News as a proof-of-concept in local inference of frontier-scale models.

The community reception was skeptical. The 2-bit quantization level is widely regarded as severely degrading output quality, and the author acknowledges this directly: JSON tool-calling becomes unreliable because the model produces malformed output. Hacker News commenter Aurornis argued that such extreme compression effectively "lobotomizes" the model, stripping out higher-order reasoning and causing the model to loop — rendering it unsuitable for real tasks. The expert reduction from 10 to 4 per token is an additional departure from the model's intended architecture, compounding the quality gap between Flash-MoE's demo and actual Qwen3.5-397B performance.

Commenter tarruda flagged a more practical option during the thread: the ubergarm/Qwen3.5-397B-A17B-GGUF repository on Hugging Face, which provides approximately 2.5 bits-per-weight GGUF quantizations. Running via llama.cpp on an Apple M1 Ultra with 128GB of unified memory, these quants reach around 20 tokens per second — four times Flash-MoE's throughput — while scoring 87.86% on MMLU and 82.32% on GPQA Diamond in lm-evaluation-harness benchmarks. That framing also exposes a softer version of the "laptop" claim: both approaches require high-end Apple Silicon hardware costing $3,000 or more, which community members noted is a significant stretch of the word "consumer."

Flash-MoE is a novelty demonstration, not a usable inference stack. At 2-bit compression, the model's JSON output is unreliable enough to break tool-calling — which rules it out for most agentic workflows without further workarounds. Anyone evaluating offline inference of Qwen3.5-397B for real use should look at the 2.5 BPW GGUF quants instead: four times the speed, benchmarks that hold up, and the same hardware requirement.