I benchmarked the Caveman plugin against "be brief."

Max Taylor ran a proper benchmark and found that "be brief." matches a popular Claude Code compression plugin on both token reduction and quality. He tested 24 prompts across six categories with five arms: baseline Claude Code, "be brief." prepended, and three intensity levels of Julius Brussee's Caveman plugin. The results cut against the complexity narrative. "Be brief." averaged 419 tokens per response. Caveman's modes landed between 401 and 449. Quality scores stayed within 1.5% across all arms, with every single one hitting 100% coverage of key points and zero dangerous claims triggered out of 120 responses.

The surprise comes from Caveman's strictest mode. Ultra, supposedly the most aggressive compression, produced the longest answers of the three plugin arms. This is the Auto-Clarity feature working as designed. When the plugin detects safety warnings or multi-step sequences, it drops compression and writes natural prose instead. Taylor's benchmark didn't test that persistence, just single-shot runs. The testing setup is open source on GitHub as cc-compression-bench, and adding a new compression strategy is one shell script. Taylor makes the point plainly: most prompt engineering advice hasn't been measured against the boring default. Measure it.