Practitioners on Hacker News this week traded notes on how they actually run AI evaluations — and the answers ranged from sophisticated multi-model review pipelines to "we basically eyeball it and hope."

The clearest shift: BLEU, ROUGE, and the traditional NLP metric toolkit are effectively dead for most LLM work. In their place, LLM-as-judge has become the de facto standard — a separate model grades outputs against a rubric rather than a fixed reference answer. It works well enough in practice, though commenters repeatedly flagged an obvious problem: using GPT-4 to grade GPT-4 outputs introduces circular reasoning that nobody has cleanly solved. Cost at scale is a separate headache that came up more than once.

On tools, the thread converged on a short list. Braintrust dominated mentions from teams running structured experiments — it handles dataset versioning, experiment tracking, and scoring in one place, which matters when you're iterating through prompt variants. LangSmith showed up wherever LangChain already lives in the stack. PromptFoo has a dedicated following among developers who don't want to pay for SaaS and are comfortable living in a CLI. Weights & Biases' Weave is picking up ML teams who already log training runs there and want eval observability in the same dashboard. Arize Phoenix covers the open-source RAG-and-tracing niche.

Where things get genuinely messy is agent evaluation. Scoring a single prompt-response pair is tractable. Scoring a five-step agent task — where the model picks tools, issues API calls, and has to reach a specific end state — is not. There's no clean way to grade intermediate reasoning, and "did the task succeed" doesn't tell you whether the agent got lucky or actually understood the problem. Practitioners described stitching together deterministic unit tests for individual tool calls alongside broader LLM-graded rubrics for overall task quality. The biggest complaint wasn't tooling — it was that building and maintaining good eval datasets takes more time than anyone budgets for.

The thread covered familiar ground, but the specificity of the pain points made it worth reading. Evals are infrastructure. Most teams treat them as an afterthought until something breaks.