Data Science Weekly's Issue 642 includes one of the more practical Claude Code evaluations to appear outside a lab setting: a direct test of whether the tool can autonomously build dbt projects from real data, with other Claude models scoring the results.

The author wasn't new to using Claude for dbt work — they'd already been leaning on it conversationally for guidance. Asking it to construct entire pipelines from scratch is a harder ask, and the results reflect that. The pipelines were credible, but they required careful prompting and still depended on the human to frame the analytical problem correctly. The author's verdict — 'Claude Code isn't going to replace data engineers (yet)' — reads as an honest assessment rather than a dismissal. The tool does real work; it's just not as autonomous as the marketing sometimes implies.

The evaluation method is worth a look on its own terms. Rather than pass/fail testing, the author used LLM-as-judge scoring to assess schema adherence, code structure, and documentation quality — the kinds of output dimensions that don't reduce to a binary. This approach is showing up more often outside research contexts, and a dbt workflow is about as real-world as a testbed gets.

Elsewhere in the issue, a researcher ran topic modeling on more than 2,800 conversations from CantoAI, an AI-powered Cantonese conversation partner. Clustering the raw user queries reveals what people actually ask for, rather than what the product team assumed they would. The analysis maps which use cases pull the most traffic and where demand is going unmet. If you're building in the language-learning or AI companion space, this kind of usage data is more useful than any number of user interviews — it shows what the product is genuinely being used for, and what it's missing.

The issue also covers PyAI, a conference co-organized by Prefect and Pydantic. Both companies build foundational Python infrastructure for agent pipelines: Prefect for workflow orchestration, Pydantic for keeping structured outputs well-formed across tool calls. Running a joint AI conference suggests they see themselves as part of a distinct layer in the stack — and that layer is starting to act like a real industry.