AI Agents Are Entering Scientific Labs. Researchers Don't Yet Know How to Judge Them

Frontier models now post expert-level scores on GPQA (Graduate-Level Google-Proof Q&A) and FrontierMath, benchmarks designed to stump human PhDs. A paper published this week in Science by the American Association for the Advancement of Science asks what those scores actually prove — and concludes the field is measuring the wrong things.

Benchmark saturation, the authors argue, may be revealing the limits of evaluation frameworks rather than confirming that scientific AI has arrived. François Chollet, creator of the ARC-AGI benchmark, has made this case for years: high scores on static tests say little about genuine fluid reasoning or real-world performance. The Science paper gives the critique institutional weight.

The gap between benchmark performance and real scientific work is structural. Science requires hypothesis generation under genuine uncertainty, experimental design, anomaly recognition, and iterative self-correction across long time horizons — none of which map cleanly onto pattern-matching against known problem types. Google DeepMind's AlphaFold stands as the canonical example of AI impact in a bounded domain; its protein structure prediction work contributed to the 2024 Nobel Prize in Chemistry for its creators. But AlphaFold solved a specific prediction problem with extraordinary precision. It did not generate the hypotheses, design the experiments, or flag anomalous results. Sakana AI's "The AI Scientist" goes further, producing research ideas, running computational experiments, and drafting manuscripts, but critics note the outputs often lack novelty and rigorous error-checking — raising the question of whether such systems accelerate science or merely automate its surface forms.

Evaluation approaches discussed in the paper include open-ended discovery benchmarks, adversarial scientific tasks designed to resist memorization, and longitudinal tracking of whether AI-generated hypotheses produce reproducible results. The "AI co-scientist" framing is gaining traction in research circles as a more practical near-term standard: evaluate AI not in isolation but by how much it measurably accelerates human researchers, sidestepping the harder question of full autonomy. Anthropic, OpenAI, and Google DeepMind have each included capability thresholds in their responsible scaling policies, though whether any of those policies define specific scientific research milestones is unclear; the Science paper does not cite their RSPs directly.

Autonomous research agents are already deployed in pharmaceutical R&D, materials science, and climate modeling, with no agreed-upon standards for what they're actually contributing. The scientific community's demands — reproducibility, falsifiability, peer scrutiny — sit in tension with the probabilistic and often opaque behavior of large language models. Chollet's ARC-AGI-2, released in early 2025 and designed specifically to resist the pattern-matching that inflated earlier benchmark scores, offers one possible template. Whether a comparable framework can be built for open-ended scientific discovery is the question the AAAS paper leaves squarely on the table.