When SwiGLU Failed on H100 but Won on Blackwell, a Framework Called It a Contradiction

The finding had looked settled: SwiGLU activation functions added nothing useful to the constrained training runs in Andrej Karpathy's autoresearch project. H100 data across multiple forks agreed — more parameters meant fewer gradient steps in the five-minute experiment window, and any quality gains were eaten by the throughput cost.

Then GB10 Blackwell results arrived and reversed the verdict. On Blackwell, SwiGLU won. According to Nervous Machine, which published a demonstration of its distributed agent framework this week, the reason was architectural: SDPA attention became the new throughput bottleneck on Blackwell, shifting the tradeoff in SwiGLU's favour. Same technique, different GPU, opposite outcome.

That split is exactly the kind of discrepancy that usually drowns in spreadsheet noise. Nervous Machine is building infrastructure designed to surface it instead.

The company's framework ingests experiment outputs from autoresearch forks, encodes findings as nodes in a knowledge graph, and assigns each a certainty score. Scores rise when independent runs on different hardware confirm a finding; they fall when contradictions appear. In the SwiGLU case, the graph created a CONTRADICTS link between the two hardware-specific claims — each retaining its own certainty score — rather than collapsing them into a single confusing result.

Karpathy's autoresearch repository, which lets anyone with a single GPU run autonomous ML experiments overnight, has accumulated over 3,300 forks. Karpathy has written publicly that coordinating across those forks is an unsolved problem, describing the next step as needing to be "asynchronously massively collaborative for agents" — the SETI@home analogy is his. Nervous Machine frames its framework as a direct response. (The Agent Wars team could not independently verify the exact post; the quote is attributed in Nervous Machine's own write-up.)

The demonstration also surfaced what the company calls universal findings — batch halving as the biggest throughput win, label smoothing as catastrophic, value embeddings as load-bearing — which held across H100, GB10 Blackwell, and GH200 hardware. Hardware-specific findings, including RoPE base frequency and weight decay sweet spots, showed lower certainty as cross-platform contradictions accumulated. The average certainty score across the graph dropped from 0.34 to 0.315 over four sessions. Nervous Machine treats the decline as a feature: a graph whose scores only rise isn't detecting contradictions, it's confirming priors. What those absolute figures represent — whether 0.315 is meaningfully low or high — goes unexplained.

A few things the demonstration doesn't address: Is the five-minute budget a real-world constraint of autoresearch, or an artifact of the demo environment? How does the framework distinguish findings that differ because of hardware from those that differ because of uncontrolled hyperparameter or dataset variation across forks? And all the technical claims here originate from Nervous Machine's own write-up — no independent researchers with access to the system are quoted.

The framework uses the Model Context Protocol, so any MCP-compatible agent can connect to the shared graph, read certainty scores, and contribute findings without central coordination. Nervous Machine suggests the architecture could extend beyond ML experiments to any domain where distributed agents generate local observations. That's a larger claim than the demo supports.