If DSPy Is So Great, Why Isn't Anyone Using It? — Why DSPy's Adoption Gap Is Bigger Than Its PR Problem

AI engineer Skylar Payne published a March 2026 post arguing what he calls "Khattab's Law" — named after DSPy creator Omar Khattab — which holds that any sufficiently complex AI system eventually reinvents DSPy's core abstractions on its own: typed I/O signatures, composable modules, prompt versioning, retry logic, and model-swapping shims. Teams do this ad hoc, buggily, and after significant pain. To make the case, the article traces a canonical seven-stage evolution of a typical LLM pipeline, from a raw OpenAI API call to a fragile hand-rolled framework complete with a prompts database, Pydantic parsing, tenacity-based retries, RAG retrieval, and eval scaffolding. Companies including JetBlue, Databricks, Replit, VMware, and Sephora are cited as production DSPy users reporting consistent benefits: faster model swaps, more maintainable pipelines, and less engineering time spent on plumbing rather than context.

The download numbers tell a different story. DSPy logs roughly 4.7 million monthly downloads against LangChain's 222 million, and Hacker News commenters were quick to interrogate Payne's framing. The most substantive pushback noted that several of DSPy's highlighted patterns — typed outputs via Pydantic, clean model abstraction — are already handled by lighter tools like LiteLLM and Vercel's AI SDK, and that framing DSPy favorably against LangChain sets a low bar given LangChain's well-documented architectural problems. More critically, commenters pointed out that the article almost entirely omits DSPy's actual differentiator: MIPROv2, its Bayesian prompt optimizer that automatically searches over instruction phrasings and few-shot demonstrations. That capability is genuinely hard to replicate with lighter tooling, yet it receives minimal coverage in the piece that purports to make the case for DSPy.

The adoption barrier commenters identified cuts deeper than unfamiliar APIs. DSPy's optimization loop requires a labeled training and evaluation dataset before it can function, demanding that engineering teams define measurable correctness upfront — a researcher's discipline that many product teams simply are not ready to apply when they are still figuring out what their LLM system should do. That friction is a direct artifact of DSPy's origins at Stanford's DAWN lab, where Khattab, working under Christopher Potts and Matei Zaharia, developed the framework as a logical extension of his retrieval research on ColBERT and BALEEN. Academic benchmarks always come with ground-truth labels; production AI systems frequently do not, and DSPy's architecture enforces an evaluation-first mental model that can actively impede iteration speed for teams in exploratory phases.

The motivating example drew separate scrutiny: company name extraction is a classic NER task that traditional ML handles competently without LLM latency or cost, suggesting some of the pipeline complexity being solved is self-inflicted by reaching for large language models unnecessarily. DSPy is genuinely valuable for teams with stable, well-defined tasks who need systematic prompt optimization at scale — the kind of work Databricks and Replit are doing. Its positioning as a general antidote to LLM engineering pain is undercut by its steep learning curve, its requirement for labeled evaluation data, and the availability of lighter-weight alternatives for the simpler abstractions it also provides.