DeepSeek just dropped the official API docs for its v4 model family. Two models: deepseek-v4-flash and deepseek-v4-pro. Both work with OpenAI and Anthropic SDKs. You switch by changing the base URL. That's it. Legacy models deepseek-chat and deepseek-reasoner stick around until July 2026, then they're gone.
Thinking Mode is the headline feature. It outputs chain-of-thought reasoning before the final answer, with configurable effort levels (high or max). Beyond that, you get multi-round conversation, tool calls, JSON output, and context caching.
Community feedback on Hacker News praised the documentation clarity, with users saying it beats what OpenAI and Google offer. That's a low bar, but credit where it's due. Pricing on OpenRouter lists the Pro model at roughly $1.74 per million input tokens and $3.48 per million output tokens. Flash is cheap enough to run locally on a Mac with 48GB RAM, though you'll wait longer than cloud inference, making it a viable option for independent developers.
Under the hood, DeepSeek's Mixture of Experts approach splits experts into shared and routed categories, activating fewer parameters per token. Their Multi-Head Latent Attention compresses the key-value cache during inference, cutting memory needs. V3 had 671B parameters but only activated 37B per token. V4 likely continues that pattern. That's how you get models that run on consumer hardware and still compete with the big players leading infrastructure competitors.