Mistral has never been shy about ambition. But with Mistral Small 4, the French AI company has swung harder than ever — releasing a 119-billion-parameter model it insists on calling "small," one that collapses four specialized systems into a single, Apache 2.0-licensed package.
The architecture is genuinely striking. Built on a Mixture-of-Experts framework deploying 128 experts — four active per token — the model draws on the capabilities of Magistral for deep reasoning, Pixtral for multimodal vision, and Devstral for agentic coding. A 256,000-token context window and a configurable reasoning_effort parameter round out a feature set that reads like a wish list for enterprise AI teams.
The efficiency numbers, if they hold, are hard to dismiss. Mistral claims a 40% reduction in end-to-end completion time and triple the throughput of its predecessor. On LiveCodeBench, the company reports it outperforms GPT-OSS 120B while generating 20% shorter outputs — and on competitive coding benchmarks, it reportedly beats Qwen models using 3.5 to 4 times less output to reach comparable scores.
But the launch cracked under its own weight.
Engineers at CTOL Digital Solutions, who evaluated the model at release, delivered a measured but pointed verdict. The efficiency story, they acknowledged, is compelling — shorter outputs mean lower API costs and better latency, metrics that matter enormously at scale. The unified architecture is a genuine feat of consolidation.
Yet the cracks were difficult to ignore. Most critically, the flagship reasoning_effort parameter — the very feature that defines Small 4's identity — was non-functional and undocumented in the API on launch day, leaving early adopters unable to access the model's core differentiator. Without it, CTOL's evaluators found the coding performance "very, very bad" and at times completely broken.
Then there is the naming problem. Calling a 119B-parameter model requiring at minimum four NVIDIA H100 GPUs "Small" invites ridicule, and CTOL did not hold back. The label creates confusion and, more practically, puts local deployment beyond reach for the vast majority of developers — a painful irony for a model released under an open-source license.
Internal benchmark comparisons further muddied the picture. CTOL's own testing showed Mistral Small 4 scoring below Qwen3.5 122B, and evaluators expressed skepticism about real-world performance at the full 256k context depth.
CTOL's conclusion was unambiguous: a meaningful step forward for European AI, and a genuine achievement in open-source development — but not good enough to claim the top position in its class, and far from the obvious choice for teams seeking a capable, deployable small model today.
Mistral has built something architecturally impressive. Whether it can deliver on that architecture — reliably, at launch, for real users — remains an open question. In AI, the distance between a benchmark and a product is where reputations are made or lost.
