1. Introduction & Overall Impressions
DeepSeek-R1 has garnered significant attention for its new approach to training large language models (LLMs). Compared to its predecessor DeepSeek-V3, this new work emphasizes a “simpler yet more elegant” style in its experimental and theoretical design.
In evaluating DeepSeek-R1, many researchers have been reminded of AlphaGo’s evolution, particularly because of the R1-Zero → R1 training process. DeepSeek-R1 stands out for its high performance on various challenging benchmarks, even outperforming or equaling top-tier models such as OpenAI-o1-1217. Additionally, a distilled 32B version (DeepSeek-R1-32B) has delivered impressive results, rivaling OpenAI-o1-mini.
At a high level, DeepSeek-R1 shows that it is possible to achieve strong reasoning capability without relying on massive supervised fine-tuning (SFT) at the outset. The model instead employs a combination of reinforcement learning (RL) with a lightweight SFT approach, plus a rule-based reward model that sidesteps some of the pitfalls of conventional reward modeling.
2. Reward Design: Moving Away from PRM & ORM
2.1 Why a Rule-Based Reward?
The authors opted for rule-based rewards rather than a parameterized reward model (PRM). Their main points are:
-
Granular Step Labeling is Difficult
In general reasoning tasks, it’s hard to define clear, fine-grained criteria for each intermediate step. -
Labeling Cost & Accuracy
Automating label generation is typically subpar, while manual annotation is too expensive to scale. -
Avoiding Reward Hacking
When the reward function is itself modeled by a machine learning system (PRM), the model can learn to game or exploit that reward (reward hacking). Continual retraining of a PRM also escalates complexity and resource demands.
Consequently, DeepSeek-R1 uses direct rule-based signals, especially in math or programming tasks, by comparing final answers to ground truths or using compilation and test cases to verify correctness. They also incorporate rules to check output format (e.g., whether the reasoning is enclosed in <think>...</think>
tags) and language consistency.
2.2 Discarding Model-Based Output Rewards (ORM)
DeepSeek-R1 even abandons an alternative “ORM” approach—where a separate model judges or scores outputs—due to similar concerns about hallucination, potential reward hacking, and instability. Despite the advantages of “dense reward” methods in some tasks, the team values simplicity, stability, and robustness offered by a purely rule-based approach.
3. Training Strategy: From “Zero” to a Multi-Stage Process
DeepSeek-R1’s training can be broken down into distinct phases:
-
DeepSeek-R1-Zero
- Starting Point: Take DeepSeek-V3-Base (or a similarly pre-trained base model) and apply RL directly, without any initial SFT.
- Method: Use a rule-based reward combined with the GRPO (Generalized Rejection Policy Optimization) algorithm.
- Goal: Maximize correctness in math/programming tasks and ensure certain formatting rules.
- Findings:
- The model outputs become longer over training, showing early signs of introspection or self-reflection in its responses.
- However, the text can be awkward to read, and there is some mixing of languages.
-
Transition to Full DeepSeek-R1
- While R1-Zero successfully boosts reasoning performance, it still struggles with readability and linguistic consistency.
- The team then adds a small amount of high-quality data to do SFT, thereby improving overall clarity and coherence. After this SFT cold start, they resume RL to push performance further.
The final R1 training pipeline consists of four steps:
- Minimal SFT with High-Quality Data
- Collect a few thousand curated examples (e.g., detailed CoT data).
- Perform short SFT to get the model to “speak” more coherently.
- Focused RL for Reasoning
- Same rule-based rewards for math/logic tasks as in R1-Zero.
- Adds a language consistency reward to reduce mixing multiple languages in one answer.
- Rejection Sampling + SFT
- Use rejection sampling to filter the model outputs from the previous phase, removing low-quality or improperly formatted responses.
- Incorporate tasks not easily judged by a simple rule-based approach by using “LLM-as-judge” style verification (e.g., from DeepSeek-V3).
- Combine ~60k–600k (depending on the exact dataset mention) filtered reasoning samples with ~20k–200k non-reasoning samples to do another round of SFT (2 epochs).
- RL for Full Coverage
- For different task types, the model uses different prompts and reward rules.
- Math/logic tasks continue to rely on the original rule-based scoring.
- “General tasks” use a standard reward model for helpfulness and safety.
In the end, DeepSeek-R1 achieves a balance between reasoning performance and user-oriented qualities such as clarity and harmlessness, effectively matching top-tier models on many benchmarks.
4. Observations: KL Loss & GRPO vs. PPO
DeepSeek-R1 uses GRPO for its RL phase, distinguishing it from methods like PPO:
- PPO commonly multiplies the KL penalty term by the reward before computing the final policy gradient.
- GRPO instead subtracts a KL term outright, typically with a specialized estimator (K3) to ensure lower variance.
This approach makes training more stable, especially when you only sample partial tokens. It avoids the higher variance that comes from using straightforward Monte Carlo estimates of KL.
5. Echoes of AlphaGo: Why “Zero” Feels Familiar
Readers often note parallels to AlphaGo because the authors also tried MCTS (Monte Carlo Tree Search) and a “Zero-like” approach:
- R1-Zero parallels AlphaGo Zero in that it starts with minimal or no supervised data.
- AlphaGo used human game records for an initial supervised policy, then self-play led to AlphaZero. By contrast, DeepSeek does a nearly reversed workflow: R1-Zero first does RL from scratch, then adds some SFT.
Ultimately, DeepSeek’s attempts to use MCTS in language reasoning encountered obstacles (large branching factor, difficulty of training a fine-grained value model, etc.), so MCTS was not deemed successful in the final pipeline.
6. Experimental Results & Benchmarks
On a range of high-difficulty tasks (math reasoning, code completion, complex QA), DeepSeek-R1 delivers performance comparable to OpenAI-o1-1217—placing it in the leading group of reasoning-capable LLMs.
Meanwhile, the interim R1-Zero already shows substantial gains over baseline in reasoning tasks. Yet, it produces more awkward or mixed-language output. Hence the SFT steps introduced later improve user experience and reliability, while continuing to preserve or even enhance the model’s strong reasoning capabilities.
7. Knowledge Distillation & Small Models
The authors note that simply distilling DeepSeek-R1 into smaller models (e.g., Qwen2.5-32B) can yield results that are on par with more expensive small-model RL training. This is a compelling argument that, instead of performing a full-blown RL pipeline on a small model, one might efficiently gather high-quality outputs from a more capable model (like R1) and then do supervised fine-tuning on these outputs.
Outcome:
- The distilled DeepSeek-R1-32B reportedly reaches performance close to OpenAI-o1-mini at a fraction of the cost of developing a small model from scratch with RL.
8. Challenges & Future Directions
-
General-Purpose Abilities
- DeepSeek-R1 focuses on reasoning tasks but still falls short of DeepSeek-V3 in some general domains. The team plans to improve the model’s broader coverage, possibly using more extensive CoT or domain-specific data.
-
Language Mixing & Multilingual Support
- Although R1 has language consistency checks for Chinese and English, it still struggles with other languages or language-switching scenarios.
-
Prompt Engineering Sensitivity
- R1 can be sensitive to multi-turn or few-shot prompts. The authors recommend a zero-shot approach, simply specifying the desired output format for optimal results.
-
Software Engineering & Long Evaluations
- Because code tasks can take longer to verify, large-scale RL is more difficult. DeepSeek-R1 does show improvements on software tests but not a dramatic leap over DeepSeek-V3. Future plans include asynchronous evaluation to speed up RL in programming tasks.
-
Scaling to 600B and Beyond
- The paper doesn’t fully demonstrate whether this approach remains stable and effective at extreme scales (e.g., 600B parameters). This is another open area the team might explore.
9. Conclusion
DeepSeek-R1 demonstrates that massive SFT is not an absolute prerequisite for significantly boosting a language model’s reasoning ability. By leveraging a simple yet robust rule-based reward, skipping or minimizing SFT at the outset, and then integrating a small curated dataset plus repeated RL phases, R1 achieves state-of-the-art performance on challenging benchmarks.
The study also highlights how knowledge distillation—taking outputs from a stronger model (R1) to train a smaller model—can be more efficient and produce superior results than having the small model directly undergo extensive RL training.
While DeepSeek-R1 still has some gaps in generality and remains sensitive to prompting, it points the way toward a future where hybrid RL + minimal SFT can yield powerful, flexible, and more controllable LLMs. This paper sets a promising milestone, showing that with the right rewards and iterative training phases, models can “discover” self-reflection, extended reasoning, and robust performance without large-scale step-by-step annotation.