ByteDance and Tsinghua Researchers Open-Source DAPO to Advance LLM Reinforcement Learning at Scale

DAPO: Open-Sourcing Reinforcement Learning for Large Language Models at Scale

Breaking the Barriers in LLM Reasoning with Open-Source Reinforcement Learning

In the race to build more intelligent Large Language Models , the industry has largely relied on Reinforcement Learning to enhance reasoning capabilities. However, a persistent challenge has been the lack of transparency—state-of-the-art RL techniques for LLMs remain locked behind proprietary systems from major AI players like OpenAI and DeepSeek. This secrecy not only stifles innovation but also makes it difficult for researchers and businesses to replicate or build upon these advancements.

A new research effort, DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization), aims to change this by fully open-sourcing a scalable RL framework for LLM reasoning. Developed by ByteDance Seed, Tsinghua University’s AI Industry Research Institute, and the University of Hong Kong, DAPO offers a transparent, high-performing RL system, releasing not just the algorithm but also the training code and a curated dataset. The goal: democratizing LLM reasoning RL and accelerating progress in AI research and industry applications.

Key Innovations of DAPO

At the heart of DAPO is a novel RL approach that improves reasoning in LLMs. The system’s effectiveness has been demonstrated through its performance on the AIME 2024 math problem dataset, where it achieves 50 points using the Qwen2.5-32B base model—surpassing prior benchmarks while requiring fewer training steps.

1. Open-Sourcing an Entire Reinforcement Learning System

Unlike most proprietary models, DAPO provides a fully open RL training pipeline, including:

DAPO Algorithm – A refined RL method based on GRPO (Generalized Reinforcement Policy Optimization).
Training Code (verl framework) – Practical, scalable RL code for training LLMs.
Curated Dataset – A dataset specifically processed for mathematical reasoning and RL training.

2. Algorithmic Innovations: Four Key Techniques

DAPO integrates four major technical improvements that enhance the efficiency and stability of RL training for LLMs:

Clip-Higher: Traditional RL models use clipping techniques to avoid extreme value fluctuations, but this often leads to entropy collapse, making the model overly deterministic. DAPO decouples the lower and upper clipping thresholds, encouraging more diverse token generation and better exploration.
Dynamic Sampling: Many RL training processes waste computational resources on redundant prompts. DAPO filters out ineffective prompts (those that yield zero-gradient samples), ensuring each training batch is meaningful and accelerates convergence.
Token-Level Policy Gradient Loss: Instead of treating an entire response as a single sample, DAPO assigns gradients at the token level, allowing longer reasoning chains to carry more weight. This is particularly useful for complex, multi-step problem-solving.
Overlong Reward Shaping: Traditional models harshly penalize long responses. DAPO refines this approach, scaling the penalty dynamically to prevent abrupt loss of valuable information, leading to more stable training.

How DAPO Outperforms Existing Models

1. Higher Accuracy in Complex Reasoning Tasks

Empirical results show that DAPO achieves 50 points on AIME 2024, surpassing DeepSeek-R1-Zero-Qwen-32B’s score of 47. Unlike previous models, DAPO attains this performance with half the training steps, demonstrating both effectiveness and efficiency.

2. Enhanced Training Efficiency and Stability

By addressing common RL issues—entropy collapse, reward noise, and inefficient sampling—DAPO streamlines training, reducing the computational costs required for developing high-performance LLMs.

3. Full Reproducibility and Open-Source Transparency

A critical issue in LLM research is the lack of verifiable, open-source RL methods. DAPO is one of the few systems that provides a complete end-to-end RL training framework, making it easier for academic researchers and AI startups to replicate and extend the work.

Industry and Business Impact

1. Accelerating AI Research and Development

The availability of a state-of-the-art RL training system can dramatically accelerate research in mathematical reasoning, LLM-based tutoring, and other advanced problem-solving applications. Open-source accessibility reduces barriers to entry, fostering broader participation in AI development.

2. Expanding LLM Business Applications

Companies focused on AI-driven reasoning tasks—from automated customer support to coding assistants and financial modeling—stand to benefit from DAPO’s advancements. By integrating DAPO’s techniques, businesses can train more capable, cost-efficient AI models tailored to industry-specific challenges.

3. Lowering AI Training Costs

With increased efficiency and reduced training steps, DAPO makes it feasible for smaller companies and startups to train high-performing LLMs without massive computational expenses. This could lead to broader commercialization of advanced reasoning AI beyond tech giants.

Challenges and Considerations

While DAPO presents a groundbreaking contribution, certain factors should be noted:

Benchmark Scope: The model’s effectiveness has been validated on AIME 2024, a math-based dataset. Additional evaluations on other complex reasoning benchmarks (e.g., MATH, GSM8K) are needed to confirm broader applicability.
Computational Requirements: Despite improved efficiency, training LLMs with RL still demands substantial GPU resources. While DAPO lowers the barrier, smaller organizations may still face infrastructure challenges.
Implementation Complexity: DAPO’s advanced techniques, particularly token-level policy gradient loss and overlong reward shaping, require a deep understanding of RL principles, which may pose adoption challenges for teams unfamiliar with reinforcement learning.

A Game-Changer for Open-Source AI

DAPO represents a significant leap forward in scalable, transparent reinforcement learning for LLM reasoning. By open-sourcing a complete, high-performing RL system, the research team is not only advancing academic knowledge but also empowering businesses and startups to develop their own sophisticated AI models.

For investors and companies looking to enhance LLM reasoning capabilities, DAPO provides a rare opportunity: a fully accessible, state-of-the-art RL framework that reduces both the cost and complexity of developing advanced AI models. As AI adoption accelerates across industries, open-source innovations like DAPO will play a crucial role in shaping the future of AI-driven problem-solving.