Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO

INSUBCONTINENT EXCLUSIVE:
(RL) in eliciting sophisticated reasoning behaviors and significantly enhancing the capabilities of large language models (LLMs).However,
the core training methodologies behind these groundbreaking reasoning models often remain veiled in their technical reports
Recent community efforts have predominantly focused on mathematical reasoning, leaving the challenge of cross-domain generalization largely
unexplored
Furthermore, standard Reinforcement Learning from Preference Optimization (GRPO) training is plagued by common issues such as performance
bottlenecks, inefficient sample utilization, and difficulties in cultivating specialized reasoning skills when dealing with mixed-domain
datasets
These challenges complicate the effective scaling of RL methods for LLMs.Addressing these limitations, researchers from the Kwaipilot team
at Kuaishou have introduced a novel reinforcement learning framework: Two-Staged history-Resampling Policy Optimization (SRPO)
This innovative approach is designed to systematically tackle the aforementioned training challenges across multiple dimensions
The team has publicly released a technical report detailing the intricacies of their training method and has also open-sourced the
SRPO-Qwen-32B model.Notably, this work marks the first instance of achieving DeepSeek-R1-Zero-level performance concurrently in both
mathematical and code domains
By leveraging the same base model as DeepSeek (Qwen2.5-32B) and employing a purely reinforcement learning training approach, SRPO has
achieved impressive results on the AIME24 (50) and LiveCodeBench (41.6) benchmarks, surpassing the performance of DeepSeek-R1-Zero-32B.Even
more remarkably, SRPO achieves this level of performance with only one-tenth of the training steps required by R1-Zero.Challenges with
Vanilla GRPOIn their initial explorations, the Kwaipilot team experimented with the standard GRPO algorithm
However, they quickly encountered bottlenecks that prevented the model from reaching the desired R1-Zero performance levels
These issues included:Cross-Domain Optimization Conflicts (Math vs
Code): Mathematical problems tend to elicit longer and more detailed reasoning trajectories (Long CoT), while code data exhibits a weaker
inclination towards this
Directly mixing these two data types led to conflicts, resulting in suboptimal performance in both domains.Reduced Training Efficiency due
to Similar Group Rewards: The GRPO algorithm relies on the variance of non-zero rewards within a sampled group to calculate the advantage
When rollouts within a group yield nearly identical reward values, the calculated advantage approaches zero
If a significant portion of the training batch exhibits this phenomenon, effective gradient contributions become minimal, drastically
reducing training efficiency.Premature Performance Saturation: GRPO training encountered early performance plateaus and reward saturation on
benchmark evaluations
This issue was partly attributed to insufficient data quality
When the training data lacks sufficient complexity or diversity, particularly with an abundance of simpler problems, the model tends to
conservatively maintain its performance on easier tasks, hindering its ability to develop the complex and in-depth reasoning required for
challenging problems.Two-Staged TrainingTo address the inherent response length conflicts between mathematical and code domains, the
Kwaipilot team implemented a two-stage training paradigm:Stage 1: Eliciting Reasoning Abilities: This initial training phase focuses
exclusively on challenging mathematical data
step-by-step decomposition.Stage 2: Skill Integration: In this stage, code data is introduced into the training process
Building upon the reasoning foundation established in Stage 1, this phase aims to further enhance coding abilities while progressively
strengthening procedural thinking, recursion, and tool-calling capabilities.Comparative Analysis of Training StrategiesThe impact of
different training data strategies on response length was analyzed, revealing the following insights:Mixed Training: Models trained on a
mixture of math and code data showed limited growth in response length and poor benchmark performance
While math problems elicited some reasoning patterns, code problems often resulted in short, direct responses focused on immediate code
output with minimal preliminary analysis or planning.Math-Only Training: Training solely on mathematical data led to a stable increase in
response length and excellent performance on math benchmarks
Crucially, it fostered strong and generalizable reasoning abilities; when faced with programming tasks, the model attempted detailed,
step-by-step reasoning, including meticulous checking and revisiting steps in mathematical problem-solving.Code-Only Training: While showing
improved performance on code benchmarks, the development of explicit reasoning behavior was minimal, and achieving significant increases in
response length proved difficult
Responses to both code and math problems were noticeably shorter compared to math-only training, with code solutions often being directly
generated without substantial step-by-step reasoning or initial analysis.Staged Training: The two-stage training approach proposed by the
Kwaipilot team yielded superior results in both mathematical and programming domains
The model consistently generated detailed step-by-step reasoning for math problems and structured reasoning patterns for programming tasks
Notably, complex behaviors emerged, such as the model spontaneously utilizing code to assist in mathematical reasoning.History ResamplingThe
Kwaipilot team observed that during the mid-to-late stages of training, nearly 50% of the sampled groups within a batch produced identical
rewards
This often occurred when the model consistently succeeded on easier problems, leading to minimal reward variance and ineffective gradient
updates.To address this inefficiency and improve the quality of the gradient signal, they introduced History Resampling
During training, they recorded the reward outcomes of all rollouts within each epoch
At the end of an epoch, they reconstructed the dataset for the next epoch based on the following criteria:Filtering Overly Simple Samples:
Samples where all rollouts resulted in correct answers were excluded, as they provided no informative signal for policy
improvement.Retaining Informative Samples: Samples with diverse outcomes (both correct and incorrect) or all incorrect outcomes were
retained
These samples generated positive reward variance, ensuring non-zero advantages and effective gradient signals
Furthermore, difficult samples where all rollouts were incorrect in the current epoch were also kept
The rationale is that these initially challenging problems might become relatively easier for the updated policy, thus generating effective
gradients in subsequent training
This strategy aligns with the principle of curriculum learning, gradually exposing the model to increasingly challenging samples on average
to enhance training efficiency.Compared to the Dynamic Sampling method proposed in DAPO, History Resampling significantly improved
computational efficiency and resulted in more stable response length growth.DataThe Kwaipilot team performed meticulous data cleaning and
filtering on publicly available Code&Math datasets
They applied heuristic rules to filter out irrelevant URLs, formatting noise, and ensured the completeness of core fields (question and
answer ground truth) in the original data
Following the data cleaning approach of PRIME for mathematical data, they removed multi-part questions, pure proof-based problems, and those
requiring image or table understanding
For code data, they excluded problems dependent on specific environments, file I/O, or network interactions, focusing on algorithmic
logic.Before data ingestion, they conducted correctness verification for both math and code problems to ensure the accuracy and solvability
of the answers, discarding those with incorrect or ambiguous solutions
Subsequently, they assessed the difficulty of each problem, categorizing them into easy, medium, and hard levels based on their pass rate
(Pass@k).Experimental ResultsThis section details the experimental results obtained using the SRPO method
The Kwaipilot team focused on observing the changes in reward and metrics such as response length during training.Training ProcessThe figure
above illustrates the complete reward curve and response length curve during SRPO training
After the initial reward growth began to plateau, the training transitioned into the second stage
increase in reward during subsequent training
Integrating code data did not significantly increase the response length, which aligned with their expectations
Simultaneously, benchmark results indicated a continuous and stable improvement in both the mathematical and coding abilities of the model,
demonstrating the effectiveness of the new method.Specifically, History Resampling ensured that gradient updates remained effective at each
training step, directly increasing the proportion of informative gradients
This enhanced sampling efficiency led to stable reward growth, clearly showcasing the improved training efficiency achieved by the
resampling strategy.Reasoning BehaviorsThe Kwaipilot team identified three representative reflective patterns: recheck, hesitation, and
exploration
They statistically analyzed responses containing these patterns and recorded the average response length for each
from the policy optimization process.As shown in the figure above, the model exhibited almost no proactive checking and reflection of
previous reasoning steps in the early stages of training
However, as training progressed, the model displayed significant reflective and backtracking behaviors, forming response patterns such as
step-by-step reasoning, numerical substitution, step-by-step verification, and self-optimization.Interestingly, they also discovered that
the model learned to spontaneously use program code for verification when solving mathematical problems
It would first provide a solution process through mathematical reasoning and then proactively write program code to verify the correctness
of the solution
indicating that in the later stages of training, the model had mastered broad thinking and the integrated application of various code-based
reasoning approaches for problem-solving.The Paper SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM is on
arXivTry with the SRPO-Qwen-32BModel on HuggingFaceLike this:LikeLoading...