[ExGRPO] Teach LLMs to Learn from Experience

Paper at a Glance

Paper Title: ExGRPO: Learning to Reason from Experience
Authors: Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng
Affiliation: University of Macau, Shanghai AI Laboratory, Nanjing University, The Chinese University of Hong Kong
Published in: arXiv Preprint, October 2025
Link to Paper: https://arxiv.org/abs/2510.02245
Project Page: Code and Models.

The Gist of It: TL;DR

In one sentence: This paper introduces ExGRPO, a framework that significantly improves the reasoning ability of large language models by intelligently replaying and prioritizing valuable past experiences during reinforcement learning, leading to more stable, efficient, and powerful models.

Why It Matters: The Big Picture

Reinforcement learning (RL) is a powerful technique for teaching large language models (LLMs) complex reasoning skills, like solving math problems. The process, often called Reinforcement Learning from Verifiable Rewards (RLVR), involves letting the model generate a “chain of thought” to solve a problem and then rewarding it if the final answer is correct. This is great in theory, but in practice, it’s incredibly wasteful.

Most standard RLVR methods are “on-policy,” which means they generate a bunch of reasoning attempts, use them for a single model update, and then throw them away. Imagine a student solving a dozen math problems, learning a tiny bit from them, and then immediately forgetting every single problem and solution they just worked through. This not only squanders massive amounts of computation but also misses a huge opportunity to learn from past successes. This inefficiency is a major bottleneck preventing us from scaling up the reasoning capabilities of LLMs. How can we teach a model to learn from its own “stream of experience” more effectively?

The Core Idea: How It Works

The authors of ExGRPO start by asking a fundamental question: what makes a reasoning “experience” valuable for learning? Instead of just replaying every past success, they first investigate the characteristics of the most useful experiences and build a system to manage them strategically.

1. The Problem They’re Solving

On-policy RL is sample-inefficient. It discards valuable data. While the concept of “experience replay” (storing and reusing past interactions) is a classic RL technique, it hasn’t been systematically explored for LLM reasoning. A naive replay buffer might just store random successes, but the authors hypothesize that not all successful reasoning chains are equally good for learning. Some are elegant and direct, while others are lucky but logically flawed.

2. The Key Innovation

The core insight of the paper is the identification of two simple yet effective proxies for the “value” of a reasoning experience:

Rollout Correctness (Problem Difficulty): This is the success rate of the model on a given problem. The authors found that problems of medium difficulty (where the model succeeds between 25% and 75% of the time) provide the strongest learning signal. Easy problems offer little new information, while very hard problems are often too difficult to learn from effectively.
Trajectory Entropy (Reasoning Quality): This measures the model’s uncertainty when generating a reasoning path (a “trajectory”). A low-entropy trajectory means the model is more “confident” and direct in its steps. As shown in Figure 1 of the paper, these low-entropy solutions are much more likely to be logically correct and represent high-quality reasoning, even if the final answer is the same.

Based on these insights, the best experiences come from medium-difficulty questions that produce low-entropy, correct reasoning chains.

3. The Method, Step-by-Step

ExGRPO (Experiential Group Relative Policy Optimization) is a two-phase framework designed to exploit these insights, as illustrated in Figure 2 of the paper.

Phase 1: Experience Management
This phase is all about curating a high-quality library of past successes.

Collection & Partition: The model attempts problems, and all successful reasoning trajectories are stored in a replay buffer. This buffer is then partitioned into “buckets” based on the problem’s difficulty (Easy, Medium, Hard). To prevent overfitting, problems the model has mastered (100% success rate) are moved to a “Retired Set.”
Selection: During training, ExGRPO prioritizes sampling from the Medium bucket. For each chosen problem, it then selects the specific reasoning trajectory from the buffer that has the lowest entropy under the current model’s policy. This ensures the model learns from its most confident and direct past successes.

Phase 2: Experiential Policy Optimization
This phase is about balancing learning from the past with exploring new solutions.

Mixed-Policy Objective: Each training batch is a mix. A portion (e.g., 50%) consists of fresh, on-policy attempts at new problems to encourage exploration. The other portion is composed of the high-value experiences curated in Phase 1 to ensure efficient exploitation.
Stabilization: ExGRPO includes mechanisms like Policy Shaping to prevent the model from becoming overconfident and simply memorizing the replayed solutions, which could harm its ability to explore. It also uses a Delayed Start to ensure the experience buffer is only populated after the model has reached a baseline level of competence.

Key Experimental Results

The authors tested ExGRPO on five different LLMs (from 1.5B to 8B parameters) across nine challenging math and general reasoning benchmarks.

Consistent Performance Gains: ExGRPO consistently outperformed standard on-policy RLVR. On average, it achieved a +3.5 point gain on in-distribution math benchmarks and an impressive +7.6 point gain on out-of-distribution general reasoning tasks.
Stabilizes Weaker Models: A standout result was ExGRPO’s ability to stabilize training for models that would otherwise fail. As shown in Figure 4, the Llama-3.1 8B base model collapses under standard on-policy training, with its reward signal flatlining. In contrast, ExGRPO allows the model to learn from its early “lucky hits,” build momentum, and achieve meaningful improvements.
Data Efficiency: The framework achieves better results while using 50% less fresh on-policy data per batch compared to baselines, demonstrating the power of efficiently reusing past experience.
Ablation studies confirmed the design choices: Removing either the difficulty-based question selection or the entropy-based trajectory selection resulted in a significant drop in performance, validating the paper’s core heuristics.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Addresses a Critical Bottleneck: The paper provides a practical and effective solution to the sample inefficiency of on-policy RL, a major hurdle in scaling LLM reasoning capabilities.
Principled Experience Selection: It moves beyond naive replay by offering a well-motivated, data-driven heuristic (medium difficulty + low entropy) for identifying the most valuable learning experiences.
Enhances Training Stability: The ability to successfully train weaker models where on-policy methods fail is a significant contribution, making advanced RL techniques more accessible to a wider range of base models.
Empirically Robust: The method’s effectiveness is demonstrated across multiple model families, sizes, and a diverse set of challenging benchmarks.

Limitations / Open Questions

Limited to Verifiable Tasks: The correctness-based bucketing system is tailored for problems with clear right or wrong answers, like math. Its applicability to more subjective, open-ended tasks (e.g., creative writing, summarization) remains an open question.
Heuristics Might Miss “Valuable Failures”: The framework focuses exclusively on successful trajectories. It may miss out on learning opportunities from “valuable failures,” where an incorrect path contains useful reasoning steps.
Potential for Premature Convergence: The strong emphasis on exploiting low-entropy (high-confidence) solutions could, in some scenarios, risk the model converging on a suboptimal reasoning strategy that it is simply very confident about.

Contribution Level: Significant Improvement. ExGRPO does not invent a new paradigm, but it provides a highly effective, practical, and well-justified solution to a major problem in applying reinforcement learning to LLMs. Its ability to improve both performance and training stability makes it a valuable contribution to the field of AI reasoning.

Conclusion: Potential Impact

ExGRPO makes a compelling case that how a model learns from its experience is just as important as the learning algorithm itself. By treating experience as a valuable, manageable resource, the authors have developed a framework that makes RL for reasoning more efficient, stable, and powerful. This work is likely to influence future research on scaling LLM capabilities, pushing the community to think more deeply about curriculum learning and data curation within RL loops. For practitioners, ExGRPO offers a tangible method to get more performance out of their models with less computational waste—a win-win for both research and application.