Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

Policy gradient methods have significantly advanced the reasoning capabilities of LLMs, particularly through RL. A key tool in stabilizing these methods is Kullback-Leibler (KL) regularization, which discourages drastic changes between the current policy and the reference policy. While widely used in algorithms like PPO, there’s still much to explore in how different KL variants, such as Forward KL, Reverse KL, and their unnormalized form, can be estimated and applied within loss functions. These choices, along with various gradient estimators and on-policy vs. off-policy settings, shape training stability and performance in nuanced and underexplored ways.

Fine-tuning LLMs with human feedback is crucial for building aligned AI systems. Two main strategies are employed: optimizing with reward models using policy gradient methods, such as PPO, and directly training on human preferences through methods like Direct Preference Optimization (DPO). While PPO stabilizes training with reward models, DPO and its variants use pairwise comparisons to simplify and scale learning, gaining popularity in recent models. Reinforcement learning is also increasingly used to enhance LLM reasoning, especially in complex tasks like math and coding. New methods aim to reduce computational costs and improve training stability, often by replacing value networks or modifying KL penalties.

Researchers from UCLA, Tsinghua University, and Shanghai Qi Zhi introduce Regularized Policy Gradient (RPG), a unified framework for KL-regularized policy gradients in online reinforcement learning. They derive policy gradients and surrogate loss functions using both Forward and Reverse KL divergences, addressing normalized and unnormalized policies. RPG supports both fully differentiable objectives and REINFORCE-style estimators, tailored for off-policy training with importance sampling. The study also identifies and addresses theoretical issues in existing methods, such as GRPO, and examines KL regularization in REINFORCE++. Experiments on LLM reasoning tasks demonstrate that RPG achieves improved stability and performance compared to leading baselines, including GRPO, REINFORCE++, and DAPO.

The study presents policy gradient methods that incorporate KL divergence regularization in both online and off-policy settings using importance sampling from an older policy. For forward KL, the gradient involves importance-weighted rewards and a regularization term, with its loss resembling the maximum likelihood loss when the rewards are zero. The unnormalized forward KL adds a correction for mismatched distribution masses. Similarly, reverse KL and its unnormalized form penalize deviation from the reference policy, modifying the reward based on log-probability ratios. All approaches share a REINFORCE-like gradient structure, enabling alternative implementations using the stop-gradient operator, which supports stable and efficient optimization in practice.

The researchers conducted a thorough evaluation of their proposed RPG methods—both differentiable and REINFORCE-style—by comparing them to several established baselines on complex math reasoning tasks using Qwen2.5 language models. They trained on the DAPO-Math-17k dataset and evaluated performance using benchmarks such as AMC23 and AIME. RPG variants consistently demonstrated strong accuracy, training stability, and efficient memory usage. Implementation utilized the Verl framework and techniques such as KL regularization, PPO-style clipping, and Schedule-Free AdamW for smoother optimization. RPG models generally outperformed others in reward shaping, entropy control, and response length, highlighting their robustness and suitability for stable, high-performance learning.

In conclusion, RPG is a comprehensive framework for designing and analyzing policy gradient methods that incorporate KL-regularization in online, off-policy reinforcement learning. They explore a range of configurations, including both forward and reverse KL divergences, normalized and unnormalized policy distributions, and two types of estimators: fully differentiable and REINFORCE-style. RPG aims to provide a structured approach to understanding and implementing these variations. Applied to reasoning tasks with large language models, the proposed methods demonstrate more stable training and competitive or improved performance compared to established baselines, such as GRPO, REINFORCE++, and DAPO.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: A Unique Way to Primary Key

BrowserStack launches Figma plugin for detecting accessibility issues in design phase

Parasoft brings agentic AI to service virtualization in latest release

Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

The best CRM software with email marketing in 2025: Expert tested and reviewed

This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

8 ways I quickly leveled up my Linux skills – and you can too

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

Execute Ping Commands and Get Back Structured Data in PHP

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

“I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Boolformer: Symbolic Regression of Logic Functions with Transformers

CVE-2025-1478 – GitLab CE/EE Denial of Service Vulnerability in Board Names

CVE-2025-34044 – WIFISKY 7-layer Flow Control Router Command Injection Vulnerability

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Nintendo Switch 2 preorders delayed in US due to tariffs, expect similar problems for future Xbox hardware

Underdamped Diffusion Samplers Outperform Traditional Methods: Researchers from Karlsruhe Institute of Technology, NVIDIA, and Zuse Institute Berlin Introduce a New Framework for Efficient Sampling from Complex Distributions with Degenerate Noise

slop queries for a selection from the user and prints the region to stdout

Ripple’s xrpl.js npm Package Backdoored to Steal Private Keys in Major Supply Chain Attack

CVE-2025-48146 – LupsOnline SEO Flow CSRF Stored XSS

Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

Related Posts