Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»From Exploration Collapse to Predictable Limits: Shanghai AI Lab Proposes Entropy-Based Scaling Laws for Reinforcement Learning in LLMs

    From Exploration Collapse to Predictable Limits: Shanghai AI Lab Proposes Entropy-Based Scaling Laws for Reinforcement Learning in LLMs

    June 3, 2025

    Recent advances in reasoning-centric large language models (LLMs) have expanded the scope of reinforcement learning (RL) beyond narrow, task-specific applications, enabling broader generalization and reasoning capabilities. However, this shift introduces significant challenges, particularly in scaling the training compute required for learning from experience. Unlike imitation learning through pre-training and fine-tuning, RL demands a more computationally intensive approach. A central issue is the decline in policy entropy, which affects the balance between exploiting known strategies and exploring new ones. This exploitation-exploration trade-off is fundamental in RL, and controlling policy entropy has become critical to maintaining effective exploration during training.

    Existing efforts address the exploration-exploitation trade-off in RL by utilizing policy entropy. Maximum entropy RL introduces a regularization term to the reward function, promoting uncertainty in action selection and encouraging broader exploration. While this technique has been widely adopted in conventional RL algorithms, its application to LLMs remains debated. Moreover, predictability in RL for LLMs is not explored. While neural scaling laws guide LLM development, similar predictive frameworks for RL training remain limited. Existing RL methods for LLMs with verifiable rewards show promise in reasoning improvements, but lack a deep understanding of their core mechanisms.

    Researchers from Shanghai AI Laboratory, Tsinghua University, UIUC, Peking University, Nanjing University, and CUHK provide an approach to address the collapse of policy entropy in RL for reasoning-centric LLMs. They established a transformation equation, R = −a exp H + b, where H is entropy, R is downstream performance, and a and b are fitting coefficients. This empirical law strongly suggests that policy performance is traded from policy entropy, thus bottlenecked by its exhaustion. Researchers investigate entropy dynamics, and their derivation highlights that the change in policy entropy is driven by the covariance between action probability and the change in logits. They also proposed two techniques, namely Clip-Cov and KL-Cov, which clip and apply a KL penalty to tokens with high covariances, respectively.​

    To investigate and validate the entropy collapse phenomenon in RL-tuned LLMs, researchers applied RL to LLMs on verifiable tasks, like math and coding, using an autoregressive generation setup where models produce token sequences based on input prompts. The study involves 11 widely adopted open-source models spanning four families: Qwen2.5, Mistral, LLaMA, and DeepSeek, with parameters ranging from 0.5B to 32 B. Evaluations are performed on eight public benchmarks, including MATH500, AIME 2024, AMC, and Eurus-2-RL-Code. Moreover, RL training follows the veRL framework in a zero-shot setting, utilizing algorithms like GRPO, REINFORCE++, and PRIME to optimize policy performance while observing entropy dynamics.

    The proposed Clip-Cov and KL-Cov techniques were evaluated on the Qwen2.5 models using the DAPOMATH dataset for math tasks. These methods achieve non-trivial performance gains across all benchmarks. In comparison to the GRPO baseline, these methods improve performance by 2.0% on average for the 7B model and by 6.4% for the 32B model. For example, when the baseline’s entropy reaches a plateau, the KL-Cov method still sustains an entropy level over 10 times higher. The methods can maintain a higher level of entropy throughout the training. Moreover, the methods yield more substantial gains on the larger Qwen2.5-32B model, with improvements of 15.0% and 14.6% compared to GRPO on the most challenging benchmarks, AIME24 and AIME25, respectively. 

    In conclusion, researchers have overcome the challenge of policy entropy collapse in RL for reasoning-centric LLMs. The findings highlight a trade-off between performance improvement and diminished exploration, which ultimately limits further gains. Through theoretical analysis and empirical validation, researchers identify entropy dynamics as a key bottleneck and propose two effective regularization strategies—Clip-Cov and KL-Cov to manage high-covariance tokens and sustain exploration. As RL emerges as a crucial axis for scaling beyond pre-training, addressing entropy collapse becomes essential. This work provides foundational insights into the role of entropy, guiding future efforts to scale RL toward more intelligent and capable language models.


    Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post From Exploration Collapse to Predictable Limits: Shanghai AI Lab Proposes Entropy-Based Scaling Laws for Reinforcement Learning in LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics
    Next Article Snowflake Charts New AI Territory: Cortex AISQL & Snowflake Intelligence Poised to Reshape Data Analytics

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Inclusivity with Voice & Language [SUBSCRIBER]

    Learning Resources

    Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

    Development
    Universal Design in Pharmacies –  WCAG Standards

    Universal Design in Pharmacies – WCAG Standards

    Development

    Empowering Small Businesses: How No-Code AI Tools Drive Scalable Growth🚀

    Web Development

    Highlights

    CVE-2025-4053 – Be-Tech Mifare Classic Card Cleartext Data Storage Vulnerability

    May 26, 2025

    CVE ID : CVE-2025-4053

    Published : May 26, 2025, 10:15 a.m. | 1 hour, 52 minutes ago

    Description : The data stored in Be-Tech Mifare Classic card is stored in cleartext. An attacker having access to a Be-Tech hotel guest Mifare Classic card can create a master key card that unlocks all the locks in the building.

    This issue affects all Be-Tech Mifare Classic card systems. To fix the vulnerability, it is necessary to replace the software, encoder, cards, and PCBs in the locks.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-43571 – Substance3D Use After Free Vulnerability

    May 13, 2025

    A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

    June 1, 2025

    How to Send Email with Django Using SMTP Server

    July 16, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.