Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Can Learn Complex Math from Just One Example: Researchers from University of Washington, Microsoft, and USC Unlock the Power of 1-Shot Reinforcement Learning with Verifiable Reward

    LLMs Can Learn Complex Math from Just One Example: Researchers from University of Washington, Microsoft, and USC Unlock the Power of 1-Shot Reinforcement Learning with Verifiable Reward

    May 3, 2025

    Recent advancements in LLMs such as OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have significantly improved their performance on complex mathematical reasoning tasks. Reinforcement Learning with Verifiable Reward (RLVR) is a key contributor to these improvements, which uses rule-based rewards, typically a binary signal indicating whether a model’s solution to a problem is correct. Beyond enhancing final output accuracy, RLVR has also been observed to foster beneficial cognitive behaviors like self-reflection and improve generalization across tasks. While much research has focused on optimizing reinforcement learning algorithms like PPO and GRPO for greater stability and performance, the influence of training data—its quantity and quality—remains less understood. Questions around how much and what kind of data is truly effective for RLVR are still open, despite some work like LIMR introducing metrics to identify impactful examples and reduce dataset size while maintaining performance.

    In contrast to the extensive research on data selection in supervised fine-tuning and human feedback-based reinforcement learning, the role of data in RLVR has seen limited exploration. While LIMR demonstrated that using a small subset of data (1.4k out of 8.5k examples) could maintain performance, it did not examine the extreme case of minimal data use. Another concurrent study found that even training with just four PPO examples led to notable improvements, but this finding wasn’t deeply investigated or benchmarked against full-dataset performance. Although RLVR shows great promise for enhancing reasoning in LLMs, a deeper, systematic study of data efficiency and selection in this context is still lacking. 

    Researchers from the University of Washington, University of Southern California, Microsoft, University of California, Santa Cruz, and Georgia Institute of Technology show that RLVR can significantly enhance large language models’ mathematical reasoning using a single training example, 1-shot RLVR. Applying it to Qwen2.5-Math-1.5B improves its MATH500 accuracy from 36.0% to 73.6%, matching the performance of much larger datasets. The improvements generalize across models, tasks, and algorithms. The study also reveals effects like cross-domain generalization, increased self-reflection, and post-saturation generalization, and highlights the roles of policy gradient loss and entropy-driven exploration. 

    The study investigates how much the RLVR training dataset can be reduced while retaining comparable performance to the full dataset. Remarkably, the authors find that a single training example—1-shot RLVR—can significantly boost mathematical reasoning in LLMs. The study shows that this effect generalizes across tasks, models, and domains. Interestingly, training on one example often enhances performance on unrelated domains. A simple data selection strategy based on training accuracy variance is proposed, but results show that even randomly chosen examples can yield major gains. 

    The study evaluates their method using Qwen2.5-Math-1.5B as the primary model and other models like Qwen2.5-Math-7B, Llama-3.2-3 B-Instructt, and DeepSeek-R1-DistillQwen-1.5 BB. They use a 1,209-example subset of the DeepScaleR dataset for data selection, and the MATH dataset for comparison. Training involves the Verl pipeline, with carefully chosen hyperparameters and batch configurations. Surprisingly, training with just one or two examples—especially π1 and π13—leads to strong generalization, even beyond math tasks. This “post-saturation generalization” persists despite overfitting signs. The study also finds increased model self-reflection and shows that even simple examples can significantly enhance performance across domains.

    In conclusion, the study explores the mechanisms behind the success of 1-shot RLVR, demonstrating that base models already possess strong reasoning abilities. Experiments show that even a single example can significantly improve performance on reasoning tasks, suggesting the model’s inherent capacity for reasoning. The study highlights that policy gradient loss is key to 1-shot RLVR’s effectiveness, with entropy loss further enhancing performance. Additionally, encouraging exploration through techniques like entropy regularization can improve post-saturation generalization. The findings also emphasize the need for careful data selection to optimize the model’s performance, particularly in data-constrained scenarios. 


    Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post LLMs Can Learn Complex Math from Just One Example: Researchers from University of Washington, Microsoft, and USC Unlock the Power of 1-Shot Reinforcement Learning with Verifiable Reward appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleImplementing An Airbnb and Excel MCP Server
    Next Article How to Create a Wordle Game & Word Cloud?

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    BBC Uses AI to Resurrect Agatha Christie as Your Personal Writing Coach

    Artificial Intelligence

    Xbox handheld leaks in new “Project Kennan” photos from the FCC — plus an ASUS ROG Ally 2 prototype with early specs

    News & Updates

    Critical Linux Kernel Vulnerability Exposes Systems to Privilege Escalation Attacks

    Security

    Use Passkeys in Your Laravel App

    Development

    Highlights

    CommVQ: Commutative Vector Quantization for KV Cache Compression

    July 9, 2025

    Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the…

    Xpano is an automated photo stitching tool

    June 8, 2025

    Distribution Release: Grml 2025.05

    May 15, 2025

    Microsoft Teams to block “unauthorized screen captures” — This new Prevent Screen Capture tool improves your privacy

    May 15, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.