Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»AREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning

    AREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning

    June 18, 2025

    Introduction: The Need for Efficient RL in LRMs

    Reinforcement Learning RL is increasingly used to enhance LLMs, especially for reasoning tasks. These models, known as Large Reasoning Models (LRMs), generate intermediate “thinking” steps before providing final answers, thereby improving performance on complex problems such as math and coding. However, training LRMs with RL at scale is challenging due to the need for massive parallelization and efficient system design. Current systems often rely on synchronous batch processing, where generation must wait for the longest output in a batch to finish, leading to GPU underutilization. Even newer methods still face bottlenecks, as they use outdated rollouts but remain batch-based.

    Background: Reinforcement Learning RL’s Impact on LLM Reasoning Abilities 

    Reinforcement Learning RL has become a widely used strategy to enhance the reasoning abilities of LLMs, particularly for tasks with clearly defined reward signals, such as math, coding, scientific reasoning, and tool use. These models typically improve by extending their chain-of-thought reasoning during training. Open-source efforts have shown that distilled, smaller models can also perform well on such tasks. Asynchronous RL methods, proven effective in gaming, have recently been explored for LLMs, but primarily in short-context or limited-overlap scenarios. Prior work has also investigated strategies, such as partial rollouts, to improve efficiency while maintaining training stability. 

    System Overview: Introducing AReaL

    Researchers from IIIS, Tsinghua University, Ant Research, and HKUST introduce AReaL, a fully asynchronous reinforcement learning system designed to train large reasoning models more efficiently. Unlike traditional synchronous systems, AReaL separates the generation and training processes; rollout workers continuously produce outputs, while training workers update the model in parallel as new data arrives. This design enhances GPU usage and accelerates training. To handle data staleness, AReaL utilizes a tailored version of PPO and incorporates optimizations such as dynamic batching and parallel reward services. On math and code tasks, AReaL achieves up to 2.77× faster training while maintaining or improving final model performance. 

    Technical Architecture: Key Components and Optimizations

    AREAL is designed to decouple generation and training across separate GPU clusters, improving scalability, hardware efficiency, and flexibility for reinforcement learning with large models. The system includes four main components: rollout workers that support interruptible generation and model updates, a reward service that evaluates responses, trainer workers that perform PPO updates, and a controller that coordinates the data flow. To address challenges such as data staleness and inconsistent policy versions, AREAL employs staleness-aware training and a decoupled PPO objective. Additionally, system-level optimizations such as pipelined CPU-GPU operations, non-blocking asynchronous requests, and dynamic sequence packing enhance training speed and GPU efficiency. 

    Experimental Results: Scaling and Performance

    AREAL was tested on math and coding tasks using distilled Qwen2 models of various sizes. It achieved 2–3 times faster training than prior methods, such as DeepScaleR and DeepCoder, while maintaining comparable accuracy. The system scales efficiently across GPUs and handles long context lengths (up to 32k tokens), outperforming synchronous methods’ key design features such as interruptible generation and dynamic microbatching, which boost training speed and hardware utilization. Algorithmically, AREAL’s decoupled PPO objective allows stable learning even with stale data, unlike standard PPO. Overall, AREAL balances speed and performance effectively, making it well-suited for large-scale RL training of language models. 

    Conclusion: Advancing Large-Scale RL for Language Models

    In conclusion, AREAL is an asynchronous reinforcement learning system designed to enhance the efficiency of training LLMs, particularly for tasks such as coding and mathematical reasoning. Unlike traditional synchronous methods that wait for all outputs before updating, AREAL allows generation and training to run in parallel. This decoupling reduces GPU idle time and boosts throughput. To ensure learning remains stable, AREAL introduces staleness-aware strategies and a modified PPO algorithm that effectively handles older training data. Experiments show that it delivers up to 2.77 times faster training than synchronous systems, without sacrificing accuracy, marking a step forward in scaling up RL for large models. 


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post AREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow Latent Vector Fields Reveal the Inner Workings of Neural Autoencoders
    Next Article AI-Powered Personalization: Redefining the Future of Customer Experience✨

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Rilasciata Archcraft 2025.04.24: la distribuzione GNU/Linux minimalista e moderna basata su Arch Linux

    Linux

    This $559 OLED laptop shows there’s still time to buy before tariffs set in

    News & Updates

    Why data literacy is essential – and elusive – for business leaders in the AI age

    News & Updates

    Ghostty DEB Installers Now Available for Ubuntu 25.04

    Linux

    Highlights

    CVE-2025-45985 – Blink Router Command Injection Vulnerability

    June 13, 2025

    CVE ID : CVE-2025-45985

    Published : June 13, 2025, 12:15 p.m. | 1 hour, 26 minutes ago

    Description : Blink routers BL-WR9000 V2.4.9 , BL-AC2100_AZ3 V1.0.4, BL-X10_AC8 v1.0.5 , BL-LTE300 v1.2.3, BL-F1200_AT1 v1.0.0, BL-X26_AC8 v1.2.8, BLAC450M_AE4 v4.0.0 and BL-X26_DA3 v1.2.7 were discovered to contain a command injection vulnerability via the bs_SetSSIDHide function.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-25208 – Apache Authorino Authentication Service Denial of Service

    June 9, 2025

    Wuchang: Fallen Feathers is a Soulslike RPG that’s coming to Xbox Game Pass — after playing it, I still have these concerns

    July 2, 2025

    I tested the largest bike computer for tracking fitness metrics – here’s my buying advice

    April 22, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.