Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling

    Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling

    May 5, 2025

    Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities across diverse tasks, with Reinforcement Learning (RL) serving as a crucial mechanism for refining their deep thinking abilities. While RL techniques have shown particular success in mathematical reasoning and coding domains with well-defined rules and verifiable correctness criteria, extending these approaches to broader reasoning contexts presents significant challenges, including limited training data and difficulties in ensuring cross-domain generalisation.

    Evolution of Reasoning in LLMs

    The development of Chain-of-Thought (CoT) methodology marked a significant advancement in LLM reasoning capabilities. CoT has demonstrated substantial improvements across mathematics, science, and programming domains by incorporating multi-step intermediate reasoning processes before reaching conclusions. This approach allows models to break down complex problems into manageable steps, mirroring human problem-solving processes.

    While mathematical reasoning has dominated recent research due to its verifiable nature, the expansion of RL training to diverse domains remains largely unexplored. Prior research works suggest that blending mathematical content with other verifiable domains can improve performance on broad reasoning benchmarks. However, systematic investigation into how non-mathematical reasoning data, such as legal analysis, social science, or historical interpretation, impacts RL training effectiveness still represents a significant research gap.

    Challenges in Diversifying Reasoning Domains

    Recent research has explored methods for diversifying RL training datasets, yet questions about optimal data-blending strategies and the relative importance of various sources remain unanswered. A fundamental challenge in applying RL to general reasoning tasks is developing verifiable reward models for domains lacking deterministic solutions. Domain-specific reasoning processes—whether rule-based and symbolic in mathematics or contextual and heuristic in fields like law and history—require different cognitive approaches. In addition to that, question formats (open-ended versus multiple-choice) demand distinct reasoning strategies, suggesting that incorporating diverse reasoning domains could significantly enhance LLMs’ broad cognitive capabilities.

    Nemotron-CrossThink: A Multi-Domain Approach

    Researchers from NVIDIA, Carnegie Mellon University, and Boston University introduce Nemotron-CrossThink, representing a systematic framework for incorporating multi-domain corpora into RL training to enhance cross-task generalisation. The methodology follows a comprehensive pipeline that curates diverse data sources, including synthetic data from CommonCrawl and open-source question-answer pairs across STEM, humanities, law, and social sciences. By applying templated formats (MCQ/Open-Ended) to constrain answer spaces, filtering samples for verifiable rewards, and implementing strategic data-blending recipes, the framework enables effective self-learning through RL across diverse reasoning domains.

    Key Results and Innovations

    Nemotron-CrossThink significantly enhances LLM reasoning capabilities by integrating multi-domain data with different question formats. Models trained with this approach demonstrate not only higher accuracy but also dynamic response strategies—generating concise answers for general-purpose questions while providing detailed responses for mathematical problems—thereby optimising inference costs while maintaining task-specific precision.

    The framework addresses the challenge of verifiable rewards in non-deterministic domains through templated data curation that limits answer space diversity. It also provides an efficient filtering approach that ranks general-purpose reasoning data by complexity, showing that training with more challenging samples amplifies RL impact across all domains. These innovations have led to substantial performance gains in both mathematical benchmarks (MATH-500: +30.1%, AMC23: +27.5%) and non-mathematical tasks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%).

    Comprehensive Data Curation

    Nemotron-CrossThink begins with meticulous data curation from multiple sources to ensure diversity. The training dataset combines synthetically generated data from CommonCrawl and publicly available open-source QA datasets, encompassing both general-purpose reasoning and mathematical content. General-purpose reasoning data includes MMLU, Natural Reasoning, and synthesised QA pairs spanning STEM fields, economics, social sciences, and humanities, while mathematical reasoning incorporates datasets like MATH and Numina-Math alongside synthetically generated problems.

    Template Application and Data Filtering

    To address the challenge of verifiable rewards in non-mathematical domains, the framework applies specific templates to structure question-answer formats: Multiple Choice Questions (MCQ) and Open-Ended questions. This approach exposes the model to diverse answer formats and reasoning pathways while limiting answer space variability to enable effective reward modeling. Rigorous filtering removes samples that are infeasible to evaluate with rule-based reward functions, discarding MCQs where correct answers aren’t among the choices and open-ended responses exceeding ten words.

    Strategic Data Blending and Reinforcement Learning

    Nemotron-CrossThink employs Group Relative Policy Optimisation (GRPO) for reinforcement learning, which improves efficiency by estimating baselines from group scores rather than using a separate critic model. The methodology investigates the impact of diverse data sources, question types, and data usefulness through six distinct blending recipes. This systematic approach enables detailed analysis of how general-purpose reasoning data complements mathematical reasoning, ultimately producing more adaptable and generalizable language models.

    Technical Contributions

    The research demonstrates several key technical advances in multi-domain reasoning through reinforcement learning:

    1. Templated question-answer formats provide more stable reward modeling, with unified open-ended question formats improving performance by 1.21% over mixed formats, and short-form answer templates outperforming long-form ones by 1.20%.
    2. Strategic data-blending proves essential, with multi-domain corpora boosting average reasoning accuracy by 1.61% compared to math-only training while reducing token usage by 28%.
    3. Model-driven filtering techniques effectively select challenging samples by removing those solvable by smaller models, yielding an additional 2.15% accuracy gain for Qwen-2.5-32B.

    These findings represent significant progress in developing LLMs with robust reasoning capabilities across diverse domains, moving beyond the traditional focus on mathematical reasoning to encompass the full spectrum of human knowledge and inference patterns.

    Experiments and Results

    Experimental results demonstrate that different datasets significantly impact model performance across reasoning benchmarks. NuminaMath produced the highest overall average, outperforming the baseline by 8.30%, with particular strength in mathematical tasks while also generalizing well across diverse domains. Synthetic question-answering data improved performance by approximately 1.0%, showing strong accuracy in MMLU-PRO, AGIEVAL, and MATH-500 tasks, confirming that synthetically generated instruction-style data can effectively generalize when aligned with benchmark distributions.

    The Nemotron-CrossThink approach consistently outperformed the base model across various blending strategies. The general-purpose reasoning blend (Bgpr↑) achieved the highest overall average, exceeding OPEN-REASONER-ZERO by approximately 5% on average and showing substantial gains on reasoning-focused benchmarks (+12.82% on MMLU-PRO, +15.12% on AGIEVAL). Though Bonly_math performed slightly better on strictly mathematical tasks, it lagged on non-mathematical reasoning benchmarks, demonstrating Bgpr↑’s superior versatility through strong cross-domain transfer.

    Further analysis revealed that open-ended question formats (Bopen↑) yielded stronger results on mathematical benchmarks than multiple-choice formats (Bmcq↑), suggesting alignment with the inherently open-ended structure of mathematical problems. Mathematical reasoning data showed transferability to structured reasoning tasks, while general-purpose data proved less effective in isolation. This counterintuitive finding confirms that optimal general-purpose reasoning performance requires including mathematical problems in training blends.

    Conclusion

    Nemotron-CrossThink introduces a scalable framework that enhances LLM generalization through reinforcement learning with multi-domain corpora. By strategically blending diverse reasoning data with a 2:1 ratio of general-purpose to mathematical content, the approach achieves a remarkable 13.36% average improvement over baselines. The research demonstrates that data diversity, not merely volume, drives broader reasoning capabilities. Through difficulty-based filtering and thoughtful template design, Nemotron-CrossThink establishes a practical methodology for developing more generalizable, efficient, and reliable LLMs that extend self-learning beyond mathematical reasoning.


    Check out the Paper and Project Page. Also, don’t forget to follow us on Twitter.

    Here’s a brief overview of what we’re building at Marktechpost:

    • Newsletter– airesearchinsights.com/(30k+ subscribers)
    • miniCON AI Events – minicon.marktechpost.com
    • AI Reports & Magazines – magazine.marktechpost.com
    • AI Dev & Research News – marktechpost.com (1M+ monthly readers)
    • ML News Community – r/machinelearningnews (92k+ members)

    The post Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleIl laboratorio Open Source dell’Oregon State University rischia la chiusura: appello urgente alla comunità!
    Next Article How the Model Context Protocol (MCP) Standardizes, Simplifies, and Future-Proofs AI Agent Tool Calling Across Models for Scalable, Secure, Interoperable Workflows Traditional Approaches to AI–Tool Integration

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Why Your Business Needs a React Native Mobile App in 2025

    Web Development

    CVE-2025-48119 – RS WP Book Showcase Code Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4218 – Handrew BrowserPilot GPTSeleniumAgent Code Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5498 – Slackero PHPwcms Remote Deserialization Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Massive cloud outage knocks out internet services across the globe

    June 12, 2025

    Google Cloud appears to be at the heart of the issue, Cloudflare was down intermittently,…

    CVE-2024-12827 – WordPress DWT Directory & Listing Theme Privilege Escalation Vulnerability

    June 27, 2025

    Debian 13: Novità nell’Installer con Supporto per il Recupero dei Sistemi Btrfs

    July 5, 2025

    CVE-2020-36844 – KnowBe4 Security Awareness Training Reflective Cross-Site Scripting

    April 20, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.