LLMs Can Now Learn to Try Again: Researchers from Menlo Introduce ReZero, a Reinforcement Learning Framework That Rewards Query Retrying to Improve Search-Based Reasoning in RAG Systems

The domain of LLMs has rapidly evolved to include tools that empower these models to integrate external knowledge into their reasoning processes. A significant advancement in this direction is Retrieval-Augmented Generation (RAG), which allows models to query databases and search engines for up-to-date or niche information not embedded during training. RAG enhances performance in knowledge-intensive scenarios by integrating LLM generation with real-time information retrieval. Yet, as tasks become more complex, especially those needing multi-step reasoning or highly specific knowledge, ensuring that LLMs interact intelligently with these retrieval systems becomes critical. Improving this interaction process is crucial for enabling LLMs to address ambiguous, evolving, or complex information needs effectively.

A challenge in LLM-based systems that rely on retrieval mechanisms is the sensitivity to query quality. When an LLM generates an initial search query that fails to retrieve useful information, the system often lacks a robust strategy to recover from this failure. This leads to situations where the model either hallucinates an answer or terminates prematurely, yielding incorrect results. Current methods largely assume that a single good query will suffice, neglecting the scenario where persistence and retries are essential for uncovering the correct information. This limitation reduces the robustness of LLMs in complex tasks where understanding improves incrementally through trial, error, and refinement.

Various tools have been developed to enhance the interaction between LLMs and external retrieval systems. Techniques such as Process Reward Models (PRMs) and Process Explanation Models (PEMs) reward intermediate reasoning improvements, whereas DeepRetrieval employs reinforcement learning (RL) to optimize query formulation. These methods reward either the quality of reasoning or the final retrieval result. Iterative techniques, such as Self-Ask and IRCoT, enable multi-step reasoning by decomposing questions and retrieving information in an iterative manner. However, they lack mechanisms to reward models for persistence after a failed attempt. These systems generally do not encourage retrying or reformulating a failed query, which can be crucial for navigating ambiguous information landscapes.

Researchers at Menlo Research introduced a new framework called ReZero (Retry-Zero). This method is designed specifically to teach large language models to persist in their information search by explicitly rewarding the act of retrying a query. Rather than only valuing the final answer, ReZero builds a learning environment where the model receives positive feedback when it recognizes a failed search and attempts again with a revised query. The reinforcement signal is applied during interactions with a search system, meaning that the model is rewarded not only for reaching the correct conclusion but also for demonstrating persistence along the way. The idea mirrors human behavior: when an initial search or strategy fails, a rational approach is to reformulate the plan and try again. ReZero operationalizes this idea by using a reward mechanism that reflects the value of retrying after encountering difficulty in information retrieval.

The team released two versions of their ReZero-trained model, Menlo/ReZero-v0.1-llama-3.2-3b-it-grpo-250404 and its GGUF variant, on Hugging Face. Both are fine-tuned on the Llama-3.2-3B-Instruct base using GRPO and optimized to reinforce retry behavior in search tasks. Trained on over 1,000 steps using Apollo Mission data on an H200 GPU, the model achieved a peak accuracy of 46.88% at step 250, validating the impact of the retry reward. The GGUF version is quantized for efficient deployment, showcasing ReZero’s potential for both research and real-world search applications.

ReZero utilizes a reinforcement learning method known as Group Relative Policy Optimization (GRPO) to train the model. This setup doesn’t rely on a separate critic model, streamlining the training process. The model is taught using a suite of reward functions: correctness of the final answer, adherence to format, retrieval of relevant content, and crucially, the presence of a retry when needed. These rewards work in combination. For instance, the retry reward only applies if a valid final answer is eventually produced, ensuring that models do not engage in endless retries without resolution. Also, a search diversity reward encourages the generation of semantically varied queries, while a search strategy reward assesses how effectively the model conducts sequential searches. Training is further enhanced by injecting noise into the search results, forcing the model to adapt to less-than-ideal conditions. This noise strengthens its generalization ability and simulates real-world imperfections.

The research team implemented ReZero using the Llama3-23B-Instruct model and evaluated it on the Apollo 3 mission dataset. This dataset was split into 341 data chunks, with 32 reserved for testing. Training lasted approximately 1,000 steps (equivalent to three epochs) and was performed on a single NVIDIA H200 GPU. Two model configurations were compared: a baseline with three reward functions (correctness, format, em chunk) and ReZero, which included an additional reward for retrying. The performance gap between the two was substantial. ReZero achieved a peak accuracy of 46.88% at 250 training steps, whereas the baseline reached its peak at only 25.00% at 350 steps. Also, ReZero demonstrated faster learning in early training stages. However, both models experienced a sharp decline in performance afterward, reaching 0% accuracy by step 450 (ReZero) and step 700 (Baseline). This performance drop suggests potential overfitting or instability in extended RL runs, indicating the need for refined training schedules or improved reward balancing.

Several Key Takeaways from the ReZero Framework:

Designed to enhance LLM search capabilities by rewarding retry behavior after a failed information retrieval attempt.
Based on reinforcement learning using Group Relative Policy Optimization (GRPO).
Includes rewards for correctness, format, retry actions, relevant information match, search strategy, and query diversity.
Rewards are only granted if retries result in a valid final answer, preventing excessive unproductive queries.
ReZero utilized the Apollo 3 dataset, which consisted of 341 chunks; 32 were reserved for evaluation.
It achieved a peak accuracy of 46.88% with a retry reward, compared to 25.00% without it.
Conducted over 1000 steps on NVIDIA H200 GPU with the Llama3-23B-Instruct model.
Both models experienced an accuracy collapse after reaching their respective peaks, indicating concerns about the stability of RL.
Introduced the idea of persistence as a trainable behavior in RAG systems, distinct from simply refining single queries.

Here is the Paper and Model. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Can Now Learn to Try Again: Researchers from Menlo Introduce ReZero, a Reinforcement Learning Framework That Rewards Query Retrying to Improve Search-Based Reasoning in RAG Systems appeared first on MarkTechPost.

Source: Read MoreÂ

CVE-2025-37882 – Linux Kernel USB xHCI Isochronous Ring Handling Vulnerability

May 9, 2025

CVE ID : CVE-2025-37882

Published : May 9, 2025, 7:16 a.m. | 4 hours, 51 minutes ago

Description : In the Linux kernel, the following vulnerability has been resolved:

usb: xhci: Fix isochronous Ring Underrun/Overrun event handling

The TRB pointer of these events points at enqueue at the time of error
occurrence on xHCI 1.1+ HCs or it’s NULL on older ones. By the time we
are handling the event, a new TD may be queued at this ring position.

I can trigger this race by rising interrupt moderation to increase IRQ
handling delay. Similar delay may occur naturally due to system load.

If this ever happens after a Missed Service Error, missed TDs will be
skipped and the new TD processed as if it matched the event. It could
be given back prematurely, risking data loss or buffer UAF by the xHC.

Don’t complete TDs on xrun events and don’t warn if queued TDs don’t
match the event’s TRB pointer, which can be NULL or a link/no-op TRB.
Don’t warn if there are no queued TDs at all.

Now that it’s safe, also handle xrun events if the skip flag is clear.
This ensures completion of any TD stuck in ‘error mid TD’ state right
before the xrun event, which could happen if a driver submits a finite
number of URBs to a buggy HC and then an error occurs on the last TD.

Severity: 0.0 | NA

Visit the link for more details, such as CVSS details, affected products, timeline, and more…

CodeSOD: A Unique Way to Primary Key

BrowserStack launches Figma plugin for detecting accessibility issues in design phase

Parasoft brings agentic AI to service virtualization in latest release

Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

The best CRM software with email marketing in 2025: Expert tested and reviewed

This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

8 ways I quickly leveled up my Linux skills – and you can too

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

Execute Ping Commands and Get Back Structured Data in PHP

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

“I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

LLMs Can Now Learn to Try Again: Researchers from Menlo Introduce ReZero, a Reinforcement Learning Framework That Rewards Query Retrying to Improve Search-Based Reasoning in RAG Systems

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Boolformer: Symbolic Regression of Logic Functions with Transformers

Can OpenAI’s mandatory week-long break fend off Meta’s $100 million talent grab? — “Someone has broken into our home”

Ubuntu Security Reinvented: Hardening Your System with AppArmor

As Windows 10’s death looms, Linux fans still promote “no ads or telemetry” for your old laptop instead of buying a Copilot+ PC

CVE-2025-3867 – WordPress Ajax Comment Form CST CSRF

CVE-2025-37882 – Linux Kernel USB xHCI Isochronous Ring Handling Vulnerability

Hobbit-inspired sword can help you find unsecured WiFi hotspots

CVE-2025-54313 – EsLint-Config-Prettier Malicious Code Injection

CVE-2025-3888 – “Jupiter X Core Stored Cross-Site Scripting Vulnerability”

LLMs Can Now Learn to Try Again: Researchers from Menlo Introduce ReZero, a Reinforcement Learning Framework That Rewards Query Retrying to Improve Search-Based Reasoning in RAG Systems

Related Posts