LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

Language models trained on vast internet-scale datasets have become prominent language understanding and generation tools. Their potential extends beyond language tasks to functioning as decision-making agents in interactive environments. When applied to environments requiring action choices, these models are expected to leverage their internal knowledge and reasoning to act effectively. Their ability to consider context, weigh options, and choose actions opens new possibilities for their integration into agentic systems that interact with dynamic environments.

Despite this promise, these models exhibit critical limitations in decision-making. While capable of forming accurate chains of reasoning, they often fail to act upon them. This issue is identified as the knowing-doing gap, where models recognize correct strategies but do not implement them in practice. Another significant concern is greediness, where models repeatedly select high-reward options prematurely, ignoring alternative strategies that could lead to better outcomes. Moreover, smaller models display frequency bias, favoring commonly seen actions regardless of reward, impairs exploration, and hinder learning from diverse scenarios.

To address these challenges, researchers have experimented with various strategies. Traditional reinforcement learning methods, including bandit algorithms like the Upper-Confidence Bound (UCB), aim to manage exploration-exploitation trade-offs. In contrast, in-context learning and behavior cloning imitate expert trajectories but often reinforce the same decision biases. While some exploration strategies have improved performance marginally, these approaches lack a mechanism to convert internal reasoning into optimal action reliably, especially in complex or stochastic environments.

Researchers from Google DeepMind and the LIT AI Lab at JKU Linz focused on refining language model behavior through Reinforcement Learning Fine-Tuning (RLFT). Their approach employs self-generated Chain-of-Thought (CoT) rationales as training signals. By evaluating the rewards of actions following specific reasoning steps, the model learns to favor decisions that sound logical and yield high returns in practice. This reinforcement links model reasoning to environmental feedback, promoting improved decision alignment and reducing gaps between thought and behavior.

The methodology centers on token-based fine-tuning using environment interactions. At each step, the model receives an input instruction and a recent action-reward history, and it generates a sequence containing the rationale and the selected action. These outputs are evaluated based on environmental rewards and whether the action conforms to the desired format. A penalty is applied when the model fails to generate a valid action. Over time, reward shaping encourages consistent output formatting while preserving exploration. The process includes Monte Carlo baseline estimates and generalized advantage estimation for variable-length tasks like Tic-tac-toe, allowing the model to learn from diverse decision sequences.

Performance results show that RLFT considerably improves the model’s decision-making abilities. In a button-based multi-armed bandit setting with 10 arms, the action coverage for a 2B parameter model increased from 40% to over 52% after 30,000 gradient updates. In environments with 20 choices, coverage remained suboptimal but showed meaningful improvement. The frequency bias in the 2B model decreased from 70% to 35% in early repetitions after RLFT. Moreover, in Tic-tac-toe, the 2B model’s win rate against a random opponent rose from 15% to 75%, and the model achieved a draw rate against an optimal Monte Carlo Tree Search agent, improving from -0.95 to 0.0 in average return. Furthermore, larger models like the 27B variant exhibited an 87% rate of generating correct rationales, yet chose the optimal action only 21% of the time without RLFT. This gap was significantly reduced after fine-tuning.

The research shows that refining large language models through reinforcement on their reasoning processes enhances their ability to act according to their knowledge. This connection between thought and action is vital in creating reliable decision-making agents. The proposed method offers a practical path forward for developing more capable and autonomous LLM-based agents by directly addressing common decision errors and reinforcing successful behaviors.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

The post LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: A Unique Way to Primary Key

BrowserStack launches Figma plugin for detecting accessibility issues in design phase

Parasoft brings agentic AI to service virtualization in latest release

Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

The best CRM software with email marketing in 2025: Expert tested and reviewed

This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

8 ways I quickly leveled up my Linux skills – and you can too

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

Execute Ping Commands and Get Back Structured Data in PHP

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

“I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Boolformer: Symbolic Regression of Logic Functions with Transformers

Oh! Canada Added to List of Nations Targeted in Salt Typhoon Telecom Spree

CVE-2024-8270 – Apple Rocket.Chat TCC Policy Bypass and DYLIB Injection Vulnerability

Playbook: Transforming Your Cybersecurity Practice Into An MRR Machine

AI Threats Are Evolving Fast — Learn Practical Defense Tactics in this Expert Webinar

CVE-2025-48947 – Auth0 Next.js SDK Cache-Control Header Missing Vulnerability

Lingmo OS – modern Linux distribution based on Debian

Designer Spotlight: Luca Franceschetti

Critical Viasat Firmware Vulnerability Let Attackers Execute Remote Code

LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

Related Posts