Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation

    Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation

    June 13, 2025

    Artificial intelligence has undergone a significant transition from basic language models to advanced models that focus on reasoning tasks. These newer systems, known as Large Reasoning Models (LRMs), represent a class of tools designed to simulate human-like thinking by producing intermediate reasoning steps before arriving at conclusions. The focus has moved from generating accurate outputs to understanding the process that leads to these answers. This shift has raised questions about how these models manage tasks with layered complexity and whether they truly possess reasoning abilities or are simply leveraging training patterns to guess outcomes.

    Redefining Evaluation: Moving Beyond Final Answer Accuracy

    A recurring problem with evaluating machine reasoning is that traditional benchmarks mostly assess the final answer without examining the steps involved in arriving at it. Final answer accuracy alone does not reveal the quality of internal reasoning, and many benchmarks are contaminated with data that may have been seen during training. This creates a misleading picture of a model’s true capabilities. To explore actual reasoning, researchers require environments where problem difficulty can be precisely controlled and intermediate steps can be analyzed. Without such settings, it is hard to determine whether these models can generalize solutions or merely memorize patterns.

    To evaluate reasoning more reliably, the research team at Apple designed a setup using four puzzle environments: Tower of Hanoi, River Crossing, Checkers Jumping, and Blocks World. These puzzles allow precise manipulation of complexity by changing elements such as the number of disks, checkers, or agents involved. Each task requires different reasoning abilities, such as constraint satisfaction and sequential planning. Importantly, these environments are free from typical data contamination, enabling thorough checks of both outcomes and the reasoning steps in between. This method ensures a detailed investigation of how models behave across varied task demands.

    The research introduced a comparative study using two sets of models: Claude 3.7 Sonnet and DeepSeek-R1, along with their “thinking” variants and their standard LLM counterparts. These models were tested across the puzzles under identical token budgets to measure both accuracy and reasoning efficiency. This helped reveal performance shifts across low, medium, and high-complexity tasks. One of the most revealing observations was the formation of three performance zones. In simple tasks, non-thinking models outperformed reasoning variants. For medium complexity, reasoning models gained an edge, while both types collapsed completely as complexity peaked.

    Comparative Insights: Thinking vs. Non-Thinking Models Under Stress

    An in-depth analysis revealed that reasoning effort increased with task difficulty up to a certain point but then declined despite the availability of resources. For instance, in the Tower of Hanoi, Claude 3.7 Sonnet (thinking) maintained high accuracy until complexity reached a certain threshold, after which performance dropped to zero. Even when these models were supplied with explicit solution algorithms, they failed to execute steps beyond specific complexity levels. In one case, Claude 3.7 could manage around 100 steps correctly for the Tower of Hanoi but was unable to complete simpler River Crossing tasks requiring only 11 moves when $N = 3$. This inconsistency exposed serious limitations in symbolic manipulation and exact computation.

    The performance breakdown also highlighted how LRMs handle their internal thought process. Models frequently engaged in “overthinking,” generating correct intermediate solutions early in the process but continuing to explore incorrect paths. This led to inefficient use of tokens. At medium complexity levels, models began to find correct answers later in their reasoning chains. However, at high levels of complexity, they failed to produce accurate solutions. Quantitative analysis confirmed that solution accuracy dropped to near zero as the problem complexity increased, and the number of reasoning tokens allocated began to decline unexpectedly.

    Scaling Limits and the Collapse of Reasoning

    This research presents a sobering assessment of how current Learning Resource Management Systems (LRMs) operate. Research from Apple makes it clear that, despite some progress, today’s reasoning models are still far from achieving generalized reasoning. The work identifies how performance scales, where it collapses, and why over-reliance on benchmark accuracy fails to capture deeper reasoning behavior. Controlled puzzle environments have proven to be a powerful tool for uncovering hidden weaknesses in these systems and emphasizing the need for more robust designs in the future.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    The post Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuild a Secure AI Code Execution Workflow Using Daytona SDK
    Next Article Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty Assessment

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Gears of War: Reloaded — Release date, price, and everything you need to know

    News & Updates

    Microsoft’s Copilot for Gaming arrives in beta – how to try it on your phone

    News & Updates

    Next.js Cache Poisoning Vulnerability Let Attackers Trigger DoS Condition

    Security

    CVE-2025-7419 – Tenda O3V2 HTTPd DestIP Stack-Based Buffer Overflow

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-47445 – Eventin Path Traversal Vulnerability

    May 14, 2025

    CVE ID : CVE-2025-47445

    Published : May 14, 2025, 12:15 p.m. | 2 hours, 51 minutes ago

    Description : Relative Path Traversal vulnerability in Themewinter Eventin allows Path Traversal.This issue affects Eventin: from n/a through 4.0.26.

    Severity: 7.5 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-49818 – AirVPN DNS Spoofing Vulnerability

    June 12, 2025

    CVE-2025-26521 – Apache CloudStack CKS Kubernetes Cluster API Key and Secret Key Exposure Vulnerability

    June 11, 2025

    AWS DMS implementation guide: Building resilient database migrations through testing, monitoring, and SOPs

    May 5, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.