Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks

    Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks

    April 4, 2025

    AI agents are increasingly vital in helping engineers efficiently handle complex coding tasks. However, one significant challenge has been accurately assessing and ensuring these agents can handle real-world coding scenarios beyond simplified benchmark tests. 

    Augment Code has announced the launch of their Augment SWE-bench Verified Agent, a development in agentic AI tailored specifically for software engineering. This release places them at the top of open-source agent performance on the SWE-bench leaderboard. By combining the strengths of Anthropic’s Claude Sonnet 3.7 and OpenAI’s O1 model, Augment Code’s approach has delivered impressive results, showcasing a compelling blend of innovation and pragmatic system architecture.

    The SWE-bench benchmark is a rigorous test that measures an AI agent’s effectiveness in handling practical software engineering tasks drawn directly from GitHub issues in prominent open-source repositories. Unlike traditional coding benchmarks, which generally focus on isolated, algorithmic-style problems, SWE-bench offers a more realistic testbed that requires agents to navigate existing codebases, identify relevant tests autonomously, create scripts, and iterate against comprehensive regression test suites.

    Augment Code’s initial submission has achieved a 65.4% success rate, a notable achievement in this demanding environment. The company focused its first effort on leveraging existing state-of-the-art models, specifically Anthropic’s Claude Sonnet 3.7 as the primary driver for task execution and OpenAI’s O1 model for ensembling. This approach strategically bypassed training proprietary models at this initial phase, establishing a robust baseline.

    Image Source

    One interesting aspect of Augment’s methodology was their exploration into different agent behaviors and strategies. For example, they found that certain expected beneficial techniques like Claude Sonnet’s ‘thinking mode’ and separate regression-fixing agents did not yield meaningful performance improvements. This highlights the nuanced and sometimes counterintuitive dynamics in agent performance optimization. Also, basic ensembling techniques such as majority voting were explored but ultimately abandoned due to cost and efficiency considerations. However, simple ensembling with OpenAI’s O1 did provide incremental improvements in accuracy, underscoring the value of ensembling even in constrained scenarios.

    While Augment Code’s initial SWE-bench submission’s success is commendable, the company is transparent about the benchmark’s limitations. Notably, SWE-bench problems are heavily skewed toward bug fixing rather than feature creation, the provided descriptions are more structured and LLM-friendly compared to typical real-world developer prompts, and the benchmark solely utilizes Python. Real-world complexities, such as navigating massive production codebases and dealing with less descriptive programming languages, pose challenges that SWE-bench does not capture.

    Augment Code has openly acknowledged these limitations, emphasizing its continued commitment to optimizing agent performance beyond benchmark metrics. They stress that while improvements to prompts and ensembling can boost quantitative results, qualitative customer feedback and real-world usability remain its priorities. The ultimate goal for Augment Code is developing cost-effective, fast agents capable of providing unparalleled coding assistance in practical professional environments.

    As part of its future roadmap, Augment is actively exploring the fine-tuning of proprietary models using RL techniques and proprietary data. Such advancements promise to enhance model accuracy and significantly reduce latency and operational costs, facilitating more accessible and scalable AI-driven coding assistance.

    Some of the key takeaways from the Augment SWE-bench Verified Agent include:

    • Augment Code released Augment SWE-bench Verified Agent, achieving the top spot among open-source agents.
    • The agent combines Anthropic’s Claude Sonnet 3.7 as its core driver and OpenAI’s O1 model for ensembling.
    • Achieved a 65.4% success rate on SWE-bench, highlighting robust baseline capabilities.
    • Found counterintuitive results, where anticipated beneficial features like ‘thinking mode’ and separate regression-fixing agents offered no substantial performance gains.
    • Identified cost-effectiveness as a critical barrier to implementing extensive ensembling in real-world scenarios.
    • Acknowledged benchmark limitations, including its bias towards Python and smaller-scale bug-fixing tasks.
    • Future improvements will focus on cost reduction, lower latency, and improved usability through reinforcement learning and fine-tuning proprietary models.
    • Highlighted the importance of balancing benchmark-driven improvements with qualitative user-centric enhancements.

    Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuilding Your AI Q&A Bot for Webpages Using Open Source AI Models
    Next Article Prompting for the best price-performance

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-7157 – Code-projects Online Note Sharing SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-32875 – COROS Bluetooth Pairing and Bonding Unencrypted Data Exfiltration

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48735 – BOS IPC SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Microsoft fixes Windows Classic Outlook CAA2000B and 4usqa sign-in errors

    Operating Systems

    Highlights

    CVE-2025-50213 – Apache Airflow Providers Snowflake Special Element Injection

    June 24, 2025

    CVE ID : CVE-2025-50213

    Published : June 24, 2025, 8:15 a.m. | 1 hour, 21 minutes ago

    Description : Failure to Sanitize Special Elements into a Different Plane (Special Element Injection) vulnerability in Apache Airflow Providers Snowflake.

    This issue affects Apache Airflow Providers Snowflake: before 6.4.0.

    Sanitation of table and stage parameters were added in CopyFromExternalStageToSnowflakeOperator to prevent SQL injection
    Users are recommended to upgrade to version 6.4.0, which fixes the issue.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-27523 – Hitachi JP1/IT Desktop Management 2 – Smart Device Manager XXE Injection Vulnerability

    May 15, 2025

    How to Thrive as a Global Business Professional in 2025

    May 30, 2025

    CVE-2025-30641 – Trend Micro Deep Security Local Privilege Escalation Vulnerability

    June 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.