Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures

    Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures

    April 9, 2025
    Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures

    AI agents quickly become core components in handling complex human interactions, particularly in business environments where conversations span multiple turns and involve task execution, information extraction, and adherence to specific procedural rules. Unlike traditional chatbots that handle single-turn questions, these agents must hold context over several dialogue exchanges while integrating external data and tool usage. These challenges demand systems capable of navigating user goals incrementally, engaging in feedback loops, and invoking structured functions like API calls based on the conversation state. These capabilities heavily depend on the availability of training datasets that reflect such tasks’ natural complexity and sequence. As these AI agents are expected to operate under domain-specific constraints and execute task-relevant functions in finance, retail, and customer support, the demand for nuanced and verified training data grows significantly.

    The central bottleneck in scaling agent capability has been the lack of high-quality, multi-turn datasets that reflect realistic user interactions. Collecting such data manually is slow and costly and requires domain knowledge to construct tasks that represent actual use cases. Also, even leading language models tend to underperform in conversations that require tracking prior context, using tools precisely, or dynamically adjusting their strategy. Without structured training datasets that reflect these challenges, models are prone to errors in execution and struggle with maintaining goal alignment across turns. These limitations become more pronounced in scenarios that involve tool usage, such as executing function calls, retrieving external data, or fulfilling service requests with multiple stages of information exchange.

    Various frameworks have attempted to bridge this gap through synthetic data generation or task-specific tuning. Some efforts like APIGen and knowledge distillation methods have helped generate single-turn task data or simplified templates. Tool-usage models have been enhanced using frameworks that provide fixed sets of functions but often lack the flexibility to adapt to dynamic tool environments. Other attempts, such as MAG-V, MATRIX, and BUTTON, use multi-agent systems to simulate training interactions but suffer from inadequate quality controls or rely on fixed instruction structures. Many of these tools either fail to capture long-term dependency or rely on brittle rule-based systems that lack generalizability. Even popular evaluation benchmarks like MultiChallenge and ToolDial struggle to emulate the intricacies of realistic conversations, often due to overly simplified interaction formats.

    A research team from Salesforce AI Research introduced APIGen-MT, a novel two-phase data generation pipeline designed to create high-quality, multi-turn interaction data between agents and simulated human users. The approach focuses on realism, structure, and verification by constructing validated task blueprints and then simulating detailed agent-human conversations in executable environments. Unlike earlier approaches, this method employs a layered validation mechanism using both automated checkers and committees of large language models to assess task coherence, accuracy, and feasibility. The researchers train a family of models under the xLAM-2-fc-r series, ranging from 1 billion to 70 billion parameters, using this synthetic data to outperform major benchmarks in multi-turn agent evaluation significantly.

    The architecture behind APIGen-MT is split into two main operational phases. In Phase 1, a task configuration is created using an LLM-driven generator that proposes user intent instructions, a sequence of groundtruth actions, and the expected outputs. These proposals are then validated for format correctness, executability, and semantic coherence using a combination of rule-based checkers and a multi-agent LLM review committee. If a proposal fails at any stage, a feedback mechanism will reflect on the errors and propose improvements. Successful tasks move to Phase 2, where a simulation engine generates realistic dialogues between a simulated human user and a test agent. The agent responds to user inputs by calling APIs, interpreting outputs, and evolving the conversation across turns. Only those dialogue trajectories that match the expected groundtruth are included in the final training dataset, ensuring functional accuracy and natural dialogue flow.

    Models trained on APIGen-MT data, specifically the xLAM-2-fc-r models, demonstrate superior performance across two industry-standard evaluation benchmarks: τ-bench and BFCL v3. For example, on the BFCL v3 benchmark in the Retail domain, the xLAM-2-70b-fc-r model achieved a score of 78.2, surpassing Claude 3.5 (56.5) and GPT-4o (72.1). Similarly, the airline domain scored 67.1 compared to GPT-4o’s 62.8. In more complex environments involving iterative interactions, the xLAM-2-8b-fc-r model outperformed larger traditional models, illustrating the impact of higher-quality training data. These results confirm that detailed and verified training interactions are more valuable than sheer model size when structured carefully through feedback loops and task validation. Also, the consistency of these models across multiple trials shows enhanced robustness, a critical factor for deployment in enterprise environments.

    The APIGen-MT framework is impactful not only because of its performance but also because of its scalability and open-source contribution. By releasing both the synthetic datasets and the xLAM-2-fc-r models to the public, the researchers aim to democratize access to high-quality agent training data. This modular, verifiable, and interaction-grounded approach opens avenues for future advancements in AI agents. It enables researchers to extend the framework across different domains, functions, and tools, making it adaptable to specific industrial requirements without sacrificing dialogue realism or execution integrity.

    Some Key Takeaways from the Research:

    • APIGen-MT creates multi-turn interaction datasets using a two-phase task blueprint generation followed by simulated conversation.  
    • The system integrates validation via format checks, execution tests, and LLM review committees.  
    • Feedback loops allow the improvement of failed tasks, creating a learning mechanism within the pipeline.  
    • Models trained with this data outperform GPT-4o and Claude 3.5 across τ-bench and BFCL v3 benchmarks.  
    • The xLAM-2-70b-fc-r scored 78.2 on Retail and 67.1 on Airline under BFCL v3, higher than all baselines.  
    • Smaller models like xLAM-2-8b-fc-r also beat larger alternatives in long-turn interactions, indicating better efficiency.  
    • The open-source release of both data and models ensures wider accessibility for research and industrial use.  
    • The framework enhances realism and technical reliability in agent training, setting a new standard for synthetic interaction data.

    Check out the Paper and Model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleOpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers
    Next Article DIY Branding vs. Working With a Design Agency

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Critical Kibana Flaws: CVE-2025-2135 (CVSS 9.9) Allows Heap Corruption & RCE; Open Redirect Also Patched

    Security

    Scaling Laws for Native Multimodal Models

    Machine Learning

    CVE-2025-4770 – PHPGurukul Park Ticketing Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Illicit crypto-miners pouncing on lazy DevOps configs that leave clouds vulnerable

    Security

    Highlights

    Machine Learning

    Get faster and actionable AWS Trusted Advisor insights to make data-driven decisions using Amazon Q Business

    May 2, 2025

    Our customers’ key strategic objectives are cost savings and building secure and resilient infrastructure. At…

    CVE-2025-34511 – Sitecore PowerShell Extensions Remote File Upload Vulnerability

    June 17, 2025

    Canonical ha Generato Quasi 300 Milioni di Dollari di Ricavi nel 2024

    June 28, 2025

    CVE-2025-32917 – Checkmk Java Home Privilege Escalation

    May 13, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.