Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

    LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

    May 6, 2025

    Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2, a family of speech-capable large language models (SpeechLMs) now available on Hugging Face. This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

    Overview of the LLaMA-Omni2 Architecture

    LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

    • Speech Encoder: Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.
    • Speech Adapter: Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.
    • Core LLM: The Qwen2.5 models serve as the main reasoning engine.
    • Streaming TTS Decoder: Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

    A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

    Streaming Generation with Read-Write Scheduling

    The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

    Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

    Training Approach

    Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

    Training is executed in two stages:

    • Stage I: Independently optimizes the speech-to-text and text-to-speech modules.
    • Stage II: Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

    Benchmark Results

    The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

    Model Llama Q (S2S) Web Q (S2S) GPT-4o Score ASR-WER Latency (ms)
    GLM-4-Voice (9B) 50.7 15.9 4.09 3.48 1562.8
    LLaMA-Omni (8B) 49.0 23.7 3.52 3.67 346.7
    LLaMA-Omni2-7B 60.7 31.3 4.15 3.26 582.9

    The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

    Component Analyses

    • Gate Fusion Module: Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.
    • TTS Pretraining: Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.
    • Read/Write Strategies: Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

    Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

    Conclusion

    LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.


    Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

    Here’s a brief overview of what we’re building at Marktechpost:

    ML News Community – r/machinelearningnews (92k+ members)

    Newsletter– airesearchinsights.com/(30k+ subscribers)

    miniCON AI Events – minicon.marktechpost.com

    AI Reports & Magazines – magazine.marktechpost.com

    AI Dev & Research News – marktechpost.com (1M+ monthly readers)

    The post LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleOWASP Top 10 Vulnerabilities: A Guide for QA Testers
    Next Article Use custom metrics to evaluate your generative AI application with Amazon Bedrock

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Cryptocurrency Miner and Clipper Malware Spread via SourceForge Cracked Software Listings

    Cryptocurrency Miner and Clipper Malware Spread via SourceForge Cracked Software Listings

    Development

    What’s the Real Cost of Building an AI Solution in 2025? A Comprehensive Guide💰

    Web Development

    CVE-2025-46689 – Ververica Platform Reflected Cross-Site Scripting

    Common Vulnerabilities and Exposures (CVEs)

    The Lost CSS Tricks of Cohost.org

    News & Updates

    Highlights

    CVE-2025-20324 – Splunk Enterprise/Cloud Platform System Source Type Configuration Injection Vulnerability

    July 7, 2025

    CVE ID : CVE-2025-20324

    Published : July 7, 2025, 6:15 p.m. | 4 hours, 29 minutes ago

    Description : In Splunk Enterprise versions below 9.4.2, 9.3.5, 9.2.7, and 9.1.10 and Splunk Cloud Platform versions below 9.3.2411.104, 9.3.2408.113, and 9.2.2406.119, a low-privileged user that does not hold the “admin” or “power” Splunk roles could create or overwrite [system source type](https://help.splunk.com/en/splunk-enterprise/get-started/get-data-in/9.2/configure-source-types/create-source-types) configurations by sending a specially-crafted payload to the `/servicesNS/nobody/search/admin/sourcetypes/` REST endpoint on the Splunk management port.

    Severity: 5.4 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Here’s How To Fix Missile Command Delta Not Installing

    July 7, 2025
    “It’s literally tens ofmillions of hours.” Xbox CEO Phil Spencer celebrates Xbox Cloud Gaming’s “dramatic growth,” now with per-device usage charts.

    “It’s literally tens ofmillions of hours.” Xbox CEO Phil Spencer celebrates Xbox Cloud Gaming’s “dramatic growth,” now with per-device usage charts.

    April 19, 2025

    CVE-2025-36049 – IBM webMethods Integration Server XML External Entity Injection (XXE) Vulnerability

    June 18, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.