Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper from ByteDance Introduces MegaScale-Infer: A Disaggregated Expert Parallelism System for Efficient and Scalable MoE-Based LLM Serving

    This AI Paper from ByteDance Introduces MegaScale-Infer: A Disaggregated Expert Parallelism System for Efficient and Scalable MoE-Based LLM Serving

    April 9, 2025
    This AI Paper from ByteDance Introduces MegaScale-Infer: A Disaggregated Expert Parallelism System for Efficient and Scalable MoE-Based LLM Serving

    Large language models are built on transformer architectures and power applications like chat, code generation, and search, but their growing scale with billions of parameters makes efficient computation increasingly challenging. Scaling such systems while maintaining low latency and high throughput puts pressure on algorithm design and system-level optimization. Effectively serving these models now requires careful orchestration of memory, communication, and compute resources.

    A critical challenge in this space is how sparsity, introduced through Mixture-of-Experts (MoE) models, affects inference performance. These models selectively activate a subset of feed-forward networks per input, reducing computational load. However, this selective activation leads to underutilization of hardware. During inference, attention modules become bottlenecks due to frequent memory access to key-value caches, while the FFN modules become idle because each receives a small fraction of tokens. As a result, GPU utilization drops significantly, especially during decoding, creating inefficiencies and inflating operational costs.

    While some methods like vLLM and TensorRT-LLM have attempted to address inference scaling through parallelism and optimized kernels, these solutions remain constrained. They process the model holistically, meaning they cannot independently adjust scaling for different components. As MoE models grow in size and sparsity, this approach leads to smaller active batches per expert, weakening the benefits of batching for FFNs. Moreover, tensor and pipeline parallelism approaches add communication overhead, especially across nodes, which becomes a limiting factor in multi-GPU environments.

    ByteDance and Peking University researchers have introduced MegaScale-Infer, a system that rethinks the architecture of MoE serving. Instead of serving the model as a monolithic block, the researchers disaggregate the attention and FFN modules, deploying them on separate GPUs. This separation enables customized scaling and parallelism strategies tailored to the specific needs of each module. Attention modules, which are memory-intensive, are replicated to aggregate requests, while FFN modules are scaled using expert parallelism. The system also supports heterogeneous GPU deployment, assigning cost-effective memory-heavy GPUs to attention tasks and compute-optimized GPUs to FFNs. This disaggregation dramatically improves resource usage and flexibility in deployment.

    To further optimize performance, MegaScale-Infer employs a ping-pong pipeline parallelism strategy. The idea is to break down batches of requests into smaller micro-batches that alternate between attention and FFN modules, ensuring that neither component sits idle. The system determines the optimal number of micro-batches required to maintain high utilization, considering compute time, communication latency, and hardware setup. For example, if the communication time is less than half the compute time, at least three micro-batches are used. Further, the system integrates a high-performance M2N communication library that avoids unnecessary GPU-to-CPU data copies, reducing latency and instability. This library replaces the traditional All-to-All routing with a more efficient sender-receiver model designed specifically for MoE’s token dispatch pattern.

    MegaScale-Infer was tested on multiple large-scale MoE models, including Mixtral 8×22B, DBRX, and a scaled custom model with 317 billion parameters. In experiments on homogeneous setups using NVIDIA Ampere GPUs, MegaScale-Infer improved per-GPU decoding throughput by up to 2.56× compared to vLLM and 1.28× over TensorRT-LLM. The scaled model achieved a 7.11× gain over vLLM and a 1.90× gain over TensorRT-LLM. On heterogeneous clusters with H20 GPUs for attention and L40S for FFNs, the system achieved up to 3.24× and 1.86× higher throughput per dollar than the baselines. Its M2N communication library delivered up to 4.2× higher throughput and 68.2% lower latency than NCCL.

    This paper presents a clear problem of underutilized GPUs during MoE inference and offers a practical solution by modularizing the architecture. The proposed disaggregation strategy, combined with micro-batch pipelining and a custom communication protocol, substantially impacts serving efficiency and cost.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post This AI Paper from ByteDance Introduces MegaScale-Infer: A Disaggregated Expert Parallelism System for Efficient and Scalable MoE-Based LLM Serving appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHuawei Noah’s Ark Lab Released Dream 7B: A Powerful Open Diffusion Reasoning Model with Advanced Planning and Flexible Inference Capabilities
    Next Article Understanding the :root Selector and CSS Variables

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4354 – Tenda DAP-1520 Stack-Based Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    AI agent deployments will grow 327% during the next two years. Here’s what to do now

    News & Updates

    Cisco Unified Intelligence Center Vulnerability Allows Remote Attackers to Upload Arbitrary Files

    Security

    Catwatchful stalkerware app spills secrets of 62,000 users – including its own admin

    Development

    Highlights

    CVE-2025-32456 – Quantenna Wi-Fi Command Injection Vulnerability

    June 8, 2025

    CVE ID : CVE-2025-32456

    Published : June 8, 2025, 9:15 p.m. | 38 minutes ago

    Description : The Quantenna Wi-Fi chipset ships with a local control script, router_command.sh (in the put_file_to_qtn argument), that is vulnerable to command injection. This is an instance of CWE-88, “Improper Neutralization of Argument Delimiters in a Command (‘Argument Injection’),” and is estimated as a CVSS 7.7 ( CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N) https://www.first.org/cvss/calculator/3-1#CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N) .
    This issue affects Quantenna Wi-Fi chipset through version 8.0.0.28 of the latest SDK, and appears to be unpatched at the time of this CVE record’s first publishing, though the vendor has released a best practices guide for implementors of this chipset.

    Severity: 7.7 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-6362 – Simple Pizza Ordering System SQL Injection

    June 20, 2025

    Microsoft says Windows 11 is better than Windows 10, PCs 2.3 times faster

    June 25, 2025

    Farmonics Peri Peri Powder – Spicy & Tangy Seasoning for Fries, Grilled Foods & Snacks | Authentic Spice Mix for Cooking & Marination

    May 30, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.