Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models

Integrating long-context capabilities with visual understanding significantly enhances the potential of VLMs, particularly in domains such as robotics, autonomous driving, and healthcare. Expanding the context size enables VLMs to process extended video and text sequences, thereby enhancing temporal resolution and performance in complex tasks, such as video comprehension. However, one major limitation is the quadratic complexity of attention mechanisms during the pre-fill phase, which results in high latency before autoregressive decoding begins. This delay, known as Time-to-First-Token, makes real-world deployment of long-context VLMs challenging. Various sparse attention methods, such as Sparse Transformer, Swin Transformer, and StreamingLLM, overlook the specific sparse patterns found in VLMs with mixed modalities, thereby limiting their efficiency and effectiveness.

Unlike text-only inputs, visual and video data in VLMs demonstrate unique spatiotemporal attention structures, forming grid-like patterns due to local correlations. In mixed-modality scenarios, clear boundaries exist between different modalities, leading to distinct attention behaviors that general sparse methods fail to capture. Recent advancements, such as MInference and dynamic sparse attention approaches, aim to improve inference efficiency by adapting attention patterns online. Yet, these techniques often fall short in handling the intricacies of mixed-modality inputs. While vision token compression and RNN-Transformer hybrids have been explored to reduce computational load, most of these methods focus on long-video and short-text pairings, neglecting the more complex dynamics of multiturn, mixed-modality interactions, which are increasingly important in practical applications.

Researchers from the University of Surrey and Microsoft have introduced MMInference, a dynamic, sparse attention method designed to accelerate the pre-filling stage of long-context VLMs. By identifying grid-like sparsity patterns in video inputs and distinct modality boundaries, MMInference applies permutation-based strategies to optimize attention computation. It dynamically constructs sparse distributions for each input and utilizes custom GPU kernels for enhanced efficiency, all without requiring modifications to existing models. Tested on benchmarks like Video QA, Captioning, and Vision-NIAH, MMInference achieved up to 8.3× speedup at 1M tokens, outperforming previous methods while maintaining high accuracy across multiple state-of-the-art VLMs.

MMInference is a framework designed to speed up the pre-filling phase of long-context vision-language models by leveraging modality-aware sparse attention. It integrates three key components: (1) intra-modality sparse patterns like Grid, A-shape, and Vertical-Slash attention; (2) cross-modality patterns such as Q-Boundary and 2D-Boundary; and (3) a modality-aware sparse attention search algorithm. Instead of dense computation, it uses dynamic sparse attention with optimized GPU kernels and efficient tensor handling. The framework dynamically identifies attention patterns and permutes tensors based on modality, enabling efficient handling of multi-modal inputs and reducing computational overhead while maintaining strong performance.

The study evaluates MMInference’s performance and efficiency on long-video tasks, including captioning, question answering, and retrieval in both unimodal and mixed-modality settings. Experiments were conducted using state-of-the-art models, such as Llava-Video and LongVILA, with comparisons against several sparse attention baselines. Results show that MMInference achieves near full-attention performance while being more computationally efficient. It performs particularly well in the newly introduced Mixed-Modality Needle in a Haystack (MM-NIAH) task by leveraging inter-modality sparse patterns. Additionally, MMInference demonstrates significant speedups in end-to-end latency and maintains robustness across varying context lengths and input types.

In conclusion, MMInference is a modality-aware sparse attention technique designed to accelerate long-context VLMs without compromising accuracy. It employs a permutation-based grid attention pattern tailored for the spatial-temporal locality of video inputs, along with specialized handling for mixed-modality boundaries. A search algorithm identifies optimal sparse patterns per attention head, dynamically adapting to the input. The method integrates directly into current VLM pipelines without requiring model changes or fine-tuning. With optimized GPU kernels, MMInference achieves up to 8.3× acceleration during the pre-filling stage at 1M tokens across various tasks, including video QA, captioning, and mixed-modality benchmarks, while retaining full-attention performance.

Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: A Unique Way to Primary Key

BrowserStack launches Figma plugin for detecting accessibility issues in design phase

Parasoft brings agentic AI to service virtualization in latest release

Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

The best CRM software with email marketing in 2025: Expert tested and reviewed

This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

8 ways I quickly leveled up my Linux skills – and you can too

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

Execute Ping Commands and Get Back Structured Data in PHP

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

“I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Boolformer: Symbolic Regression of Logic Functions with Transformers

0-Click NTLM Authentication Bypass Hits Microsoft Telnet Server, PoC Releases, No Patch

CVE-2025-48187 – RAGFlow Authentication Bypass

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Google quietly pushes emergency fix for Chrome 0-day as exploit runs wild

Do LLMs Know Internally When They Follow Instructions?

CVE-2022-50213 – Linux Kernel Netfilter Use-After-Free Vulnerability

Add QR Code field functionality to your Filament UI’s

CVE-2025-5171 – LliSoft MTA Maita Training System Unrestricted File Upload Vulnerability

Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models

Related Posts