Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation

RAG has proven effective in enhancing the factual accuracy of LLMs by grounding their outputs in external, relevant information. However, most existing RAG implementations are limited to text-based corpora, which restricts their applicability to real-world scenarios where queries may require diverse types of information, ranging from textual definitions to spatial understanding from images or temporal reasoning from videos. While some recent approaches have extended RAG to handle different modalities like images and videos, these systems are often constrained to operate within a single modality-specific corpus. This limits their ability to effectively respond to a wide spectrum of user queries that demand multimodal reasoning. Moreover, current RAG methods usually retrieve from all modalities without discerning which is most relevant for a given query, making the process inefficient and less adaptive to specific information needs.

To address this, recent research emphasizes the need for adaptive RAG systems to determine the appropriate modality and retrieval granularity based on the query context. Strategies include routing queries based on complexity, such as deciding between no retrieval, single-step, or multi-step retrieval, and using model confidence to trigger retrieval only when needed. Furthermore, the granularity of retrieval plays a crucial role, as studies have shown that indexing corpora at finer levels, like propositions or specific video clips, can significantly improve retrieval relevance and system performance. Hence, for RAG to truly support complex, real-world information needs, it must handle multiple modalities and adapt its retrieval depth and scope to the specific demands of each query.

Researchers from KAIST and DeepAuto.ai introduce UniversalRAG, a RAG framework that retrieves and integrates knowledge from various modality-specific sources (text, image, video) and multiple granularity levels. Unlike traditional approaches that embed all modalities into a shared space, leading to modality bias, UniversalRAG uses a modality-aware routing mechanism to select the most relevant corpus dynamically based on the query. It further enhances retrieval precision by organizing each modality into granularity-specific corpora, such as paragraphs or video clips. Validated on eight multimodal benchmarks, UniversalRAG consistently outperforms unified and modality-specific baselines, demonstrating its adaptability to diverse query needs.

UniversalRAG is a retrieval-augmented generation framework that handles queries across various modalities and data granularities. Unlike standard RAG models limited to a single corpus, UniversalRAG separates knowledge into text, image, and video corpora, each with fine- and coarse-grained levels. A routing module first determines the optimal modality and granularity for a given query, choosing among options like paragraphs, full documents, video clips, or full video, and retrieves relevant information accordingly. This router can be either a training-free LLM-based classifier or a trained model using heuristic labels from benchmark datasets. An LVLM then uses the selected content to generate the final response.

The experimental setup assesses UniversalRAG across six retrieval scenarios: no retrieval, paragraph, document, image, clip, and video. For no-retrieval, MMLU tests general knowledge. Paragraph-level tasks use SQuAD and Natural Questions, while HotpotQA handles multi-hop document retrieval. Image-based queries come from WebQA, and video-related ones are sourced from LVBench and VideoRAG datasets, split into clip- and full-video levels. Corresponding retrieval corpora are curated for each modality—Wikipedia-based for text, WebQA for images, and YouTube videos for video tasks. This comprehensive benchmark ensures robust evaluation across varied modalities and retrieval granularities.

In conclusion, UniversalRAG is a Retrieval-Augmented Generation framework that can retrieve knowledge from multiple modalities and levels of granularity. Unlike existing RAG methods that rely on a single, often text-only, corpus or a single-modality source, UniversalRAG dynamically routes queries to the most appropriate modality- and granularity-specific corpus. This approach addresses issues like modality gaps and rigid retrieval structures. Evaluated on eight multimodal benchmarks, UniversalRAG outperforms both unified and modality-specific baselines. The study also emphasizes the benefits of fine-grained retrieval and highlights how both trained and train-free routing mechanisms contribute to robust, flexible multimodal reasoning.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: A Unique Way to Primary Key

BrowserStack launches Figma plugin for detecting accessibility issues in design phase

Parasoft brings agentic AI to service virtualization in latest release

Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

The best CRM software with email marketing in 2025: Expert tested and reviewed

This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

8 ways I quickly leveled up my Linux skills – and you can too

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

Execute Ping Commands and Get Back Structured Data in PHP

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

“I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Boolformer: Symbolic Regression of Logic Functions with Transformers

CVE-2025-49112 – Valkey TCP/IP Stack Integer Underflow Vulnerability

Playwright MCP: Expert Strategies for Success

New Copilot+ PC feature rolls out to Windows Insiders — even if your device is powered by Intel or AMD

Critical Vulnerability in Anthropic’s MCP Exposes Developer Machines to Remote Exploits

CVE-2025-4779 – Lunary Ai Lunary Stored Cross-Site Scripting (XSS)

InvenTree – inventory management system

Lizard Squad DDoS-for-hire service hacked – users’ details revealed

CVE-2022-41871 – SEPPmail Root Command Injection Vulnerability

Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation

Related Posts