Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models

    Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models

    May 23, 2025

    Recent advances in long-context (LC) modeling have unlocked new capabilities for LLMs and large vision-language models (LVLMs). Long-context vision–language models (LCVLMs) show an important step forward by enabling LVLMs to process hundreds of images and thousands of interleaved text tokens in a single forward pass. However, the development of effective evaluation benchmarks lags. It is still unclear how well current LCVLMs perform in long-context settings, what tasks they struggle with, and how robust they are to input length variation. Current benchmarks face the following problem: (a) Limited coverage of downstream tasks, (b) Insufficient coverage of image types, (c) Lack of context length control, and (d) Single context length.

    Various techniques have extended context windows for LVLMs, including longer pre-training lengths, position extrapolation, and efficient architectures. Models like Gemini-2.5 and Qwen2.5-VL have adopted these approaches alongside vision token compression methods to accommodate longer sequences. For evaluation, the Needle-in-a-Haystack task became a standard benchmark for testing LC ability by inserting information at specific depths within long texts. However, existing vision-language benchmarks remain limited, focusing only on NIAH variants or long-document VQA tasks. Even MileBench contains short-context tasks with an average length of only 9K tokens, failing to evaluate true LC capabilities across diverse vision-language applications.

    Researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.AI, and NVIDIA AI Technology Center have proposed MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs. It comprises 13,331 examples spanning five downstream task categories, including Visual RAG and Many-Shot ICL, covering natural and synthetic image types. All examples are standardized across five input lengths from 8K to 128K tokens using a cross-modal tokenization scheme combining vision patches and text tokens. Through benchmarking 46 closed-source and open-source models, the research reveals that single-task performance poorly predicts overall LC capability, both model types struggle with LC tasks, and stronger reasoning models show better LC performance.

    Researchers construct LC by inserting gold passages containing answers among large sets of distracting passages retrieved from Wikipedia. For ViQuAE, gold passages from KILT are used, while InfoSeek uses lead sections from Wikipedia entity pages. Further, Wikipedia pages are split into 100-word passages, and retrieved distractors are added until reaching desired input lengths. Many-shot in-context learning tasks utilize four diverse image classification datasets: Stanford Cars, Food101, SUN397, and iNat2021, accommodating 500 images within 128K context windows. Cross-modal token counting combines text tokens using the Llama2 tokenizer with visual tokens processed through 14×14 patches and 2×2 pixel unshuffle compression, ensuring compatibility with modern LVLMs for evaluation.

    The evaluation on MMLONGBENCH across tasks and context Lengths shows that all models struggle, but closed-source models perform better. For the longest input length of 128K, all models struggle with long-context vision-language tasks, with GPT-4o achieving only 62.9 average performance. Gemini-2.5-Pro became the strongest performer, outperforming open-source models by 20 points except on ICL tasks. Further, Ovis2-34B model achieves a score of 41.6 on summarization, similar to GPT-4o (42.4). Qwen2.5-VL-32B achieves a SubEM score of 64.6 on VRAG, even better than Gemini-2.0-Flash. Models show generalization capabilities beyond their training context lengths, with Qwen2-VL-72B achieving a 51.9 average score at 128K despite a 32K training window.

    In conclusion, researchers introduced MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs across diverse downstream tasks. It provides a rigorous foundation for diagnosing frontier model capabilities by covering five distinct task categories with unified cross-modal token counting and standardized context lengths. The evaluation of 46 models demonstrates that single-task performance unreliably predicts overall long-context ability, and frontier models face significant challenges in OCR accuracy and cross-modal retrieval. MMLONGBENCH is a standard evaluation framework to drive future research toward more efficient vision-language token encodings, robust position-extrapolation schemes, and improved multi-modal retrieval and reasoning capabilities.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticlePrincipal Financial Group increases Voice Virtual Assistant performance using Genesys, Amazon Lex, and Amazon QuickSight
    Next Article Midcontract – A secure freelance contract and payment infrastructure built on smart contracts — no middlemen, no delays, no lock-ins.

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5393 – WordPress Alone Charity Multipurpose Non-profit Theme Arbitrary File Deletion Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40671 – AES Multimedia Gestnet SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-49533 – Adobe Experience Manager MS Deserialization of Untrusted Data Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    The new frontier of API governance: Ensuring alignment, security, and efficiency through decentralization

    Tech & Work

    Highlights

    CVE-2025-48954 – Discourse is an open-source discussion platform. V

    June 25, 2025

    CVE ID : CVE-2025-48954

    Published : June 25, 2025, 2:15 p.m. | 14 minutes ago

    Description : Discourse is an open-source discussion platform. Versions prior to 3.5.0.beta6 are vulnerable to cross-site scripting when the content security policy isn’t enabled when using social logins. Version 3.5.0.beta6 patches the issue. As a workaround, have the content security policy enabled.

    Severity: 8.1 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    October CMS: A Modern CMS for Developers and Creators

    May 27, 2025

    How to Create Serverless AI Agents with Langbase Docs MCP Server in Minutes

    May 6, 2025

    The 13 best early Prime Day 2025 deals under $25

    June 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.