Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs

    UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs

    April 29, 2025

    The CLIP framework has become foundational in multimodal representation learning, particularly for tasks such as image-text retrieval. However, it faces several limitations: a strict 77-token cap on text input, a dual-encoder design that separates image and text processing, and a limited compositional understanding that resembles bag-of-words models. These issues hinder its effectiveness in capturing nuanced, instruction-sensitive semantics. Although MLLMs like LLaVA, Qwen2-VL, and CogVLM offer significant advances in vision-language reasoning, their autoregressive next-token prediction objective restricts their ability to learn generalized, transferable embeddings. This has sparked growing interest in developing alternative methods that can combine the strengths of both contrastive learning and LLM-based reasoning.

    Recent approaches aim to overcome these limitations by employing novel architectures and training strategies. For instance, E5-V proposes unimodal contrastive training for aligning cross-modal features, while VLM2Vec introduces the MMEB benchmark to convert advanced vision-language models into effective embedding generators. Models like LLM2Vec and NV-Embed enhance text-based representation learning by modifying the attention mechanisms in decoder-only LLMs. Despite these innovations, challenges such as handling long sequences, enabling better cross-modal fusion, and effectively distinguishing hard negatives in contrastive learning remain. As multimodal applications expand, there is a pressing need for representation learning methods that are both scalable and capable of fine-grained semantic alignment.

    Researchers from institutions including The University of Sydney, DeepGlint, Tongyi Lab at Alibaba, and Imperial College London introduce UniME, a two-stage framework designed to improve multimodal representation learning using MLLMs. The first stage applies textual discriminative knowledge distillation from a strong LLM teacher to enhance the language encoder. The second stage employs hard negative enhanced instruction tuning, which involves filtering false negatives and sampling multiple challenging negatives per instance to improve the model’s discriminative and instruction-following abilities. Evaluations on the MMEB benchmark and various retrieval tasks show that UniME delivers consistent and significant improvements in both performance and compositional understanding.

    The UniME framework introduces a two-stage method for learning universal multimodal embeddings using MLLMs. First, it employs textual discriminative knowledge distillation, where a student MLLM is trained using text-only prompts and supervised by a teacher model to enhance embedding quality. Then, a second stage—hard negative enhanced instruction tuning—improves cross-modal alignment and task performance by filtering false negatives and sampling hard negatives. This stage also leverages task-specific prompts to enhance instruction-following for various applications, such as retrieval and visual question answering. Together, these stages significantly boost UniME’s performance on both in- and out-of-distribution tasks.

    The study evaluated UniME on Phi3.5-V and LLaVA-1.6 using PyTorch with DeepSpeed for efficient training across 8 NVIDIA A100 GPUs. Training consisted of two stages: a textual knowledge distillation phase using the NLI dataset (273,000 pairs) and a hard negative instruction tuning phase on 662,000 multimodal pairs. NV-Embed V2 served as the teacher model. UniME was evaluated on 36 MMEB benchmark datasets, achieving consistent improvements over baselines such as E5-V and VLM2Vec. Hard negatives significantly improved the model’s ability to distinguish subtle differences, thereby enhancing its performance, particularly in long-caption and compositional retrieval tasks. Ablation studies confirmed the effectiveness of both training stages and tuning parameters.

    In conclusion, UniME is a two-stage framework designed to improve multimodal representation learning using MLLMs. In the first stage, UniME distills textual discriminative knowledge from a large language model to strengthen the language embeddings of the MLLM. In the second stage, it enhances learning through instruction tuning with multiple hard negatives per batch, reducing false negative interference and encouraging the model to distinguish challenging examples. Extensive evaluation on MMEB and various retrieval tasks demonstrates that UniME consistently boosts performance, offering strong discriminative and compositional abilities across tasks, thereby surpassing the limitations of prior models, such as CLIP.


    Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow to Create a Custom Model Context Protocol (MCP) Client Using Gemini
    Next Article ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5857 – “Code-projects Patient Record Management System SQL Injection Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    Microsoft confirms Windows 11 version 25H2 is coming soon — will install much faster than version 24H2

    News & Updates

    CVE-2025-2579 – Lottie Player WordPress Stored Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    My new favorite iPhone portable charger has a magnetic superpower – and it’s cheap

    News & Updates

    Highlights

    Mozilla Firefox To Get AI-Powered Link Previews Feature on Hover

    April 14, 2025

    Mozilla Firefox is set to launch a new AI feature – the link previews feature…

    CVE-2025-53823 – WeGIA SQL Injection Vulnerability

    July 15, 2025

    3 sticky insights from 3 eng management books

    June 3, 2025

    Garmin Updater for Windows 11: How to Download and Set It Up Easily

    July 8, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.