Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video

    Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video

    April 18, 2025

    The Challenge of Designing General-Purpose Vision Encoders

    As AI systems grow increasingly multimodal, the role of visual perception models becomes more complex. Vision encoders are expected not only to recognize objects and scenes, but also to support tasks like captioning, question answering, fine-grained recognition, document parsing, and spatial reasoning across both images and videos. Existing models typically rely on diverse pretraining objectives—contrastive learning for retrieval, captioning for language tasks, and self-supervised methods for spatial understanding. This fragmentation complicates scalability and model deployment, and introduces trade-offs in performance across tasks.

    What remains a key challenge is the design of a unified vision encoder that can match or exceed task-specific methods, operate robustly in open-world scenarios, and scale efficiently across modalities.

    A Unified Solution: Meta AI’s Perception Encoder

    Meta AI introduces Perception Encoder (PE), a vision model family trained using a single contrastive vision-language objective and refined with alignment techniques tailored for downstream tasks. PE departs from the traditional multi-objective pretraining paradigm. Instead, it demonstrates that with a carefully tuned training recipe and appropriate alignment methods, contrastive learning alone can yield highly generalizable visual representations.

    The Perception Encoder operates across three scales—PEcoreB, PEcoreL, and PEcoreG—with the largest (G-scale) model containing 2B parameters. These models are designed to function as general-purpose encoders for both image and video inputs, offering strong performance in classification, retrieval, and multimodal reasoning.

    Training Approach and Architecture

    The pretraining of PE follows a two-stage process. The first stage involves robust contrastive learning on a large-scale curated image-text dataset (5.4B pairs), where several architectural and training enhancements improve both accuracy and robustness. These include progressive resolution scaling, large batch sizes (up to 131K), use of the LAMB optimizer, 2D RoPE positional encoding, tuned augmentations, and masked regularization.

    The second stage introduces video understanding by leveraging a video data engine that synthesizes high-quality video-text pairs. This pipeline incorporates captions from the Perception Language Model (PLM), frame-level descriptions, and metadata, which are then summarized using Llama 3.3. These synthetic annotations allow the same image encoder to be fine-tuned for video tasks via frame averaging.

    Despite using a single contrastive objective, PE features general-purpose representations distributed across intermediate layers. To access these, Meta introduces two alignment strategies:

    • Language alignment for tasks such as visual question answering and captioning.
    • Spatial alignment for detection, tracking, and depth estimation, using self-distillation and spatial correspondence distillation via SAM2.

    Empirical Performance Across Modalities

    PE demonstrates strong zero-shot generalization across a wide range of vision benchmarks. On image classification, PEcoreG matches or exceeds proprietary models trained on large private datasets such as JFT-3B. It achieves:

    • 86.6% on ImageNet-val,
    • 92.6% on ImageNet-Adversarial,
    • 88.2% on the full ObjectNet set,
    • Competitive results on fine-grained datasets including iNaturalist, Food101, and Oxford Flowers.

    In video tasks, PE achieves state-of-the-art performance on zero-shot classification and retrieval benchmarks, outperforming InternVideo2 and SigLIP2-g-opt, while being trained on just 22M synthetic video-caption pairs. The use of simple average pooling across frames—rather than temporal attention—demonstrates that architectural simplicity, when paired with well-aligned training data, can still yield high-quality video representations.

    An ablation study shows that each component of the video data engine contributes meaningfully to performance. Improvements of +3.9% in classification and +11.1% in retrieval over image-only baselines highlight the utility of synthetic video data, even at modest scale.

    Conclusion

    Perception Encoder provides a technically compelling demonstration that a single contrastive objective, if implemented with care and paired with thoughtful alignment strategies, is sufficient to build general-purpose vision encoders. PE not only matches specialized models in their respective domains but does so with a unified and scalable approach.

    The release of PE, along with its codebase and the PE Video Dataset, offers the research community a reproducible and efficient foundation for building multimodal AI systems. As visual reasoning tasks grow in complexity and scope, PE provides a path forward toward more integrated and robust visual understanding.


    Check out the Paper, Model, Code and Dataset. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleStream ingest data from Kafka to Amazon Bedrock Knowledge Bases using custom connectors
    Next Article LLM Unlearning Benchmarks are Weak Measures of Progress

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-43922 – FileWave Windows Privilege Escalation

    Common Vulnerabilities and Exposures (CVEs)

    Gemini 2.5: Our most intelligent AI model

    Artificial Intelligence

    Several Ubisoft games can now be purchased in the Xbox PC app, including Assassin’s Creed Shadows and Watch Dogs 2

    News & Updates

    KimJongRAT Returns: New PE & PowerShell Variants Steal Crypto and Browser Data via CDNs

    Security

    Highlights

    CVE-2025-44882 – Wavlink Firewall CGI Command Injection

    May 20, 2025

    CVE ID : CVE-2025-44882

    Published : May 20, 2025, 9:15 p.m. | 2 hours, 18 minutes ago

    Description : A command injection vulnerability in the component /cgi-bin/firewall.cgi of Wavlink WL-WN579A3 v1.0 allows attackers to execute arbitrary commands via a crafted input.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency

    LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency

    April 20, 2025

    CVE-2025-52979 – Apache HTTP Server Cross-Site Request Forgery

    June 24, 2025

    CVE-2025-31260 – Apple macOS Sequoia Permission Escalation Vulnerability

    May 12, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.