Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Mila & Universite de Montreal Researchers Introduce the Forgetting Transformer (FoX) to Boost Long-Context Language Modeling without Sacrificing Efficiency

    Mila & Universite de Montreal Researchers Introduce the Forgetting Transformer (FoX) to Boost Long-Context Language Modeling without Sacrificing Efficiency

    April 25, 2025

    Transformers have revolutionized sequence modeling by introducing an architecture that handles long-range dependencies efficiently without relying on recurrence. Their ability to process input tokens simultaneously, while utilizing self-attention mechanisms, enables them to achieve impressive performance in natural language tasks. However, despite their dominance, some of the essential features found in recurrent neural networks, particularly the ability to forget irrelevant past information, are not natively present in standard Transformer models. This has led researchers to explore hybrid approaches that combine the best aspects of both architectures. The growing body of work on linear attention and gated recurrent designs has prompted interest in how such mechanisms can be meaningfully integrated into the Transformer paradigm to enhance its adaptability and precision in processing context-sensitive sequences.

    A key challenge in sequential modeling is dynamically controlling memory. Standard attention-based models, such as the Transformer, process and store all input information uniformly, regardless of its relevance over time. This approach can be suboptimal when recent inputs carry more significance for a task, or when older inputs introduce noise. Traditional recurrent models address this with mechanisms such as forget gates, which allow them to modulate memory retention. However, these models struggle to maintain performance over extended sequences because of their fixed-size hidden states. The Transformer, while powerful, lacks a native method for discarding less useful past information in a context-sensitive manner. As a result, tasks that demand selective memory can suffer, especially when input lengths grow substantially and noise accumulates.

    To address memory challenges, some strategies have introduced static positional biases into attention mechanisms. For instance, ALiBi adds predefined slopes to attention logits to simulate a form of recency weighting. However, such methods lack adaptability, as they do not consider the content of the input when deciding what to retain. Other efforts, such as Mamba-2 and GLA, implement gating within linear attention frameworks but often sacrifice normalization, a key aspect of Transformer accuracy. Also, these models tend to deviate significantly from the Transformer structure, making them less compatible with Transformer-based optimizations and pretraining paradigms. Thus, a gap remains for an approach that can dynamically forget in a learnable and efficient manner while preserving the Transformer’s computational strengths.

    Researchers from Mila & Universite de Montreal and MakerMaker AI proposed a novel architecture called the Forgetting Transformer (FoX). This model introduces a mechanism known as Forgetting Attention, which inserts a scalar forget gate into the softmax attention process. Unlike existing recurrent models, this modification is fully compatible with parallel computation and avoids the need for positional embeddings. The forget gate adjusts the raw attention scores based on the data itself, allowing FoX to effectively down-weight less relevant past inputs. Importantly, the model retains full compatibility with the efficient FlashAttention algorithm, ensuring minimal deployment overhead. Two architectural variants were tested: FoX, based on LLaMA, and FoX (Pro), which incorporates normalization techniques and token-shifting mechanisms derived from recent recurrent models.

    Technically, the model computes forget gate values for each timestep using a sigmoid activation on a learned linear transformation of the input. These scalar gate values are then used to bias attention logits through a log-sum formulation, modifying the softmax operation in a hardware-efficient manner. The modification is implemented by computing the cumulative sum of log forget values and adjusting attention weights without requiring the instantiation of large matrices. Multi-head attention support is retained, with each head maintaining independent forget gate parameters. The Pro variant introduces output normalization and output gates, along with a key-value shift mechanism that mixes current and previous tokens in a learnable manner. These adjustments further refine context sensitivity and model flexibility without significantly increasing the number of parameters.

    In a long-context language modeling task using the LongCrawl64 dataset (a 48-billion-token subset of RedPajama-v2), FoX consistently surpassed both standard Transformer baselines and leading recurrent models. Per-token loss metrics showed a sharper decline for FoX across token positions, indicating better context utilization. At position 64,000, FoX (Pro) achieved significantly lower loss values than Transformer (Pro) and LLaMA variants. Also, perplexity evaluations demonstrated that FoX maintains robust accuracy across increasing validation context lengths, with performance degrading less sharply beyond the training limit of 16,384 tokens. Competing models, such as Mamba-2 and DeltaNet, showed earlier plateaus, highlighting FoX’s superior extrapolation capabilities. Training was performed with 760 million parameters using the TikToken tokenizer for GPT-2, with extensive tuning for learning rates and head dimensions. Fox preferred higher learning rates and smaller head dimensions, indicating architectural resilience and adaptability.

    The researchers emphasized that Forgetting Attention retains the core benefits of the Transformer while overcoming its limitations regarding selective memory. They demonstrated that the forget gate introduces a data-driven recency bias that strengthens performance in both short and long sequences. Additionally, the implementation incurs minimal computational cost and requires no additional memory overhead, thanks to its compatibility with FlashAttention. Notably, Forgetting Attention also generalizes static biases, such as ALiBi, by introducing learnable gates, providing evidence that dynamic biasing is significantly more effective. FoX models also matched or exceeded standard Transformer performance on downstream tasks, with the Pro variant showing consistent superiority, especially in functions that reward adaptability across contexts.

    This work demonstrates that the effective integration of dynamic memory mechanisms into Transformer architectures is not only feasible but also beneficial across a wide range of benchmarks. The introduction of a forget gate within the attention computation allows models to discard irrelevant information in a learned manner, substantially improving focus and generalization. The compatibility with high-performance implementations, such as FlashAttention, ensures that such improvements come without trade-offs in efficiency.

    Several Key takeaways from the research on FoX include:

    • FoX introduces Forgetting Attention, enhancing standard softmax attention with learnable forget gates.
    • Two architectural variants were tested: FoX (LLaMA) and FoX (Pro), with the latter incorporating additional normalization and gating layers.
    • FoX models trained on 48B tokens with 760M parameters significantly outperformed Transformers in long-context modeling.
    • Per-token loss L(i) and perplexity P(l) confirmed that FoX maintained low error rates even beyond 64k-token sequences.
    • Forgetting Attention is a generalization of ALiBi, offering dynamic, data-dependent gating over fixed biases.
    • The Pro architecture further improved results with minimal overhead by using output normalization and token shift mechanisms.
    • Hardware compatibility was preserved through modifications to FlashAttention, enabling practical deployment at scale.

    Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post Mila & Universite de Montreal Researchers Introduce the Forgetting Transformer (FoX) to Boost Long-Context Language Modeling without Sacrificing Efficiency appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA Comprehensive Tutorial on the Five Levels of Agentic AI Architectures: From Basic Prompt Responses to Fully Autonomous Code Generation and Execution
    Next Article How to use chatgpt4o to redesign your website

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Sam Altman Says AI Privacy Concerns Are Real — But Early Regulation Could Hurt Innovation

    Development

    CVE-2025-24054 Under Active Attack—Steals NTLM Credentials on File Download

    Development

    CVE-2025-4804 – WatchGuard Fireware OS Stored Cross-site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)
    Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

    Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

    Artificial Intelligence

    Highlights

    CVE-2025-4594 – WordPress Tournamatch Stored Cross-Site Scripting Vulnerability

    May 23, 2025

    CVE ID : CVE-2025-4594

    Published : May 23, 2025, 4:15 a.m. | 17 minutes ago

    Description : The Tournamatch plugin for WordPress is vulnerable to Stored Cross-Site Scripting via the plugin’s ‘trn-ladder-registration-button’ shortcode in all versions up to, and including, 4.6.1 due to insufficient input sanitization and output escaping on user supplied attributes. This makes it possible for authenticated attackers, with contributor-level access and above, to inject arbitrary web scripts in pages that will execute whenever a user accesses an injected page.

    Severity: 6.4 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-37997 – Netfilter Ipset Region Locking Vulnerability

    May 29, 2025

    Pasaffe is an easy to use password manager for GNOME

    April 28, 2025

    CVE-2025-4494 – JAdmin-JAVA Remote Authentication Bypass

    May 9, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.