Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing

    Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing

    April 22, 2025

    Diffusion Transformers have demonstrated outstanding performance in image generation tasks, surpassing traditional models, including GANs and autoregressive architectures. They operate by gradually adding noise to images during a forward diffusion process and then learning to reverse this process through denoising, which helps the model approximate the underlying data distribution. Unlike the commonly used UNet-based diffusion models, Diffusion Transformers apply the transformer architecture, which has proven effective after sufficient training. However, their training process is slow and computationally intensive. A key limitation lies in their architecture: during each denoising step, the model must balance encoding low-frequency semantic information while simultaneously decoding high-frequency details using the same modules—this creates an optimization conflict between the two tasks.

    To address the slow training and performance bottlenecks, recent work has focused on improving the efficiency of Diffusion Transformers through various strategies. These include utilizing optimized attention mechanisms, such as linear and sparse attention, to reduce computational costs, and introducing more effective sampling techniques, including log-normal resampling and loss reweighting, to stabilize the learning process. Additionally, methods like REPA, RCG, and DoD incorporate domain-specific inductive biases, while masked modeling enforces structured feature learning, boosting the model’s reasoning capabilities. Models like DiT, SiT, SD3, Lumina, and PixArt have extended the diffusion transformer framework to advanced areas such as text-to-image and text-to-video generation. 

    Researchers from Nanjing University and ByteDance Seed Vision introduce the Decoupled Diffusion Transformer (DDT), which separates the model into a dedicated condition encoder for semantic extraction and a velocity decoder for detailed generation. This decoupled design enables faster convergence and improved sample quality. On the ImageNet 256×256 and 512×512 benchmarks, their DDT-XL/2 model achieves state-of-the-art FID scores of 1.31 and 1.28, respectively, with up to 4× faster training. To further accelerate inference, they propose a statistical dynamic programming method that optimally shares encoder outputs across denoising steps with minimal impact on performance.

    The DDT introduces a condition encoder and a velocity decoder to handle low- and high-frequency components in image generation separately. The encoder extracts semantic features (zt) from noisy inputs, timesteps, and class labels, which are then used by the decoder to estimate the velocity field. To ensure consistency of zt across steps, representation alignment and decoder supervision are applied. During inference, a shared self-condition mechanism reduces computation by reusing zt at certain timesteps. A dynamic programming approach identifies the optimal timesteps for recomputing zt, minimizing performance loss while accelerating the sampling process.

    The researchers trained their models on 256×256 ImageNet using a batch size of 256 without gradient clipping or warm-up. Using VAE-ft-EMA and Euler sampling, they evaluated performance using FID, sFID, IS, Precision, and Recall. They built improved baselines with SwiGLU, RoPE, RMSNorm, and lognorm sampling. Their DDT models consistently outperformed prior baselines, particularly in larger sizes, and converged significantly faster than REPA. Further gains were achieved through encoder sharing strategies and careful tuning of the encoder-decoder ratio, resulting in state-of-the-art FID scores on both 256×256 and 512×512 ImageNet.

    In conclusion, the study presents the DDT, which addresses the optimization challenge in traditional diffusion transformers by separating semantic encoding and high-frequency decoding into distinct modules. By scaling encoder capacity relative to the decoder, DDT achieves notable performance gains, especially in larger models. The DDT-XL/2 model sets new benchmarks on ImageNet, achieving faster training convergence and lower FID scores for both 256×256 and 512×512 resolutions. Additionally, the decoupled design enables encoder sharing across denoising steps, significantly improving inference efficiency. A dynamic programming strategy further enhances this by determining optimal sharing points, maintaining image quality while reducing computational load.


    Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI Agents
    Next Article A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite Database

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    DoJ Seizes 145 Domains Tied to BidenCash Carding Marketplace in Global Takedown

    Development

    NVIDIA to manufacture AI supercomputers in the U.S. for the first time

    News & Updates

    Your LG TV might analyze your emotional state to show you more relevant ads

    News & Updates

    Why Adobe Firefly might be the only AI image tool that actually matters

    News & Updates

    Highlights

    Databases

    Upgrade your Amazon DynamoDB global tables to the current version

    June 4, 2025

    Amazon DynamoDB is a fully managed, serverless NoSQL database that delivers single-digit millisecond performance for…

    CVE-2025-6729 – WordPress PayMaster for WooCommerce SSRF Vulnerability

    July 3, 2025

    Transform JSON into Typed Collections with Laravel’s AsCollection::of()

    May 19, 2025

    Development Release: GLF OS Omnislash Beta

    June 4, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.