xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

2,568 stars

314 forks

88 issues

PythonShellDockerfile

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing xdit-project/xDiT in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/xdit-project/xDiT)

Preview:

Repository Overview (README excerpt)

Crawler view

KTransformers --> A Scalable Inference Engine for Diffusion Transformers (DiTs) on Multiple Computing Devices 📝 Papers | 🚀 Quick Start | 🎯 Supported DiTs | 📚 Dev Guide | 📈 Discussion | 📝 Blogs Table of Contents • 🔥 Meet xDiT • 📢 Open-source Community • 🎯 Supported DiTs • 📈 Performance • 🚀 QuickStart • 🖼️ ComfyUI with xDiT • ✨ xDiT's Arsenal • Parallel Methods • 1. PipeFusion • 2. Unified Sequence Parallel • 3. Hybrid Parallel • 4. CFG Parallel • 5. Parallel VAE • Single GPU Acceleration • Compilation Acceleration • Cache Acceleration • 📚 Develop Guide • 🚧 History and Looking for Contributions • 📝 Cite Us 🔥 Meet xDiT Diffusion Transformers (DiTs) are driving advancements in high-quality image and video generation. With the escalating input context length in DiTs, the computational demand of the Attention mechanism grows **quadratically**! Consequently, multi-GPU and multi-machine deployments are essential to meet the **real-time** requirements in online services. Parallel Inference To meet real-time demand for DiTs applications, parallel inference is a must. xDiT is an inference engine designed for the parallel deployment of DiTs on a large scale. xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as computation accelerations. The overview of xDiT is shown as follows. • Sequence Parallelism, USP is a unified sequence parallel approach proposed by us combining DeepSpeed-Ulysses, Ring-Attention. • PipeFusion, a sequence-level pipeline parallelism, similar to TeraPipe but takes advantage of the input temporal redundancy characteristics of diffusion models. • Data Parallel: Processes multiple prompts or generates multiple images from a single prompt in parallel across images. • CFG Parallel, also known as Split Batch: Activates when using classifier-free guidance (CFG) with a constant parallelism of 2. The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware. As shown in the following picture, xDiT offers a set of APIs to adapt DiT models in huggingface/diffusers to hybrid parallel implementation through simple wrappers. If the model you require is not available in the model zoo, developing it by yourself is not so difficult; please refer to our Dev Guide. We also have implemented the following parallel strategies for reference: • Tensor Parallelism • DistriFusion Cache Acceleration Cache method, including TeaCache, First-Block-Cache and DiTFastAttn, which exploits computational redundancies between different steps of the Diffusion Model to accelerate inference on a single GPU. Computing Acceleration Optimization is orthogonal to parallel and focuses on accelerating performance on a single GPU. First, xDiT employs a series of kernel acceleration methods. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as and . 📢 Open-source Community The following open-sourced DiT Models are released with xDiT in day 1. HunyuanVideo StepVideo SkyReels-V1 Wan2.1 🎯 Supported DiTs | Model Name | CFG | SP | PipeFusion | TP | MR* | Performance Report Link | | --- | --- | --- | --- | --- | --- | --- | | 🎬 StepVideo | NA | ✔️ | ❎ | ✔️ | ❎ | Report | | 🎬 HunyuanVideo | NA | ✔️ | ❎ | ❎ | ✔️ | Report | | 🎬 HunyuanVideo-1.5 | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🎬 ConsisID-Preview | ✔️ | ✔️ | ❎ | ❎ | ❎ | Report | | 🎬 CogVideoX1.5 | ✔️ | ✔️ | ❎ | ❎ | ❎ | Report | | 🎬 Mochi-1 | ✔️ | ✔️ | ❎ | ❎ | ❎ | Report | | 🎬 CogVideoX | ✔️ | ✔️ | ❎ | ❎ | ❎ | Report | | 🎬 Latte | ❎ | ✔️ | ❎ | ❎ | ❎ | Report | | 🎬 Wan2.1 | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🎬 Wan2.2 | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🎬 LTX-2 | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🔵 HunyuanDiT-v1.2-Diffusers | ✔️ | ✔️ | ✔️ | ❎ | ❎ | Report | | 🔴 Z-Image Turbo | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🟠 Flux 2 klein | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🟠 Flux 2 | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🟠 Flux | NA | ✔️ | ✔️ | ❎ | ✔️ | Report | | 🟠 Flux Kontext | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🟢 Qwen Image | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🟢 Qwen Image-Edit | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA | | 🔴 PixArt-Sigma | ✔️ | ✔️ | ✔️ | ❎ | ❎ | Report | | 🟢 PixArt-alpha | ✔️ | ✔️ | ✔️ | ❎ | ❎ | Report | | 🟠 Stable Diffusion 3 | ✔️ | ✔️ | ✔️ | ❎ | ✔️ | Report | | 🟤 SANA | ✔️ | ✔️ | ✔️ | ❎ | ❎ | Report | | ⚫ SANA Sprint | NA | ✔️ | ❎ | ❎ | ❎ | NA | | 🟣 SDXL | ✔️ | ❎ | ❎ | ❎ | ❎ | NA | MR* = Model is runnable via the model runner. If not, it's runnable via the provided example scripts. 🖼️ TACO-DiT: ComfyUI with xDiT ComfyUI, is the most popular web-based Diffusion Model interface optimized for workflow. It provides users with a UI platform for image generation, supporting plugins like LoRA, ControlNet, and IPAdaptor. Yet, its design for native single-GPU usage leaves it struggling with the demands of today's large DiTs, resulting in unacceptably high latency for users like Flux.1. Using our commercial project **TACO-DiT**, a close-sourced ComfyUI variant built with xDiT, we've successfully implemented a multi-GPU parallel processing workflow within ComfyUI, effectively addressing Flux.1's performance challenges. Below is an example of using TACO-DiT to accelerate a Flux workflow with LoRA: By using TACO-DiT, you could significantly reduce your ComfyUI workflow inference latency, and boosting the throughput with Multi-GPUs. Now it is compatible with multiple Plug-ins, including ControlNet and LoRAs. More features and details can be found in our Intro Video: • [[YouTube] TACO-DiT: Accelerating Your ComfyUI Generation Experience](https://www.youtube.com/watch?v=7DXnGrARqys) • [[Bilibili] TACO-DiT: 加速你的ComfyUI生成体验](https://www.bilibili.com/video/BV18tU7YbEra/?vd_source=59c1f990379162c8f596974f34224e4f) The blog article is also available: Supercharge Your AIGC Experience: Leverage xDiT for Multiple GPU Parallel in ComfyUI Flux.1 Workflow. ComfyUI plugin for xDiT is now ava…