back to home

songmzhang / KDFlow

A user-friendly & efficient knowledge distillation framework for LLMs, supporting off-policy, on-policy (OPD), cross-tokenizer, multimodal, and on-policy self-distillation.

View on GitHub
63 stars
5 forks
0 issues

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing songmzhang/KDFlow in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/songmzhang/KDFlow)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

### **A User-friendly and Efficient Framework for LLM Knowledge Distillation** --- šŸ“° News • **[2026/03]** šŸŽ‰ KDFlow v0.1.1 released! Now supports **vision-language (multimodal) models** and **Qwen3.5 series** (as the teacher model). --- šŸ“‘ Table of Contents • ✨ Key Features • šŸš€ Quick Start • Installation • Off-Policy Knowledge Distillation • On-Policy Knowledge Distillation • Cross-Tokenizer Knowledge Distillation • Supervised Fine-Tuning (SFT) • āš™ļø Configuration Reference • 🧩 Extending KDFlow • Adding a Custom KD Algorithm • Adding a Custom KD Loss • šŸ”‘ Design Highlights • šŸ™ Acknowledgement • šŸ“– Citation • šŸ“„ License • ⭐ Star History --- ✨ Key Features • **Decoupled Infrastructure** - Using SGLang & FSDP2 for teacher inference and student training respectively. • **Off-Policy Knowledge Distillation** — Distill from pre-collected teacher hidden states on static datasets. • **On-Policy Knowledge Distillation** — Student-generated rollout responses are used for teacher forward and distillation training in a closed loop. • **Cross-Tokenizer Distillation** — Native support for distilling between models with different tokenizers (e.g., Llama → Qwen). • **SFT Training (Black-box KD)** — Supervised fine-tuning on collected dataset. • **MultiModal Support** — Support distillation with vision-language models (e.g., Qwen3-VL). • **Colocate Mode** — Teacher and student models **share the same GPUs** via sleep/wakeup mechanism, maximizing GPU utilization. • **Teacher on SGLang** — Teacher inference is powered by SGLang Engine, enabling high-throughput prefilling and flexible parallel strategies. • **Pluggable KD Algorithms** — Built-in support for Vanilla KD and DSKD (Dual-Space Knowledge Distillation), with easy registration of custom algorithms. • **Multiple Loss Functions** — Torch compiled KL divergence, Reverse KL divergence, JS divergence, Adaptive KL (AKL), etc. • **LoRA Support** — Optional LoRA fine-tuning for the student model. • **Wand&b Integration** — Built-in wand&b logging for experiment tracking. • **High Training Efficiency** — Achieves **1.4x to 6x** faster distillation compared to mainstream knowledge distillation frameworks. --- šŸš€ Quick Start Installation > Since SGLang 0.5.9 does not support transformers v5, please use transformers v4.57.1 to ensure correct teacher computation. Off-Policy Knowledge Distillation LLMs: VLMs: On-Policy Knowledge Distillation LLMs: VLMs: Cross-Tokenizer Knowledge Distillation Off-Policy Use SimpleCrossTokenizerKD (suggested): or DSKD: Off-Policy Use SimpleCrossTokenizerKD (suggested): or DSKD: Supervised Fine-Tuning (SFT) --- āš™ļø Configuration Reference Model Arguments | Argument | Default | Description | |---|---|---| | | | Student model name or path | | | | Teacher model name or path | | | | Attention implementation | | | | Use Liger Kernel for student model | | | | LoRA rank (0 = disabled) | | | | LoRA alpha | | | | Ring attention group size | | | | Enable thinking mode | Training Arguments | Argument | Default | Description | |---|---|---| | | | Number of training nodes | | | | GPUs per node | | | | Number of training epochs | | | | Global training batch size | | | | Per-GPU micro batch size | | | | Learning rate | | | | LR scheduler type | | | | Warmup ratio | | | | Gradient clipping max norm | | | | Training backend | | | | Enable gradient checkpointing | | | | Model save path | Distillation Arguments | Argument | Default | Description | |---|---|---| | | | KD loss weight: | | | | Temperature for softmax in KD | | | | KD algorithm ( / ) | | | | Divergence function (like / / / ) | | | | Teacher tensor parallel size | | | | Teacher expert parallel size | | | | Teacher pipeline parallel size | | | | Teacher data parallel size | | | | Enable teacher sleep/wakeup for GPU sharing | | | | Teacher forward N batches at once | | | | SGLang static memory fraction for teacher | Rollout Arguments (On-Policy) | Argument | Default | Description | |---|---|---| | | | Number of SGLang rollout engines | | | | Tensor parallel per rollout engine | | | | Prompts per rollout iteration | | | | Number of responses per prompt | | | | Max generation length | | | | Sampling temperature | | | | Top-p sampling | | | | Enable rollout sleep/wakeup | Data Arguments | Argument | Default | Description | |---|---|---| | | | Training dataset path | | | | Max sequence length | | | | Dataset input key | | | | Apply tokenizer chat template | | | | Pack sequences for efficiency | | | | Dynamic batching by token count | Logging Arguments | Argument | Default | Description | |---|---|---| | | | Log every N steps | | | | Enable W&B logging | | | | W&B project name | | | | W&B run name | --- 🧩 Extending KDFlow Adding a Custom KD Algorithm Create a new file in and register it: Then use it with . Adding a Custom KD Loss Create a new file in and register it: Then use it with . --- šŸ”‘ Design Highlights GPU Co-location via Sleep/Wakeup KDFlow enables teacher and student to **share the same GPUs** through a sleep/wakeup mechanism: • **Teacher phase**: Teacher model weights are loaded on GPU, student optimizer states are offloaded to CPU. • **Student phase**: Student optimizer states are reloaded to GPU, teacher model weights are offloaded to CPU. This allows running large teacher models (e.g., 200B+ parameters) on the same hardware as the student without requiring separate GPU pools. Hidden States Transfer via Shared Memory Instead of transferring full teacher logits (which can be enormous for large vocabularies), KDFlow: • Extracts **hidden states** from the teacher's last layer via SGLang. • Transfers them to the student via **shared memory** (zero-copy). • Computes teacher logits **on the student side** using only the teacher's weights. This dramatically reduces memory and communication overhead. Token-Based Teacher Load Balancing The uses a **greedy token-based load balancing** strategy to distribute micro-batches across teacher a…