Cross-Architecture Distillation for Diffusion Large Language Models
🌊 The first cross-architecture distillation framework for diffusion LLMs — distilling 8B dense and 16B MoE teachers into a 0.6B student 🌊
This organization hosts the distilled student checkpoints and pre-tokenized SFT datasets released with TIDE. The framework consists of three modular components — TIDAL (dual-axis interpolation), CompDemo (complementary mask-split teacher inference), and Reverse CALM (cross-tokenizer chunk-level matching) — and is evaluated across two heterogeneous distillation pipelines.
Six 0.6B distilled student checkpoints (3 per pipeline). Each is initialized from dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 and distilled from a larger dLLM teacher.
| Pipeline | Variant | Repo |
|---|---|---|
| A — Cross-Tokenizer (LLaDA2 teacher) | TIDE-Cross (native, paper-best) | distill-LLaDA2-TIDE_Cross |
| A — Cross-Tokenizer (LLaDA2 teacher) | TIDE-Shared variant | distill-LLaDA2-TIDE_Shared |
| A — Cross-Tokenizer (LLaDA2 teacher) | CALM baseline | distill-LLaDA2-CALM |
| B — Shared-Tokenizer (WeDLM teacher) | TIDE-Shared (native, paper-best) | distill-WeDLM-TIDE_Shared |
| B — Shared-Tokenizer (WeDLM teacher) | TIDE-Cross variant | distill-WeDLM-TIDE_Cross |
| B — Shared-Tokenizer (WeDLM teacher) | KL baseline | distill-WeDLM-KL |
Pre-tokenized SFT mixtures (tulu-3-sft-mixture + smoltalk + opc-sft-stage1 + opc-sft-stage2) prepared for each teacher, so distillation jobs never re-tokenize at startup.
| Pipeline | Repo |
|---|---|
| A — for the LLaDA2 teacher | distill_llada2_sft |
| B — for the WeDLM teacher | distill_wedlm_sft |
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
repo = "TIDE-dllm/distill-LLaDA2-TIDE_Cross" # paper-best Pipeline-A checkpoint
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForMaskedLM.from_pretrained(
repo, dtype=torch.bfloat16, trust_remote_code=True,
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
The same generate() routine published with dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 works on every TIDE checkpoint — just swap the model name.
@misc{zhang2026turningtidecrossarchitecturedistillation,
title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
year={2026},
eprint={2604.26951},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.26951},
}