TIDE logo

Turning the TIDE

Cross-Architecture Distillation for Diffusion Large Language Models

🌊 The first cross-architecture distillation framework for diffusion LLMs — distilling 8B dense and 16B MoE teachers into a 0.6B student 🌊

arXiv HF Paper Code Project Page License


This organization hosts the distilled student checkpoints and pre-tokenized SFT datasets released with TIDE. The framework consists of three modular components — TIDAL (dual-axis interpolation), CompDemo (complementary mask-split teacher inference), and Reverse CALM (cross-tokenizer chunk-level matching) — and is evaluated across two heterogeneous distillation pipelines.

✨ Highlights

🤖 Released models

Six 0.6B distilled student checkpoints (3 per pipeline). Each is initialized from dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 and distilled from a larger dLLM teacher.

Pipeline Variant Repo
A — Cross-Tokenizer (LLaDA2 teacher) TIDE-Cross (native, paper-best) distill-LLaDA2-TIDE_Cross
A — Cross-Tokenizer (LLaDA2 teacher) TIDE-Shared variant distill-LLaDA2-TIDE_Shared
A — Cross-Tokenizer (LLaDA2 teacher) CALM baseline distill-LLaDA2-CALM
B — Shared-Tokenizer (WeDLM teacher) TIDE-Shared (native, paper-best) distill-WeDLM-TIDE_Shared
B — Shared-Tokenizer (WeDLM teacher) TIDE-Cross variant distill-WeDLM-TIDE_Cross
B — Shared-Tokenizer (WeDLM teacher) KL baseline distill-WeDLM-KL

📚 Released datasets

Pre-tokenized SFT mixtures (tulu-3-sft-mixture + smoltalk + opc-sft-stage1 + opc-sft-stage2) prepared for each teacher, so distillation jobs never re-tokenize at startup.

Pipeline Repo
A — for the LLaDA2 teacher distill_llada2_sft
B — for the WeDLM teacher distill_wedlm_sft

🚀 Quick start

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

repo = "TIDE-dllm/distill-LLaDA2-TIDE_Cross"   # paper-best Pipeline-A checkpoint
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForMaskedLM.from_pretrained(
    repo, dtype=torch.bfloat16, trust_remote_code=True,
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)

The same generate() routine published with dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 works on every TIDE checkpoint — just swap the model name.

📝 Citation

@misc{zhang2026turningtidecrossarchitecturedistillation,
      title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
      author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
      year={2026},
      eprint={2604.26951},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.26951},
}