Turning the TIDE

Cross-Architecture Distillation for Diffusion Large Language Models

🌊 The first cross-architecture distillation framework for diffusion LLMs — distilling 8B dense and 16B MoE teachers into a 0.6B student 🌊

This organization hosts the distilled student checkpoints and pre-tokenized SFT datasets released with TIDE. The framework consists of three modular components — TIDAL (dual-axis interpolation), CompDemo (complementary mask-split teacher inference), and Reverse CALM (cross-tokenizer chunk-level matching) — and is evaluated across two heterogeneous distillation pipelines.

✨ Highlights

+1.53 average gain over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67).
+16.48 on HumanEval over the equivalent-size AR baseline (48.78 vs. 32.30) — distilled dLLMs especially excel at code generation.
22× peak-memory reduction vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and 5.2× faster inference (6.25 s vs. 32.55 s for 256 tokens on H100).

🤖 Released models

Six 0.6B distilled student checkpoints (3 per pipeline). Each is initialized from dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 and distilled from a larger dLLM teacher.

Pipeline	Variant	Repo
A — Cross-Tokenizer (LLaDA2 teacher)	TIDE-Cross (native, paper-best)	`distill-LLaDA2-TIDE_Cross`
A — Cross-Tokenizer (LLaDA2 teacher)	TIDE-Shared variant	`distill-LLaDA2-TIDE_Shared`
A — Cross-Tokenizer (LLaDA2 teacher)	CALM baseline	`distill-LLaDA2-CALM`
B — Shared-Tokenizer (WeDLM teacher)	TIDE-Shared (native, paper-best)	`distill-WeDLM-TIDE_Shared`
B — Shared-Tokenizer (WeDLM teacher)	TIDE-Cross variant	`distill-WeDLM-TIDE_Cross`
B — Shared-Tokenizer (WeDLM teacher)	KL baseline	`distill-WeDLM-KL`

📚 Released datasets

Pre-tokenized SFT mixtures (tulu-3-sft-mixture + smoltalk + opc-sft-stage1 + opc-sft-stage2) prepared for each teacher, so distillation jobs never re-tokenize at startup.

Pipeline	Repo
A — for the LLaDA2 teacher	`distill_llada2_sft`
B — for the WeDLM teacher	`distill_wedlm_sft`

🚀 Quick start

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

repo = "TIDE-dllm/distill-LLaDA2-TIDE_Cross"   # paper-best Pipeline-A checkpoint
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForMaskedLM.from_pretrained(
    repo, dtype=torch.bfloat16, trust_remote_code=True,
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)

The same generate() routine published with dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 works on every TIDE checkpoint — just swap the model name.

📝 Citation

@misc{zhang2026turningtidecrossarchitecturedistillation,
      title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
      author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
      year={2026},
      eprint={2604.26951},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.26951},
}