A PyTorch Lightning implementation of LLaDA (Large Language Diffusion with mAsking) - a novel approach that challenges the autoregressive paradigm in large language models.
After working extensively with various text diffusion models that failed to deliver satisfactory results, I was excited to implement the LLaDA approach when the paper was released. This implementation represents my successful attempt to get text diffusion models working effectively.
Unlike autoregressive models that generate token by token from left to right, LLaDA works very differently:
- Starts with completely masked text (all tokens are [MASK])
- Iteratively unmask tokens using a reverse diffusion process
- Doesn't start from Gaussian noise like traditional diffusions, but from masked tokens
- Can generate in any direction thanks to bidirectional dependencies
It's like having hidden text and gradually revealing words until you have the complete text.
- ✅ Masked token diffusion instead of Gaussian noise
- ✅ Bidirectional generation - not limited to left-to-right
- ✅ Two sampling strategies: greedy (deterministic) and multinomial (probabilistic)
- ✅ PyTorch Lightning training - easy to use and scale
- ✅ Automatic multi-GPU support
git clone https://github.com/yourusername/diffusion-llm.git
cd diffusion-llm
uv syncuv run train.py# With greedy sampling (deterministic)
uv run inference.py --sampling greedy --n_tokens 50
# With multinomial sampling (more diverse)
uv run inference.py --sampling multinomial --n_tokens 50from engine import LLADAEngine
# Load trained model
engine = LLADAEngine.load_from_checkpoint("weights/model.ckpt")
# Generate text
engine.generate(sampling="multinomial", n_tokens=50)The model is built on top of DistilBERT and implements:
- MaskGenerator: Masks tokens during training
- RandomRemaskStrategy: Implements the reverse diffusion process
- Sampling strategies: GreedySampling and MultinomialSampling
- Custom loss: Optimization following equation 5 from the paper
LLaDA demonstrates that you don't need to generate left-to-right to have an effective language model. After struggling with other text diffusion approaches, this implementation finally shows that effective text diffusion is achievable.
The original paper shows that LLaDA 8B is competitive with LLaMA3 8B on many tasks, and even outperforms GPT-4o on reverse reasoning tasks (like completing poems backwards).
MIT License - you can use this code freely.
Note: This implementation focuses on the generation component of LLaDA. For the complete system including supervised fine-tuning, consult the original paper.
