LLaDA: Large Language Diffusion Models

A PyTorch Lightning implementation of LLaDA (Large Language Diffusion with mAsking) - a novel approach that challenges the autoregressive paradigm in large language models.

Why this project?

After working extensively with various text diffusion models that failed to deliver satisfactory results, I was excited to implement the LLaDA approach when the paper was released. This implementation represents my successful attempt to get text diffusion models working effectively.

How does LLaDA work?

Unlike autoregressive models that generate token by token from left to right, LLaDA works very differently:

Starts with completely masked text (all tokens are [MASK])
Iteratively unmask tokens using a reverse diffusion process
Doesn't start from Gaussian noise like traditional diffusions, but from masked tokens
Can generate in any direction thanks to bidirectional dependencies

It's like having hidden text and gradually revealing words until you have the complete text.

Key features

✅ Masked token diffusion instead of Gaussian noise
✅ Bidirectional generation - not limited to left-to-right
✅ Two sampling strategies: greedy (deterministic) and multinomial (probabilistic)
✅ PyTorch Lightning training - easy to use and scale
✅ Automatic multi-GPU support

Quick installation

git clone https://github.com/yourusername/diffusion-llm.git
cd diffusion-llm
uv sync

Basic usage

Train the model

uv run train.py

Generate text

# With greedy sampling (deterministic)
uv run inference.py --sampling greedy --n_tokens 50

# With multinomial sampling (more diverse)
uv run inference.py --sampling multinomial --n_tokens 50

From Python

from engine import LLADAEngine

# Load trained model
engine = LLADAEngine.load_from_checkpoint("weights/model.ckpt")

# Generate text
engine.generate(sampling="multinomial", n_tokens=50)

Architecture

The model is built on top of DistilBERT and implements:

MaskGenerator: Masks tokens during training
RandomRemaskStrategy: Implements the reverse diffusion process
Sampling strategies: GreedySampling and MultinomialSampling
Custom loss: Optimization following equation 5 from the paper

Why is this important?

LLaDA demonstrates that you don't need to generate left-to-right to have an effective language model. After struggling with other text diffusion approaches, this implementation finally shows that effective text diffusion is achievable.

The original paper shows that LLaDA 8B is competitive with LLaMA3 8B on many tasks, and even outperforms GPT-4o on reverse reasoning tasks (like completing poems backwards).

References

License

MIT License - you can use this code freely.

Note: This implementation focuses on the generation component of LLaDA. For the complete system including supervised fine-tuning, consult the original paper.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
media		media
weights		weights
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
inference.py		inference.py
pyproject.toml		pyproject.toml
train.py		train.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaDA: Large Language Diffusion Models

Why this project?

How does LLaDA work?

Key features

Quick installation

Basic usage

Train the model

Generate text

From Python

Architecture

Why is this important?

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLaDA: Large Language Diffusion Models

Why this project?

How does LLaDA work?

Key features

Quick installation

Basic usage

Train the model

Generate text

From Python

Architecture

Why is this important?

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages