LLaDA: Large Language Diffusion with mAsking

This project is an implementation of a character-level diffusion model (LLaDA) for text generation, based on the principles outlined in the take-home exercise. It also includes a standard autoregressive Transformer model as a baseline for comparison.

The entire project has been refactored from a monolithic script into a modular, clean, and testable structure that supports training and inference from the command line and tracks experiments using Weights & Biases.

The research paper

Project Structure

/
├── configs/            # Centralized configuration files
├── data/               # Raw data files (e.g., tinyshakespeare.txt)
├── outputs/            # Saved models (.pth) and plots (.png)
├── tests/              # Unit tests for the project
├── .gitignore
├── data_utils.py       # Tokenizer and PyTorch Dataset classes
├── model.py            # Model architectures (LLaDA and Autoregressive)
├── train.py            # Training script with wandb integration
├── generate.py         # Inference script for text generation
├── main.py             # Main entry point for the CLI
├── README.md           # This file
├── requirements.txt    # Python dependencies
└── EVALUATION_REPORT.md # Analysis of the model performance

Setup

Clone the repository:

git clone <repository_url>
cd <repository_directory>

Create a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
(Optional but Recommended) Login to Weights & Biases: To enable experiment tracking, log in to your W&B account. You will be prompted for your API key.
```
wandb login
```

Usage

The project is controlled via the main.py script with command-line arguments.

Training

To train a model, use the --mode train argument and specify the model type. The script will download the dataset, train the model, save the best version to outputs/models/, and log the experiment to Weights & Biases.

Train the LLaDA model:

python3 main.py --mode train --model_type llada

Train the Autoregressive model:

python3 main.py --mode train --model_type autoregressive

Generation

To generate text with a trained model, use the --mode generate argument. The script will automatically load the best saved model weights.

Generate with the LLaDA model:

python3 main.py --mode generate --model_type llada --prompt "O Romeo, Romeo!"

Generate with the Autoregressive model:

python3 main.py --mode generate --model_type autoregressive --prompt "O Romeo, Romeo!"

Running Tests

To ensure all components are working correctly, run the unit tests:

python3 -m unittest discover tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaDA: Large Language Diffusion with mAsking

Project Structure

Setup

Usage

Training

Generation

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.ipynb_checkpoints		.ipynb_checkpoints
configs		configs
data		data
outputs		outputs
tests		tests
.gitignore		.gitignore
EVALUATION_REPORT.md		EVALUATION_REPORT.md
README.md		README.md
data_utils.py		data_utils.py
generate.py		generate.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

LLaDA: Large Language Diffusion with mAsking

Project Structure

Setup

Usage

Training

Generation

Running Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages