Skip to content

Commit 659b202

Browse files
Docs/update (#311)
* updated docs * updated docs * updated docs --------- Co-authored-by: root <FrankLeeeee> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
1 parent 2ab6da7 commit 659b202

16 files changed

Lines changed: 329 additions & 339 deletions

docs/advanced_features/customization.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# 💡 Customize Your Own Training
22

3-
### 🔧 Customize Training Args
3+
## 🔧 Customize Training Args
44

55
```bash
66
torchrun \
@@ -23,7 +23,7 @@ If you wish to understand what each argument does, you can run `python scripts/t
2323
- `--chat-template`: This should be the chat template to use for the model, so please make sure you set it to the correct value.
2424
- `--cache-dir`: This directory contains the dataset cache including the `input_ids`, `loss_mask`, `attention_mask` and `vocab_mapping`. These caches can make your data loading much faster once a cache is generated. The cache file has a name which is obtained by hashing the dataset path to avoid cache collision.
2525

26-
### 💬 Customize Chat Template
26+
## 💬 Customize Chat Template
2727

2828
You can register a new chat template for your model by adding a new entry to the `TEMPLATE_REGISTRY` in the `specforge.data.template.py` file.
2929

@@ -39,9 +39,9 @@ TEMPLATE_REGISTRY.register(
3939
)
4040
```
4141

42-
### 🪅 Customize Model
42+
## 🪅 Customize Model
4343

44-
#### Customize Target Model
44+
### Customize Target Model
4545

4646
If you wish to train Eagle3 for other models, you need to modify the `--target-model-path` value. We support loading these models directly from HuggingFace.
4747

@@ -71,7 +71,7 @@ class AutoDistributedTargetModel(AutoModelForCausalLMBase):
7171

7272
When `tp_size` is greater than 1, the script will automatically load the distributed version of the model for tensor parallelism.
7373

74-
#### Customize Draft Model
74+
### Customize Draft Model
7575

7676
If you want to change the draft model configuration, you can write your own configuration file and pass its path to the `--draft-model-config` argument. Or, if you do not provide the `--draft-model-config` argument, the script will automatically generate the draft model configuration based on the target model configuration. If you wish to serve your customized draft model with SGLang, make sure you implement the draft model in SGLang as well and the architecture name must match. To implement your own draft model, you can create a new class and inherit it from the `Eagle3DraftModel` class in the `specforge.modeling.draft.base.py` file.
7777

docs/advanced_features/regenerate_dataset.md

Lines changed: 0 additions & 35 deletions
This file was deleted.

docs/basic_usage/benchmarking.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
## 🤖 Benchmarking On SGLang
1+
# 🤖 Benchmarking On SGLang
22

33
Please refer to the [benchmarks](./benchmarks/README.md) document for more details.

docs/basic_usage/data_preparation.md

Lines changed: 45 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,18 @@
1-
## 📝 Data Preparation
1+
# 📝 Data Preparation
22

3-
In this section, we will introduce how to prepare the dataset for both online and offline training. As mentioned in the [Overview](#-overview) section, online training only requires the raw dataset while offline training requires the hidden states generated from the target model. In the section below, we will introduce how to prepare both the raw dataset and the hidden states.
3+
## 📍 Overview
44

5-
### 🔄 Regenerate Train Dataset
5+
Data is an important aspect of speculative decoding as the quality of the dataset directly affects the acceptance rate of the draft model. In this section, we will introduce how to prepare the dataset for both online and offline training.
66

7-
Many public datasets were not generated by your target model, which may lead to misalignment between the draft model’s outputs and the target model’s behavior — reducing acceptance rate and inference efficiency. To address this, we **recommend regenerating the dataset using the target model**, which better aligns the draft model with the target model’s output distribution, improving acceptance length and overall performance.
7+
## ☁️ Pre-supported Datasets
88

9-
Run the following command to regenerate your dataset:
9+
We have provided a script to prepare some sample datasets out of the box, these datasets include:
10+
1. [ultrachat](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (200k)
11+
2. [sharegpt](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) (120k)
12+
3. [perfectblend](https://huggingface.co/datasets/mlabonne/open-perfectblend) (1.4M)
13+
4. and others (we continuously add support for more datasets)
1014

11-
```bash
12-
python3 \
13-
scripts/regenerate_data.py \
14-
--model <target-model-path> \
15-
--input-file-path <jsonl-file-path> \
16-
--output-file-path <regenerated-jsonl-file-path> \
17-
--batch-size 128 \
18-
--tp-size 8 \
19-
--num-samples 1000 \
20-
--port 30000 \
21-
--temperature 0 \
22-
--mem-fraction-static 0.85 \
23-
--auto-launch-server
24-
```
25-
26-
### ☁️ Prepare Online Training Dataset
27-
28-
We have provided a script to prepare some sample datasets including ultrachat (200k) and sharegpt (120k) for demo purpose. You can easily process the dataset by running the following command. The jsonl files will be placed in the `cache/dataset/<dataset_name>` directory of the project path by default. These datasets will be processed into `jsonl` files, which are the raw dataset ready for online training!
15+
You can run the script below to prepare the corresponding dataset.
2916

3017
```bash
3118
# ultrachat
@@ -35,37 +22,45 @@ python scripts/prepare_data.py --dataset ultrachat
3522
python scripts/prepare_data.py --dataset sharegpt
3623
```
3724

38-
### 💾 Prepare Offline Training Dataset
25+
You can view the full list of pre-supported datasets using `python scripts/prepare_data.py --help`. The datasets are processed and saved as `jsonl` files in the `cache/dataset/<dataset_name>` directory of the project path by default.
3926

40-
Compared to online data, offline data requires one more step for hidden states generation. Thus, before you delve into this section, make sure you have your `jsonl` files ready as mentioned in the [Prepare Online Training Dataset](#-prepare-online-training-dataset) section. Once you have the `jsonl` files, you can start the hidden states generation.
4127

42-
You can run the following command to obtain the hidden states.
28+
## ↩️ Regenerate Datasets
4329

44-
```bash
45-
torchrun --nproc_per_node=8 \
46-
scripts/prepare_hidden_states.py \
47-
--model-path <target-model-path> \
48-
--enable-aux-hidden-states \
49-
--data-path <jsonl-file-path> \
50-
--chat-template llama3 \
51-
--max-length 2048 \
52-
--tp-size 8 \
53-
--batch-size 4 \
54-
--mem-frac=0.75 \
55-
--num-samples 1000
30+
When training speculative decoding draft models for a specific target model, instead of using the original dataset, we can regenerate the assistant responses using the target model to better align the draft model with the target model's output distribution. This will improve the acceptance rate of the draft model and the overall performance of the speculative decoding. According to the [EAGLE1 paper](https://arxiv.org/pdf/2401.15077), the EAGLE method is not very sensitive to the dataset quality, which means the performance is still good even if you use the original dataset. However, if you are looking for optimal performance in the production environment, it is recommended to regenerate the dataset using the target model.
31+
32+
We can follow the following steps to regenerate the dataset. In the example below, we will use `meta-llama/Llama-3.1-8B-Instruct` as an example, you can replace it with your own target model.
33+
34+
1. Start the SGLang server for the target model.
35+
36+
```shell
37+
python3 -m sglang.launch_server \
38+
--model meta-llama/Llama-3.1-8B-Instruct \
39+
--cuda-graph-bs 1 2 4 8 16 32 64 128 \
40+
--dtype bfloat16 \
41+
--mem-frac=0.8 \
42+
--port 30000
43+
```
44+
45+
2. Regenerate the dataset using the `regenerate_train_data.py` script.
46+
47+
```shell
48+
python scripts/regenerate_train_data.py \
49+
--model meta-llama/Llama-3.1-8B-Instruct \
50+
--concurrency 128 \
51+
--max-tokens 98304 \
52+
--server-address localhost:30000 \
53+
--temperature 0.8 \
54+
--input-file-path ./cache/dataset/sharegpt_train.jsonl \
55+
--output-file-path ./cache/dataset/sharegpt_train_regen.jsonl
5656
```
57-
> ⚠️ This extract may take 2 hours and about 5T Disk
5857

59-
You need to specify the following arguments:
60-
- `--model-path`: this is the huggingface repo name or path to the target model.
61-
- `--data-path`: this is actual output path from the previous `prepare_data.py` script.
62-
- `--chat-template`: this is the chat template to use for this model.
63-
- `--num-samples`: this specifies how many data samples to use for hidden states generation. By default it will use all the data from `data-path`.
58+
For maximum performance, we recommend to scale the number of GPUs to regenerate the dataset in data parallel mode. To do this, you can simply add more server addresses to the `--server-address` argument, e.g. `--server-address localhost:30000 localhost:30001 localhost:30002 localhost:30003`.
6459

6560

66-
### 🤩 Prepare your own dataset
61+
## 🤩 Prepare your own dataset
6762

68-
Besides the provided ShareGPT/Ultrachat datasets, you can also prepare your own dataset. We support two formats:
63+
Besides the provided datasets, you can also prepare your own dataset. We support two formats:
6964

7065
#### Option 1: Conversation Format
7166

@@ -102,12 +97,14 @@ To use pre-formatted datasets, add the `--is-preformatted` flag to your training
10297
torchrun --standalone --nproc_per_node 8 \
10398
scripts/train_eagle3.py \
10499
--is-preformatted \
105-
--chat-template qwen \
106100
--train-data-path ./your_preformatted_dataset.jsonl \
107101
# ... other arguments
108102
```
109103

110-
Once you have the `jsonl` file ready, you can go straight for online training or hidden states generation for offline training.
104+
Once you have the `jsonl` file ready, you can proceed with online training or generate hidden states for offline training. See the Training guide for more details.
105+
106+
107+
## ➕ Handling Multiple Datasets
111108

112109
If you have multiple datasets, you can just merge them into the one jsonl file. For example, you can do something like this
113110

docs/basic_usage/training.md

Lines changed: 44 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,62 @@
11
## 🚀 Training
22

3-
### 🏎️ Online Training
3+
## 📍 Overview
44

5-
We have provided a simple startup script to train the Eagle3 model for the Llama 3 and 4, Qwen3 models. You can run the following command to start the training.
5+
Existing speculative decoding methods such as EAGLE3 requires training in the feature-space, which means the draft model relies on the hidden states generated from the target model for autoregressive prediction. In SpecForge, we provide two orthogonal paths to cater to the users' specific needs when training this kind of draft models. We name these two methods as `Online` and `Offline`. By definition, it is easy to understandd them:
6+
7+
- **`Online`**: the hidden states are generated on the fly during training.
8+
- **`Offline`**: the hidden states are generated beforehand, stored to the disk, and loaded back to GPU during training.
9+
10+
Online training is suitable for users with limited disk space but sufficient GPUs while offline training is suitable for users with sufficient disk space but limited GPUs.
11+
12+
| Method | Target Model | Disk Space Requirement | GPU Requirement | One-liner rationale |
13+
| --- | --- | --- | --- | --- |
14+
| Online | Used during training | Small | More GPUs are needed if your target model is large | Generating auxiliary hidden states on the fly |
15+
| Offline | Only used during data preparation | Huge (e.g. ultrachat+sharegpt will need 12TB storage ) | as low as 1 GPU, as only need to accommodate the draft model | Preparing auxiliary hidden states beforehand and only once |
16+
17+
> **Why does disk matter?**
18+
> During Eagle3 training, the frozen target model will first generate the hidden states for each token given the data sample. The hidden states are fed to the draft model for training.
19+
> Offline mode stores these hidden states to the local disk, so a small disk can be filled up fast.
20+
> Online mode only generates these hidden states on the fly without storing them to the disk, but needs to keep the target model resident in memory during training, trading GPU RAM for almost-zero disk footprint.
21+
22+
## 🏎️ Online Training
23+
24+
We have provided training scripts for the EAGLE3 models in the `examples` directory. These scripts cover a wide range of models range from Llama to Qwen, small to large and dense to MoE. Online training is often conducted in two steps and we will use ShareGPT and Llama3-8B-Instruct as an example.
25+
26+
**Step 1: Prepare the dataset**
27+
28+
```bash
29+
# prepare the dataset
30+
python scripts/prepare_data.py --dataset sharegpt
31+
```
32+
33+
**Step 2: Start the training**
634

735
```bash
8-
# make sure you have sharegpt data prepared
936
# train llama3-8B-instruct
1037
bash ./examples/run_llama3_eagle3_online.sh
38+
```
1139

12-
# train llama4-scout
13-
bash ./examples/run_llama4_eagle3_online.sh
14-
15-
# train Qwen3-30B-A3B
16-
# Qwen3-235B-A22B online training is also supported;
17-
bash ./examples/run_qwen3_moe_eagle3_online.sh
40+
## 💨 Offline Training
1841

19-
# train Qwen3-8B
20-
bash ./examples/run_qwen3_dense_eagle3_online.sh
42+
The difference between online and offline training is that we need to generate the hidden states before training. We also use ShareGPT and Llama3-8B-Instruct as an example.
2143

22-
# train Qwq-32B
23-
bash ./examples/run_qwq_eagle3_online.sh
24-
```
44+
**Step 1: Prepare the dataset**
2545

26-
### 💨 Offline Training
46+
Same as above
2747

28-
We have provided a simple startup script to train the Eagle3 model for Llama-3.1-8B-Instruct model in an offline manner. You can run the following command to start the training. Almost Everything is the same as the Online Training Step, except that you don't need to configure anything about target model. Instead, you need to pass `--train-hidden-states-path` to the file.
48+
**Step 2: Generate the hidden states and train**
2949

3050
```bash
31-
# make sure you have sharegpt data prepared
51+
# train llama3-8B-instruct in an offline manner
3252
bash ./examples/run_llama3_eagle3_offline.sh
3353
```
3454

35-
### 📈 Experiment Tracking
55+
It is important to note that the `run_llama3_eagle3_offline.sh` script consists of two steps:
56+
57+
1. Generate the hidden states using the `prepare_hidden_states.py` script. This script will generate the hidden states for the test and train datasets and save them to the disk.
58+
2. Train the model: suppling the `--train-hidden-states-path` argument to the script so that the script will load the hidden states from the disk during training.
59+
60+
## 📈 Experiment Tracking
3661

37-
This project supports logging training progress to Wandb, TensorBoard, and SwanLab. You can enable tracking by adding the --report-to argument to the command line in your shell script.
62+
This project supports logging training progress to Wandb, TensorBoard, and SwanLab. You can enable tracking by adding the `--report-to` argument to the command line in your shell script.

docs/concepts/EAGLE3.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# 🦅 EAGLE3
2+
3+
## 📍 Overview
4+
5+
In the previous speculative decoding practices, we usually choose a small language model from the same family as the draft model. For example, we can use `Llama-3.1-8B-Instruct` as the draft model and `Llama-3.1-70B-Instruct` as the target model. However, this approach is not always feasible because the small language model may not always be available. Thus, researchers have proposed to train a separate small model as the speculator, this type of models usually use the target model's hidden states or KV cache as input to predict the next few tokens.
6+
7+
Among this type of models, EAGLE3 is the state-of-the-art and has been integrated in [SGLang](https://github.com/sgl-project/sglang). It relies on the hidden states of the target model and often consists of only one dense decoder layer. Before you read on, you can revisit the details of [speculative decoding](./speculative_decoding.md) first if not familiar.
8+
9+
## 🔧 How it works?
10+
11+
<p align="center">
12+
<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2025/09/speculative-decoding-eagle-drafting-mechanism.gif" alt="EAGLE3"><br>
13+
<span>Source: <a href="https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/">Blog by NVIDIA</a></span>
14+
</p>
15+
16+
The workflow of EAGLE3 is shown in the animation above. It differs from other speculative decoding methods in several ways:
17+
1. **`Feature-based Drafting`**: Unlike other speculative decoding methods which directly feeds the tokens to the draft model to generate predictions, EAGLE3 operates in the feature space. It will extract the 3 hidden states from the target model at 3 layers at different depths and concatenate them together to form a single feature vector. This feature vector will be fed to the draft model to generate predictions.
18+
2. **`Training-time Test`**: During training, EAGLE3 simulate the autoregressive generation process by autoregressively generating the next few tokens. It then computes the loss between the predicted output sequence and the ground truth sequence. This method improves the draft model performance because it reduces the generation errors accumulated from previous tokens for higher acceptance rate.
19+
3. **`Dynamic Draft Tree`**: EAGLE3 uses a dynamic draft tree to store the candidate tokens as proposed in [EAGLE2](https://arxiv.org/abs/2406.16858). In simple words, it will only store the candidate tokens that are most likely to be accepted by the target model to improve the acceptance rate.

0 commit comments

Comments
 (0)