sgl-project
diff --git a/‎docs/advanced_features/customization.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/advanced_features/customization.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/advanced_features/regenerate_dataset.md‎
Lines changed: 0 additions & 35 deletions b/‎docs/advanced_features/regenerate_dataset.md‎
Lines changed: 0 additions & 35 deletions
diff --git a/‎docs/basic_usage/benchmarking.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/basic_usage/benchmarking.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/basic_usage/data_preparation.md‎
Lines changed: 45 additions & 48 deletions b/‎docs/basic_usage/data_preparation.md‎
Lines changed: 45 additions & 48 deletions
diff --git a/‎docs/basic_usage/training.md‎
Lines changed: 44 additions & 19 deletions b/‎docs/basic_usage/training.md‎
Lines changed: 44 additions & 19 deletions
diff --git a/‎docs/concepts/EAGLE3.md‎
Lines changed: 19 additions & 0 deletions b/‎docs/concepts/EAGLE3.md‎
Lines changed: 19 additions & 0 deletions
@@ -1,6 +1,6 @@
 # 💡 Customize Your Own Training
 
-### 🔧 Customize Training Args
+## 🔧 Customize Training Args
 
 ```bash
 torchrun \
@@ -23,7 +23,7 @@ If you wish to understand what each argument does, you can run `python scripts/t
 - `--chat-template`: This should be the chat template to use for the model, so please make sure you set it to the correct value.
 - `--cache-dir`: This directory contains the dataset cache including the `input_ids`, `loss_mask`, `attention_mask` and `vocab_mapping`. These caches can make your data loading much faster once a cache is generated. The cache file has a name which is obtained by hashing the dataset path to avoid cache collision.
 
-### 💬 Customize Chat Template
+## 💬 Customize Chat Template
 
 You can register a new chat template for your model by adding a new entry to the `TEMPLATE_REGISTRY` in the `specforge.data.template.py` file.
 
@@ -39,9 +39,9 @@ TEMPLATE_REGISTRY.register(
 )
 ```
 
-### 🪅 Customize Model
+## 🪅 Customize Model
 
-#### Customize Target Model
+### Customize Target Model
 
 If you wish to train Eagle3 for other models, you need to modify the `--target-model-path` value. We support loading these models directly from HuggingFace.
 
@@ -71,7 +71,7 @@ class AutoDistributedTargetModel(AutoModelForCausalLMBase):
 
 When `tp_size` is greater than 1, the script will automatically load the distributed version of the model for tensor parallelism.
 
-#### Customize Draft Model
+### Customize Draft Model
 
 If you want to change the draft model configuration, you can write your own configuration file and pass its path to the `--draft-model-config` argument. Or, if you do not provide the `--draft-model-config` argument, the script will automatically generate the draft model configuration based on the target model configuration. If you wish to serve your customized draft model with SGLang, make sure you implement the draft model in SGLang as well and the architecture name must match. To implement your own draft model, you can create a new class and inherit it from the `Eagle3DraftModel` class in the `specforge.modeling.draft.base.py` file.
 
 
@@ -1,3 +1,3 @@
-## 🤖 Benchmarking On SGLang
+# 🤖 Benchmarking On SGLang
 
 Please refer to the [benchmarks](./benchmarks/README.md) document for more details.
@@ -1,31 +1,18 @@
-## 📝 Data Preparation
+# 📝 Data Preparation
 
-In this section, we will introduce how to prepare the dataset for both online and offline training. As mentioned in the [Overview](#-overview) section, online training only requires the raw dataset while offline training requires the hidden states generated from the target model. In the section below, we will introduce how to prepare both the raw dataset and the hidden states.
+## 📍 Overview
 
-### 🔄 Regenerate Train Dataset
+Data is an important aspect of speculative decoding as the quality of the dataset directly affects the acceptance rate of the draft model. In this section, we will introduce how to prepare the dataset for both online and offline training.
 
-Many public datasets were not generated by your target model, which may lead to misalignment between the draft model’s outputs and the target model’s behavior — reducing acceptance rate and inference efficiency. To address this, we **recommend regenerating the dataset using the target model**, which better aligns the draft model with the target model’s output distribution, improving acceptance length and overall performance.
+## ☁️ Pre-supported Datasets
 
-Run the following command to regenerate your dataset:
+We have provided a script to prepare some sample datasets out of the box, these datasets include:
+1. [ultrachat](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (200k)
+2. [sharegpt](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) (120k)
+3. [perfectblend](https://huggingface.co/datasets/mlabonne/open-perfectblend) (1.4M)
+4. and others (we continuously add support for more datasets)
 
-```bash
-python3 \
-    scripts/regenerate_data.py \
-    --model <target-model-path> \
-    --input-file-path <jsonl-file-path> \
-    --output-file-path <regenerated-jsonl-file-path> \
-    --batch-size 128 \
-    --tp-size 8 \
-    --num-samples 1000 \
-    --port 30000 \
-    --temperature 0 \
-    --mem-fraction-static 0.85 \
-    --auto-launch-server
-```
-
-### ☁️ Prepare Online Training Dataset
-
-We have provided a script to prepare some sample datasets including ultrachat (200k) and sharegpt (120k) for demo purpose. You can easily process the dataset by running the following command. The jsonl files will be placed in the `cache/dataset/<dataset_name>` directory of the project path by default. These datasets will be processed into `jsonl` files, which are the raw dataset ready for online training!
+You can run the script below to prepare the corresponding dataset.
 
 ```bash
 # ultrachat
@@ -35,37 +22,45 @@ python scripts/prepare_data.py --dataset ultrachat
 python scripts/prepare_data.py --dataset sharegpt
 ```
 
-### 💾 Prepare Offline Training Dataset
+You can view the full list of pre-supported datasets using `python scripts/prepare_data.py --help`. The datasets are processed and saved as `jsonl` files in the `cache/dataset/<dataset_name>` directory of the project path by default.
 
-Compared to online data, offline data requires one more step for hidden states generation. Thus, before you delve into this section, make sure you have your `jsonl` files ready as mentioned in the [Prepare Online Training Dataset](#-prepare-online-training-dataset) section. Once you have the `jsonl` files, you can start the hidden states generation.
 
-You can run the following command to obtain the hidden states.
+## ↩️ Regenerate Datasets
 
-```bash
-torchrun --nproc_per_node=8 \
-    scripts/prepare_hidden_states.py \
-    --model-path <target-model-path> \
-    --enable-aux-hidden-states \
-    --data-path <jsonl-file-path> \
-    --chat-template llama3 \
-    --max-length 2048 \
-    --tp-size 8 \
-    --batch-size 4 \
-    --mem-frac=0.75 \
-    --num-samples 1000
+When training speculative decoding draft models for a specific target model, instead of using the original dataset, we can regenerate the assistant responses using the target model to better align the draft model with the target model's output distribution. This will improve the acceptance rate of the draft model and the overall performance of the speculative decoding. According to the [EAGLE1 paper](https://arxiv.org/pdf/2401.15077), the EAGLE method is not very sensitive to the dataset quality, which means the performance is still good even if you use the original dataset. However, if you are looking for optimal performance in the production environment, it is recommended to regenerate the dataset using the target model.
+
+We can follow the following steps to regenerate the dataset. In the example below, we will use `meta-llama/Llama-3.1-8B-Instruct` as an example, you can replace it with your own target model.
+
+1. Start the SGLang server for the target model.
+
+```shell
+python3 -m sglang.launch_server \
+    --model meta-llama/Llama-3.1-8B-Instruct \
+    --cuda-graph-bs 1 2 4 8 16 32 64 128 \
+    --dtype bfloat16 \
+    --mem-frac=0.8 \
+    --port 30000
+```
+
+2. Regenerate the dataset using the `regenerate_train_data.py` script.
+
+```shell
+python scripts/regenerate_train_data.py \
+    --model meta-llama/Llama-3.1-8B-Instruct \
+    --concurrency 128 \
+    --max-tokens 98304 \
+    --server-address localhost:30000 \
+    --temperature 0.8 \
+    --input-file-path ./cache/dataset/sharegpt_train.jsonl \
+    --output-file-path ./cache/dataset/sharegpt_train_regen.jsonl
 ```
-> ⚠️ This extract may take 2 hours and about 5T Disk
 
-You need to specify the following arguments:
-- `--model-path`: this is the huggingface repo name or path to the target model.
-- `--data-path`: this is actual output path from the previous `prepare_data.py` script.
-- `--chat-template`: this is the chat template to use for this model.
-- `--num-samples`: this specifies how many data samples to use for hidden states generation. By default it will use all the data from `data-path`.
+For maximum performance, we recommend to scale the number of GPUs to regenerate the dataset in data parallel mode. To do this, you can simply add more server addresses to the `--server-address` argument, e.g. `--server-address localhost:30000 localhost:30001 localhost:30002 localhost:30003`.
 
 
-### 🤩 Prepare your own dataset
+## 🤩 Prepare your own dataset
 
-Besides the provided ShareGPT/Ultrachat datasets, you can also prepare your own dataset. We support two formats:
+Besides the provided datasets, you can also prepare your own dataset. We support two formats:
 
 #### Option 1: Conversation Format
 
@@ -102,12 +97,14 @@ To use pre-formatted datasets, add the `--is-preformatted` flag to your training
 torchrun --standalone --nproc_per_node 8 \
     scripts/train_eagle3.py \
     --is-preformatted \
-    --chat-template qwen \
     --train-data-path ./your_preformatted_dataset.jsonl \
     # ... other arguments
 ```
 
-Once you have the `jsonl` file ready, you can go straight for online training or hidden states generation for offline training.
+Once you have the `jsonl` file ready, you can proceed with online training or generate hidden states for offline training. See the Training guide for more details.
+
+
+## ➕ Handling Multiple Datasets
 
 If you have multiple datasets, you can just merge them into the one jsonl file. For example, you can do something like this
 
 
@@ -1,37 +1,62 @@
 ## 🚀 Training
 
-### 🏎️ Online Training
+## 📍 Overview
 
-We have provided a simple startup script to train the Eagle3 model for the Llama 3 and 4, Qwen3 models. You can run the following command to start the training.
+Existing speculative decoding methods such as EAGLE3 requires training in the feature-space, which means the draft model relies on the hidden states generated from the target model for autoregressive prediction. In SpecForge, we provide two orthogonal paths to cater to the users' specific needs when training this kind of draft models. We name these two methods as `Online` and `Offline`. By definition, it is easy to understandd them:
+
+- **`Online`**: the hidden states are generated on the fly during training.
+- **`Offline`**: the hidden states are generated beforehand, stored to the disk, and loaded back to GPU during training.
+
+Online training is suitable for users with limited disk space but sufficient GPUs while offline training is suitable for users with sufficient disk space but limited GPUs.
+
+| Method | Target Model | Disk Space Requirement | GPU Requirement | One-liner rationale |
+| --- | --- | --- | --- | --- |
+| Online | Used during training | Small | More GPUs are needed if your target model is large | Generating auxiliary hidden states on the fly |
+| Offline | Only used during data preparation | Huge (e.g. ultrachat+sharegpt will need 12TB storage ) | as low as 1 GPU, as only need to accommodate the draft model  | Preparing auxiliary hidden states beforehand and only once |
+
+> **Why does disk matter?**
+> During Eagle3 training, the frozen target model will first generate the hidden states for each token given the data sample. The hidden states are fed to the draft model for training.
+> Offline mode stores these hidden states to the local disk, so a small disk can be filled up fast.
+> Online mode only generates these hidden states on the fly without storing them to the disk, but needs to keep the target model resident in memory during training, trading GPU RAM for almost-zero disk footprint.
+
+## 🏎️ Online Training
+
+We have provided training scripts for the EAGLE3 models in the `examples` directory. These scripts cover a wide range of models range from Llama to Qwen, small to large and dense to MoE. Online training is often conducted in two steps and we will use ShareGPT and Llama3-8B-Instruct as an example.
+
+**Step 1: Prepare the dataset**
+
+```bash
+# prepare the dataset
+python scripts/prepare_data.py --dataset sharegpt
+```
+
+**Step 2: Start the training**
 
 ```bash
-# make sure you have sharegpt data prepared
 # train llama3-8B-instruct
 bash ./examples/run_llama3_eagle3_online.sh
+```
 
-# train llama4-scout
-bash ./examples/run_llama4_eagle3_online.sh
-
-# train Qwen3-30B-A3B
-# Qwen3-235B-A22B online training is also supported;
-bash ./examples/run_qwen3_moe_eagle3_online.sh
+## 💨 Offline Training
 
-# train Qwen3-8B
-bash ./examples/run_qwen3_dense_eagle3_online.sh
+The difference between online and offline training is that we need to generate the hidden states before training. We also use ShareGPT and Llama3-8B-Instruct as an example.
 
-# train Qwq-32B
-bash ./examples/run_qwq_eagle3_online.sh
-```
+**Step 1: Prepare the dataset**
 
-### 💨 Offline Training
+Same as above
 
-We have provided a simple startup script to train the Eagle3 model for Llama-3.1-8B-Instruct model in an offline manner. You can run the following command to start the training. Almost Everything is the same as the Online Training Step, except that you don't need to configure anything about target model. Instead, you need to pass `--train-hidden-states-path` to the file.
+**Step 2: Generate the hidden states and train**
 
 ```bash
-# make sure you have sharegpt data prepared
+# train llama3-8B-instruct in an offline manner
 bash ./examples/run_llama3_eagle3_offline.sh
 ```
 
-### 📈 Experiment Tracking
+It is important to note that the `run_llama3_eagle3_offline.sh` script consists of two steps:
+
+1. Generate the hidden states using the `prepare_hidden_states.py` script. This script will generate the hidden states for the test and train datasets and save them to the disk.
+2. Train the model: suppling the `--train-hidden-states-path` argument to the script so that the script will load the hidden states from the disk during training.
+
+## 📈 Experiment Tracking
 
-This project supports logging training progress to Wandb, TensorBoard, and SwanLab. You can enable tracking by adding the --report-to argument to the command line in your shell script.
+This project supports logging training progress to Wandb, TensorBoard, and SwanLab. You can enable tracking by adding the `--report-to` argument to the command line in your shell script.
@@ -0,0 +1,19 @@
+# 🦅 EAGLE3
+
+## 📍 Overview
+
+In the previous speculative decoding practices, we usually choose a small language model from the same family as the draft model. For example, we can use `Llama-3.1-8B-Instruct` as the draft model and `Llama-3.1-70B-Instruct` as the target model. However, this approach is not always feasible because the small language model may not always be available. Thus, researchers have proposed to train a separate small model as the speculator, this type of models usually use the target model's hidden states or KV cache as input to predict the next few tokens.
+
+Among this type of models, EAGLE3 is the state-of-the-art and has been integrated in [SGLang](https://github.com/sgl-project/sglang). It relies on the hidden states of the target model and often consists of only one dense decoder layer. Before you read on, you can revisit the details of [speculative decoding](./speculative_decoding.md) first if not familiar.
+
+## 🔧 How it works?
+
+<p align="center">
+  <img src="https://developer-blogs.nvidia.com/wp-content/uploads/2025/09/speculative-decoding-eagle-drafting-mechanism.gif" alt="EAGLE3"><br>
+  <span>Source: <a href="https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/">Blog by NVIDIA</a></span>
+</p>
+
+The workflow of EAGLE3 is shown in the animation above. It differs from other speculative decoding methods in several ways:
+1. **`Feature-based Drafting`**: Unlike other speculative decoding methods which directly feeds the tokens to the draft model to generate predictions, EAGLE3 operates in the feature space. It will extract the 3 hidden states from the target model at 3 layers at different depths and concatenate them together to form a single feature vector. This feature vector will be fed to the draft model to generate predictions.
+2. **`Training-time Test`**: During training, EAGLE3 simulate the autoregressive generation process by autoregressively generating the next few tokens. It then  computes the loss between the predicted output sequence and the ground truth sequence. This method improves the draft model performance because it reduces the generation errors accumulated from previous tokens for higher acceptance rate.
+3. **`Dynamic Draft Tree`**: EAGLE3 uses a dynamic draft tree to store the candidate tokens as proposed in [EAGLE2](https://arxiv.org/abs/2406.16858). In simple words, it will only store the candidate tokens that are most likely to be accepted by the target model to improve the acceptance rate.
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`		`-## 🤖 Benchmarking On SGLang`
	`1`	`+# 🤖 Benchmarking On SGLang`
`2`	`2`
`3`	`3`	`Please refer to the [benchmarks](./benchmarks/README.md) document for more details.`