| title | PD Disaggregation |
|---|
Prefill-Decode (PD) disaggregation separates the two phases of LLM inference — prompt processing (prefill) and token generation (decode) — onto specialized workers. This optimizes Time to First Token (TTFT) and throughput independently.
- Completed the Getting Started guide
- At least one prefill worker and one decode worker
- For vLLM PD: workers started with gRPC entrypoint and KV transfer backend
| Phase | Compute Pattern | Bottleneck |
|---|---|---|
| Prefill | Compute-bound, parallel | GPU compute |
| Decode | Memory-bound, sequential | Memory bandwidth |
Running both on the same worker creates contention — prefill batches wait for decode slots, and decode batches stay small due to memory pressure. Dedicating workers to each phase removes this conflict.
SMG sends the request to both prefill and decode workers simultaneously, and they coordinate KV cache transfer through a bootstrap mechanism.
# Prefill worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--port 8000 \
--prefill-only
# Decode worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--port 8001 \
--decode-onlyEach prefill worker needs a bootstrap port for coordination:
smg \
--pd-disaggregation \
--prefill http://prefill:8000 9001 \
--decode http://decode:8001 \
--host 0.0.0.0 \
--port 30000smg \
--pd-disaggregation \
--prefill http://prefill1:8000 9001 \
--prefill http://prefill2:8000 9002 \
--decode http://decode1:8001 \
--decode http://decode2:8001 \
--prefill-policy cache_aware \
--decode-policy power_of_twoSMG sends to prefill first with max_tokens=1, then sends the original request to decode. The KV backend (NIXL or Mooncake) transfers the cache transparently.
# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50051 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50052 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'vLLM workers use grpc:// URLs and require --model-path for tokenizer loading:
smg \
--pd-disaggregation \
--prefill grpc://prefill:50051 \
--decode grpc://decode:50052 \
--model-path /path/to/model \
--host 0.0.0.0 \
--port 30000Mooncake supports TCP transport (no RDMA required). Each prefill worker needs a unique bootstrap port:
# Prefill worker
VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50051 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
# Decode worker
python -m vllm.entrypoints.grpc_server \
--model /path/to/model \
--port 50052 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'smg \
--pd-disaggregation \
--prefill grpc://prefill:50051 8998 \
--decode grpc://decode:50052 \
--model-path /path/to/modelUse the provided script to launch workers with either backend:
# NIXL (default)
./scripts/launch-pd-workers.sh vllm /path/to/model
# Mooncake
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model# Check workers and their roles
curl http://localhost:30000/workers | jq
# Send a request
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'| SGLang PD | vLLM PD | |
|---|---|---|
| Protocol | HTTP | gRPC |
| Dispatch | Both workers receive request simultaneously | Prefill first, then decode |
| KV Transfer | Bootstrap-based coordination | NIXL (RDMA) or Mooncake (TCP/RDMA) |
| SMG flags | --prefill http://... <bootstrap_port> |
--prefill grpc://... + --model-path |
For sizing guidelines, per-phase routing policies, Kubernetes service discovery, and monitoring, see the full PD Disaggregation Concepts page.