title	PD Disaggregation

PD Disaggregation

Prefill-Decode (PD) disaggregation separates the two phases of LLM inference — prompt processing (prefill) and token generation (decode) — onto specialized workers. This optimizes Time to First Token (TTFT) and throughput independently.

Before you begin

Completed the Getting Started guide
At least one prefill worker and one decode worker
For vLLM PD: workers started with gRPC entrypoint and KV transfer backend

Why Disaggregate?

Phase	Compute Pattern	Bottleneck
Prefill	Compute-bound, parallel	GPU compute
Decode	Memory-bound, sequential	Memory bandwidth

Running both on the same worker creates contention — prefill batches wait for decode slots, and decode batches stay small due to memory pressure. Dedicating workers to each phase removes this conflict.

SGLang PD

SMG sends the request to both prefill and decode workers simultaneously, and they coordinate KV cache transfer through a bootstrap mechanism.

Start SGLang Workers

# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --port 8000 \
  --prefill-only

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --port 8001 \
  --decode-only

Start SMG

Each prefill worker needs a bootstrap port for coordination:

smg \
  --pd-disaggregation \
  --prefill http://prefill:8000 9001 \
  --decode http://decode:8001 \
  --host 0.0.0.0 \
  --port 30000

Multiple Workers

smg \
  --pd-disaggregation \
  --prefill http://prefill1:8000 9001 \
  --prefill http://prefill2:8000 9002 \
  --decode http://decode1:8001 \
  --decode http://decode2:8001 \
  --prefill-policy cache_aware \
  --decode-policy power_of_two

vLLM PD

SMG sends to prefill first with max_tokens=1, then sends the original request to decode. The KV backend (NIXL or Mooncake) transfers the cache transparently.

Start vLLM Workers with NIXL

# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Start SMG

vLLM workers use grpc:// URLs and require --model-path for tokenizer loading:

smg \
  --pd-disaggregation \
  --prefill grpc://prefill:50051 \
  --decode grpc://decode:50052 \
  --model-path /path/to/model \
  --host 0.0.0.0 \
  --port 30000

Alternative: Mooncake Backend

Mooncake supports TCP transport (no RDMA required). Each prefill worker needs a unique bootstrap port:

# Prefill worker
VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

# Decode worker
python -m vllm.entrypoints.grpc_server \
  --model /path/to/model \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

smg \
  --pd-disaggregation \
  --prefill grpc://prefill:50051 8998 \
  --decode grpc://decode:50052 \
  --model-path /path/to/model

Helper Script

Use the provided script to launch workers with either backend:

# NIXL (default)
./scripts/launch-pd-workers.sh vllm /path/to/model

# Mooncake
KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model

Verify

# Check workers and their roles
curl http://localhost:30000/workers | jq

# Send a request
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

SGLang vs vLLM PD at a Glance

	SGLang PD	vLLM PD
Protocol	HTTP	gRPC
Dispatch	Both workers receive request simultaneously	Prefill first, then decode
KV Transfer	Bootstrap-based coordination	NIXL (RDMA) or Mooncake (TCP/RDMA)
SMG flags	`--prefill http://... <bootstrap_port>`	`--prefill grpc://...` + `--model-path`

Next Steps

For sizing guidelines, per-phase routing policies, Kubernetes service discovery, and monitoring, see the full PD Disaggregation Concepts page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD Disaggregation

Before you begin

Why Disaggregate?

SGLang PD

Start SGLang Workers

Start SMG

Multiple Workers

vLLM PD

Start vLLM Workers with NIXL

Start SMG

Alternative: Mooncake Backend

Helper Script

Verify

SGLang vs vLLM PD at a Glance

Next Steps

FilesExpand file tree

pd-disaggregation.md

Latest commit

History

pd-disaggregation.md

File metadata and controls

PD Disaggregation

Before you begin

Why Disaggregate?

SGLang PD

Start SGLang Workers

Start SMG

Multiple Workers

vLLM PD

Start vLLM Workers with NIXL

Start SMG

Alternative: Mooncake Backend

Helper Script

Verify

SGLang vs vLLM PD at a Glance

Next Steps