[Serve][LLM][SGLang] Wide / elastic expert parallelism serving pattern

The existing Wide-EP pattern (`DPServer` + decode-as-orchestrator PD + gang scheduling) passes vLLM-keyed kwargs and sets `VLLM_RAY_BUNDLE_INDICES`. SGLang consumes different names (`dist_init_addr`, `moe_dp_size`, `enable_dp_attention`, etc.) and has in-engine elastic expert backup. This issue tracks a parallel SGLang Wide-EP server class.

## Subitems

- [ ] A new SGLang DP / Wide-EP server class.
- [ ] SGLang's MoE / EP engine args (`moe_dp_size`, `moe_ep_size`, `enable_dp_attention`, `elastic_ep_backend`, `enable_elastic_expert_backup`) are first-class through the new server.
- [ ] Layering: engine-level elastic backup handles expert-local failures, gang `RESTART_GANG` handles control-plane / NCCL-level failures. Documented contract.
- [ ] SGLang Wide-EP release test on a real MoE model (DeepSeek-V3 class). Fault-injection variant kills one scheduler actor and asserts engine-level recovery.
- [ ] SGLang Wide-EP user guide documenting topology and failure-mode contract.

## Open questions

- Failure-mode layering between engine-level elastic expert backup and Ray-side `RESTART_GANG`. Engine handles expert-local; gang handles control-plane. Explicit contract so the two don't fight.
- Whether expert-placement topology is observable at the Ray layer or stays engine-internal.
- Autoscaling behavior on wide-EP deployments: expert rebalance on scale-up, or opaque DP-group scaling.
- Whether the new server class subclasses `DPServer` or stands alone.

## Upstream coordination

- Expert-placement state exposed over a `TokenizerManager` RPC for Ray-side observability.
- Clear engine signal when elastic backup is in progress, so gang restart can defer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve][LLM][SGLang] Wide / elastic expert parallelism serving pattern #62793

Subitems

Open questions

Upstream coordination

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Serve][LLM][SGLang] Wide / elastic expert parallelism serving pattern #62793

Description

Subitems

Open questions

Upstream coordination

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions