| title | Getting Started |
|---|
Shepherd Model Gateway (SMG) routes and manages LLM traffic across workers. This page gives you a fast path to a working gateway, then points you to feature-specific setup guides.
=== "pip (recommended)"
Pre-built wheels are available for Linux (x86_64, aarch64, musllinux), macOS (Intel and Apple Silicon), and Windows (x86_64), with Python 3.9–3.14.
```bash
pip install smg
```
This installs both:
- `smg serve` (Python orchestration command for workers + gateway)
- `smg launch` (router launch path in Rust CLI)
=== "Cargo (crates.io)"
```bash
cargo install smg
```
=== "Docker"
**SMG only** (gateway/router, no inference engine):
Multi-architecture images are available for x86_64 and ARM64.
```bash
docker pull lightseekorg/smg:latest
```
Available tags: `latest` (stable), `v1.4.x` (specific version), `nightly` (development, from `ghcr.io/lightseekorg/smg:nightly`).
**SMG + Engine** (all-in-one, ready to serve models):
Engine images bundle SMG with a specific inference engine (x86_64/CUDA only). Use these when you want a single container that can both route and serve.
```bash
# SGLang
docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
# vLLM
docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0
# TensorRT-LLM
docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10
```
Tag format: `{smg_version}-{engine}-{engine_version}`. Browse all tags at [ghcr.io/lightseekorg/smg](https://github.com/lightseekorg/smg/pkgs/container/smg).
=== "From Source"
```bash
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"
# Clone and build
git clone https://github.com/lightseekorg/smg.git
cd smg
cargo build --release
```
The binary is available at `./target/release/smg`.
Choose one of these startup paths.
smg serve launches backend worker process(es) and then starts SMG with generated worker URLs.
=== "SGLang"
```bash
smg serve \
--backend sglang \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--data-parallel-size 2 \
--connection-mode grpc \
--host 0.0.0.0 \
--port 30000
```
=== "vLLM"
```bash
smg serve \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--data-parallel-size 2 \
--host 0.0.0.0 \
--port 30000
```
=== "TensorRT-LLM (gRPC)"
```bash
smg serve \
--backend trtllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--data-parallel-size 2 \
--host 0.0.0.0 \
--port 30000
```
This starts --data-parallel-size worker replicas, waits for readiness, then starts the gateway.
| Option | Default | Description |
|---|---|---|
--backend |
sglang |
Inference backend: sglang, vllm, or trtllm |
--connection-mode |
grpc |
Worker connection mode: grpc or http (TensorRT-LLM only supports gRPC) |
--data-parallel-size |
1 |
Number of worker replicas (one per GPU) |
--worker-base-port |
31000 |
Base port for worker processes |
--host |
127.0.0.1 |
Router host |
--port |
8080 |
Router port |
Use this when workers are already running or managed by another platform.
For gRPC workers:
smg launch \
--worker-urls grpc://localhost:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--policy round_robin \
--host 0.0.0.0 \
--port 30000For HTTP workers:
smg launch \
--worker-urls http://localhost:8000 \
--policy round_robin \
--host 0.0.0.0 \
--port 30000Health:
curl http://localhost:30000/health
curl http://localhost:30000/readinessOpenAI-compatible chat completions:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'Responses API:
curl http://localhost:30000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"input": "Say hello in one sentence."
}'Use these when workers are not started via smg serve.
=== "SGLang (gRPC)"
```bash
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--grpc-mode
```
=== "SGLang (HTTP)"
```bash
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
```
=== "vLLM (gRPC)"
```bash
python -m vllm.entrypoints.grpc_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--tensor-parallel-size 1
```
=== "TensorRT-LLM (gRPC)"
```bash
python -m tensorrt_llm.commands.serve \
meta-llama/Llama-3.1-8B-Instruct \
--grpc \
--host 0.0.0.0 \
--port 50051 \
--backend pytorch \
--tp_size 1
```
For prefill-decode disaggregation, start separate prefill and decode workers:
=== "SGLang PD (gRPC)"
```bash
# Prefill worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--grpc-mode \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8998
# Decode worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50052 \
--grpc-mode \
--disaggregation-mode decode \
--disaggregation-bootstrap-port 8999
```
Start SMG with bootstrap ports for SGLang coordination:
```bash
smg launch \
--pd-disaggregation \
--prefill grpc://localhost:50051 8998 \
--decode grpc://localhost:50052 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
```
=== "SGLang PD (HTTP)"
```bash
# Prefill worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8998
# Decode worker
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8001 \
--disaggregation-mode decode \
--disaggregation-bootstrap-port 8999
```
Start SMG with bootstrap ports for SGLang coordination:
```bash
smg launch \
--pd-disaggregation \
--prefill http://localhost:8000 8998 \
--decode http://localhost:8001 \
--host 0.0.0.0 \
--port 30000
```
=== "vLLM PD (gRPC + NIXL)"
vLLM uses NIXL for KV cache transfer between prefill and decode workers:
```bash
# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50052 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
```
Start SMG (no bootstrap ports needed — NIXL handles KV transfer):
```bash
smg \
--pd-disaggregation \
--prefill grpc://localhost:50051 \
--decode grpc://localhost:50052 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
```
See PD Disaggregation for full details including Mooncake backend and scaling.
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 50
}'Expected response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 8,
"total_tokens": 22
}
}# Gateway health
curl http://localhost:30000/health
# Worker status
curl http://localhost:30000/workersFor local deployment, run SMG in a container and point it at your worker:
docker pull lightseekorg/smg:latest
docker run -d \
--name smg \
-p 30000:30000 \
-p 29000:29000 \
lightseekorg/smg:latest \
--worker-urls http://host.docker.internal:8000 \
--policy cache_aware \
--prometheus-port 29000Verify:
docker ps | grep smg
curl http://localhost:30000/healthEngine images include both SMG and an inference engine. Use serve to launch workers and the gateway together:
docker run -d --gpus all \
--name smg \
-p 30000:30000 \
-v /path/to/models:/models \
ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10 \
serve \
--backend sglang \
--model-path /models/meta-llama/Llama-3.1-8B-Instruct \
--port 30000Verify:
curl http://localhost:30000/health
curl http://localhost:30000/v1/modelsRun SMG in-cluster and use service discovery to pick up worker pods automatically.
Start SMG with service discovery:
smg \
--service-discovery \
--selector app=sglang-worker \
--service-discovery-namespace inference \
--service-discovery-port 8000 \
--policy cache_awareRequired RBAC permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: smg-discovery
namespace: inference
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]Verify:
kubectl get pods -n inference -l app=sglang-worker
curl http://localhost:30000/workers- Multiple Workers — connect local or external worker endpoints
- gRPC Workers — gateway-side tokenization, parsing, and tool handling
- PD Disaggregation — split prefill and decode paths
- Service Discovery — Kubernetes pod-based worker registration
- Monitoring — Prometheus metrics, tracing, and alerts
- Logging — structured logs and aggregation patterns
- TLS — HTTPS gateway configuration
- Control Plane Auth — secure worker/tokenizer/WASM management endpoints
- Reliability Controls — concurrency limits, retries, and circuit breakers
- Data Connections — history backend setup for Postgres, Redis, and Oracle
- Tokenization and Parsing APIs — tokenize, detokenize, and parser endpoints
- Load Balancing — policy selection and tuning
- Tokenizer Caching — L0/L1 cache setup for gRPC mode
- MCP in Responses API — configure and execute MCP tools through
/v1/responses
??? question "Gateway starts but can't connect to worker"
**Symptoms:** Gateway logs show connection errors.
**Solutions:**
1. Verify the worker is running: `curl http://localhost:8000/health`
2. Check network connectivity between gateway and worker
3. If using Docker, ensure proper network configuration (`--network host` or Docker network)
??? question "Request times out"
**Symptoms:** Requests hang or return 504 errors.
**Solutions:**
1. Check worker health: `curl http://localhost:30000/workers`
2. Increase timeout: `--request-timeout-secs 120`
3. Check worker logs for errors
??? question "Model not found error"
**Symptoms:** `model not found` in response.
**Solutions:**
1. The `model` field in requests should match the model loaded on the worker
2. Check available models: `curl http://localhost:30000/v1/models`