title	Getting Started

Getting Started

Shepherd Model Gateway (SMG) routes and manages LLM traffic across workers. This page gives you a fast path to a working gateway, then points you to feature-specific setup guides.

Install

=== "pip (recommended)"

Pre-built wheels are available for Linux (x86_64, aarch64, musllinux), macOS (Intel and Apple Silicon), and Windows (x86_64), with Python 3.9–3.14.

```bash
pip install smg
```

This installs both:

- `smg serve` (Python orchestration command for workers + gateway)
- `smg launch` (router launch path in Rust CLI)

=== "Cargo (crates.io)"

```bash
cargo install smg
```

=== "Docker"

**SMG only** (gateway/router, no inference engine):

Multi-architecture images are available for x86_64 and ARM64.

```bash
docker pull lightseekorg/smg:latest
```

Available tags: `latest` (stable), `v1.4.x` (specific version), `nightly` (development, from `ghcr.io/lightseekorg/smg:nightly`).

**SMG + Engine** (all-in-one, ready to serve models):

Engine images bundle SMG with a specific inference engine (x86_64/CUDA only). Use these when you want a single container that can both route and serve.

```bash
# SGLang
docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10

# vLLM
docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

# TensorRT-LLM
docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10
```

Tag format: `{smg_version}-{engine}-{engine_version}`. Browse all tags at [ghcr.io/lightseekorg/smg](https://github.com/lightseekorg/smg/pkgs/container/smg).

=== "From Source"

```bash
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"

# Clone and build
git clone https://github.com/lightseekorg/smg.git
cd smg
cargo build --release
```

The binary is available at `./target/release/smg`.

Step 1: Start SMG

Choose one of these startup paths.

Option A: All-in-one with `smg serve`

smg serve launches backend worker process(es) and then starts SMG with generated worker URLs.

=== "SGLang"

```bash
smg serve \
  --backend sglang \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --connection-mode grpc \
  --host 0.0.0.0 \
  --port 30000
```

=== "vLLM"

```bash
smg serve \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --host 0.0.0.0 \
  --port 30000
```

=== "TensorRT-LLM (gRPC)"

```bash
smg serve \
  --backend trtllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --host 0.0.0.0 \
  --port 30000
```

This starts --data-parallel-size worker replicas, waits for readiness, then starts the gateway.

Option	Default	Description
`--backend`	`sglang`	Inference backend: `sglang`, `vllm`, or `trtllm`
`--connection-mode`	`grpc`	Worker connection mode: `grpc` or `http` (TensorRT-LLM only supports gRPC)
`--data-parallel-size`	`1`	Number of worker replicas (one per GPU)
`--worker-base-port`	`31000`	Base port for worker processes
`--host`	`127.0.0.1`	Router host
`--port`	`8080`	Router port

Option B: Launch gateway only with `smg launch`

Use this when workers are already running or managed by another platform.

For gRPC workers:

smg launch \
  --worker-urls grpc://localhost:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --policy round_robin \
  --host 0.0.0.0 \
  --port 30000

For HTTP workers:

smg launch \
  --worker-urls http://localhost:8000 \
  --policy round_robin \
  --host 0.0.0.0 \
  --port 30000

Step 2: Verify Core Endpoints

Health:

curl http://localhost:30000/health
curl http://localhost:30000/readiness

OpenAI-compatible chat completions:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
  }'

Responses API:

curl http://localhost:30000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "input": "Say hello in one sentence."
  }'

Step 3: Choose Your Setup Track

Core Deployment

Operations and Security

Reliability and Data

Advanced Features

Worker Startup Recipes (Standalone)

Use these when workers are not started via smg serve.

=== "SGLang (gRPC)"

```bash
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode
```

=== "SGLang (HTTP)"

```bash
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000
```

=== "vLLM (gRPC)"

```bash
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --tensor-parallel-size 1
```

=== "TensorRT-LLM (gRPC)"

```bash
python -m tensorrt_llm.commands.serve \
  meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --host 0.0.0.0 \
  --port 50051 \
  --backend pytorch \
  --tp_size 1
```

PD Disaggregation Workers

For prefill-decode disaggregation, start separate prefill and decode workers:

=== "SGLang PD (gRPC)"

```bash
# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode \
  --disaggregation-mode prefill \
  --disaggregation-bootstrap-port 8998

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50052 \
  --grpc-mode \
  --disaggregation-mode decode \
  --disaggregation-bootstrap-port 8999
```

Start SMG with bootstrap ports for SGLang coordination:

```bash
smg launch \
  --pd-disaggregation \
  --prefill grpc://localhost:50051 8998 \
  --decode grpc://localhost:50052 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000
```

=== "SGLang PD (HTTP)"

```bash
# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --disaggregation-mode prefill \
  --disaggregation-bootstrap-port 8998

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8001 \
  --disaggregation-mode decode \
  --disaggregation-bootstrap-port 8999
```

Start SMG with bootstrap ports for SGLang coordination:

```bash
smg launch \
  --pd-disaggregation \
  --prefill http://localhost:8000 8998 \
  --decode http://localhost:8001 \
  --host 0.0.0.0 \
  --port 30000
```

=== "vLLM PD (gRPC + NIXL)"

vLLM uses NIXL for KV cache transfer between prefill and decode workers:

```bash
# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
```

Start SMG (no bootstrap ports needed — NIXL handles KV transfer):

```bash
smg \
  --pd-disaggregation \
  --prefill grpc://localhost:50051 \
  --decode grpc://localhost:50052 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000
```

See PD Disaggregation for full details including Mooncake backend and scaling.

Send a Request

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 50
  }'

Expected response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 8,
    "total_tokens": 22
  }
}

Verify Health

# Gateway health
curl http://localhost:30000/health

# Worker status
curl http://localhost:30000/workers

Deploy with Docker

For local deployment, run SMG in a container and point it at your worker:

docker pull lightseekorg/smg:latest

docker run -d \
  --name smg \
  -p 30000:30000 \
  -p 29000:29000 \
  lightseekorg/smg:latest \
  --worker-urls http://host.docker.internal:8000 \
  --policy cache_aware \
  --prometheus-port 29000

Verify:

docker ps | grep smg
curl http://localhost:30000/health

All-in-one with engine images

Engine images include both SMG and an inference engine. Use serve to launch workers and the gateway together:

docker run -d --gpus all \
  --name smg \
  -p 30000:30000 \
  -v /path/to/models:/models \
  ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10 \
  serve \
  --backend sglang \
  --model-path /models/meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

Verify:

curl http://localhost:30000/health
curl http://localhost:30000/v1/models

Deploy to Kubernetes (Quick Start)

Run SMG in-cluster and use service discovery to pick up worker pods automatically.

Start SMG with service discovery:

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --service-discovery-port 8000 \
  --policy cache_aware

Required RBAC permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: smg-discovery
  namespace: inference
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]

Verify:

kubectl get pods -n inference -l app=sglang-worker
curl http://localhost:30000/workers

Navigate by Category

Core Setup

Multiple Workers — connect local or external worker endpoints
gRPC Workers — gateway-side tokenization, parsing, and tool handling
PD Disaggregation — split prefill and decode paths
Service Discovery — Kubernetes pod-based worker registration

Operations

Monitoring — Prometheus metrics, tracing, and alerts
Logging — structured logs and aggregation patterns
TLS — HTTPS gateway configuration
Control Plane Auth — secure worker/tokenizer/WASM management endpoints

Reliability and Data

Reliability Controls — concurrency limits, retries, and circuit breakers
Data Connections — history backend setup for Postgres, Redis, and Oracle
Tokenization and Parsing APIs — tokenize, detokenize, and parser endpoints

Advanced Features

Load Balancing — policy selection and tuning
Tokenizer Caching — L0/L1 cache setup for gRPC mode
MCP in Responses API — configure and execute MCP tools through /v1/responses

Troubleshooting

??? question "Gateway starts but can't connect to worker"

**Symptoms:** Gateway logs show connection errors.

**Solutions:**

1. Verify the worker is running: `curl http://localhost:8000/health`
2. Check network connectivity between gateway and worker
3. If using Docker, ensure proper network configuration (`--network host` or Docker network)

??? question "Request times out"

**Symptoms:** Requests hang or return 504 errors.

**Solutions:**

1. Check worker health: `curl http://localhost:30000/workers`
2. Increase timeout: `--request-timeout-secs 120`
3. Check worker logs for errors

??? question "Model not found error"

**Symptoms:** `model not found` in response.

**Solutions:**

1. The `model` field in requests should match the model loaded on the worker
2. Check available models: `curl http://localhost:30000/v1/models`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Install

Step 1: Start SMG

Option A: All-in-one with `smg serve`

Option B: Launch gateway only with `smg launch`

Step 2: Verify Core Endpoints

Step 3: Choose Your Setup Track

Core Deployment

Operations and Security

Reliability and Data

Advanced Features

Worker Startup Recipes (Standalone)

PD Disaggregation Workers

Send a Request

Verify Health

Deploy with Docker

All-in-one with engine images

Deploy to Kubernetes (Quick Start)

Navigate by Category

Core Setup

Operations

Reliability and Data

Advanced Features

Troubleshooting

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

Getting Started

Install

Step 1: Start SMG

Option A: All-in-one with smg serve

Option B: Launch gateway only with smg launch

Step 2: Verify Core Endpoints

Step 3: Choose Your Setup Track

Core Deployment

Operations and Security

Reliability and Data

Advanced Features

Worker Startup Recipes (Standalone)

PD Disaggregation Workers

Send a Request

Verify Health

Deploy with Docker

All-in-one with engine images

Deploy to Kubernetes (Quick Start)

Navigate by Category

Core Setup

Operations

Reliability and Data

Advanced Features

Troubleshooting

Option A: All-in-one with `smg serve`

Option B: Launch gateway only with `smg launch`