Skip to content

Latest commit

 

History

History
553 lines (416 loc) · 13.4 KB

File metadata and controls

553 lines (416 loc) · 13.4 KB
title Getting Started

Getting Started

Shepherd Model Gateway (SMG) routes and manages LLM traffic across workers. This page gives you a fast path to a working gateway, then points you to feature-specific setup guides.

Install

=== "pip (recommended)"

Pre-built wheels are available for Linux (x86_64, aarch64, musllinux), macOS (Intel and Apple Silicon), and Windows (x86_64), with Python 3.9–3.14.

```bash
pip install smg
```

This installs both:

- `smg serve` (Python orchestration command for workers + gateway)
- `smg launch` (router launch path in Rust CLI)

=== "Cargo (crates.io)"

```bash
cargo install smg
```

=== "Docker"

**SMG only** (gateway/router, no inference engine):

Multi-architecture images are available for x86_64 and ARM64.

```bash
docker pull lightseekorg/smg:latest
```

Available tags: `latest` (stable), `v1.4.x` (specific version), `nightly` (development, from `ghcr.io/lightseekorg/smg:nightly`).

**SMG + Engine** (all-in-one, ready to serve models):

Engine images bundle SMG with a specific inference engine (x86_64/CUDA only). Use these when you want a single container that can both route and serve.

```bash
# SGLang
docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10

# vLLM
docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

# TensorRT-LLM
docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10
```

Tag format: `{smg_version}-{engine}-{engine_version}`. Browse all tags at [ghcr.io/lightseekorg/smg](https://github.com/lightseekorg/smg/pkgs/container/smg).

=== "From Source"

```bash
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"

# Clone and build
git clone https://github.com/lightseekorg/smg.git
cd smg
cargo build --release
```

The binary is available at `./target/release/smg`.

Step 1: Start SMG

Choose one of these startup paths.

Option A: All-in-one with smg serve

smg serve launches backend worker process(es) and then starts SMG with generated worker URLs.

=== "SGLang"

```bash
smg serve \
  --backend sglang \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --connection-mode grpc \
  --host 0.0.0.0 \
  --port 30000
```

=== "vLLM"

```bash
smg serve \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --host 0.0.0.0 \
  --port 30000
```

=== "TensorRT-LLM (gRPC)"

```bash
smg serve \
  --backend trtllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --data-parallel-size 2 \
  --host 0.0.0.0 \
  --port 30000
```

This starts --data-parallel-size worker replicas, waits for readiness, then starts the gateway.

Option Default Description
--backend sglang Inference backend: sglang, vllm, or trtllm
--connection-mode grpc Worker connection mode: grpc or http (TensorRT-LLM only supports gRPC)
--data-parallel-size 1 Number of worker replicas (one per GPU)
--worker-base-port 31000 Base port for worker processes
--host 127.0.0.1 Router host
--port 8080 Router port

Option B: Launch gateway only with smg launch

Use this when workers are already running or managed by another platform.

For gRPC workers:

smg launch \
  --worker-urls grpc://localhost:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --policy round_robin \
  --host 0.0.0.0 \
  --port 30000

For HTTP workers:

smg launch \
  --worker-urls http://localhost:8000 \
  --policy round_robin \
  --host 0.0.0.0 \
  --port 30000

Step 2: Verify Core Endpoints

Health:

curl http://localhost:30000/health
curl http://localhost:30000/readiness

OpenAI-compatible chat completions:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
  }'

Responses API:

curl http://localhost:30000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "input": "Say hello in one sentence."
  }'

Step 3: Choose Your Setup Track

Core Deployment

Operations and Security

Reliability and Data

Advanced Features


Worker Startup Recipes (Standalone)

Use these when workers are not started via smg serve.

=== "SGLang (gRPC)"

```bash
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode
```

=== "SGLang (HTTP)"

```bash
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000
```

=== "vLLM (gRPC)"

```bash
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --tensor-parallel-size 1
```

=== "TensorRT-LLM (gRPC)"

```bash
python -m tensorrt_llm.commands.serve \
  meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --host 0.0.0.0 \
  --port 50051 \
  --backend pytorch \
  --tp_size 1
```

PD Disaggregation Workers

For prefill-decode disaggregation, start separate prefill and decode workers:

=== "SGLang PD (gRPC)"

```bash
# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode \
  --disaggregation-mode prefill \
  --disaggregation-bootstrap-port 8998

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50052 \
  --grpc-mode \
  --disaggregation-mode decode \
  --disaggregation-bootstrap-port 8999
```

Start SMG with bootstrap ports for SGLang coordination:

```bash
smg launch \
  --pd-disaggregation \
  --prefill grpc://localhost:50051 8998 \
  --decode grpc://localhost:50052 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000
```

=== "SGLang PD (HTTP)"

```bash
# Prefill worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --disaggregation-mode prefill \
  --disaggregation-bootstrap-port 8998

# Decode worker
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8001 \
  --disaggregation-mode decode \
  --disaggregation-bootstrap-port 8999
```

Start SMG with bootstrap ports for SGLang coordination:

```bash
smg launch \
  --pd-disaggregation \
  --prefill http://localhost:8000 8998 \
  --decode http://localhost:8001 \
  --host 0.0.0.0 \
  --port 30000
```

=== "vLLM PD (gRPC + NIXL)"

vLLM uses NIXL for KV cache transfer between prefill and decode workers:

```bash
# Prefill worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode worker
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50052 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
```

Start SMG (no bootstrap ports needed — NIXL handles KV transfer):

```bash
smg \
  --pd-disaggregation \
  --prefill grpc://localhost:50051 \
  --decode grpc://localhost:50052 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000
```

See PD Disaggregation for full details including Mooncake backend and scaling.

Send a Request

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 50
  }'

Expected response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 8,
    "total_tokens": 22
  }
}

Verify Health

# Gateway health
curl http://localhost:30000/health

# Worker status
curl http://localhost:30000/workers

Deploy with Docker

For local deployment, run SMG in a container and point it at your worker:

docker pull lightseekorg/smg:latest

docker run -d \
  --name smg \
  -p 30000:30000 \
  -p 29000:29000 \
  lightseekorg/smg:latest \
  --worker-urls http://host.docker.internal:8000 \
  --policy cache_aware \
  --prometheus-port 29000

Verify:

docker ps | grep smg
curl http://localhost:30000/health

All-in-one with engine images

Engine images include both SMG and an inference engine. Use serve to launch workers and the gateway together:

docker run -d --gpus all \
  --name smg \
  -p 30000:30000 \
  -v /path/to/models:/models \
  ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10 \
  serve \
  --backend sglang \
  --model-path /models/meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

Verify:

curl http://localhost:30000/health
curl http://localhost:30000/v1/models

Deploy to Kubernetes (Quick Start)

Run SMG in-cluster and use service discovery to pick up worker pods automatically.

Start SMG with service discovery:

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --service-discovery-port 8000 \
  --policy cache_aware

Required RBAC permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: smg-discovery
  namespace: inference
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]

Verify:

kubectl get pods -n inference -l app=sglang-worker
curl http://localhost:30000/workers

Navigate by Category

Core Setup

Operations

  • Monitoring — Prometheus metrics, tracing, and alerts
  • Logging — structured logs and aggregation patterns
  • TLS — HTTPS gateway configuration
  • Control Plane Auth — secure worker/tokenizer/WASM management endpoints

Reliability and Data

Advanced Features

Troubleshooting

??? question "Gateway starts but can't connect to worker"

**Symptoms:** Gateway logs show connection errors.

**Solutions:**

1. Verify the worker is running: `curl http://localhost:8000/health`
2. Check network connectivity between gateway and worker
3. If using Docker, ensure proper network configuration (`--network host` or Docker network)

??? question "Request times out"

**Symptoms:** Requests hang or return 504 errors.

**Solutions:**

1. Check worker health: `curl http://localhost:30000/workers`
2. Increase timeout: `--request-timeout-secs 120`
3. Check worker logs for errors

??? question "Model not found error"

**Symptoms:** `model not found` in response.

**Solutions:**

1. The `model` field in requests should match the model loaded on the worker
2. Check available models: `curl http://localhost:30000/v1/models`