title	gRPC Workers

gRPC Workers

When workers connect via gRPC instead of HTTP, SMG becomes a full OpenAI-compatible server — handling tokenization, chat templates, reasoning extraction, and tool calling at the gateway level. Workers run raw inference only.

Before you begin

Completed the Getting Started guide
A gRPC-capable inference worker (vLLM with gRPC entrypoint)
Access to the model weights or a HuggingFace model path (for tokenizer loading)

What gRPC Mode Enables

Capability	HTTP Mode (worker handles)	gRPC Mode (gateway handles)
Chat templates	Worker	Gateway
Tokenization	Worker	Gateway (with caching)
Load balancing	Request-level	Token-aware
Reasoning extraction	Worker	Gateway
Tool call parsing	Worker	Gateway
MCP tool execution (Responses API)	N/A	Gateway

In HTTP mode, SMG is a smart proxy — routing and failover only. In gRPC mode, SMG takes over the full request processing pipeline.

Start a gRPC Worker

=== "SGLang"

```bash
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051 \
  --grpc-mode
```

=== "vLLM"

```bash
python -m vllm.entrypoints.grpc_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 50051
```

=== "TensorRT-LLM"

```bash
python -m tensorrt_llm.commands.serve serve \
  meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --host 0.0.0.0 \
  --port 50051 \
  --backend pytorch
```

Connect SMG

Point SMG at the gRPC worker using grpc:// URLs and provide --model-path so the gateway can load the tokenizer:

smg \
  --worker-urls grpc://localhost:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

!!! note "--model-path is required" The gateway needs the tokenizer to apply chat templates, count tokens for load balancing, and parse tool calls. This can be a HuggingFace model ID or a local path.

The API is still OpenAI-compatible — clients send the same requests as with HTTP workers:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Multiple gRPC Workers

smg \
  --worker-urls grpc://worker1:50051 grpc://worker2:50052 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --policy round_robin

Reasoning Extraction

For thinking models (DeepSeek-R1, Qwen3, etc.), SMG can extract chain-of-thought content into a separate field:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path deepseek-ai/DeepSeek-R1 \
  --reasoning-parser deepseek_r1

The parser is auto-detected from the model name by default. Override with --reasoning-parser if needed.

Request with separate_reasoning: true:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [{"role": "user", "content": "What is 25 * 37?"}],
    "separate_reasoning": true
  }'

Response includes both fields:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "925",
      "reasoning_content": "Let me calculate 25 * 37 step by step..."
    }
  }]
}

Supported Reasoning Parsers

Auto-detected from the model name. Override with --reasoning-parser if needed.

Parser	Models
`deepseek_r1`	DeepSeek-R1
`qwen3`	Qwen3
`qwen3_thinking`	Qwen3-Thinking
`kimi`	Kimi
`glm45`	GLM-4.5, GLM-4.7
`step3`	Step-3
`minimax`	MiniMax, MiniMax-M2
`cohere_cmd`	Command-R, Command-A, C4AI

Tool Calling

In gRPC mode, SMG parses function calls from model output:

smg \
  --worker-urls grpc://worker:50051 \
  --model-path meta-llama/Llama-3.2-70B-Instruct \
  --tool-call-parser llama

For MCP tool execution in Responses API, see the dedicated guide:

# See:
#   Getting Started → MCP in Responses API
#   /v1/responses + --mcp-config-path

Supported Tool Call Parsers

Auto-detected from the model name. Override with --tool-call-parser if needed.

Parser	Models
`json`	GPT-4/4o, Claude, Gemini, Gemma, Llama (generic)
`llama`	Llama 3.2
`pythonic`	Llama 4, DeepSeek (generic)
`deepseek`	DeepSeek-V3
`mistral`	Mistral, Mixtral
`qwen`	Qwen
`qwen_coder`	Qwen3-Coder, Qwen2.5-Coder
`glm45_moe`	GLM-4.5, GLM-4.6
`glm47_moe`	GLM-4.7
`step3`	Step-3
`kimik2`	Kimi-K2
`minimax_m2`	MiniMax
`cohere`	Command-R, Command-A, C4AI

HTTP vs gRPC: When to Use Which

Use Case	Recommended Mode
Workers already run OpenAI servers (SGLang, vLLM HTTP)	HTTP
You need gateway-level tool parsing or Responses MCP	gRPC
You want token-aware load balancing	gRPC
You use thinking models and want reasoning extraction	gRPC
Simplest possible setup	HTTP

Next Steps

gRPC Pipeline Concepts — Full pipeline architecture, all supported parsers
Tokenizer Caching — Two-level cache for reduced CPU overhead
MCP in Responses API — Configure Model Context Protocol servers for /v1/responses
PD Disaggregation — Separate prefill and decode with gRPC workers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC Workers

Before you begin

What gRPC Mode Enables

Start a gRPC Worker

Connect SMG

Multiple gRPC Workers

Reasoning Extraction

Supported Reasoning Parsers

Tool Calling

Supported Tool Call Parsers

HTTP vs gRPC: When to Use Which

Next Steps

FilesExpand file tree

grpc-workers.md

Latest commit

History

grpc-workers.md

File metadata and controls

gRPC Workers

Before you begin

What gRPC Mode Enables

Start a gRPC Worker

Connect SMG

Multiple gRPC Workers

Reasoning Extraction

Supported Reasoning Parsers

Tool Calling

Supported Tool Call Parsers

HTTP vs gRPC: When to Use Which

Next Steps