You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- adds 'context_size' provider_opt for DMR usage instead of giving 'max_tokens' double responsibility to avoid confusion
- improves how flags are sent to the DMR model/runtime configuration endpoint
Signed-off-by: Christopher Petito <chrisjpetito@gmail.com>
Copy file name to clipboardExpand all lines: docs/providers/dmr/index.md
+32-12Lines changed: 32 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,29 +64,49 @@ models:
64
64
model: ai/qwen3
65
65
max_tokens: 8192
66
66
provider_opts:
67
-
runtime_flags: ["--ngl=33", "--top-p=0.9"]
67
+
runtime_flags: ["--threads", "8"]
68
68
```
69
69
70
70
Runtime flags also accept a single string:
71
71
72
72
```yaml
73
73
provider_opts:
74
-
runtime_flags: "--ngl=33 --top-p=0.9"
74
+
runtime_flags: "--threads 8"
75
75
```
76
76
77
-
## Parameter Mapping
77
+
Use only flags your Model Runner backend allows (see `docker model configure --help` and backend docs). **Do not** put sampling parameters (`temperature`, `top_p`, penalties) in `runtime_flags` — set them on the model (`temperature`, `top_p`, etc.); they are sent **per request** via the OpenAI-compatible chat API.
78
78
79
-
docker-agent model config fields map to llama.cpp flags automatically:
79
+
## Context size
80
80
81
-
| Config | llama.cpp Flag |
82
-
| ------------------- | --------------------- |
83
-
| `temperature` | `--temp` |
84
-
| `top_p` | `--top-p` |
85
-
| `frequency_penalty` | `--frequency-penalty` |
86
-
| `presence_penalty` | `--presence-penalty` |
87
-
| `max_tokens` | `--context-size` |
81
+
`max_tokens` controls the **maximum output tokens** per chat completion request. To set the engine's **total context window**, use `provider_opts.context_size`:
88
82
89
-
`runtime_flags`always take priority over derived flags on conflict.
83
+
```yaml
84
+
models:
85
+
local:
86
+
provider: dmr
87
+
model: ai/qwen3
88
+
max_tokens: 4096 # max output tokens (per-request)
89
+
provider_opts:
90
+
context_size: 32768 # total context window (sent via _configure)
91
+
```
92
+
93
+
If `context_size` is omitted, Model Runner uses its default. `max_tokens` is **not** used as the context window.
94
+
95
+
## Thinking / reasoning budget
96
+
97
+
When using the **llama.cpp** backend, `thinking_budget` is sent as structured `llamacpp.reasoning-budget` on `_configure` (maps to `--reasoning-budget`). String efforts use the same token mapping as other providers; `adaptive` maps to unlimited (`-1`).
98
+
99
+
When using the **vLLM** backend, `thinking_budget` is sent as `thinking_token_budget` in each chat completion request. Effort levels map to token counts using the same scale as other providers; `adaptive` maps to unlimited (`-1`).
100
+
101
+
```yaml
102
+
models:
103
+
local:
104
+
provider: dmr
105
+
model: ai/qwen3
106
+
thinking_budget: medium # llama.cpp: reasoning-budget=8192; vLLM: thinking_token_budget=8192
107
+
```
108
+
109
+
On **MLX** and **SGLang** backends, `thinking_budget` is silently ignored — those engines do not currently expose a per-request reasoning token budget knob.
0 commit comments