CUDA: opt-in managed pool + OOM fallback for alloc_buffer by icex · Pull Request #22158 · ggml-org/llama.cpp

icex · 2026-04-20T08:15:49Z

Summary

Two opt-in escape hatches for users hitting OOM on small-VRAM AMD cards running long-context models. Not a bug fix — as noted on #21376, ROCm doesn't spill to system RAM by design, and this PR doesn't change that default. Both paths are gated behind env vars; with neither set, behavior is byte-for-byte identical to master.

Motivating case: gfx1201 (16 GB) on ROCm without HIP VMM (system_info reports NO_VMM=1), where the legacy pool grows monotonically and the user has no way to trade raw throughput for "don't abort."

1. `GGML_CUDA_POOL_USE_MANAGED=1` (opt-in)

Routes the legacy pool's grow path through cudaMallocManaged (→ hipMallocManaged on ROCm). On the grow path we also free the pool's idle smaller buffers — under managed memory those otherwise sit in system RAM, inflate pool_size, and cause PCIe page-faults during later TG.

2. `GGML_CUDA_ALLOC_FALLBACK_MANAGED=1` (opt-in)

When set, ggml_backend_cuda_buffer_type_alloc_buffer retries once with managed memory if plain cudaMalloc OOMs, so the overflowing buffer (typically the KV cache at high context) spills instead of aborting. With the flag unset, OOM behavior is unchanged — logs the original error and returns nullptr. Buffers that fit already landed on device via the first attempt; only the one that overflowed pages on demand, so the common TG working set stays on GPU.

Motivation

Reproduces with Qwen3.6-35B-A3B (Unsloth IQ3_S) at -c 262144 -ctk q4_0 -ctv q4_0 -b 1024 -ub 128 -fa on 16 GB gfx1201. Previously crashed at ~41k PP tokens:

ggml_cuda_pool_leg::alloc at ggml-cuda.cu:410 OOMs during FA tile scratch growth.
ggml_backend_cuda_buffer_type_alloc_buffer OOMs on KV cache alloc.

Refs #21376.

Behavior matrix

Flags	Pool grow path	`alloc_buffer` OOM
(none)	`cudaMalloc`	log + return nullptr (master behavior)
`GGML_CUDA_POOL_USE_MANAGED=1`	`cudaMallocManaged` + drop idle buffers	log + return nullptr
`GGML_CUDA_ALLOC_FALLBACK_MANAGED=1`	`cudaMalloc`	retry once with `cudaMallocManaged`
both	`cudaMallocManaged` + drop idle buffers	retry once with `cudaMallocManaged`

Test plan

Build on ROCm (gfx1201), run Qwen3.6-35B-A3B IQ3_S at the repro params with both flags — no longer aborts, PP continues past 41k.
Sanity run on CUDA (NVIDIA) with both flags unset to confirm no behavior change.
Sanity run with each flag individually.

Notes

Draft while I gather more numbers across configurations and GPU vendors. Feedback welcome on: (a) whether the idle-buffer drop heuristic on the managed grow path should be its own flag, (b) whether the two env vars should be merged into one, and (c) naming.

Adds two opt-in mitigations for OOM on gfx1201 (16 GB, ROCm without HIP VMM) and other small-VRAM AMD cards when running long-context models. 1. GGML_CUDA_POOL_USE_MANAGED=1: route the legacy pool's grow path through cudaMallocManaged. On the grow path we also free the pool's idle smaller buffers, since under managed memory they otherwise sit in system RAM inflating pool_size and trigger PCIe page-faults during later TG. 2. alloc_buffer OOM fallback: if plain cudaMalloc fails, retry once with managed memory so the overflowing buffer (typically the KV cache at high context) can spill to system RAM instead of aborting. Buffers that fit still land on device from the first attempt; only the one that overflowed pages on demand. Reproduces with Qwen3.6-35B-A3B (Unsloth IQ3_S) at -c 262144 -ctk q4_0 -ctv q4_0 -b 1024 -ub 128 -fa on 16 GB gfx1201, which previously OOM'd at ~41k PP tokens in ggml_cuda_pool_leg::alloc during FA tile scratch growth and in ggml_backend_cuda_buffer_type_alloc_buffer on KV alloc. Refs ggml-org#21376

Default behavior (no flag set) is now byte-for-byte identical to master: plain cudaMalloc, and on OOM log the original error and return nullptr. The managed-memory spill path now requires explicit opt-in via GGML_CUDA_ALLOC_FALLBACK_MANAGED=1, matching the stance that ROCm does not spill to system RAM by design. GGML_CUDA_POOL_USE_MANAGED remains a separate opt-in for the pool grow path.

ggml-gh-bot · 2026-04-20T08:21:16Z

Hi @icex, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

JohannesGaessler · 2026-04-20T09:25:06Z

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

icex added 2 commits April 20, 2026 11:15

JohannesGaessler closed this Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: opt-in managed pool + OOM fallback for alloc_buffer#22158

CUDA: opt-in managed pool + OOM fallback for alloc_buffer#22158
icex wants to merge 2 commits intoggml-org:masterfrom
icex:gfx1201-kv-spill-fallback

icex commented Apr 20, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot bot commented Apr 20, 2026

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

icex commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. GGML_CUDA_POOL_USE_MANAGED=1 (opt-in)

2. GGML_CUDA_ALLOC_FALLBACK_MANAGED=1 (opt-in)

Motivation

Behavior matrix

Test plan

Notes

Uh oh!

ggml-gh-bot bot commented Apr 20, 2026

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

icex commented Apr 20, 2026 •

edited

Loading

1. `GGML_CUDA_POOL_USE_MANAGED=1` (opt-in)

2. `GGML_CUDA_ALLOC_FALLBACK_MANAGED=1` (opt-in)