Skip to content

CUDA: opt-in managed pool + OOM fallback for alloc_buffer#22158

Closed
icex wants to merge 2 commits intoggml-org:masterfrom
icex:gfx1201-kv-spill-fallback
Closed

CUDA: opt-in managed pool + OOM fallback for alloc_buffer#22158
icex wants to merge 2 commits intoggml-org:masterfrom
icex:gfx1201-kv-spill-fallback

Conversation

@icex
Copy link
Copy Markdown

@icex icex commented Apr 20, 2026

Summary

Two opt-in escape hatches for users hitting OOM on small-VRAM AMD cards running long-context models. Not a bug fix — as noted on #21376, ROCm doesn't spill to system RAM by design, and this PR doesn't change that default. Both paths are gated behind env vars; with neither set, behavior is byte-for-byte identical to master.

Motivating case: gfx1201 (16 GB) on ROCm without HIP VMM (system_info reports NO_VMM=1), where the legacy pool grows monotonically and the user has no way to trade raw throughput for "don't abort."

1. GGML_CUDA_POOL_USE_MANAGED=1 (opt-in)

Routes the legacy pool's grow path through cudaMallocManaged (→ hipMallocManaged on ROCm). On the grow path we also free the pool's idle smaller buffers — under managed memory those otherwise sit in system RAM, inflate pool_size, and cause PCIe page-faults during later TG.

2. GGML_CUDA_ALLOC_FALLBACK_MANAGED=1 (opt-in)

When set, ggml_backend_cuda_buffer_type_alloc_buffer retries once with managed memory if plain cudaMalloc OOMs, so the overflowing buffer (typically the KV cache at high context) spills instead of aborting. With the flag unset, OOM behavior is unchanged — logs the original error and returns nullptr. Buffers that fit already landed on device via the first attempt; only the one that overflowed pages on demand, so the common TG working set stays on GPU.

Motivation

Reproduces with Qwen3.6-35B-A3B (Unsloth IQ3_S) at -c 262144 -ctk q4_0 -ctv q4_0 -b 1024 -ub 128 -fa on 16 GB gfx1201. Previously crashed at ~41k PP tokens:

  • ggml_cuda_pool_leg::alloc at ggml-cuda.cu:410 OOMs during FA tile scratch growth.
  • ggml_backend_cuda_buffer_type_alloc_buffer OOMs on KV cache alloc.

Refs #21376.

Behavior matrix

Flags Pool grow path alloc_buffer OOM
(none) cudaMalloc log + return nullptr (master behavior)
GGML_CUDA_POOL_USE_MANAGED=1 cudaMallocManaged + drop idle buffers log + return nullptr
GGML_CUDA_ALLOC_FALLBACK_MANAGED=1 cudaMalloc retry once with cudaMallocManaged
both cudaMallocManaged + drop idle buffers retry once with cudaMallocManaged

Test plan

  • Build on ROCm (gfx1201), run Qwen3.6-35B-A3B IQ3_S at the repro params with both flags — no longer aborts, PP continues past 41k.
  • Sanity run on CUDA (NVIDIA) with both flags unset to confirm no behavior change.
  • Sanity run with each flag individually.

Notes

Draft while I gather more numbers across configurations and GPU vendors. Feedback welcome on: (a) whether the idle-buffer drop heuristic on the managed grow path should be its own flag, (b) whether the two env vars should be merged into one, and (c) naming.

icex added 2 commits April 20, 2026 11:15
Adds two opt-in mitigations for OOM on gfx1201 (16 GB, ROCm without HIP
VMM) and other small-VRAM AMD cards when running long-context models.

1. GGML_CUDA_POOL_USE_MANAGED=1: route the legacy pool's grow path
   through cudaMallocManaged. On the grow path we also free the pool's
   idle smaller buffers, since under managed memory they otherwise sit
   in system RAM inflating pool_size and trigger PCIe page-faults during
   later TG.

2. alloc_buffer OOM fallback: if plain cudaMalloc fails, retry once with
   managed memory so the overflowing buffer (typically the KV cache at
   high context) can spill to system RAM instead of aborting. Buffers
   that fit still land on device from the first attempt; only the one
   that overflowed pages on demand.

Reproduces with Qwen3.6-35B-A3B (Unsloth IQ3_S) at -c 262144 -ctk q4_0
-ctv q4_0 -b 1024 -ub 128 -fa on 16 GB gfx1201, which previously OOM'd
at ~41k PP tokens in ggml_cuda_pool_leg::alloc during FA tile scratch
growth and in ggml_backend_cuda_buffer_type_alloc_buffer on KV alloc.

Refs ggml-org#21376
Default behavior (no flag set) is now byte-for-byte identical to
master: plain cudaMalloc, and on OOM log the original error and
return nullptr. The managed-memory spill path now requires explicit
opt-in via GGML_CUDA_ALLOC_FALLBACK_MANAGED=1, matching the stance
that ROCm does not spill to system RAM by design.

GGML_CUDA_POOL_USE_MANAGED remains a separate opt-in for the pool
grow path.
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 20, 2026

Hi @icex, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants