CUDA: opt-in managed pool + OOM fallback for alloc_buffer#22158
Closed
icex wants to merge 2 commits intoggml-org:masterfrom
Closed
CUDA: opt-in managed pool + OOM fallback for alloc_buffer#22158icex wants to merge 2 commits intoggml-org:masterfrom
icex wants to merge 2 commits intoggml-org:masterfrom
Conversation
Adds two opt-in mitigations for OOM on gfx1201 (16 GB, ROCm without HIP VMM) and other small-VRAM AMD cards when running long-context models. 1. GGML_CUDA_POOL_USE_MANAGED=1: route the legacy pool's grow path through cudaMallocManaged. On the grow path we also free the pool's idle smaller buffers, since under managed memory they otherwise sit in system RAM inflating pool_size and trigger PCIe page-faults during later TG. 2. alloc_buffer OOM fallback: if plain cudaMalloc fails, retry once with managed memory so the overflowing buffer (typically the KV cache at high context) can spill to system RAM instead of aborting. Buffers that fit still land on device from the first attempt; only the one that overflowed pages on demand. Reproduces with Qwen3.6-35B-A3B (Unsloth IQ3_S) at -c 262144 -ctk q4_0 -ctv q4_0 -b 1024 -ub 128 -fa on 16 GB gfx1201, which previously OOM'd at ~41k PP tokens in ggml_cuda_pool_leg::alloc during FA tile scratch growth and in ggml_backend_cuda_buffer_type_alloc_buffer on KV alloc. Refs ggml-org#21376
Default behavior (no flag set) is now byte-for-byte identical to master: plain cudaMalloc, and on OOM log the original error and return nullptr. The managed-memory spill path now requires explicit opt-in via GGML_CUDA_ALLOC_FALLBACK_MANAGED=1, matching the stance that ROCm does not spill to system RAM by design. GGML_CUDA_POOL_USE_MANAGED remains a separate opt-in for the pool grow path.
|
Hi @icex, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Contributor
|
According to the llama.cpp AI usage policy:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two opt-in escape hatches for users hitting OOM on small-VRAM AMD cards running long-context models. Not a bug fix — as noted on #21376, ROCm doesn't spill to system RAM by design, and this PR doesn't change that default. Both paths are gated behind env vars; with neither set, behavior is byte-for-byte identical to master.
Motivating case: gfx1201 (16 GB) on ROCm without HIP VMM (
system_inforeportsNO_VMM=1), where the legacy pool grows monotonically and the user has no way to trade raw throughput for "don't abort."1.
GGML_CUDA_POOL_USE_MANAGED=1(opt-in)Routes the legacy pool's grow path through
cudaMallocManaged(→hipMallocManagedon ROCm). On the grow path we also free the pool's idle smaller buffers — under managed memory those otherwise sit in system RAM, inflatepool_size, and cause PCIe page-faults during later TG.2.
GGML_CUDA_ALLOC_FALLBACK_MANAGED=1(opt-in)When set,
ggml_backend_cuda_buffer_type_alloc_bufferretries once with managed memory if plaincudaMallocOOMs, so the overflowing buffer (typically the KV cache at high context) spills instead of aborting. With the flag unset, OOM behavior is unchanged — logs the original error and returnsnullptr. Buffers that fit already landed on device via the first attempt; only the one that overflowed pages on demand, so the common TG working set stays on GPU.Motivation
Reproduces with Qwen3.6-35B-A3B (Unsloth IQ3_S) at
-c 262144 -ctk q4_0 -ctv q4_0 -b 1024 -ub 128 -faon 16 GB gfx1201. Previously crashed at ~41k PP tokens:ggml_cuda_pool_leg::allocatggml-cuda.cu:410OOMs during FA tile scratch growth.ggml_backend_cuda_buffer_type_alloc_bufferOOMs on KV cache alloc.Refs #21376.
Behavior matrix
alloc_bufferOOMcudaMallocGGML_CUDA_POOL_USE_MANAGED=1cudaMallocManaged+ drop idle buffersGGML_CUDA_ALLOC_FALLBACK_MANAGED=1cudaMalloccudaMallocManagedcudaMallocManaged+ drop idle bufferscudaMallocManagedTest plan
Notes
Draft while I gather more numbers across configurations and GPU vendors. Feedback welcome on: (a) whether the idle-buffer drop heuristic on the managed grow path should be its own flag, (b) whether the two env vars should be merged into one, and (c) naming.