NIXL/UCX: enforce VRAM memtype hint query behavior and add tests#1536
NIXL/UCX: enforce VRAM memtype hint query behavior and add tests#1536yafshar wants to merge 7 commits intoai-dynamo:mainfrom
Conversation
- Add UCX backend parameter ucx_vram_memtype_hint with supported values: auto, none, cuda, cuda-managed, rocm, ze-device. - Resolve and validate the VRAM memtype hint once at context initialization using ucp_context_query and UCS_BIT_GET. - Apply the hint only to VRAM registrations in memReg and preserve the invariant that VRAM must not resolve to host memory. - Keep the auto policy conservative by enabling ZE hints only in ZE-only accelerator stacks. - Use ucs_memory_type_names for canonical naming and strict parsing. Co-authored-by: Daniel Socek daniel.socek@intel.com
Add user-facing documentation for UCX backend initialization options
and clarify VRAM memtype hint usage.
- docs/BackendGuide.md:
- document `ucx_error_handling_mode` and `ucx_vram_memtype_hint`
- docs/python_api.md:
- add a backend initialization parameters section with
`get_plugin_params("UCX")` and `create_backend("UCX", params)`
examples
- examples/device/ep/csrc/nixl_ep.cpp:
- add a commented example showing
`init_params["ucx_vram_memtype_hint"] = "auto"`
Fail backend creation when an explicit ucx_vram_memtype_hint is set but UCX memory type queries are not supported. Document accepted ucx_vram_memtype_hint values and the behavior on query failure. Add gtests covering default hint exposure and case-sensitive rejection of invalid hint values, and suppress expected error logs in negative-path tests.
|
👋 Hi yafshar! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a UCX backend initialization option Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Plugin as "Backend Plugin"
participant Context as "nixlUcxContext"
participant UCX as "UCX (ucp_context_query)"
User->>Plugin: get_plugin_params("UCX")
Plugin-->>User: returns params (ucx_vram_memtype_hint default = "auto")
User->>Plugin: create_backend("UCX", params)
Plugin->>Plugin: parse ucx_vram_memtype_hint -> policy
Plugin->>Context: construct(..., vram_mem_type_hint_policy)
Context->>UCX: ucp_context_query() -> supported memtype mask
alt policy == AUTO
UCX-->>Context: mask
Context->>Context: ucx_auto_detect_vram_hint(mask) -> maybe set resolved memtype
else policy == NONE
UCX-->>Context: mask
Context->>Context: leave hint unset
else explicit policy
UCX-->>Context: mask
Context->>Context: validate policy in mask (throw if unsupported)
end
Context-->>Plugin: constructed (resolved hint state)
User->>Plugin: memReg(VRAM_SEG)
Plugin->>Context: memReg -> include UCP_MEM_MAP_PARAM_FIELD_MEMORY_TYPE if resolved hint present
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/plugins/ucx/ucx_backend.cpp (1)
849-869:⚠️ Potential issue | 🟡 MinorRun clang-format on the new UCX hint parsing block.
This hunk is currently tripping the clang-format check, so CI stays red until it is reformatted.
As per coding guidelines, "Enforce formatting constraints: 4-space indentation, max 100 char lines, braces on same line (including if/for/try/catch), and break long function signatures/calls across lines."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/plugins/ucx/ucx_backend.cpp` around lines 849 - 869, Reformat the new UCX hint parsing block to satisfy clang-format: enforce 4-space indentation, max 100-char lines, braces on same line, and break long calls/initializers across lines; specifically adjust the nixl_ucx_vram_memtype_hint_t declaration and the vram_memtype_hint_it lookup/assignment so lines do not exceed 100 chars and align with 4-space indent, and reflow the uc = std::make_unique<nixlUcxContext>(...) call arguments (devs, init_params.enableProgTh, num_workers, init_params.syncMode, num_device_channels, engine_config, vram_memtype_hint_policy) onto separate indented lines so the call fits formatting rules, then run clang-format to apply the project's style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/python_api.md`:
- Around line 60-66: Update the ucx_vram_memtype_hint docs to document the
additional failure path: besides the case where UCX context memory types cannot
be queried, explicitly mention that if UCX context is queried but does not
advertise the requested memory type (e.g., user selects
`cuda`/`rocm`/`ze-device` but UCX reports no such memtype), backend creation
will be rejected for those explicit hints; reference the same behavior described
in BackendGuide.md and the ucx_utils.cpp logic to keep the doc consistent with
implementation.
In `@test/gtest/error_handling.cpp`:
- Around line 26-29: The new pointer constant declarations
(ucx_err_handling_mode_key, ucx_err_handling_mode_peer,
ucx_vram_memtype_hint_key, ucx_vram_memtype_hint_auto) are misformatted and
failing clang-format; reformat them to match the project's pointer spacing style
(adjust spacing around the '*' to be consistent with existing constants) and run
clang-format to apply the repository style so the declarations conform to the
file's formatting rules.
- Around line 441-484: Set the plugin directory and add a runtime-negative case:
ensure these tests set the same NIXL_PLUGIN_DIR used by TestErrorHandling (so
UCX is discovered) before calling agent.getAvailPlugins (mirror the setup at
TestErrorHandling lines that export NIXL_PLUGIN_DIR), and replace the
rejected-from-parser hint ("CUDA") with two things: 1) keep the existing
RejectsCaseMismatchedVramHint test but set the env/plugin dir so it doesn't
GTEST_SKIP, and 2) add a new negative test that sets
params[nixl::ucx_vram_memtype_hint_key] to a syntactically valid hint string
(one accepted by nixl::ucx_vram_memtype_hint_from_string) so that code reaches
nixlUcxContext::resolveMemoryTypeConfig() and triggers the intended runtime
failure path; locate params by the symbol params and the parser by
nixl::ucx_vram_memtype_hint_from_string and the runtime logic by
nixlUcxContext::resolveMemoryTypeConfig when making changes.
---
Outside diff comments:
In `@src/plugins/ucx/ucx_backend.cpp`:
- Around line 849-869: Reformat the new UCX hint parsing block to satisfy
clang-format: enforce 4-space indentation, max 100-char lines, braces on same
line, and break long calls/initializers across lines; specifically adjust the
nixl_ucx_vram_memtype_hint_t declaration and the vram_memtype_hint_it
lookup/assignment so lines do not exceed 100 chars and align with 4-space
indent, and reflow the uc = std::make_unique<nixlUcxContext>(...) call arguments
(devs, init_params.enableProgTh, num_workers, init_params.syncMode,
num_device_channels, engine_config, vram_memtype_hint_policy) onto separate
indented lines so the call fits formatting rules, then run clang-format to apply
the project's style.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0912178a-5e1f-4234-9199-e54269f2b338
📒 Files selected for processing (7)
docs/BackendGuide.mddocs/python_api.mdexamples/device/ep/csrc/nixl_ep.cppsrc/plugins/ucx/ucx_backend.cppsrc/plugins/ucx/ucx_utils.cppsrc/plugins/ucx/ucx_utils.htest/gtest/error_handling.cpp
- document explicit-hint backend rejection when UCX context memory types cannot be queried - document explicit-hint backend rejection when requested VRAM memtype is not advertised by UCX - align wording with BackendGuide and resolveMemoryTypeConfig behavior - normalize pointer spacing for new UCX hint constants in gtest - ensure standalone UcxBackendParams tests set NIXL_PLUGIN_DIR before plugin discovery - keep parser-negative coverage for case-mismatched hint value (CUDA) - add runtime-negative coverage with valid hints (cuda, cuda-managed, rocm, ze-device) to exercise resolve logic and validate unsupported memtype rejection - apply clang-format to the changed UCX backend
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/gtest/error_handling.cpp`:
- Around line 502-544: The loop currently only sets
found_runtime_unsupported_path when lig_expected_runtime_unsupported reports a
match; update the failure handling after calling agent.createBackend(...) so
that if status != NIXL_SUCCESS and backend==nullptr you also treat a matched
lig_expected_runtime_query_failure as a covered runtime-negative path: check
lig_expected_runtime_query_failure.getIgnoredCount() > 0 (in addition to the
existing lig_expected_runtime_unsupported check) and set
found_runtime_unsupported_path = true and break; reference this change around
the createBackend/status/backend handling block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 264439e5-867b-47c3-9183-806e663789e7
📒 Files selected for processing (3)
docs/python_api.mdsrc/plugins/ucx/ucx_backend.cpptest/gtest/error_handling.cpp
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/plugins/ucx/ucx_utils.cpp`:
- Around line 467-475: In ucx_memory_type_to_string, the bounds check uses
`mem_type_index > static_cast<unsigned>(UCS_MEMORY_TYPE_LAST)` which is
off-by-one; change the comparison to `>=
static_cast<unsigned>(UCS_MEMORY_TYPE_LAST)` so that when mem_type ==
UCS_MEMORY_TYPE_LAST we return "invalid" instead of indexing
`ucs_memory_type_names` out-of-bounds; update the check around the
`mem_type_index` variable in the ucx_memory_type_to_string function accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b92f7346-d303-43fe-985d-17b2ad7e6791
📒 Files selected for processing (2)
src/plugins/ucx/ucx_utils.cppsrc/plugins/ucx/ucx_utils.h
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/plugins/ucx/ucx_utils.cpp (1)
579-586:⚠️ Potential issue | 🟠 MajorConstructor leaks UCX context if
resolveMemoryTypeConfig()throws.At line 579,
ucp_init()acquires a UCX context. At line 585,resolveMemoryTypeConfig()can throw. If it does, the constructor never completes, the destructor is not invoked, anducp_cleanup(ctx)at line 589 never runs.Suggested fix
const auto status = ucp_init(&ucp_params, config.getUcpConfig(), &ctx); if (status != UCS_OK) { throw std::runtime_error("Failed to create UCX context: " + std::string(ucs_status_string(status))); } - - resolveMemoryTypeConfig(); + try { + resolveMemoryTypeConfig(); + } + catch (...) { + ucp_cleanup(ctx); + ctx = nullptr; + throw; + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/plugins/ucx/ucx_utils.cpp` around lines 579 - 586, The constructor currently calls ucp_init(...) to set ctx and then calls resolveMemoryTypeConfig(), which may throw and leak the UCX context; fix by ensuring ctx is cleaned up on exceptional exit—either wrap ctx in a RAII guard (e.g., a small local scope guard or std::unique_ptr with a custom deleter that calls ucp_cleanup) immediately after successful ucp_init, or surround resolveMemoryTypeConfig() with a try/catch that calls ucp_cleanup(ctx) and rethrows; update symbols: ucp_init, ctx, resolveMemoryTypeConfig, and ucp_cleanup so ctx is always cleaned if construction fails.
♻️ Duplicate comments (1)
test/gtest/error_handling.cpp (1)
532-546:⚠️ Potential issue | 🟠 MajorUnexpected backend failures are still silently converted into skip.
When
createBackend()fails but neither expected runtime log is matched, the test currently keeps looping and can end inGTEST_SKIP()instead of failing. This can hide real regressions in the new fail-fast path.Suggested tightening
- bool found_runtime_unsupported_path = false; + bool found_expected_runtime_failure = false; @@ nixlBackendH *backend = nullptr; const auto status = agent.createBackend(*it, params, backend); if (status != NIXL_SUCCESS) { EXPECT_EQ(nullptr, backend); - if ((backend == nullptr) && - ((lig_expected_runtime_unsupported.getIgnoredCount() > 0) || - (lig_expected_runtime_query_failure.getIgnoredCount() > 0))) { - found_runtime_unsupported_path = true; - break; - } + const bool saw_runtime_unsupported = + lig_expected_runtime_unsupported.getIgnoredCount() > 0; + const bool saw_runtime_query_failure = + lig_expected_runtime_query_failure.getIgnoredCount() > 0; + ASSERT_TRUE(saw_runtime_unsupported || saw_runtime_query_failure) + << "createBackend() failed for hint '" << hint + << "' without expected runtime failure path"; + found_expected_runtime_failure = true; + break; } @@ - if (!found_runtime_unsupported_path) { + if (!found_expected_runtime_failure) { GTEST_SKIP() - << "No explicit hint exercised the unsupported-memtype runtime path on this system"; + << "No explicit hint exercised an expected runtime failure path on this system"; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/gtest/error_handling.cpp` around lines 532 - 546, When createBackend() fails (status != NIXL_SUCCESS) and backend == nullptr but neither lig_expected_runtime_unsupported nor lig_expected_runtime_query_failure show ignored hits, don't allow the loop to continue into a GTEST_SKIP; instead fail the test immediately. In the block guarded by "if (status != NIXL_SUCCESS)" check the condition where backend == nullptr and both lig_expected_runtime_unsupported.getIgnoredCount() == 0 and lig_expected_runtime_query_failure.getIgnoredCount() == 0, and call GTEST_FAIL() (or ASSERT/FAIL) with a clear message and exit the test rather than setting found_runtime_unsupported_path or continuing the loop; keep the existing path that sets found_runtime_unsupported_path when an expected ignored count is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/plugins/ucx/ucx_utils.cpp`:
- Around line 579-586: The constructor currently calls ucp_init(...) to set ctx
and then calls resolveMemoryTypeConfig(), which may throw and leak the UCX
context; fix by ensuring ctx is cleaned up on exceptional exit—either wrap ctx
in a RAII guard (e.g., a small local scope guard or std::unique_ptr with a
custom deleter that calls ucp_cleanup) immediately after successful ucp_init, or
surround resolveMemoryTypeConfig() with a try/catch that calls ucp_cleanup(ctx)
and rethrows; update symbols: ucp_init, ctx, resolveMemoryTypeConfig, and
ucp_cleanup so ctx is always cleaned if construction fails.
---
Duplicate comments:
In `@test/gtest/error_handling.cpp`:
- Around line 532-546: When createBackend() fails (status != NIXL_SUCCESS) and
backend == nullptr but neither lig_expected_runtime_unsupported nor
lig_expected_runtime_query_failure show ignored hits, don't allow the loop to
continue into a GTEST_SKIP; instead fail the test immediately. In the block
guarded by "if (status != NIXL_SUCCESS)" check the condition where backend ==
nullptr and both lig_expected_runtime_unsupported.getIgnoredCount() == 0 and
lig_expected_runtime_query_failure.getIgnoredCount() == 0, and call GTEST_FAIL()
(or ASSERT/FAIL) with a clear message and exit the test rather than setting
found_runtime_unsupported_path or continuing the loop; keep the existing path
that sets found_runtime_unsupported_path when an expected ignored count is
present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9db16a46-f3e7-4b09-a374-87bd004606fd
📒 Files selected for processing (2)
src/plugins/ucx/ucx_utils.cpptest/gtest/error_handling.cpp
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/plugins/ucx/ucx_utils.cpp`:
- Around line 584-590: The catch qualifier in the try/catch around
resolveMemoryTypeConfig() is not on its own line; update the block in the
function containing resolveMemoryTypeConfig() so the catch is on a separate line
(i.e., place "catch (...)" on its own line), preserving the ucp_cleanup(ctx)
call and rethrow behavior that currently follows in the catch block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 10865e65-3ecd-40b1-9e95-60fbfb0c3c14
📒 Files selected for processing (2)
src/plugins/ucx/ucx_utils.cpptest/gtest/error_handling.cpp
3380f8e to
9b019c7
Compare
brminich
left a comment
There was a problem hiding this comment.
why is it needed? Isn't it enough to check if GPU memory type is not detected as host?
I'd not expectd that rocm memory can be detected as cuda
This PR is needed beyond any single platform: memory-type auto-detection can be unreliable or ambiguous when allocations come from external runtimes/allocators, cross-context boundaries, or incomplete cache coverage. In those cases, relying on automatic detection alone can misclassify VRAM and fail registration. This NIXL change was motivated by the discussion in UCX PR 11332: openucx/ucx#11332. We are not changing UCX slowpath/cache semantics here; instead, we add an integration-level explicit memtype hint with early capability validation to reduce reliance on auto-detection in ambiguous runtime environments. |
What?
This PR hardens UCX VRAM memtype hint handling and aligns documentation and
tests with runtime behavior.
ucx_vram_memtype_hintis setbut UCX context memory types cannot be queried.
autoandnoneas non-failing modes that proceed without hintingwhen memory-type queries are unavailable.
ucx_vram_memtype_hint:auto,none,cuda,cuda-managed,rocm,ze-device.case-mismatched values (e.g.
CUDA).global log-failure checks.
Why?
Previous behavior around explicit VRAM memtype hints was not fully documented
or covered by tests, which could lead to confusion and fragile failure paths.
This change improves correctness and predictability by:
memory-type capabilities.
How?
memory-type query for explicit hint modes.
autoandnonepermissive when queries are unavailable.failure semantics.
ucx_vram_memtype_hintexposure (auto)Testing performed
Credits
Co-authored-by: Daniel Socek daniel.socek@intel.com
Summary by CodeRabbit
Documentation
New Features
Tests