UCP/CORE: Detect memory type on cache miss with non-host detect MDs#11332
UCP/CORE: Detect memory type on cache miss with non-host detect MDs#11332yafshar wants to merge 1 commit intoopenucx:masterfrom
Conversation
On memtype cache miss, avoid assuming host memory when non-host detect-capable MDs are present. Run the detection slowpath first to determine memory type and sys_dev. This prevents incorrect transport selection on cold-cache paths (e.g. host paths chosen for accelerator memory). Add has_non_host_detect_md flag to ucp_context and use it to trigger slowpath detection instead of immediate host fallback.
| @@ -683,6 +684,14 @@ ucp_memory_detect_internal(ucp_context_h context, const void *address, | |||
|
|
|||
| status = ucs_memtype_cache_lookup(address, length, mem_info); | |||
| if (ucs_likely(status == UCS_ERR_NO_ELEM)) { | |||
There was a problem hiding this comment.
do you know what memory allocator is being used for such unknown memory, and would it make sense to add hook under src/ucm instead? afaiu the slow path was meant to be used when memtype cache was disabled.
There was a problem hiding this comment.
do you know what memory allocator is being used for such unknown memory, and would it make sense to add hook under src/ucm instead? afaiu the slow path was meant to be used when memtype cache was disabled.
Unknown memtype here is not a specific allocator. It can happen even with memtype cache enabled, for example when UCM reports UNKNOWN for existing allocations or paths it cannot classify immediately, or when cache coverage is incomplete for the queried range.
Because of that, the internal slowpath is not only for the cache-disabled case. It is the correctness fallback for unknown or non-covered entries while cache is active.
Adding a hook under src/ucm is useful only if we identify a concrete allocator/runtime path that currently bypasses UCM memtype events. That may reduce slowpath frequency, but it will not remove the need for fallback in cross-context cases such as separate L0 contexts (for example PyTorch or SYCL vs UCX).
There was a problem hiding this comment.
thanks for details, also I think that memtype cache only tracks non-host memory, so no element currently means host memory type. if so I think that doing slowpath for those cases could have perf impact? for all the cases you mention maybe the memtype could be passed along with pointer?
There was a problem hiding this comment.
Thanks, that makes sense. I agree we should avoid per-call warnings in this path, but we can add lightweight observability: I mean to track slowpath hits (miss and unknown cases) in the existing UCP stats tree using UCS_STATS_NODE_DECLARE, and emit a one-time or end-of-run summary hint when the counters are non-zero. That gives actionable feedback to pass explicit memtype hints without adding hot-path log noise.
Also agreed that passing memtype with the pointer is the preferred fix at the application boundary. In our NIXL/Dynamo integration we already do this for PyTorch GPU buffers by setting UCP_MEM_MAP_PARAM_FIELD_MEMORY_TYPE to UCS_MEMORY_TYPE_ZE_DEVICE, and in that path detection is bypassed.
There was a problem hiding this comment.
i think there is logging aspect but also the implied uct_md_mem_query() that could have perf impact for host memory case (repeatedly calling it as host mem type is never in memtype cache).
There was a problem hiding this comment.
i think there is logging aspect but also the implied uct_md_mem_query() that could have perf impact for host memory case (repeatedly calling it as host mem type is never in memtype cache).
I agree we should avoid changing NO_ELEM semantics globally due performance risk on host-heavy paths. I will make this PR into draft for now
What?
Why?
How?