Skip to content

UCP/CORE: Detect memory type on cache miss with non-host detect MDs#11332

Draft
yafshar wants to merge 1 commit intoopenucx:masterfrom
intel-staging:fix/ucp-core-memtype-cache-miss
Draft

UCP/CORE: Detect memory type on cache miss with non-host detect MDs#11332
yafshar wants to merge 1 commit intoopenucx:masterfrom
intel-staging:fix/ucp-core-memtype-cache-miss

Conversation

@yafshar
Copy link
Copy Markdown
Contributor

@yafshar yafshar commented Apr 8, 2026

What?

  • On memtype cache miss, avoid assuming host memory when detect‑capable MDs supporting non-host memory types are present.
  • Add a context-level flag indicating whether any detect-capable MD supports non-host memory types.
  • Use this flag to choose cache-miss behavior: run slowpath detection or fall back to host memory.

Why?

  • External runtimes may allocate accelerator memory outside UCX-visible contexts, resulting in missing memtype cache entries.
  • Falling back to host memory on cache miss can select host-only transports for accelerator pointers, leading to incorrect behavior or runtime failures.
  • Running slowpath detection when non-host detection is possible prevents wrong transport and protocol selection.

How?

  • Set the new context flag during resource discovery when any MD advertises non-host detect support.
  • On memtype cache miss, invoke slowpath detection when the flag is set.
  • Slowpath queries detect-capable MDs to resolve memory type and sys_dev before transport and protocol selection.

On memtype cache miss, avoid assuming host memory when non-host detect-capable
MDs are present. Run the detection slowpath first to determine memory type
and sys_dev.

This prevents incorrect transport selection on cold-cache paths (e.g. host
paths chosen for accelerator memory).

Add has_non_host_detect_md flag to ucp_context and use it to trigger slowpath
detection instead of immediate host fallback.
@yafshar yafshar marked this pull request as ready for review April 8, 2026 21:30
@@ -683,6 +684,14 @@ ucp_memory_detect_internal(ucp_context_h context, const void *address,

status = ucs_memtype_cache_lookup(address, length, mem_info);
if (ucs_likely(status == UCS_ERR_NO_ELEM)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know what memory allocator is being used for such unknown memory, and would it make sense to add hook under src/ucm instead? afaiu the slow path was meant to be used when memtype cache was disabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know what memory allocator is being used for such unknown memory, and would it make sense to add hook under src/ucm instead? afaiu the slow path was meant to be used when memtype cache was disabled.

Unknown memtype here is not a specific allocator. It can happen even with memtype cache enabled, for example when UCM reports UNKNOWN for existing allocations or paths it cannot classify immediately, or when cache coverage is incomplete for the queried range.

Because of that, the internal slowpath is not only for the cache-disabled case. It is the correctness fallback for unknown or non-covered entries while cache is active.

Adding a hook under src/ucm is useful only if we identify a concrete allocator/runtime path that currently bypasses UCM memtype events. That may reduce slowpath frequency, but it will not remove the need for fallback in cross-context cases such as separate L0 contexts (for example PyTorch or SYCL vs UCX).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for details, also I think that memtype cache only tracks non-host memory, so no element currently means host memory type. if so I think that doing slowpath for those cases could have perf impact? for all the cases you mention maybe the memtype could be passed along with pointer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. I agree we should avoid per-call warnings in this path, but we can add lightweight observability: I mean to track slowpath hits (miss and unknown cases) in the existing UCP stats tree using UCS_STATS_NODE_DECLARE, and emit a one-time or end-of-run summary hint when the counters are non-zero. That gives actionable feedback to pass explicit memtype hints without adding hot-path log noise.

Also agreed that passing memtype with the pointer is the preferred fix at the application boundary. In our NIXL/Dynamo integration we already do this for PyTorch GPU buffers by setting UCP_MEM_MAP_PARAM_FIELD_MEMORY_TYPE to UCS_MEMORY_TYPE_ZE_DEVICE, and in that path detection is bypassed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there is logging aspect but also the implied uct_md_mem_query() that could have perf impact for host memory case (repeatedly calling it as host mem type is never in memtype cache).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there is logging aspect but also the implied uct_md_mem_query() that could have perf impact for host memory case (repeatedly calling it as host mem type is never in memtype cache).

I agree we should avoid changing NO_ELEM semantics globally due performance risk on host-heavy paths. I will make this PR into draft for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants