Description
Concrete memory resources (device_memory_pool, legacy_pinned_memory_resource, etc.) default to default_cuda_malloc_alignment (256 bytes) when callers omit the alignment argument. However, when one of these resources is type-erased into any_resource / any_synchronous_resource, wrapped in shared_resource, or adapted with synchronous_resource_adapter, the default alignment silently changes — or disappears entirely.
I noticed this when attempting to call shared_resource's allocate(stream, bytes) method, only to realize that alignment had to be specified explicitly (there is no two-argument overload with a default alignment).
// Calls allocate_sync(bytes, 256) — correct
pool.allocate_sync(bytes);
// Calls allocate_sync(bytes, 16) — silent behavior change
any_synchronous_resource<> any{pool};
any.allocate_sync(bytes);
Background
The codebase uses two alignment constants:
default_cuda_malloc_alignment (256) — required for device-accessible memory
default_cuda_malloc_host_alignment / alignof(max_align_t) (~16) — sufficient for host-only memory
Host-only resources like legacy_pinned_memory_resource technically only need 16-byte alignment on the host side. But type-erased wrappers (any_resource, shared_resource, etc.) don't know what they hold — the underlying resource could target device memory, host memory, or both. Defaulting to 16 bytes is unsafe because it silently under-aligns device allocations. Note that most of the underlying resources that only call CUDA runtime APIs will simply ignore 16-byte alignment requests and will always give 256-byte aligned pointers, but that isn't safe to assume if the user has implemented a custom device allocator like RMM's pool or arena! Defaulting to 256 bytes is always safe: it satisfies both host and device requirements, at the cost of potentially wasted alignment padding for host-only resources.
Type-erased and wrapper types should default to 256 bytes since they do not know the requirements of the underlying resource. I would accept a solution where resources with only host_accessible properties (and no device_accessible property) use 16-byte alignments by default, but I think that's potentially messy because some resources might be device_accessible at runtime though that might not be statically known.
Affected locations
Current host-friendly default (16) should be device-friendly (256):
any_resource.h:116 — __ibasic_resource::allocate_sync
any_resource.h:123 — __ibasic_resource::deallocate_sync
any_resource.h:143 — __ibasic_async_resource::allocate (2-arg overload)
any_resource.h:155 — __ibasic_async_resource::deallocate (3-arg overload)
shared_resource.h:188 — shared_resource::allocate_sync
shared_resource.h:197 — shared_resource::deallocate_sync
No default and no convenience overload:
synchronous_resource_adapter.h:67 — allocate (no 2-arg overload)
synchronous_resource_adapter.h:79 — allocate_sync (no default alignment)
synchronous_resource_adapter.h:85 — deallocate (no 3-arg overload)
synchronous_resource_adapter.h:98 — deallocate_sync (no default alignment)
shared_resource.h:212 — allocate (no 2-arg overload)
shared_resource.h:227 — deallocate (no 3-arg overload)
Proposed fix
- Replace
alignof(::cuda::std::max_align_t) defaults with ::cuda::mr::default_cuda_malloc_alignment in any_resource.h and shared_resource.h.
- Add default
= ::cuda::mr::default_cuda_malloc_alignment to allocate_sync and deallocate_sync in synchronous_resource_adapter.h.
- Add 2-arg
allocate(stream, bytes) and 3-arg deallocate(stream, ptr, bytes) convenience overloads to shared_resource and synchronous_resource_adapter, matching the overloads already present on __memory_pool_base and __ibasic_async_resource.
Description
Concrete memory resources (
device_memory_pool,legacy_pinned_memory_resource, etc.) default todefault_cuda_malloc_alignment(256 bytes) when callers omit the alignment argument. However, when one of these resources is type-erased intoany_resource/any_synchronous_resource, wrapped inshared_resource, or adapted withsynchronous_resource_adapter, the default alignment silently changes — or disappears entirely.I noticed this when attempting to call
shared_resource'sallocate(stream, bytes)method, only to realize that alignment had to be specified explicitly (there is no two-argument overload with a default alignment).Background
The codebase uses two alignment constants:
default_cuda_malloc_alignment(256) — required for device-accessible memorydefault_cuda_malloc_host_alignment/alignof(max_align_t)(~16) — sufficient for host-only memoryHost-only resources like
legacy_pinned_memory_resourcetechnically only need 16-byte alignment on the host side. But type-erased wrappers (any_resource,shared_resource, etc.) don't know what they hold — the underlying resource could target device memory, host memory, or both. Defaulting to 16 bytes is unsafe because it silently under-aligns device allocations. Note that most of the underlying resources that only call CUDA runtime APIs will simply ignore 16-byte alignment requests and will always give 256-byte aligned pointers, but that isn't safe to assume if the user has implemented a custom device allocator like RMM's pool or arena! Defaulting to 256 bytes is always safe: it satisfies both host and device requirements, at the cost of potentially wasted alignment padding for host-only resources.Type-erased and wrapper types should default to 256 bytes since they do not know the requirements of the underlying resource. I would accept a solution where resources with only
host_accessibleproperties (and nodevice_accessibleproperty) use 16-byte alignments by default, but I think that's potentially messy because some resources might bedevice_accessibleat runtime though that might not be statically known.Affected locations
Current host-friendly default (16) should be device-friendly (256):
any_resource.h:116—__ibasic_resource::allocate_syncany_resource.h:123—__ibasic_resource::deallocate_syncany_resource.h:143—__ibasic_async_resource::allocate(2-arg overload)any_resource.h:155—__ibasic_async_resource::deallocate(3-arg overload)shared_resource.h:188—shared_resource::allocate_syncshared_resource.h:197—shared_resource::deallocate_syncNo default and no convenience overload:
synchronous_resource_adapter.h:67—allocate(no 2-arg overload)synchronous_resource_adapter.h:79—allocate_sync(no default alignment)synchronous_resource_adapter.h:85—deallocate(no 3-arg overload)synchronous_resource_adapter.h:98—deallocate_sync(no default alignment)shared_resource.h:212—allocate(no 2-arg overload)shared_resource.h:227—deallocate(no 3-arg overload)Proposed fix
alignof(::cuda::std::max_align_t)defaults with::cuda::mr::default_cuda_malloc_alignmentinany_resource.handshared_resource.h.= ::cuda::mr::default_cuda_malloc_alignmenttoallocate_syncanddeallocate_syncinsynchronous_resource_adapter.h.allocate(stream, bytes)and 3-argdeallocate(stream, ptr, bytes)convenience overloads toshared_resourceandsynchronous_resource_adapter, matching the overloads already present on__memory_pool_baseand__ibasic_async_resource.