Skip to content

[BUG] Type-erasing a memory resource changes default allocation alignment from 256 to 16 bytes #8063

@bdice

Description

@bdice

Description

Concrete memory resources (device_memory_pool, legacy_pinned_memory_resource, etc.) default to default_cuda_malloc_alignment (256 bytes) when callers omit the alignment argument. However, when one of these resources is type-erased into any_resource / any_synchronous_resource, wrapped in shared_resource, or adapted with synchronous_resource_adapter, the default alignment silently changes — or disappears entirely.

I noticed this when attempting to call shared_resource's allocate(stream, bytes) method, only to realize that alignment had to be specified explicitly (there is no two-argument overload with a default alignment).

// Calls allocate_sync(bytes, 256) — correct
pool.allocate_sync(bytes);

// Calls allocate_sync(bytes, 16) — silent behavior change
any_synchronous_resource<> any{pool};
any.allocate_sync(bytes);

Background

The codebase uses two alignment constants:

  • default_cuda_malloc_alignment (256) — required for device-accessible memory
  • default_cuda_malloc_host_alignment / alignof(max_align_t) (~16) — sufficient for host-only memory

Host-only resources like legacy_pinned_memory_resource technically only need 16-byte alignment on the host side. But type-erased wrappers (any_resource, shared_resource, etc.) don't know what they hold — the underlying resource could target device memory, host memory, or both. Defaulting to 16 bytes is unsafe because it silently under-aligns device allocations. Note that most of the underlying resources that only call CUDA runtime APIs will simply ignore 16-byte alignment requests and will always give 256-byte aligned pointers, but that isn't safe to assume if the user has implemented a custom device allocator like RMM's pool or arena! Defaulting to 256 bytes is always safe: it satisfies both host and device requirements, at the cost of potentially wasted alignment padding for host-only resources.

Type-erased and wrapper types should default to 256 bytes since they do not know the requirements of the underlying resource. I would accept a solution where resources with only host_accessible properties (and no device_accessible property) use 16-byte alignments by default, but I think that's potentially messy because some resources might be device_accessible at runtime though that might not be statically known.

Affected locations

Current host-friendly default (16) should be device-friendly (256):

  • any_resource.h:116__ibasic_resource::allocate_sync
  • any_resource.h:123__ibasic_resource::deallocate_sync
  • any_resource.h:143__ibasic_async_resource::allocate (2-arg overload)
  • any_resource.h:155__ibasic_async_resource::deallocate (3-arg overload)
  • shared_resource.h:188shared_resource::allocate_sync
  • shared_resource.h:197shared_resource::deallocate_sync

No default and no convenience overload:

  • synchronous_resource_adapter.h:67allocate (no 2-arg overload)
  • synchronous_resource_adapter.h:79allocate_sync (no default alignment)
  • synchronous_resource_adapter.h:85deallocate (no 3-arg overload)
  • synchronous_resource_adapter.h:98deallocate_sync (no default alignment)
  • shared_resource.h:212allocate (no 2-arg overload)
  • shared_resource.h:227deallocate (no 3-arg overload)

Proposed fix

  1. Replace alignof(::cuda::std::max_align_t) defaults with ::cuda::mr::default_cuda_malloc_alignment in any_resource.h and shared_resource.h.
  2. Add default = ::cuda::mr::default_cuda_malloc_alignment to allocate_sync and deallocate_sync in synchronous_resource_adapter.h.
  3. Add 2-arg allocate(stream, bytes) and 3-arg deallocate(stream, ptr, bytes) convenience overloads to shared_resource and synchronous_resource_adapter, matching the overloads already present on __memory_pool_base and __ibasic_async_resource.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    In Review

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions