Skip to content

Add ComputeEncoder abstraction with parallel/serial barrier modes#1100

Draft
MarijnS95 wants to merge 8 commits intollvm:mainfrom
Traverse-Research:command-encoder
Draft

Add ComputeEncoder abstraction with parallel/serial barrier modes#1100
MarijnS95 wants to merge 8 commits intollvm:mainfrom
Traverse-Research:command-encoder

Conversation

@MarijnS95
Copy link
Copy Markdown
Collaborator

Depends on #1057

Introduces a generic command encoder abstraction for recording GPU commands to a command buffer. ComputeEncoder provides dispatch() and barrier() operations with two modes: Parallel (no automatic barriers, caller manages synchronization) and Serial (auto-inserts barriers between commands using tracked destination scope as next source).

Each backend implements barrier tracking natively:

  • Metal: MTL::BarrierScope accumulated on the encoder
  • Vulkan: VkPipelineStageFlags/VkAccessFlags on the command buffer
  • DX12: UAV barrier flag on the command buffer

VK/DX store barrier state on the command buffer (encoders hold a back-reference) so it persists across encoder lifetimes. Metal stores it on the encoder since each MTL::ComputeCommandEncoder is a separate native object with implicit inter-encoder ordering.

Creating a parallel encoder flushes a full barrier on VK/DX to ensure prior work is visible. endEncoding() flushes any pending barriers.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Copy link
Copy Markdown
Contributor

@manon-traverse manon-traverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github is not loading all the files for me right now :(
But I am able to provide some feedback on the files that did manage to load.

Comment thread include/API/Encoder.h
Comment thread include/API/Encoder.h Outdated
Comment on lines +97 to +100
virtual llvm::Error dispatch(uint32_t GroupCountX, uint32_t GroupCountY,
uint32_t GroupCountZ, uint32_t ThreadsPerGroupX,
uint32_t ThreadsPerGroupY,
uint32_t ThreadsPerGroupZ) = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to be able to specify the pipeline and descriptor sets as well. That way, we can enforce that everything that needs to be set up to dispatch has been set up.

Also, could we perhaps grab the threads per group from the pipeline and let the user only specify the group count OR thread count to launch?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reading back the threads-per-group from reflection is the most convenient. While Metal has ways to specify the total dispatch in thread-count, with automatic bounds-checking in the shader, that is not applicable to DX and Vulkan and makes this API a lot less convenient to use.

Specifically because the thread-count in Metal is fixed in place through transpiling HLSL with numthreads() to Metallib; that value should be persisted in the pipeline and passed through.

Comment thread include/API/Encoder.h
Comment thread lib/API/DX/Device.cpp Outdated
Comment thread lib/API/VK/Device.cpp Outdated
Move command buffer submission logic from each backend's Device into
Queue::submit(), which takes ownership of the command buffers. Each
backend uses its Fence abstraction for GPU synchronization:

- Metal: commit() + waitUntilCompleted()
- Vulkan: vkQueueSubmit() signaling a timeline semaphore (VulkanFence),
  then VulkanFence::waitForCompletion()
- DX12: ExecuteCommandLists() + Queue::Signal() on the queue-owned
  DXFence, then DXFence::waitForCompletion()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MarijnS95 and others added 4 commits April 20, 2026 12:43
Introduces a generic command encoder abstraction for recording GPU
commands to a command buffer. ComputeEncoder provides dispatch() and
barrier() operations with two modes: Parallel (no automatic barriers,
caller manages synchronization) and Serial (auto-inserts barriers
between commands using tracked destination scope as next source).

Each backend implements barrier tracking natively:
- Metal: MTL::BarrierScope accumulated on the encoder
- Vulkan: VkPipelineStageFlags/VkAccessFlags on the command buffer
- DX12: UAV barrier flag on the command buffer

VK/DX store barrier state on the command buffer (encoders hold a
back-reference) so it persists across encoder lifetimes. Metal stores
it on the encoder since each MTL::ComputeCommandEncoder is a separate
native object with implicit inter-encoder ordering.

Creating a parallel encoder flushes a full barrier on VK/DX to ensure
prior work is visible. endEncoding() flushes any pending barriers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntations

Buffer copies are recorded through the encoder abstraction. DX and VK
record directly on the underlying command list/buffer. Metal lazily
switches between compute and blit encoders as needed — Metal 4 removes
this separation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consistent with the readback barrier helpers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fillBuffer fills a buffer region with a repeated byte value (uint8_t).
Vulkan broadcasts the byte to a uint32_t for vkCmdFillBuffer, Metal
uses the native byte-fill BlitCommandEncoder API. DX12 returns
not_supported for now (ClearUnorderedAccessViewUint requires extra
descriptor heap management).

dispatchIndirect dispatches from an indirect argument buffer containing
{GroupCountX, GroupCountY, GroupCountZ}. Vulkan uses vkCmdDispatchIndirect,
Metal uses dispatchThreadgroups with an indirect buffer (ThreadsPerGroup
passed separately), DX12 uses ExecuteIndirect with a lazily-created
ID3D12CommandSignature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MarijnS95 and others added 2 commits April 20, 2026 14:04
Add pushDebugGroup/popDebugGroup/insertDebugSignpost to CommandEncoder
with no-op defaults. Each backend overrides them:

- Vulkan: vkCmd{Begin,End,Insert}DebugUtilsLabelEXT, loaded via
  vkGetDeviceProcAddr (naturally gated by VK_EXT_debug_utils availability)
- DX12: ID3D12GraphicsCommandList BeginEvent/EndEvent/SetMarker with
  ANSI string encoding
- Metal: pushDebugGroup/popDebugGroup/insertDebugSignpost on the active
  native encoder, with correct pop/push across compute/blit switches

Every encoder command automatically emits a signpost with its parameters
(e.g. "Dispatch [8,1,1]", "CopyBuffer 4096B", "FillBuffer 256B
value=0x00"). Encoder creation pushes a "ComputeEncoder (Serial/
Parallel)" debug group, balanced by a pop in endEncoding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Makes CommandEncoder::endEncoding() idempotent (split into a non-virtual wrapper plus a protected virtual endEncodingImpl()), then has each backend's ComputeEncoder destructor call endEncoding() so an encoder destroyed without an explicit end still flushes pending barriers and pops its debug group.
DX and Vulkan bake the threadgroup size into the compiled shader, making the
ThreadsPerGroup parameters redundant.  Metal reads it back from shader
reflection (the numthreads attribute persisted in the transpiled Metallib) and
stores it in the encoder via setThreadGroupSize() before dispatching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants