PTO2 (Parallel Task Orchestration v2) is a runtime system for executing task graphs on Ascend AI processors. It coordinates four layers of execution:
- Host (x86/ARM CPU): compiles kernels, allocates device memory, initializes the Runtime, and launches AICPU/AICore threads.
- AICPU (device ARM cores): runs the orchestrator (task graph builder) and scheduler threads.
- AICore (AI compute cores): executes kernel functions dispatched by the scheduler.
- Shared Memory (Global Memory): ring buffers, task descriptors, heap, and TensorMap shared between orchestrator and schedulers.
┌───────────────────────────────────────────────────────────────────────┐
│ Host (CPU) │
│ golden.py → code_runner.py → compile kernels → init Runtime │
│ → upload binaries → launch AICPU/AICore → collect results │
└───────────────────────────┬───────────────────────────────────────────┘
│ device memory / GM
┌───────────────────────────▼───────────────────────────────────────────┐
│ AICPU (4 threads) │
│ Thread 3: Orchestrator (builds task graph) │
│ Threads 0-2: Schedulers (dispatch tasks to AICore) │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Shared Memory (GM) │ │
│ │ SharedMemoryHeader │ TaskDescriptors[] │ DepListPool │ │
│ │ GM Heap (output buffers) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Scheduler ──Handshake/Registers──► AICore workers (AIC + AIV) │
└───────────────────────────────────────────────────────────────────────┘
Three runtime backends exist under src/runtime/, each representing a different orchestration and scheduling strategy.
The host builds the complete task graph before launching device execution. The orchestration SO is loaded and executed on the host CPU.
- Task storage: fixed
Task[]array (up to 131072 tasks) - Scheduling: AICPU receives the pre-built graph and dispatches tasks by traversing dependencies
- Use case: development and debugging; no device-side orchestration overhead
The orchestration runs on an AICPU thread, building the task graph on device. Supports concurrent build + schedule (build_mode=1).
- Task storage: same
Task[]array as host_build_graph - AicpuBuildApi:
add_task,add_successor_conditional,publish_task,device_malloc - Use case: reduced host→device data transfer; graph can depend on device-side data
The primary production runtime. Uses ring buffers for task slots and output memory, with a TensorMap for automatic dependency tracking.
- Task storage:
PTO2TaskDescriptor[]in shared memory ring buffer - Memory: GM Heap ring for output buffer allocation
- Dependencies: automatically derived from tensor read/write patterns via TensorMap
- Thread model: 3 scheduler threads + 1 orchestrator thread on AICPU
- Use case: production workloads; supports streaming, flow control, and large batch sizes
Two platform implementations exist under src/platform/, sharing a common interface.
| Component | Description |
|---|---|
device_runner.cpp |
Uses CANN APIs: rtMalloc, rtMemcpy, rtLaunchKernel |
memory_allocator.cpp |
Wraps rtMalloc/rtFree with allocation tracking |
aicore/kernel.cpp |
KERNEL_ENTRY(aicore_kernel) → aicore_execute |
aicpu/kernel.cpp |
DynTileFwkBackendKernelServer entry → aicpu_execute |
spin_hint.h |
ARM wfe/yield instructions for efficient spinning |
| Component | Description |
|---|---|
device_runner.cpp |
Uses std::thread to simulate AICPU/AICore |
memory_allocator.cpp |
Wraps malloc/free |
aicore/kernel.cpp |
aicore_execute_wrapper sets g_sim_reg_base per core |
upload_kernel_binary |
dlopen kernel SO, dlsym entry point |
| Constant | Value | Description |
|---|---|---|
PLATFORM_MAX_BLOCKDIM |
24 | Maximum blocks (each = 1 AIC + 2 AIV) |
PLATFORM_MAX_AICPU_THREADS |
4 | AICPU thread count (3 schedulers + 1 orchestrator) |
PLATFORM_MAX_AIC_PER_THREAD |
24 | Max AIC cores per scheduler thread |
PLATFORM_MAX_AIV_PER_THREAD |
48 | Max AIV cores per scheduler thread |
PLATFORM_PROF_SYS_CNT_FREQ |
50 MHz | System counter frequency for profiling |
The orchestrator and schedulers communicate through a contiguous shared memory region in Global Memory (GM):
┌─────────────────────────────┐ offset 0
│ PTO2SharedMemoryHeader │ (flow control, config, sync flags)
├─────────────────────────────┤ aligned
│ PTO2TaskDescriptor[N] │ N = task_window_size (default 65536)
├─────────────────────────────┤ aligned
│ PTO2DepListEntry[M+1] │ M = dep_list_pool_size (entry 0 = NULL sentinel)
└─────────────────────────────┘
| Field | Writer | Reader | Purpose |
|---|---|---|---|
current_task_index |
Orchestrator | Scheduler | Next task ID to allocate (task ring head) |
last_task_alive |
Scheduler | Orchestrator | Oldest still-active task (task ring tail) |
heap_top |
Orchestrator | Scheduler | Heap ring allocation pointer |
heap_tail |
Scheduler | Orchestrator | Heap ring reclamation pointer |
orchestrator_done |
Orchestrator | Scheduler | Signals orchestration completion |
task_window_size |
Init | Both | Number of task slots |
heap_size |
Init | Both | Heap total size |
total = ALIGN(Header) + ALIGN(window_size * sizeof(TaskDescriptor))
+ ALIGN((dep_pool_size + 1) * sizeof(DepListEntry))
Alignment is 64 bytes (PTO2_ALIGN_SIZE).
The task ring manages task slot allocation with back-pressure flow control.
Structure (PTO2TaskRing):
descriptors: pointer toTaskDescriptor[]in shared memorywindow_size: number of slots (power of 2)current_index: next task ID to allocate (monotonically increasing)last_alive_ptr: pointer toheader->last_task_alive
Slot mapping: slot = task_id & (window_size - 1)
Allocation (pto2_task_ring_alloc):
active_count = current_index - *last_alive_ptr
if active_count < window_size - 1:
allocate slot, advance current_index
else:
spin-wait (back-pressure from scheduler)
Reclamation: Scheduler threads advance last_task_alive via lock-free CAS when the oldest task reaches state CONSUMED (3). This frees slots for reuse.
Flow control: When the ring is full, the orchestrator blocks until the scheduler advances last_task_alive. With PTO2_RING_TASK_WINDOW=16 and 208 tasks, slots are recycled ~13 times each.
The heap ring manages output buffer allocation from a circular GM heap.
Structure (PTO2HeapRing):
base: GM heap base addresssize: total heap size (default 1 GB)top: allocation pointer (local to orchestrator)tail_ptr: pointer toheader->heap_tail(updated by scheduler)
Allocation: Buffers are allocated contiguously from top. When reaching the end, allocation wraps to the beginning if tail has advanced far enough. Buffers never straddle the wrap-around boundary.
Reclamation: When last_task_alive advances past a task, its packed_buffer_end is used to advance heap_tail, freeing the memory region.
A simple bump allocator for PTO2DepListEntry nodes used in fanin/fanout linked lists.
- Entry 0: NULL sentinel (
task_id=-1, next_offset=0) - Allocation:
pool->top++, wraps around when full - Reclamation: implicit — old entries become unreachable as
last_task_aliveadvances
The ring buffer mechanism provides flow control between the orchestrator (producer) and the scheduler (consumer). When a ring is exhausted, the orchestrator blocks — it cannot submit new tasks or allocate more output memory until the scheduler reclaims slots/space by advancing the watermarks.
Task Ring back-pressure: When active_count = current_index - last_task_alive >= window_size - 1, pto2_task_ring_alloc spin-waits until the scheduler completes tasks and advances last_task_alive.
Heap Ring back-pressure: When the heap has insufficient contiguous space, pto2_heap_ring_alloc spin-waits until the scheduler advances heap_tail past completed tasks' output buffers.
TensorMap pool back-pressure: When the entry pool is exhausted, new_entry() spin-waits on pto2_orchestrator_sync_tensormap(force=true) until cleanup frees entries (see Section 5.4).
This back-pressure is essential for correctness with small ring sizes — for example, with PTO2_RING_TASK_WINDOW=16 and 208 tasks, the orchestrator blocks ~192 times, each time waiting for the scheduler to drain completed tasks before continuing.
A ring that is too small can cause a deadlock. The root cause is the scope mechanism: each task's fanout_count includes a reference from its owning scope. The scope reference is only released when scope_end() runs — but scope_end() is called by the orchestrator, which is blocked waiting for ring space. This creates a circular dependency:
Orchestrator blocked on task_ring_alloc (ring full)
→ needs scheduler to advance last_task_alive
→ needs tasks to reach CONSUMED state (fanout_count == 0)
→ needs scope_end() to release scope reference
→ needs orchestrator to continue
→ DEADLOCK
The runtime detects this automatically by counting spin iterations in the allocation functions:
Periodic BLOCKED warnings (every 10,000 spins):
[TaskRing] BLOCKED (Flow Control): current=208, last_alive=192, active=16/16 (100.0%), spins=10000
[HeapRing] BLOCKED: requesting 4096 bytes, available=0, top=65536, tail=0, spins=10000
Deadlock detection (after 100,000 spins with no progress):
FATAL: Flow Control Deadlock Detected!
Task Ring is FULL and no progress after 100000 spins.
- Active tasks: 16
- Window size: 16
Root Cause:
Tasks cannot transition to CONSUMED state because fanout_count
includes 1 for the owning scope, and scope_end() requires the
orchestrator to continue — creating a circular dependency.
Solution:
Recommended: 32 (at least 2x current active tasks)
The FATAL message is logged to the device log and the process exits. The solution is to increase the ring size so that it can hold at least all tasks within the largest parallel scope. For example, if a scope submits 13 tasks, task_window >= 14 is required (13 + 1 to distinguish full from empty).
Sizing guideline: task_window_size must be larger than the maximum number of tasks in any single PTO2_SCOPE. A safe choice is 2 × max_tasks_per_scope or simply the default 65536 for production.
TensorMap maintains a mapping from tensor memory regions to their producer task IDs. When a new task reads a tensor (INPUT/INOUT), TensorMap automatically discovers the producer and establishes a dependency edge.
- Key: tensor base address (
buffer.addr) - Value: producer task ID, with overlap detection for sub-regions
- Overlap:
COVERED(new region fully contains old) orOTHER(partial overlap) - Sub-tensors of the same base tensor hash to the same bucket, enabling overlap detection
Unlike the Task Ring and Heap Ring, TensorMap entries are not managed by a ring buffer. Instead, a fixed-size pool + free list is used:
- Free list first:
free_entry_list[]stores indices of released entries. Allocation pops from here (O(1)). - Bump allocation: if free list is empty,
next_entry_idx++allocates from the end of the pool. - Blocking reclaim: if the pool is fully exhausted,
pto2_orchestrator_sync_tensormap(force=true)reads the latestlast_task_aliveand callscleanup_retiredto batch-free all entries belonging to retired tasks, returning them to the free list.
This design avoids the complexity of ring-based wrapping while still being bounded by PTO2_TENSORMAP_POOL_SIZE (default 65536 entries).
TensorMap must ensure entries for retired tasks (producer_task_id < last_task_alive) are removed, so that:
- The pool does not grow unboundedly (capacity is finite)
- Lookup performance does not degrade as stale entries accumulate in bucket chains
Three complementary mechanisms achieve this:
Layer 1 — Chain Truncation during Lookup (lazy, per-bucket):
Since insert always prepends to the bucket head, entries in each bucket chain are in descending task_id order. When pto2_tensormap_lookup encounters the first stale entry (producer_task_id < last_task_alive), all subsequent entries in the chain are guaranteed stale too. The entire tail is truncated in one operation:
// pto2_tensormap_lookup: chain truncation
if (!pto2_tensormap_entry_valid(tm, entry)) {
*prev_ptr = -1; // cut chain here
while (offset >= 0) {
stale->in_bucket = false; // mark for reuse
offset = stale->next_in_bucket;
}
return;
}This guarantees lookup only traverses valid entries — O(valid_entries_in_bucket), not O(total_entries).
Layer 2 — Periodic Batch Cleanup (cleanup_retired, per-task):
Every time the orchestrator submits a task (Step 0 of pto2_submit_task), it calls pto2_orchestrator_sync_tensormap. When last_task_alive has advanced by more than PTO2_TENSORMAP_CLEANUP_INTERVAL (default 64) tasks since the last cleanup, pto2_tensormap_cleanup_retired runs:
// pto2_tensormap_cleanup_retired: batch free by per-task chain
for (task_id = old_last_task_alive; task_id < new_last_task_alive; task_id++) {
task_slot = task_id & (TASK_WINDOW_SIZE - 1);
offset = task_entry_head[task_slot];
while (offset >= 0) {
free_entry(offset); // remove from bucket + return to free list
offset = next;
}
task_entry_head[task_slot] = -1;
}This uses the per-task entry chain (task_entry_head[task_slot]) — each task's entries are linked together at insert time, allowing O(entries_per_task) cleanup without scanning the entire pool or all buckets. Freed entries are returned to free_entry_list for immediate reuse.
Layer 3 — Back-Pressure on Pool Exhaustion (blocking):
If both the free list and bump region are depleted, new_entry() spins on pto2_orchestrator_sync_tensormap(force=true), waiting for the scheduler to advance last_task_alive so that cleanup_retired can free entries:
// PTO2TensorMap::new_entry: back-pressure
while (free_num == 0) {
pto2_orchestrator_sync_tensormap(this, /*force=*/true);
}This forms a back-pressure mechanism analogous to the Task Ring's flow control.
Summary:
| Layer | Trigger | Method | Guarantees |
|---|---|---|---|
| Chain Truncation | Every lookup | Truncate stale tail of bucket chain | Lookup only visits valid entries |
| Periodic Cleanup | Every 64 retired tasks | Walk per-task chains, free entries | Pool capacity reclaimed in bounded time |
| Pool Back-Pressure | Pool exhausted | Block until scheduler advances watermark | Hard capacity bound, no OOM |
In steady state, the number of valid TensorMap entries ≈ active_tasks × avg_outputs_per_task. With the default task_window=65536 and pool_size=65536, this is well within bounds. With small windows (e.g., task_window=16), active entries are even fewer (~16 × a few), and cleanup runs frequently.
When pto2_submit_task processes parameters:
- INPUT/INOUT:
pto2_tensormap_lookupsearches for overlapping producers (with chain truncation) - For each producer found:
pto2_add_consumer_to_produceradds the dependency - OUTPUT/INOUT:
pto2_tensormap_insertregisters the current task as the new producer at bucket head - Stale entries are pruned lazily during lookup (Layer 1) and periodically by cleanup (Layer 2)
| Field | Description |
|---|---|
task_id |
Monotonically increasing ID |
kernel_id |
Function ID (maps to compiled kernel binary) |
worker_type |
CUBE (AIC) or VECTOR (AIV) |
fanin_head |
Head of fanin dependency list (DepListPool offset) |
fanin_count |
Number of producer dependencies |
fanout_lock |
Spinlock for concurrent fanout modification |
fanout_head |
Head of fanout consumer list |
fanout_count |
1 (scope ref) + number of consumers |
packed_buffer_base/end |
GM heap region for output buffers |
output_index[] |
Maps outputs to param indices |
params[] |
Tensor and scalar parameters |
[0] INITIAL ──scan/orch_ready──► [1] READY ──dispatch──► RUNNING
▲ │
│ ▼
slot recycled ◄── [3] CONSUMED ◄──fanout done── [2] COMPLETED
In the scheduler's s_pto2_task_completed[] array:
- 0: not yet ready (initial or recycled)
- 1: ready for dispatch (all fanin satisfied)
- 2: hardware execution complete
- 3: fanout traversed, fully consumed
The orchestrator runs on AICPU Thread 3 and builds the task graph by calling the user-provided orchestration function.
Key members:
task_ring,heap_ring,dep_pool: ring buffer statetensor_map,tensor_pool: dependency trackingscope_tasks[],scope_stack_top: scope nesting stackaicpu_fanin_refcount,aicpu_task_completed,aicpu_completed_by_task: pointers to scheduler-side arrays for parallel mode
| Step | Operation |
|---|---|
| 0 | pto2_orchestrator_sync_tensormap — prune stale TensorMap entries |
| 1 | pto2_task_ring_alloc — allocate task slot (may block on flow control) |
| 1b | Reset completed[slot]=0, completed_by_task[slot]=-1 for recycled slots |
| 2 | Initialize task descriptor, copy parameters |
| 3 | Lookup: for each INPUT/INOUT param, search TensorMap for producers |
| 4 | Dependency: pto2_add_consumer_to_producer for each producer found |
| 5 | Heap alloc: pto2_alloc_packed_buffer for OUTPUT params (addr=0) |
| 6 | Insert: register OUTPUT/INOUT params in TensorMap |
| 7 | Fanin: finalize fanin_count; if already satisfied, push to orch_ready_queue |
| 8 | Publish: STORE_RELEASE(current_task_index) makes task visible to scanners |
The orchestrator and scheduler run concurrently. When adding a consumer to a producer's fanout list:
- Orchestrator acquires the producer's
fanout_lock - Check early-return: if
completed[prod_slot] >= 2ANDcompleted_by_task[prod_slot] == producer_id, the producer already finished — directly increment the consumer's refcount - Normal path: prepend consumer to the producer's fanout list
- Unlock
The scheduler's completion handler mirrors this:
- Set
completed_by_task[slot] = task_id(RELEASE) - Set
completed[slot] = 2(RELEASE) - Acquire
fanout_lock, readfanout_head, release lock - Traverse fanout, incrementing each consumer's
fanin_refcount
This lock protocol guarantees every consumer is accounted for exactly once.
Scopes control the lifetime of intermediate buffers. Each scope:
- Tracks tasks submitted within it via
scope_tasks[] - On
scope_end: decrementsfanout_countfor scope tasks; when it reaches 0, the task's packed buffer can be reclaimed
PTO2_SCOPE(rt) {
// Tasks submitted here belong to this scope
pto2_rt_submit_task(rt, FUNC_QK, PTO2_WORKER_CUBE, params, n);
pto2_rt_submit_task(rt, FUNC_SF, PTO2_WORKER_VECTOR, params, n);
}
// scope_end: scope reference released from all tasks aboveWith aicpu_thread_num=4, the AICPU runs 4 threads:
| Thread | Role | Cores |
|---|---|---|
| 0 | Scheduler | 6 AIC + ~13 AIV |
| 1 | Scheduler | 6 AIC + ~13 AIV |
| 2 | Scheduler | 6 AIC + ~13 AIV |
| 3 | Orchestrator | none |
Core assignment: AICs and AIVs are divided equally among the 3 scheduler threads.
Each scheduler thread runs a tight loop with four phases:
Phase 1 — Completion Handling:
- Poll register
CONDon each managed core - When
TASK_FIN_STATEdetected: record completion timestamps, setcompleted[slot]=2, acquire fanout lock, traverse fanout list, setcompleted[slot]=3, advancelast_task_alivewatermark
Phase 2 — Dispatch:
- For each idle core: pop a task from the ready queue (own shard first, then steal from other shards)
- Build
PTO2DispatchPayloadfromTaskDescriptor - Write task pointer to
Handshake.task, signal AICore via registerDATA_MAIN_BASE
Phase 3 — Incremental Scan:
- Atomically claim task indices from
next_scan_index - For root tasks (
fanin_count == 0): CAScompleted[slot]0→1, push to ready queue
Phase 4 — Orch Ready Queue Drain:
- Consume entries pushed by the orchestrator's early-ready path (Step 7 in submit)
- CAS
completed[slot]0→1, push to ready queue
Ready queues are sharded to reduce lock contention:
active_shards(default 3, configurable viaPTO2_READY_QUEUE_SHARDS)- Separate queues for AIC and AIV tasks, each with
active_shardsshards - Push: thread
tpushes to shardt % active_shards - Pop: try own shard first, then scan other shards (work stealing)
After a task reaches state 3 (CONSUMED), the scheduler tries to advance last_task_alive:
while la < current_task_index:
if completed[la & mask] < 3: break
reset fanin_refcount[la & mask] = 0
CAS(last_task_alive, la, la+1)
advance heap_tail from task's packed_buffer_end
la++
This is lock-free (CAS-based) and multiple scheduler threads can attempt it concurrently.
Each AICore worker has a Handshake struct in shared memory:
| Field | Direction | Purpose |
|---|---|---|
task |
AICPU→AICore | Pointer to PTO2DispatchPayload |
control |
AICPU→AICore | 0=normal, 1=shutdown |
perf_records_addr |
AICPU→AICore | Performance buffer address |
Instead of polling Handshake.task_status, the production protocol uses hardware registers:
| Register | Direction | Usage |
|---|---|---|
DATA_MAIN_BASE |
AICPU→AICore | Write task_id + 1 to dispatch; EXIT_SIGNAL to shut down |
COND |
AICore→AICPU | [bit31=state, bits30:0=task_id]: ACK (state=0) or FIN (state=1) |
AICore execution loop:
- Poll
DATA_MAIN_BASEfor non-zero value - Read payload from
Handshake.task - Write ACK to
COND - Execute kernel function via
func_id_to_addrlookup - Write FIN to
COND
Built by the scheduler from PTO2TaskDescriptor:
| Field | Description |
|---|---|
task_id |
Task identifier |
kernel_id |
Function ID |
function_bin_addr |
GM address of compiled kernel binary |
num_args |
Number of arguments |
args[] |
Tensor addresses and scalar values |
- Host compiles each kernel source (
.cpp) into a binary (.oor.so) host_api.upload_kernel_binary(func_id, binary, size)uploads to GM- The returned GM address is stored in
Runtime.func_id_to_addr_[func_id] - When dispatching, the scheduler copies this address into
PTO2DispatchPayload.function_bin_addr
- Host compiles the orchestration source into a shared library (
.so) - The SO binary is embedded into
Runtime.device_orch_so_storage_[]and copied to device - AICPU Thread 3 writes the SO to a temp file, calls
dlopen dlsym("aicpu_orchestration_config")returns configuration (expected arg count)dlsym("aicpu_orchestration_entry")returns the orchestration function pointer- Thread 3 creates a
PTO2Runtime, calls the orchestration function within aPTO2_SCOPE - After orchestration completes:
dlclose, delete temp file
| Flag | Set by | Waited by | Purpose |
|---|---|---|---|
sm_header_ready_ |
Thread 3 | Threads 0-2 | SM header initialized |
pto2_init_complete_ |
First init thread | Others | One-time memset of arrays done |
orch_pointers_ready_ |
Thread 3 | Threads 0-2 | Parallel mode pointers configured |
Startup sequence:
- Thread 3: create SM handle → set
sm_header_ready_ - Scheduler threads: wait for
sm_header_ready_→ one-time init → setpto2_init_complete_ - Thread 3: wait for
pto2_init_complete_→ configure pointers → setorch_pointers_ready_ - Scheduler threads: wait for
orch_pointers_ready_→ enter main loop - Thread 3: call orchestration function → set
orchestrator_done
The orchestration API is defined in pto_orchestration_api.h. Orchestration code depends only on this header.
| Function/Macro | Purpose |
|---|---|
pto2_rt_submit_task(rt, kernel_id, worker_type, params, n) |
Submit a task with parameters |
PTO2_SCOPE(rt) { ... } |
RAII scope for buffer lifetime |
pto2_rt_orchestration_done(rt) |
Signal orchestration complete |
pto2_rt_init_tensor_pool(rt) |
Initialize tensor pool for make_tensor() |
| Function | Description |
|---|---|
make_tensor_external(ptr, shapes, ndim, dtype) |
Wrap an existing device pointer as a tensor |
make_tensor(shapes, ndim, dtype) |
Create an intermediate tensor (addr=0, allocated by runtime from heap) |
make_input_param(tensor) |
INPUT parameter — read by the task |
make_output_param(tensor) |
OUTPUT parameter — written by the task (auto-allocated if addr=0) |
make_inout_param(tensor) |
INOUT parameter — read then written |
make_scalar_param(value) |
64-bit scalar parameter |
| Type | Target |
|---|---|
PTO2_WORKER_CUBE |
AIC cores (matrix multiplication) |
PTO2_WORKER_VECTOR |
AIV cores (vector operations) |
Each orchestration .so must export:
extern "C" PTO2OrchestrationConfig aicpu_orchestration_config(uint64_t* args, int arg_count);
extern "C" int aicpu_orchestration_entry(PTO2Runtime* rt, uint64_t* args, int arg_count);KERNELS = [
{"func_id": 0, "name": "QK", "source": "aic/aic_qk_matmul.cpp", "core_type": "aic"},
{"func_id": 1, "name": "SF", "source": "aiv/aiv_softmax_prepare.cpp", "core_type": "aiv"},
{"func_id": 2, "name": "PV", "source": "aic/aic_pv_matmul.cpp", "core_type": "aic"},
{"func_id": 3, "name": "UP", "source": "aiv/aiv_online_update.cpp", "core_type": "aiv"},
{"func_id": 5, "name": "AIV_HUB", "source": "aiv/aiv_hub.cpp", "core_type": "aiv"},
]
ORCHESTRATION = {
"source": "orchestration/paged_attention_orch.cpp",
"function_name": "aicpu_orchestration_entry",
}
RUNTIME_CONFIG = {
"runtime": "tensormap_and_ringbuffer",
"aicpu_thread_num": 4,
"block_dim": 24,
}void aicpu_orchestration_entry(PTO2Runtime* rt, uint64_t* args, int arg_count) {
// Unpack args: query, key_cache, value_cache, block_table, context_lens, out, config
for (q_idx = 0; q_idx < q_loop; q_idx++) {
for (batch_start = 0; batch_start < batch; batch_start += IN_CORE_BATCH) {
PTO2_SCOPE(rt) {
// Allocate accumulator tensors (oi, li, mi) via make_tensor()
// Submit AIV_HUB to initialize accumulators
for (bn = 0; bn < max_bn; bn++) {
// Allocate intermediate tensors (sij, pij, mij, lij, oi_new)
// Submit QK (CUBE) → SF (VECTOR) → PV (CUBE) → UP (VECTOR)
}
}
}
}
}The task graph per chunk (16 batches):
AIV_HUB ──► QK ──► SF ──► PV ──► UP
│
QK ──► SF ──► PV ──► UP (next block, depends on UP above via INOUT oi/li/mi)
With batch=256, IN_CORE_BATCH=16: 16 chunks × 13 tasks = 208 tasks, parallelizable across cores.
ALL_CASES = {
"Case1": {"batch": 1, "num_heads": 16, "head_dim": 16, "context_len": 16},
"CaseBatch256": {"batch": 256, "num_heads": 1, "head_dim": 256, "context_len": 16},
...
}
def generate_inputs(params) -> list:
# Returns [(name, tensor_or_scalar), ...] for host→device transfer
return [("query", query), ("key_cache", key_cache), ..., ("out", out), ("config", config)]
def compute_golden(tensors, params):
# PyTorch reference implementation of online softmax paged attention
tensors["out"][:] = paged_attention(...)1. Parse kernel_config.py (KERNELS, ORCHESTRATION, RUNTIME_CONFIG)
2. Compile in parallel:
- Runtime shared library
- Orchestration SO
- Each kernel binary (AIC/AIV)
3. Load host binary: bind_host_binary() → Runtime class
4. For each test case:
a. golden.py:generate_inputs() → func_args, arg_types, arg_sizes
b. runtime.initialize(orch_so, func_name, func_args, arg_types, arg_sizes, kernels)
→ allocates device memory, uploads binaries, prepares SM and heap
c. launch_runtime(runtime, aicpu_threads=4, block_dim=24)
→ spawns AICPU + AICore threads
d. runtime.finalize() → copy results back to host
e. Compare output vs golden.py:compute_golden()
Time ──────────────────────────────────────────────────────────────────────►
Thread 3: [create SM] [wait init] [set pointers] [orchestrate: submit 208 tasks] [done]
│ ▲ │
▼ │ ▼
Threads 0-2: [wait SM] [init arrays] [wait ptrs] [scan/dispatch/complete loop] [shutdown]
│
▼
AICore: [execute kernels...]
| Variable | Default | Description |
|---|---|---|
PTO2_RING_TASK_WINDOW |
65536 | Task ring window size (power of 2, >= 4) |
PTO2_RING_HEAP |
1 GB | GM heap size (>= 1024) |
PTO2_RING_DEP_POOL |
65536 | Dependency list pool size (>= 16) |
PTO2_READY_QUEUE_SHARDS |
3 | Ready queue shard count per core type |
PA_CASE |
Case1 | Test case name for batch_paged_attention |
PA_SEQ_LEN |
- | Comma-separated per-batch sequence lengths |
| Constant | Value | Description |
|---|---|---|
PTO2_TASK_WINDOW_SIZE |
65536 | Default task window |
PTO2_HEAP_SIZE |
1 GB | Default heap size |
PTO2_DEP_LIST_POOL_SIZE |
65536 | Default dep list pool |
PTO2_TENSORMAP_POOL_SIZE |
65536 | TensorMap entry pool |
PTO2_TENSORMAP_NUM_BUCKETS |
65536 | TensorMap hash buckets |
PTO2_ALIGN_SIZE |
64 | Memory alignment |
PTO2_PACKED_OUTPUT_ALIGN |
1024 | Output buffer alignment |
The docs/pypto-frontend-coding-style.md describes the Python-to-C++ code generation pipeline:
| Type | Description |
|---|---|
| Opaque | Default function type; may contain pl.incore() calls |
| Orchestration | Host/AICPU orchestration function; calls InCore functions |
| InCore | AICore kernel subgraph (load/compute/store) |
pypto IR ──► Orchestration Codegen ──► orchestration.cpp (uses PTO2 API)
pypto IR ──► InCore Codegen ──► kernel.cpp (AIC/AIV kernels)
The generated orchestration code uses the same PTO2 API described in Section 11:
make_tensor_external()for external inputs/outputsmake_tensor()for intermediate bufferspto2_rt_submit_task()for kernel submissionPTO2_SCOPE()for buffer lifetime management
Dependencies are inferred automatically by the TensorMap from tensor read/write patterns — the orchestration code does not need to specify explicit dependency edges.
| Backend | Output | Description |
|---|---|---|
| PTO | .pto → ptoas → C++ |
PTO ISA assembly |
| CCE | C++ with set_flag/wait_flag |
Direct C++ with synchronization |
Sections 10–11 (in the pypto-frontend-coding-style) describe the language-level semantics of cluster allocation and block_incore functions. This section describes the runtime-level changes required to support these features in the PTO2 runtime, orchestration codegen, and scheduler.
All incore functions submitted between allocate_cluster() and free_cluster() (or scope-based automatic release) form an in-cluster function group. The runtime must treat this group as a co-scheduled unit: every task in the group executes on the same physical cluster identified by clusterID.
The key invariant:
allocate_cluster() → clusterID
submit_task(kernel_A, clusterID, ...) ─┐
submit_task(kernel_B, clusterID, ...) │ function group
submit_task(kernel_C, clusterID, ...) ─┘
free_cluster(clusterID) // or automatic release when clusterID tensor leaves scope
All tasks within the group carry the same clusterID constraint. The scheduler dispatches them only to the cores belonging to that cluster, while still respecting data dependencies for ordering.
The current PTO2TaskDescriptor must be extended to record function group membership:
| New Field | Type | Description |
|---|---|---|
cluster_id |
int32_t |
ID of the allocated cluster (-1 = unconstrained) |
group_id |
int32_t |
Function group identifier (all tasks in the same allocate/free scope share the same group_id) |
When cluster_id >= 0, the scheduler must not dispatch the task to any core outside the designated cluster. When cluster_id == -1, the task follows the current unconstrained scheduling policy.
New API functions for orchestration code (generated or hand-written):
// Allocate a cluster. Blocks if no cluster is available.
// Returns a clusterID (integer) identifying the allocated cluster.
int32_t pto2_rt_allocate_cluster(PTO2Runtime* rt);
// Release a cluster back to the free pool.
// All tasks in the group must have completed before release.
void pto2_rt_free_cluster(PTO2Runtime* rt, int32_t cluster_id);
// Submit a task constrained to a specific cluster.
void pto2_rt_submit_task_clustered(PTO2Runtime* rt, int kernel_id,
int worker_type, PTOParam* params,
int n, int32_t cluster_id);Scope-based usage pattern (generated by codegen):
PTO2_SCOPE(rt) {
int32_t cid = pto2_rt_allocate_cluster(rt); // may block
// All tasks in this group are pinned to cluster cid
pto2_rt_submit_task_clustered(rt, FUNC_A, PTO2_WORKER_VECTOR, ..., cid);
pto2_rt_submit_task_clustered(rt, FUNC_B, PTO2_WORKER_CUBE, ..., cid);
pto2_rt_submit_task_clustered(rt, FUNC_C, PTO2_WORKER_VECTOR, ..., cid);
pto2_rt_free_cluster(rt, cid);
// or: automatic release when scope ends and clusterID tensor is reclaimed
}The scheduler must be extended to support cluster-constrained tasks:
-
Cluster ↔ Core mapping: A static mapping from
cluster_idto the set of physical cores (e.g., cluster 0 = {AIC0, AIV0, AIV1}). This mapping is platform-specific and configured at initialization. -
Ready queue partitioning: When popping a task for a core, the scheduler checks
task.cluster_id:- If
-1: dispatch to any idle core of the correct type (current behavior). - If
>= 0: dispatch only to a core belonging to that cluster.
- If
-
Cluster free pool: A ring or bitset tracking which clusters are currently free.
allocate_clusterpops from this pool (blocking if empty);free_clusterpushes back. -
Dependency ordering within a group: Tasks within a function group are still ordered by TensorMap dependencies (PIPE_IN/PIPE_OUT produce read/write edges). The scheduler respects these edges as usual — cluster pinning only constrains where, not when.
pto2_rt_allocate_cluster uses the same spin-wait pattern as the ring buffer allocators:
spin_count = 0
while (no free cluster):
spin_count++
if spin_count % BLOCK_NOTIFY_INTERVAL == 0:
LOG_WARN("[Cluster] BLOCKED: no free cluster, spins=%d", spin_count)
if spin_count >= CLUSTER_SPIN_LIMIT:
LOG_ERROR("FATAL: Cluster allocation deadlock — all clusters occupied")
exit(1)
This provides the same deadlock detection as the task ring and heap ring (Section 4.5).
A block_incore function is written as a single SPMD kernel parameterized by (block_dim, block_id). At the runtime level, the orchestration layer expands this single logical SPMD call into block_dim individual MPMD tasks, each with a distinct block_id:
block_incore call (block_dim=N):
──► submit_task(kernel, block_id=0, ...)
──► submit_task(kernel, block_id=1, ...)
──► ...
──► submit_task(kernel, block_id=N-1, ...)
Each expanded task is an independent PTO2TaskDescriptor submitted through the standard pto2_rt_submit_task path. The scheduler treats them as N separate tasks that can be dispatched to any available cores.
The generated orchestration code for a block_incore call produces a loop:
PTO2_SCOPE(rt) {
for (int bid = 0; bid < block_dim; bid++) {
PTOParam params[] = {
make_input_param(input),
make_output_param(output),
make_scalar_param(block_dim),
make_scalar_param(bid),
};
pto2_rt_submit_task(rt, KERNEL_FUNC_ID, PTO2_WORKER_VECTOR, params, 4);
}
}The kernel binary is the same for all block_dim tasks — only the block_id scalar parameter differs. The runtime's TensorMap tracks per-task tensor dependencies as usual.
When block_incore is used within a cluster function group, each of the block_dim expanded tasks carries the cluster_id constraint.
If the block_incore function group requires block_dim clusters (one per block, as described in Section 11.2), the orchestration allocates block_dim clusters and assigns each block to its own cluster:
int32_t cluster_ids[block_dim];
for (int bid = 0; bid < block_dim; bid++) {
cluster_ids[bid] = pto2_rt_allocate_cluster(rt);
}
PTO2_SCOPE(rt) {
for (int bid = 0; bid < block_dim; bid++) {
// Each block's function group runs on its own cluster
pto2_rt_submit_task_clustered(rt, FUNC_A, PTO2_WORKER_VECTOR, ..., cluster_ids[bid]);
pto2_rt_submit_task_clustered(rt, FUNC_B, PTO2_WORKER_CUBE, ..., cluster_ids[bid]);
pto2_rt_submit_task_clustered(rt, FUNC_C, PTO2_WORKER_VECTOR, ..., cluster_ids[bid]);
}
}
for (int bid = 0; bid < block_dim; bid++) {
pto2_rt_free_cluster(rt, cluster_ids[bid]);
}The SPMD-to-MPMD expansion is the simplest correct approach, but has overhead:
| Concern | Current (MPMD expansion) | Potential Optimization |
|---|---|---|
| Task descriptors | block_dim descriptors per call |
Batch descriptor: single descriptor with block_dim field |
| Orchestrator submission | O(block_dim) submit_task calls |
Single submit_block_task call |
| Scheduler scan | O(block_dim) tasks to scan and dispatch | Group-aware dispatch: scan one, expand to block_dim dispatches |
| TensorMap entries | O(block_dim × params) entries | Shared-tensor optimization: one entry per logical tensor |
| Ring pressure | block_dim slots consumed simultaneously | Block-aware flow control: reserve block_dim slots atomically |
Measurement-first strategy: The current MPMD expansion is used as the baseline. Performance profiling (Perfetto swimlane, scheduler overhead breakdown) will identify whether the submission overhead, ring pressure, or TensorMap pressure is the bottleneck. Optimization is applied only where measured data shows a need.
An incore function (see Section 15.1) is a subgraph of load/compute/store operations that executes on AICore. A single incore function may involve both AIC (cube/matrix) and AIV (vector) cores working together — for example, a fused matmul+activation where the matmul runs on AIC and the activation runs on AIV.
A block_incore function can also be an incore function. This means each block instance is itself an incore subgraph that may require both a cube kernel and a vector kernel co-operating on the same data.
When a block_incore function is an incore function consisting of a cube kernel and a vector kernel, the orchestration expands each block into two tasks (or more, depending on the pipeline depth):
block_incore call (block_dim=N, incore = cube + vector):
for bid in 0..N-1:
submit_task(cube_kernel, WORKER_CUBE, ..., block_id=bid)
submit_task(vector_kernel, WORKER_VECTOR, ..., block_id=bid)
The TensorMap automatically tracks the dependency between the cube and vector tasks through their shared intermediate tensors (the cube kernel writes the intermediate, the vector kernel reads it).
When combined with cluster allocation, both the cube and vector tasks of each block are pinned to the same cluster, ensuring they execute on co-located cores with local interconnect:
int32_t cid = pto2_rt_allocate_cluster(rt);
PTO2_SCOPE(rt) {
// Cube kernel on AIC within cluster cid
pto2_rt_submit_task_clustered(rt, CUBE_KERNEL, PTO2_WORKER_CUBE, ..., cid);
// Vector kernel on AIV within cluster cid (depends on cube output via TensorMap)
pto2_rt_submit_task_clustered(rt, VEC_KERNEL, PTO2_WORKER_VECTOR, ..., cid);
}
pto2_rt_free_cluster(rt, cid);The intermediate data between cube and vector can use PIPE_IN/PIPE_OUT (local interconnect, no GM allocation) or regular tensors (GM-backed), depending on whether the cluster's local interconnect is used.
block_incore (block_dim=4, incore=cube+vector)
│
┌─────────────┼─────────────┬─────────────┐
▼ ▼ ▼ ▼
Block 0 Block 1 Block 2 Block 3
Cluster 0 Cluster 1 Cluster 2 Cluster 3
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│AIC: cube│ │AIC: cube│ │AIC: cube│ │AIC: cube│
│ │ │ │ │ │ │ │ │ │ │ │
│ ▼ │ │ ▼ │ │ ▼ │ │ ▼ │
│AIV: vec │ │AIV: vec │ │AIV: vec │ │AIV: vec │
└────────┘ └────────┘ └────────┘ └────────┘
(local pipe) (local pipe) (local pipe) (local pipe)
Each block runs its cube and vector kernels on the same cluster. Data flows through the local interconnect (TPUSH/TPOP) within each cluster. Cross-block data flows through global memory.
| Component | Change |
|---|---|
PTO2TaskDescriptor |
Add cluster_id, group_id, block_id, block_dim fields |
PTO2SharedMemoryHeader |
Add cluster free pool (bitset or ring) |
| Orchestration API | Add allocate_cluster, free_cluster, submit_task_clustered |
| Scheduler | Cluster-aware dispatch, cluster→core mapping table |
| Ready Queue | Optional: per-cluster queues for pinned tasks |
| TensorMap | No change — PIPE_IN/PIPE_OUT handled as minimal-shape tensors |
| Codegen | Generate cluster allocation + block_dim expansion loop + clustered submit |