perf(nbd): zero-alloc, zero-copy synchronous dispatch by ValentaTomas · Pull Request #2395 · e2b-dev/infra

ValentaTomas · 2026-04-14T23:25:00Z

Summary

Replace async goroutine-per-request NBD dispatch with a synchronous loop using mmap slices directly
Reads: Overlay.ReadSlices returns per-block mmap slices (overlay cache or template cache), written to the NBD socket via writev (net.Buffers) — zero user-space copies, zero allocations after warmup
Writes: Cache.WriteSlice returns a writable mmap reference; socket data is read directly into the mmap page — zero user-space copies, zero allocations
Removes per-request: goroutines, channels, sync.Pool, write lock, WaitGroup, and shutting-down machinery

Profile context

From profiling on n2-highmem-80 nodes, NBD dispatch was the dominant allocator after the multipart upload fix (#2394):

Path	Cumulative allocs	What it was
`Dispatch.Handle` (write buffers)	57 TB (47%)	`make([]byte, request.Length)` per write
`cmdRead.func1` (read buffers)	20 TB (16%)	`make([]byte, length)` per read

This PR eliminates both — zero allocations after the first request at each size.

How it works

Reads: Instead of allocating a buffer, copying mmap data into it, then writing to socket — we collect per-block mmap slice references and send them via writev in a single syscall.

Writes: Instead of allocating a buffer, reading socket data into it, then copying into mmap — we get a writable reference to the overlay cache mmap and read socket data directly into it.

Parallelism is preserved via FlagCanMulticonn (multiple NBD socket connections per device), rather than per-request goroutines within a connection.

Test plan

go build ./... passes
go vet passes
Existing NBD tests pass (require root / nbd kernel module)
Profile on staging to verify allocation elimination

cursor · 2026-04-14T23:25:06Z

PR Summary

High Risk
High risk because it rewires the NBD dispatch I/O path (provider interface, read/write execution model, and buffer management), where mistakes can cause data corruption, stalled connections, or increased memory use under load.

Overview
Refactors NBD dispatch to use a new Provider API (ReadSlices/WriteFrom) and stages write payloads in a reusable buffer pool while handling reads/writes asynchronously, aiming to avoid per-request allocations and prevent a slow backend operation from stalling pipelined requests. Adds Cache.WriteSlice and Overlay.ReadSlices/Overlay.WriteFrom to expose mmap-backed slices and support direct reads/writes, and updates direct-path mounting and test utilities to use the new provider interface.

^{Reviewed by Cursor Bugbot for commit d23c0a4. Bugbot is set up for automated code reviews on this repo. Configure here.}

claude · 2026-04-14T23:42:57Z

+			c.setIsCached(off, end-off)
+		}
+
+		c.mu.Unlock()


test comment

claude · 2026-04-14T23:43:41Z

+// done(false) releases the write lock without marking dirty (use on write failure).
+func (c *Cache) WriteSlice(off, length int64) ([]byte, func(bool), error) {
+	c.mu.Lock()
+


Cache mutex held during socket I/O. c.mu.Lock() is acquired here and not released until done() is called in dispatch.go after the entire fp.Read loop completes (line 195). The mutex is held for the full duration of reading the write payload off the socket. With FlagCanMulticonn, multiple Dispatch goroutines share the same Provider. While one goroutine is blocked on slow socket data holding c.mu, every other connection's reads and writes stall on this lock. The old design held the lock only briefly inside WriteAt, after socket data was already buffered locally. The fix is to read socket data into a temp buffer first, then acquire the lock only to copy into and mark the mmap.

claude · 2026-04-14T23:43:41Z

+func (o *Overlay) ReadSlices(ctx context.Context, off, length int64, dest [][]byte) ([][]byte, error) {
+	dest = dest[:0]
+	numBlocks := length / o.blockSize
+


Silent truncation for non-blockSize-aligned lengths. numBlocks := length / o.blockSize discards trailing bytes when length is not a multiple of blockSize, with no error. If a non-aligned length is passed, writeReadResponse sends fewer bytes than requested and the kernel NBD driver treats this as an I/O error. Kernel NBD requests are block-aligned in practice, but a guard returning an error for non-aligned lengths would surface any future misuse immediately.

Replace async goroutine-per-request NBD dispatch with a synchronous loop using mmap slices directly. Reads: Overlay.ReadSlices returns per-block mmap slices, written to the NBD socket via writev (net.Buffers) — zero user-space copies. Writes: socket data is read into a reusable buffer, then copied into the mmap under a brief lock via Cache.WriteSlice. Removes per-request: goroutines, channels, sync.Pool, write lock, WaitGroup, and shutting-down machinery.

…ignment Restore the 32 MB dispatchMaxWriteBufferSize check removed in the zero-alloc rewrite — a corrupted NBD header could cause an unbounded allocation and OOM. Add offset alignment validation in Overlay.ReadSlices to match the existing length check.

ReadSlices calls device.Slice instead of ReadAt, so the SlowDevice test wrapper must delay Slice as well to trigger the kernel NBD I/O timeout in TestSlowBackend_ShortTimeout.

Copy mmap slice data into a reusable contiguous buffer before writing the NBD read response, matching the old ReadAt behavior. This avoids potential issues with sending mmap references directly while the underlying data could be modified by concurrent operations.

Replace per-request buffer allocations in NBD dispatch with zero-copy mmap slice access. Reads use ReadSlices to get per-block mmap references and copy into a pooled buffer; writes use WriteSlice to get a writable mmap reference and copy socket data directly in. Keeps the existing goroutine-per-request dispatch structure unchanged.

cmdRead spawned a goroutine that called ReadSlices and then wrote the returned mmap slices to the NBD socket. Meanwhile the dispatch loop kept reading the next request and could process a write that mutated the same mmap pages while the read goroutine was still streaming them out. The result was silently corrupted data on the device — manifesting as "adduser: exit status 126" during template builds because executable bytes on the rootfs were partially overwritten before being delivered. Process the read response inline so the slices are fully written before the loop accepts the next request. Concurrency across requests is still preserved by FlagCanMulticonn (one Handle goroutine per NBD socket connection). The shuttingDown / pendingResponses bookkeeping that only existed to wait for the in-flight goroutine becomes unnecessary, but is retained as no-ops in the struct/Drain method to keep the diff minimal.

Adopt the provName field and structured per-request error logging that main introduced for cmdRead/cmdWrite, while preserving this PR's zero-copy dispatch design (synchronous cmdRead + WriteFrom). This removes the merge conflict so the pull_request workflow can run.

Resolve dispatch.go conflict by keeping this branch's zero-copy design (synchronous cmdRead with mmap slices, WriteFrom for writes) while incorporating main's structured per-request error logging and the provName field.

cmdRead returned mmap slices directly to the NBD socket, which means concurrent mutations of the underlying mmap pages (other NBD connection's writes, snapshot/diff export, chunker activity, etc.) could corrupt an in-flight read response. The smoketest exhibited this as 'adduser ... exit status 126' from a corrupted busybox/adduser binary on the freshly built rootfs. Re-introduce the per-request copy into a reusable Dispatch buffer so the bytes that go on the wire match a single point-in-time snapshot of the backend, mirroring the original ReadAt behaviour. The buffer is sized to cmdLength and reused across reads, so the only allocation happens when a request is larger than any seen before.

The previous implementation used io.MultiReader to stream NBD write payloads directly into the cache mmap via Overlay.WriteFrom. That required Cache.WriteSlice to hold the cache write mutex for the entire duration of the network read, serializing every other cache operation (other connections' WriteFrom, ExportToDiff, Close, ...) behind a potentially slow socket. Stage the payload in a reusable per-Dispatch buffer first, then hand that bytes.Reader to WriteFrom. The cache write lock is now held only for the in-memory copy, while still avoiding per-request allocations.

The synchronous dispatch loop only allowed one in-flight NBD request per connection. With NBDConnectionsPerDevice defaulting to 1, that serializes every read/write the kernel pipelines onto the device, which surfaces as flaky "build was cancelled" failures during sandbox boot. Restore the goroutine-per-request model from main while keeping the new ReadSlices/WriteFrom interface and zero-copy mmap handoff. Read responses and write payloads use a sync.Pool to recycle buffers so we still avoid the per-request allocations the previous synchronous version was tuned to remove.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d23c0a4. Configure here.}

cursor · 2026-04-18T23:17:26Z

+		if err != nil {
+			d.fatalIfFirst(ctx, "write", cmdHandle, cmdFrom, uint32(len(*payloadBuf)), err)
 		}
+	}()


Pool buffer use-after-put race on context cancellation

Medium Severity

When performWrite returns early due to context cancellation, the cmdWrite goroutine's deferred putPayloadBuf returns the buffer to the pool. However, performWrite's inner goroutine may still be running WriteFrom(bytes.NewReader(payload), ...), where the bytes.Reader holds a reference to the same underlying array. A subsequent getPayloadBuf call from another request can reclaim and overwrite the buffer, causing the leaked goroutine to read corrupted data and write it into the cache mmap. The old code allocated a fresh []byte per request, avoiding this shared-buffer hazard.

Additional Locations (1)

packages/orchestrator/pkg/sandbox/nbd/dispatch.go#L415-L428

^{Reviewed by Cursor Bugbot for commit d23c0a4. Configure here.}

ValentaTomas · 2026-04-18T23:49:18Z

Closing for now as this is a more invasive change.

e2b-request-same-site-reviewers Bot assigned tvi Apr 14, 2026

ValentaTomas force-pushed the perf/zero-alloc-nbd branch from 32ef3b7 to 8018ef2 Compare April 14, 2026 23:37

claude Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/block/cache.go

c.setIsCached(off, end-off)

}

c.mu.Unlock()

Copy link
Copy Markdown

claude Bot Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test comment

claude Bot reviewed Apr 14, 2026

View reviewed changes

ValentaTomas force-pushed the perf/zero-alloc-nbd branch from 8018ef2 to 6e73f66 Compare April 14, 2026 23:44

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go

ValentaTomas force-pushed the perf/zero-alloc-nbd branch from 6e73f66 to c8fb5c2 Compare April 15, 2026 00:19

cursor Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go

cursor Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go Outdated

ValentaTomas added 3 commits April 15, 2026 18:02

fix(nbd): add read delay to SlowDevice.Slice for timeout test

eef5a81

ReadSlices calls device.Slice instead of ReadAt, so the SlowDevice test wrapper must delay Slice as well to trigger the kernel NBD I/O timeout in TestSlowBackend_ShortTimeout.

ValentaTomas force-pushed the perf/zero-alloc-nbd branch from 7a492d2 to eef5a81 Compare April 16, 2026 01:02

ValentaTomas force-pushed the perf/zero-alloc-nbd branch from 723a308 to 054fbf2 Compare April 16, 2026 01:24

cursor Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go Outdated

Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go Outdated

ValentaTomas force-pushed the perf/zero-alloc-nbd branch 3 times, most recently from 5c06f2c to f0ec082 Compare April 16, 2026 07:52

ValentaTomas force-pushed the perf/zero-alloc-nbd branch from f0ec082 to f2ca90c Compare April 16, 2026 07:58

cursor Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/block/overlay.go

Comment thread packages/orchestrator/pkg/sandbox/block/cache.go

ValentaTomas added 4 commits April 18, 2026 14:57

ci: retrigger workflows

51d585d

Merge origin/main into perf/zero-alloc-nbd

a7e534d

Resolve dispatch.go conflict by keeping this branch's zero-copy design (synchronous cmdRead with mmap slices, WriteFrom for writes) while incorporating main's structured per-request error logging and the provName field.

cursor Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go Outdated

ValentaTomas added 2 commits April 18, 2026 15:28

cursor Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/nbd/dispatch.go

cursor Bot reviewed Apr 18, 2026

View reviewed changes

ValentaTomas closed this Apr 18, 2026

ValentaTomas deleted the perf/zero-alloc-nbd branch April 19, 2026 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(nbd): zero-alloc, zero-copy synchronous dispatch#2395

perf(nbd): zero-alloc, zero-copy synchronous dispatch#2395
ValentaTomas wants to merge 12 commits intomainfrom
perf/zero-alloc-nbd

ValentaTomas commented Apr 14, 2026

Uh oh!

cursor Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 14, 2026

Uh oh!

claude Bot Apr 14, 2026

Uh oh!

claude Bot Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 18, 2026

Uh oh!

ValentaTomas commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ValentaTomas commented Apr 14, 2026

Summary

Profile context

How it works

Test plan

Uh oh!

cursor Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

claude Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 18, 2026

Choose a reason for hiding this comment

Pool buffer use-after-put race on context cancellation

Uh oh!

ValentaTomas commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cursor Bot commented Apr 14, 2026 •

edited

Loading