Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions packages/orchestrator/pkg/sandbox/map.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,14 @@
"github.com/e2b-dev/infra/packages/shared/pkg/smap"
)

// MapSubscriber receives lifecycle notifications from the sandbox Map.
//
// Callbacks are invoked synchronously from the goroutine that performed the
// state change. Implementations must be non-blocking; if async work is needed,
// the subscriber is responsible for dispatching it.
type MapSubscriber interface {
// OnInsert is triggered when a sandbox transitions to the running state.
OnInsert(ctx context.Context, sandbox *Sandbox)

Check notice on line 21 in packages/orchestrator/pkg/sandbox/map.go

View check run for this annotation

Claude / Claude Code Review

NFSHandler.OnNetworkRelease violates new non-blocking subscriber contract

NFSHandler.OnNetworkRelease performs blocking I/O (blocking on <-doneCh inside chroot.Close()), which violates the new non-blocking contract added to MapSubscriber in this PR (map.go:14-21). This is a pre-existing issue: NetworkReleased was already invoked synchronously before this PR and NFSHandler was already blocking, but the PR replaces the old 'synchronous, caller can rely on completion' comment with a new 'must be non-blocking' requirement without updating NFSHandler to comply. A slow chro
Comment on lines 14 to 21
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 NFSHandler.OnNetworkRelease performs blocking I/O (blocking on <-doneCh inside chroot.Close()), which violates the new non-blocking contract added to MapSubscriber in this PR (map.go:14-21). This is a pre-existing issue: NetworkReleased was already invoked synchronously before this PR and NFSHandler was already blocking, but the PR replaces the old 'synchronous, caller can rely on completion' comment with a new 'must be non-blocking' requirement without updating NFSHandler to comply. A slow chroot teardown will stall the goroutine calling NetworkReleased, delaying sandbox network slot recycling.

Extended reasoning...

What the bug is

The PR adds an interface-level contract to MapSubscriber (map.go:14-18):

Callbacks are invoked synchronously from the goroutine that performed the state change. Implementations must be non-blocking; if async work is needed, the subscriber is responsible for dispatching it.

At the same time, the PR removes the old NetworkReleased comment that said "Subscribers are invoked synchronously so the caller can rely on them having completed before taking any follow-up action." The old contract explicitly allowed blocking; the new contract explicitly forbids it.

The code path that triggers the violation

NFSHandler.OnNetworkRelease (nfsproxy/chroot/nfs.go:93-113) acquires h.mu.Lock(), removes entries from an internal map, releases the lock, then calls chroot.Close() in a loop for each removed entry. chroot.Close() calls fs.ns.Close() (chroot.go:119-121), which dispatches to mountNS.Close() (mountns.go:76-99). mountNS.Close() closes the stopCh channel to signal a dedicated OS thread to restore the original mount namespace, then blocks on <-doneCh (mountns.go:91), waiting for that thread to call unix.Setns() and signal completion. This is a genuine synchronous blocking operation that can take non-trivial time for each chroot being torn down.

Why existing code doesn't prevent it

NetworkReleased calls m.trigger() synchronously (no goroutine dispatch), and trigger() iterates all subscribers calling OnNetworkRelease in sequence. There is no timeout, context cancellation check, or goroutine wrapper around the subscriber calls. OnNetworkRelease is called directly on the same goroutine that initiated the network release.

Impact

Each chroot teardown in OnNetworkRelease stalls the goroutine calling NetworkReleased for the duration of the mount namespace restoration syscall. When multiple network slots are released concurrently or sequentially, these stalls accumulate, delaying sandbox network slot recycling. Future contributors reading the MapSubscriber interface will see "must be non-blocking" and assume compliance, potentially introducing race conditions or deadlocks if they rely on that guarantee.

Proof by example

  1. A sandbox is stopped; its IP slot is released via NetworkReleased(ctx, ip).
  2. NetworkReleased calls m.trigger(), which iterates subscribers and calls NFSHandler.OnNetworkRelease(ctx, sbx).
  3. NFSHandler.OnNetworkRelease iterates its chroot map and calls chroot.Close() for each entry.
  4. Each chroot.Close() → mountNS.Close() closes stopCh and then does <-doneCh.
  5. The dedicated OS goroutine receives the stop signal, calls unix.Setns() to restore the original namespace, then closes doneCh.
  6. Only after doneCh is closed does OnNetworkRelease return, unblocking NetworkReleased.
  7. The network slot recycling (and any follow-up logic) is delayed by the sum of all Setns syscall latencies.

How to fix

Wrap the blocking work in a goroutine inside NFSHandler.OnNetworkRelease, or provide a separate async teardown path. For example:

func (h *NFSHandler) OnNetworkRelease(ctx context.Context, sbx *sandbox.Sandbox) {
    chroots := h.collectChroots(sbx) // non-blocking extraction
    go func() {
        for _, c := range chroots {
            c.Close()
        }
    }()
}

This would satisfy the new non-blocking contract while preserving the teardown semantics.

// OnNetworkRelease is triggered when a sandbox's network slot is released.
OnNetworkRelease(ctx context.Context, sbx *Sandbox)
}
Expand Down Expand Up @@ -100,7 +104,7 @@
return
}

go m.trigger(ctx, func(ctx context.Context, s MapSubscriber) {
m.trigger(ctx, func(ctx context.Context, s MapSubscriber) {
s.OnInsert(ctx, sbx)
})

Expand Down Expand Up @@ -146,9 +150,6 @@

// NetworkReleased unregisters a sandbox's IP and notifies OnNetworkRelease
// subscribers after a successful removal.
//
// Subscribers are invoked synchronously so the caller can rely on them
// having completed before taking any follow-up action.
func (m *Map) NetworkReleased(ctx context.Context, ip string) {
var sbx *Sandbox
removed := m.network.RemoveCb(ip, func(_ string, v *Sandbox, exists bool) bool {
Expand Down
Loading