feat: add NetworkPolicy support for sandbox isolation#291
feat: add NetworkPolicy support for sandbox isolation#291Sanchit2662 wants to merge 2 commits intovolcano-sh:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request implements network isolation for sandboxes using Kubernetes NetworkPolicies, supporting 'None', 'Restricted', and 'Custom' modes via the AgentRuntime and CodeInterpreter CRDs. The workload-manager is updated to manage the lifecycle of these policies, including automated creation and cleanup. A critical issue was identified in pkg/workloadmanager/workload_builder.go where the per-session NetworkPolicy override is ignored during the creation of CodeInterpreter sandboxes when warm pools are not utilized.
| } | ||
| sandbox := buildSandboxObject(buildParams) | ||
| return sandbox, nil, sandboxEntry, nil | ||
| np := buildNetworkPolicy(sandboxName, namespace, codeInterpreterObj.Spec.Template.NetworkPolicy, routerSelector, routerNamespace) |
There was a problem hiding this comment.
The per-session NetworkPolicy override is ignored in this code path. While it is correctly handled in the warm-pool path (line 403) and in buildSandboxByAgentRuntime (line 305), it is missing here. This means that a custom NetworkPolicy provided in the CreateSandboxRequest will not be applied when creating a CodeInterpreter sandbox without a warm pool.
| np := buildNetworkPolicy(sandboxName, namespace, codeInterpreterObj.Spec.Template.NetworkPolicy, routerSelector, routerNamespace) | |
| np := buildNetworkPolicy(sandboxName, namespace, effectiveNetworkPolicy(sessionNetworkPolicy, codeInterpreterObj.Spec.Template.NetworkPolicy), routerSelector, routerNamespace) |
- Add SandboxNetworkPolicy API type (None/Restricted/Custom modes) - Build per-sandbox NetworkPolicy with router ingress always allowed - Support cross-namespace router via NamespaceSelector - Per-session NP override in CreateSandboxRequest (replace semantics) - Configurable router selector/namespace via chart values and CLI flags - Clean up NP on sandbox delete, rollback, and GC - Add networkpolicies RBAC to workloadmanager ClusterRole Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
c8b70b4 to
9e2f065
Compare
There was a problem hiding this comment.
Pull request overview
Adds optional Kubernetes NetworkPolicy generation/cleanup to improve sandbox network isolation in AgentCube’s Workload Manager, closing the “no network restrictions” gap for untrusted sandbox code execution.
Changes:
- Introduces
SandboxNetworkPolicyin runtime API templates (AgentRuntime + CodeInterpreter) and a per-session override viaCreateSandboxRequest. - Implements NetworkPolicy building (None/Restricted/Custom) plus lifecycle management (create on sandbox create; best-effort delete on rollback, delete API, and GC).
- Adds router selector/namespace configuration (flags + Helm values) to support cross-namespace router→sandbox ingress allow rules.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/workload_builder.go | Wires template/session policy into sandbox build flow and returns a NetworkPolicy object for creation. |
| pkg/workloadmanager/server.go | Adds config defaults for router selector/namespace used by Restricted mode ingress. |
| pkg/workloadmanager/networkpolicy_builder.go | Builds NetworkPolicy objects for None/Restricted/Custom modes (router ingress + optional DNS egress). |
| pkg/workloadmanager/networkpolicy_builder_test.go | Unit tests for NetworkPolicy building and “replace, not merge” selection behavior. |
| pkg/workloadmanager/k8s_client.go | Adds helpers to create/delete NetworkPolicies and generalizes clientset type. |
| pkg/workloadmanager/handlers.go | Creates NetworkPolicy during sandbox creation; deletes on rollback and explicit delete handler. |
| pkg/workloadmanager/handlers_test.go | Updates handler tests for new builder signatures and adds session override forwarding test. |
| pkg/workloadmanager/garbage_collection.go | Ensures NetworkPolicy cleanup runs even when Sandbox/SandboxClaim is already NotFound. |
| pkg/workloadmanager/garbage_collection_test.go | Adds fake clientset to cover NP cleanup path without nil panics. |
| pkg/common/types/sandbox.go | Extends CreateSandboxRequest to accept per-session NetworkPolicy override. |
| pkg/apis/runtime/v1alpha1/networkpolicy_types.go | Adds new API types for SandboxNetworkPolicy modes and rules. |
| pkg/apis/runtime/v1alpha1/agent_type.go | Adds NetworkPolicy field to AgentRuntime SandboxTemplate. |
| pkg/apis/runtime/v1alpha1/codeinterpreter_types.go | Adds NetworkPolicy field to CodeInterpreterSandboxTemplate. |
| pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go | Updates generated deep-copies for new NetworkPolicy fields/types. |
| manifests/charts/base/values.yaml | Adds Helm values for router selector/namespace used in Restricted mode. |
| manifests/charts/base/templates/workloadmanager.yaml | Passes router selector/namespace flags into workload-manager deployment. |
| manifests/charts/base/templates/rbac/workloadmanager.yaml | Grants workload-manager RBAC to manage networkpolicies. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml | Updates AgentRuntime CRD schema with networkPolicy fields. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml | Updates CodeInterpreter CRD schema with networkPolicy fields. |
| cmd/workload-manager/main.go | Adds flags + selector parsing for router selector/namespace configuration. |
| cmd/workload-manager/main_test.go | Unit tests for selector parsing logic. |
| } | ||
| sandbox := buildSandboxObject(buildParams) | ||
| return sandbox, nil, sandboxEntry, nil | ||
| np := buildNetworkPolicy(sandboxName, namespace, codeInterpreterObj.Spec.Template.NetworkPolicy, routerSelector, routerNamespace) |
| // createNetworkPolicy creates a NetworkPolicy using the workloadmanager's own client. | ||
| func createNetworkPolicy(ctx context.Context, clientset kubernetes.Interface, np *networkingv1.NetworkPolicy) error { | ||
| _, err := clientset.NetworkingV1().NetworkPolicies(np.Namespace).Create(ctx, np, metav1.CreateOptions{}) | ||
| if err != nil && !apierrors.IsAlreadyExists(err) { | ||
| return fmt.Errorf("failed to create network policy %s/%s: %w", np.Namespace, np.Name, err) | ||
| } | ||
| return nil |
| @@ -21,6 +21,7 @@ import ( | |||
| "time" | |||
|
|
|||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | |||
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #291 +/- ##
==========================================
- Coverage 46.73% 46.32% -0.42%
==========================================
Files 30 31 +1
Lines 2822 3033 +211
==========================================
+ Hits 1319 1405 +86
- Misses 1363 1481 +118
- Partials 140 147 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…eter path - Fix session NP override being ignored in direct sandbox path - Add CIDR format validation markers to SandboxEgressRule and SandboxIngressRule - Add example manifest for NetworkPolicy modes (Restricted, Custom) Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
What type of PR is this?
/kind feature
What this PR does / why we need it:
Sandboxes had zero network restrictions. A pod inside a sandbox could reach anything, internal or external. Since sandboxes run untrusted code this is a real isolation gap, so I picked up issue #216 to add NetworkPolicy support.
I added a
SandboxNetworkPolicyfield to bothSandboxTemplate(AgentRuntime) andCodeInterpreterSandboxTemplate. It has three modes you pick from:None- default, no policy created, nothing breaks for existing users -Restricted- deny-all baseline. DNS egress on port 53 (UDP+TCP) is always allowed so pods don't lose name resolution. Router ingress is always allowed toosince the router has to be able to reach the sandbox -
Custom- you pass raw k8s NetworkPolicy ingress/egress rules directlyA few specific things I was careful about:
Cross-namespace router ingress - using just
PodSelectorwould only match pods in the same namespace as the NP. In a real deployment the router usually lives in a different namespace than the sandbox pods, so that would silently break the router rule. I fixed this by adding aNamespaceSelectorusingkubernetes.io/metadata.nameso it works across namespaces.Router selector is configurable - you can set it in
values.yamlunderworkloadmanager.networkPolicy.routerSelectorandrouterNamespace, or pass--router-selectorand--router-namespaceflags. Defaults toapp=agentcube-routerand the workloadmanager's own namespace.Per-session override - added
NetworkPolicyfield toCreateSandboxRequest. When set it fully replaces the template-level policy for that session only. No merging, just a clean replace.Cleanup - NPs are deleted when you delete the sandbox through the API, on rollback if creation fails partway through, and in the GC loop. I also fixed a bug where the GC was returning early on
IsNotFoundbefore deleting the NP. Since the store record gets purged right after, that would've leaked the NP permanently. Now NP cleanup runs regardless of whether the sandbox was already gone.Rollback - simplified it to always attempt NP delete when
networkPolicy != nilrather than tracking anpCreatedflag. This also handles the edge case where the k8s write succeeded but returned an error to us.RBAC - added
networking.k8s.io/networkpoliciescreate/get/list/watch/delete to the workloadmanager ClusterRole. Without this the API server returns 403 on NP create.Which issue(s) this PR fixes:
Fixes #216
Special notes for your reviewer:
Known gap: for the warm-pool path (SandboxClaim), the NP gets created but its
PodSelectorusesSandboxNameLabelKeywhich agent-sandbox doesn't propagate onto warm pool pods. So the NP exists but won't enforce anything for warm-pool sessions right now. I want to confirm the label propagation behavior on the agent-sandbox side before fixing this, happy to do it as a follow-up.No OwnerReferences on the NP for now. Cleanup is handled explicitly through the delete handler and GC. Can add owner refs in a follow-up if you'd prefer that approach.
Does this PR introduce a user-facing change?: