ai-dynamo
diff --git a/‎docs/ARCHITECTURE.md‎
Lines changed: 19 additions & 5 deletions b/‎docs/ARCHITECTURE.md‎
Lines changed: 19 additions & 5 deletions
diff --git a/‎docs/DEPLOYMENT.md‎
Lines changed: 16 additions & 0 deletions b/‎docs/DEPLOYMENT.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎examples/p2p_transfer_k8s/client/vllm/vllm-single-node-p2p.yaml‎
Lines changed: 139 additions & 0 deletions b/‎examples/p2p_transfer_k8s/client/vllm/vllm-single-node-p2p.yaml‎
Lines changed: 139 additions & 0 deletions
diff --git a/‎examples/p2p_transfer_k8s/server/kubernetes_backend/crd-modelmetadata.yaml‎
Lines changed: 9 additions & 1 deletion b/‎examples/p2p_transfer_k8s/server/kubernetes_backend/crd-modelmetadata.yaml‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎modelexpress_client/python/modelexpress/heartbeat.py‎
Lines changed: 1 addition & 1 deletion b/‎modelexpress_client/python/modelexpress/heartbeat.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎modelexpress_client/python/modelexpress/nixl_transfer.py‎
Lines changed: 69 additions & 6 deletions b/‎modelexpress_client/python/modelexpress/nixl_transfer.py‎
Lines changed: 69 additions & 6 deletions
@@ -251,7 +251,15 @@ Key message types: `ModelProvider` (HuggingFace), `ModelStatus` (Downloading, Do
 | `GetMetadata` | `GetMetadataRequest` | `GetMetadataResponse` | Fetch full tensor metadata for one specific worker (MB-scale, on demand) |
 | `UpdateStatus` | `UpdateStatusRequest` | `UpdateStatusResponse` | Update per-worker lifecycle status (Initializing/Ready/Stale) |
 
-Key message types: `SourceIdentity` (all fields affecting tensor layout compatibility), `WorkerMetadata` (rank, oneof backend_metadata, tensors, status), `TensorDescriptor` (name, addr, size, device_id, dtype), `SourceInstanceRef` (lightweight worker reference for listing).
+Key message types: `SourceIdentity` (all fields affecting tensor layout compatibility), `WorkerMetadata` (rank, oneof backend_metadata, tensors, status, P2P endpoint fields), `TensorDescriptor` (name, addr, size, device_id, dtype), `SourceInstanceRef` (lightweight worker reference for listing).
+
+### p2p.proto - WorkerService (P2P, opt-in)
+
+| RPC | Request | Response | Purpose |
+|-----|---------|----------|---------|
+| `GetTensorManifest` | `GetTensorManifestRequest` | `GetTensorManifestResponse` | Fetch tensor descriptors directly from a source worker |
+
+Per-worker gRPC service started when `MX_P2P_METADATA=1`. Targets call this instead of fetching tensor descriptors from the central server. Validates `mx_source_id` to catch stale discovery.
 
 See [`metadata.md`](metadata.md) for the full metadata architecture including storage schemas and coordination protocol.
 
@@ -430,6 +438,7 @@ Loading precedence: CLI args > environment variables > config file > defaults.
 | `gds_transfer.py` | GPUDirect Storage availability check and transfer utilities |
 | `gds_loader.py` | `MxGdsLoader` - GDS-based model loader (direct file-to-GPU) |
 | `vllm_loader.py` | `MxModelLoader` - auto-detecting model loader (RDMA -> GDS -> disk) |
+| `worker_server.py` | `WorkerGrpcServer` - per-worker gRPC server for P2P tensor manifest exchange |
 | `vllm_worker.py` | `ModelExpressWorker` - custom vLLM worker class (use `--worker-cls=modelexpress.vllm_worker.ModelExpressWorker`) |
 | `types.py` | `TensorDescriptor`, `WorkerMetadata`, `GetMetadataResponse` dataclasses |
 | `p2p_pb2.py` / `p2p_pb2_grpc.py` | Generated protobuf/gRPC stubs |
@@ -452,10 +461,11 @@ Manages a NIXL agent and RDMA transfers for a single GPU worker:
 
 | Method | Purpose |
 |--------|---------|
-| `__init__(agent_name, device_id)` | Create NIXL agent with UCX backend |
+| `__init__(agent_name, device_id, listen_port)` | Create NIXL agent with UCX backend; `listen_port` enables P2P listen thread |
 | `register_tensors(tensors)` | Register GPU tensors for RDMA, return serialized metadata |
 | `get_registered_descriptors()` | Return region descriptors (`MX_CONTIGUOUS_REG=1`) or tensor descriptors |
-| `receive_from_source(source_metadata, source_tensors, ...)` | Execute RDMA read transfer with optional coalescing |
+| `fetch_remote_and_wait(agent_name, ip, port)` | P2P: fetch remote NIXL metadata via listen thread (polls until loaded) |
+| `receive_from_source(source_metadata, source_tensors, ..., remote_agent_name)` | Execute RDMA read transfer; `remote_agent_name` skips `add_remote_agent` (P2P) |
 | `shutdown()` | Clean up NIXL agent and resources |
 
 ### vLLM Loader
@@ -558,10 +568,10 @@ graph TD
 ### Flow
 
 1. **Source loads**: Loads weights from disk (or GDS), runs `process_weights_after_loading()`
-2. **Source publishes**: Registers tensors with NIXL, calls `PublishMetadata(identity, worker, worker_id)` -> gets `mx_source_id` (status=INITIALIZING)
+2. **Source publishes**: Registers tensors with NIXL, calls `PublishMetadata(identity, worker, worker_id)` -> gets `mx_source_id` (status=INITIALIZING). In P2P mode (`MX_P2P_METADATA=1`), publishes only lightweight endpoint pointers and starts a `WorkerGrpcServer` for tensor manifest serving.
 3. **Heartbeat starts**: `HeartbeatThread` sends `UpdateStatus(READY)` every 30s, refreshing `updated_at`
 4. **Target discovers**: Calls `ListSources(identity, status=READY)`, filters by `worker_rank`
-5. **Target fetches on demand**: Calls `GetMetadata(mx_source_id, worker_id)` for the chosen candidate
+5. **Target fetches on demand**: Calls `GetMetadata(mx_source_id, worker_id)` for the chosen candidate. Auto-detects P2P mode if `worker_grpc_endpoint` is populated - fetches tensors from the source worker's `WorkerService` and NIXL metadata via the listen thread instead of from the central server.
 6. **Target transfers**: Executes RDMA reads from source; on `SourceTransferError` tries next candidate (max 3)
 7. **Target becomes source**: After receiving weights, publishes own metadata and starts its own heartbeat
 8. **Stale detection**: Server-side reaper marks workers STALE if `updated_at` > 90s old; GC deletes after 1 hour
@@ -579,6 +589,10 @@ See [`metadata.md`](metadata.md) for the full storage schema and debugging guide
 | `MX_SERVER_ADDRESS` | `localhost:8001` | Backward-compat alias for `MODEL_EXPRESS_URL` |
 | `MX_METADATA_BACKEND` | (required) | Metadata backend: `redis` or `kubernetes` |
 | `MX_CONTIGUOUS_REG` | `0` | Enable contiguous region registration (experimental) |
+| `MX_P2P_METADATA` | `0` | Enable P2P metadata exchange on source workers |
+| `MX_METADATA_PORT` | `0` | NIXL listen thread port for P2P metadata exchange |
+| `MX_WORKER_GRPC_PORT` | `0` | Worker gRPC port for P2P tensor manifest serving |
+| `MX_WORKER_HOST` | (auto-detect) | Override worker IP/hostname for P2P endpoints |
 | `MX_HEARTBEAT_INTERVAL_SECS` | `30` | Client heartbeat frequency |
 | `MX_HEARTBEAT_TIMEOUT_SECS` | `90` | Server reaper staleness threshold |
 | `MX_REAPER_SCAN_INTERVAL_SECS` | `30` | Server reaper scan frequency |
 
@@ -230,6 +230,10 @@ ModelExpress supports GPU-to-GPU model weight transfers between vLLM instances u
 | `MX_SERVER_ADDRESS` | `localhost:8001` | Backward-compat alias for `MODEL_EXPRESS_URL` |
 | `MX_REGISTER_LOADERS` | `1` | Auto-register the mx loader with vLLM |
 | `MX_CONTIGUOUS_REG` | `0` | Contiguous region registration (experimental) |
+| `MX_P2P_METADATA` | `0` | Enable P2P metadata exchange (source workers only) |
+| `MX_METADATA_PORT` | `0` | NIXL listen thread port for P2P metadata exchange |
+| `MX_WORKER_GRPC_PORT` | `0` | Worker gRPC port for P2P tensor manifest serving |
+| `MX_WORKER_HOST` | (auto-detect) | Override worker IP/hostname for P2P endpoints |
 | `MX_STATUS_TTL_SECS` | `3600` | TTL for Redis metadata keys (seconds) |
 | `REDIS_URL` | `redis://localhost:6379` | Redis connection URL (Redis backend only) |
 | `MX_METADATA_NAMESPACE` | `default` | K8s namespace for CRD backend |
@@ -238,6 +242,18 @@ ModelExpress supports GPU-to-GPU model weight transfers between vLLM instances u
 
 Each GPU worker publishes independently using its global rank (`torch.distributed.get_rank()`). No inter-worker coordination or barriers required.
 
+### P2P Metadata Exchange (Opt-In)
+
+By default, source workers publish full tensor metadata (NIXL blobs + tensor descriptors) to the central server. With `MX_P2P_METADATA=1`, source workers instead publish lightweight endpoint pointers and exchange metadata directly with targets:
+
+- **NIXL agent blobs** exchanged via NIXL's native listen thread (`MX_METADATA_PORT`)
+- **Tensor descriptors** served by a per-worker gRPC `WorkerService` (`MX_WORKER_GRPC_PORT`)
+- **Central server** stores only endpoint addresses, not MB-scale metadata
+
+Targets auto-detect which mode a source is using based on whether `worker_grpc_endpoint` is populated in the metadata. No configuration needed on the target side.
+
+Set `MX_METADATA_PORT` and `MX_WORKER_GRPC_PORT` to fixed ports when running in K8s (port 0 picks an ephemeral port). Set `MX_WORKER_HOST` if the pod IP auto-detection doesn't produce a routable address.
+
 ### UCX/NIXL Tuning
 
 | Variable | Recommended | Description |
 
@@ -0,0 +1,139 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Single-node vLLM deployment with P2P metadata exchange enabled.
+# Same as vllm-single-node.yaml but with MX_P2P_METADATA=1 on source workers.
+#
+# With P2P enabled, source workers exchange NIXL metadata and tensor manifests
+# directly with targets instead of routing through the central server. The
+# central server stores only lightweight endpoint pointers.
+#
+# Targets auto-detect P2P sources and need no special configuration.
+#
+# Prerequisites:
+#   - ModelExpress server deployed (see ../../server/)
+#   - PVC with model weights pre-downloaded
+#   - kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=<token>
+apiVersion: v1
+kind: Service
+metadata:
+  name: mx-vllm-p2p
+  labels:
+    app: mx-vllm-p2p
+spec:
+  type: ClusterIP
+  ports:
+    - port: 8000
+      targetPort: 8000
+      name: http
+  selector:
+    app: mx-vllm-p2p
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: mx-vllm-p2p
+  labels:
+    app: mx-vllm-p2p
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: mx-vllm-p2p
+  template:
+    metadata:
+      labels:
+        app: mx-vllm-p2p
+    spec:
+      serviceAccountName: modelexpress
+      containers:
+        - name: vllm
+          image: nvcr.io/nvidian/dynamo-dev/modelexpress-client:latest
+          imagePullPolicy: IfNotPresent
+          securityContext:
+            capabilities:
+              add:
+                - IPC_LOCK
+          env:
+            - name: VLLM_RPC_TIMEOUT
+              value: "7200000"
+            - name: HF_HUB_CACHE
+              value: "/models"
+            - name: MODEL_NAME
+              value: "deepseek-ai/DeepSeek-V3"
+            - name: VLLM_PLUGINS
+              value: "modelexpress"
+            - name: MX_SERVER_ADDRESS
+              value: "modelexpress-server:8001"
+            - name: MX_CONTIGUOUS_REG
+              value: "0"
+            # P2P metadata exchange: source workers serve NIXL metadata and
+            # tensor manifests directly to targets instead of via the server.
+            - name: MX_P2P_METADATA
+              value: "1"
+            # Fixed ports for NIXL listen thread and worker gRPC server.
+            # Use fixed ports in K8s so they can be reached across pods.
+            - name: MX_METADATA_PORT
+              value: "5555"
+            - name: MX_WORKER_GRPC_PORT
+              value: "6555"
+            - name: NIXL_LOG_LEVEL
+              value: "INFO"
+            - name: UCX_LOG_LEVEL
+              value: "INFO"
+            - name: UCX_TLS
+              value: "rc_x,rc,dc_x,dc,cuda_copy"
+            - name: UCX_RNDV_SCHEME
+              value: "get_zcopy"
+            - name: UCX_RNDV_THRESH
+              value: "0"
+            - name: POD_IP
+              valueFrom:
+                fieldRef:
+                  fieldPath: status.podIP
+            - name: NODE_NAME
+              valueFrom:
+                fieldRef:
+                  fieldPath: spec.nodeName
+            - name: POD_NAMESPACE
+              valueFrom:
+                fieldRef:
+                  fieldPath: metadata.namespace
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token-secret
+                  key: HF_TOKEN
+          args:
+            - --model
+            - $(MODEL_NAME)
+            - --load-format
+            - mx
+            - --tensor-parallel-size
+            - "8"
+            - --enable-expert-parallel
+          resources:
+            limits:
+              nvidia.com/gpu: "8"
+              rdma/ib: "8"
+            requests:
+              nvidia.com/gpu: "8"
+              rdma/ib: "8"
+              memory: "200Gi"
+              cpu: "16"
+          volumeMounts:
+            - name: shm
+              mountPath: /dev/shm
+            - name: model-cache-block
+              mountPath: /models
+
+      volumes:
+        - name: shm
+          emptyDir:
+            medium: Memory
+            sizeLimit: 64Gi
+        - name: model-cache-block
+          persistentVolumeClaim:
+            claimName: model-cache-block
+      imagePullSecrets:
+        - name: nvcr-imagepullsecret
@@ -51,7 +51,6 @@ spec:
                         - none
                     nixlMetadata:
                       type: string
-                      format: byte
                       description: Base64-encoded NIXL agent metadata blob
                     transferEngineSessionId:
                       type: string
@@ -62,6 +61,15 @@ spec:
                     tensorConfigMap:
                       type: string
                       description: Name of ConfigMap containing tensor descriptors
+                    metadataEndpoint:
+                      type: string
+                      description: P2P NIXL listen thread endpoint (host:port)
+                    agentName:
+                      type: string
+                      description: P2P NIXL agent name for remote identification
+                    workerGrpcEndpoint:
+                      type: string
+                      description: P2P worker gRPC endpoint for tensor manifest (host:port)
                     status:
                       type: string
                       description: Worker lifecycle status
 
@@ -50,7 +50,7 @@ def __init__(
         worker_id: str,
         worker_rank: int,
         nixl_manager: NixlTransferManager,
-        
+
     ):
         self._mx_client = mx_client
         self._mx_source_id = mx_source_id
 
@@ -53,16 +53,22 @@ class NixlTransferManager:
         device_id: GPU device ID for this worker
     """
 
-    def __init__(self, agent_name: str, device_id: int):
+    def __init__(self, agent_name: str, device_id: int, listen_port: int | None = None):
         self._agent_name = agent_name
         self._device_id = device_id
+        self._listen_port = listen_port
 
         self._agent: Any = None
         self._metadata: bytes = b""
         self._tensor_descriptors: list[TensorDescriptor] = []
         self._tensors: dict[str, torch.Tensor] = {}
         self._registered_regions: list[tuple[int, int]] | None = None
 
+    @property
+    def agent_name(self) -> str:
+        """Get NIXL agent name."""
+        return self._agent_name
+
     @property
     def nixl_metadata(self) -> bytes:
         """Get NIXL metadata for this agent."""
@@ -83,7 +89,19 @@ def initialize(self) -> None:
 
         torch.cuda.set_device(self._device_id)
 
-        config = nixl_agent_config(backends=["UCX"]) if nixl_agent_config else None
+        if self._listen_port is not None and nixl_agent_config:
+            config = nixl_agent_config(
+                backends=["UCX"],
+                enable_listen_thread=True,
+                listen_port=self._listen_port,
+            )
+            logger.info(
+                f"NIXL listen thread enabled on port {self._listen_port}"
+            )
+        elif nixl_agent_config:
+            config = nixl_agent_config(backends=["UCX"])
+        else:
+            config = None
         self._agent = NixlAgent(self._agent_name, config)
         logger.info(f"NIXL agent '{self._agent_name}' created on device {self._device_id}")
 
@@ -227,21 +245,59 @@ def _find_contiguous_regions(
 
         return regions
 
+    def fetch_remote_and_wait(
+        self,
+        remote_agent_name: str,
+        ip: str,
+        port: int,
+        timeout_seconds: float = 120.0,
+    ) -> None:
+        """Fetch remote NIXL agent metadata via the P2P listen thread.
+
+        Initiates an async fetch and polls until the remote agent's metadata
+        is loaded locally. Used in P2P mode instead of add_remote_agent().
+        """
+        if self._agent is None:
+            raise RuntimeError("NIXL agent not initialized")
+
+        logger.info(
+            f"Fetching remote metadata from {remote_agent_name} at {ip}:{port}"
+        )
+        self._agent.fetch_remote_metadata(remote_agent_name, ip, port)
+
+        start = time.perf_counter()
+        while True:
+            if time.perf_counter() - start >= timeout_seconds:
+                raise TimeoutError(
+                    f"Timed out waiting for remote metadata from "
+                    f"{remote_agent_name} at {ip}:{port}"
+                )
+            if self._agent.check_remote_metadata(remote_agent_name):
+                logger.info(
+                    f"Remote metadata loaded for {remote_agent_name} "
+                    f"({time.perf_counter() - start:.2f}s)"
+                )
+                return
+            time.sleep(0.01)
+
     def receive_from_source(
         self,
         source_metadata: bytes,
         source_tensors: list[TensorDescriptor],
         timeout_seconds: float | None = None,
         coalesce_transfers: bool = True,
+        remote_agent_name: str | None = None,
     ) -> tuple[int, int, float]:
         """
         Receive weights from a remote source via NIXL RDMA.
 
         Args:
-            source_metadata: NIXL metadata from the source agent
+            source_metadata: NIXL metadata from the source agent (unused if remote_agent_name set)
             source_tensors: Tensor descriptors from the source
             timeout_seconds: Maximum time to wait for transfer (None for no timeout)
             coalesce_transfers: If True, coalesce contiguous memory regions (optimization)
+            remote_agent_name: If set, use this pre-loaded agent (P2P mode) instead of
+                calling add_remote_agent with source_metadata (centralized mode)
 
         Returns:
             Tuple of (total_bytes, total_tensors, duration)
@@ -252,9 +308,16 @@ def receive_from_source(
         start_time = time.perf_counter()
         torch.cuda.set_device(self._device_id)
 
-        # Add remote agent
-        remote_agent_name = self._agent.add_remote_agent(source_metadata)
-        logger.info(f"Added remote agent {remote_agent_name}")
+        if remote_agent_name is None:
+            add_start = time.perf_counter()
+            remote_agent_name = self._agent.add_remote_agent(source_metadata)
+            add_time = time.perf_counter() - add_start
+            logger.info(
+                f"[TIMING] add_remote_agent: {add_time:.3f}s "
+                f"(agent={remote_agent_name}, blob={len(source_metadata)} bytes)"
+            )
+        else:
+            logger.info(f"Using pre-loaded remote agent {remote_agent_name}")
 
         # Check if source is sending region descriptors (MX_CONTIGUOUS_REG=1 on source)
         is_region_transfer = (