Detailed reference document for the ModelExpress codebase. For deployment and configuration, see DEPLOYMENT.md. For contribution guidelines and dev setup, see CONTRIBUTING.md. For coding standards and AI assistant instructions, see CLAUDE.md. For CLI usage, see CLI.md.
ModelExpress is a Rust-based model cache management service and GPU-to-GPU model weight transfer system. It serves two roles:
- Model Cache Service - A sidecar alongside inference solutions (vLLM, SGLang, NVIDIA Dynamo) that accelerates model downloads from HuggingFace and NGC with SQLite-backed tracking and LRU cache eviction.
- P2P Weight Transfer - GPU-to-GPU model weight transfers between vLLM instances using NVIDIA NIXL over RDMA/InfiniBand, enabling ~15-second transfers for 681GB models.
| Model | Status | Transfer Time | Notes |
|---|---|---|---|
| DeepSeek-V3 (671B, FP8) | Working | ~15s | 681GB across 8 GPUs @ ~45 Gbps per link |
| Llama 3.3 70B | Working | ~5s | 140GB across 8 GPUs @ ~28 Gbps per link |
graph TD
subgraph "Model Cache Mode"
C1[Client CLI / Library] -->|gRPC| S1[ModelExpress Server]
S1 --> DB[(SQLite)]
S1 --> HF[HuggingFace Hub]
S1 --> NGC[NVIDIA NGC]
S1 --> Cache[Model Cache Dir]
end
subgraph "P2P Transfer Mode"
subgraph "Node A"
A[vLLM + MxModelLoader]
end
subgraph "Node B"
B[vLLM + MxModelLoader]
end
A -->|gRPC metadata| S2[ModelExpress Server]
B -->|gRPC metadata| S2
S2 --> R[(Redis)]
A -- "RDMA via NIXL" --> B
end
| Component | Language | Location | Purpose |
|---|---|---|---|
| Server | Rust | modelexpress_server/ |
gRPC server: model downloads, cache eviction, P2P coordination |
| Rust Client | Rust | modelexpress_client/src/ |
Client library and CLI tool |
| Python Client | Python | modelexpress_client/python/ |
vLLM loaders, NIXL transfer manager, gRPC client |
| Common | Rust | modelexpress_common/ |
Protobuf definitions, shared types, provider trait, config |
| Workspace Tests | Rust | workspace-tests/ |
Integration tests and Criterion benchmarks |
ModelExpress/
├── Cargo.toml # Workspace root (4 members)
├── Cargo.lock
├── Dockerfile # Multi-stage production image
├── docker-compose.yml # Single-service dev setup
├── run_integration_tests.sh # Integration test runner
├── test_client.sh # Client test script
├── test_grpc_transfer_k8s.sh # K8s gRPC transfer test
├── deny.toml # cargo-deny config
├── rust-toolchain.toml # Rust version pinning
├── rustfmt.toml # Formatting config
├── modelexpress-cli-completion.bash # Shell completions
│
├── modelexpress_server/
│ ├── Cargo.toml
│ └── src/
│ ├── main.rs # Server startup, service registration
│ ├── lib.rs # Module exports
│ ├── config.rs # ServerConfig, layered loading, validation
│ ├── database.rs # SQLite ModelDatabase, atomic claim
│ ├── cache.rs # CacheEvictionService, LRU policy
│ ├── state.rs # P2pStateManager wrapper
│ ├── services.rs # Health, API, Model gRPC services
│ ├── p2p_service.rs # P2P gRPC service implementation
│ ├── source_identity.rs # SHA256-based mx_source_id computation
│ ├── reaper.rs # Server-side stale source detection and GC
│ ├── k8s_types.rs # Kubernetes CRD type definitions
│ ├── metadata_backend.rs # MetadataBackend trait + types
│ ├── metadata_backend/
│ │ ├── redis.rs # Redis backend implementation
│ │ └── kubernetes.rs # Kubernetes CRD backend implementation
│ └── bin/
│ └── config_gen.rs # Config file generator/migrator
│
├── modelexpress_client/
│ ├── Cargo.toml
│ └── src/
│ ├── lib.rs # Client struct, public API
│ └── bin/
│ ├── cli.rs # CLI entry point (modelexpress-cli)
│ ├── test_client.rs # Concurrent/single download tests
│ ├── fallback_test.rs # Provider fallback tests
│ └── modules/
│ ├── args.rs # CLI args (Cli struct, embeds ClientArgs)
│ ├── handlers.rs # CLI command handlers
│ ├── output.rs # Output formatting (human, JSON, JSON-pretty)
│ └── payload.rs # JSON payload reader (inline, file, stdin)
│
├── modelexpress_client/python/
│ ├── pyproject.toml # Python package config
│ ├── generate_proto.sh # Proto stub generation script
│ └── modelexpress/
│ ├── __init__.py # Package init, vLLM loader auto-registration
│ ├── client.py # MxClient gRPC client
│ ├── heartbeat.py # Client-side heartbeat for source liveness
│ ├── nixl_transfer.py # NixlTransferManager
│ ├── gds_transfer.py # GPUDirect Storage transfer support
│ ├── gds_loader.py # GDS model loader
│ ├── vllm_loader.py # MxModelLoader (thin orchestration)
│ ├── load_strategy/ # Loading strategy chain
│ │ ├── __init__.py # LoadStrategyChain.run()
│ │ ├── base.py # LoadStrategy ABC, LoadContext, shared helpers
│ │ ├── rdma_strategy.py # RdmaStrategy (P2P GPU transfer via NIXL)
│ │ ├── model_streamer_strategy.py # ModelStreamerStrategy (S3/GCS/Azure/local)
│ │ ├── gds_strategy.py # GdsStrategy (GPUDirect Storage)
│ │ └── default_strategy.py # DefaultStrategy (vLLM DefaultModelLoader)
│ ├── tensor_utils.py # Tensor collection, checksums, storage views
│ ├── transfer_safety.py # MLA feature gate, TransferFingerprint
│ ├── metadata.py # Metadata building and publishing
│ ├── rank_utils.py # Rank detection utilities
│ ├── worker_server.py # WorkerGrpcServer (P2P tensor manifest)
│ ├── vllm_worker.py # ModelExpressWorker (custom vLLM worker)
│ ├── types.py # TensorDescriptor, WorkerMetadata dataclasses
│ ├── p2p_pb2.py # Generated protobuf stubs
│ └── p2p_pb2_grpc.py # Generated gRPC stubs
│
├── modelexpress_common/
│ ├── Cargo.toml
│ ├── build.rs # tonic-build: compiles all 4 proto files
│ ├── proto/
│ │ ├── health.proto # HealthService
│ │ ├── api.proto # ApiService
│ │ ├── model.proto # ModelService
│ │ └── p2p.proto # P2pService
│ └── src/
│ ├── lib.rs # Module exports, gRPC stubs, type conversions
│ ├── cache.rs # CacheEvictionConfig, LruConfig, DurationConfig
│ ├── client_config.rs # ClientConfig, ClientArgs (shared CLI args)
│ ├── config.rs # Config trait utilities
│ ├── download.rs # Download orchestration (smart-fallback, direct, server-only)
│ ├── models.rs # Status, ModelProvider, ModelStatus, ModelStatusResponse
│ ├── providers.rs # ModelProviderTrait definition, re-exports
│ └── providers/
│ ├── huggingface.rs # HuggingFaceProvider implementation
│ └── ngc.rs # NgcProvider implementation
│
├── workspace-tests/
│ ├── Cargo.toml
│ ├── tests/
│ │ └── integration_tests.rs # Health, ping, download, fallback tests
│ └── benches/
│ └── performance.rs # Criterion: DB ops, serialization benchmarks
│
├── helm/
│ ├── Chart.yaml # v0.2.2
│ ├── deploy.sh # Deploy script (microk8s/kubectl auto-detect)
│ ├── values.yaml # Default (1 replica, 10Gi PVC)
│ ├── values-development.yaml # Dev (debug, 512Mi)
│ ├── values-production.yaml # Prod (3 replicas, 2Gi, ingress)
│ ├── values-local-storage.yaml # Test (no PVC, emptyDir)
│ └── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── pvc.yaml
│ ├── ingress.yaml
│ └── serviceaccount.yaml
│
├── examples/
│ ├── p2p_transfer_k8s/ # GPU-to-GPU weight transfer example
│ │ ├── README.md
│ │ ├── Dockerfile.client # vLLM + ModelExpress client image
│ │ ├── model-download.yaml # Model weights download job
│ │ ├── server/
│ │ │ ├── kubernetes_backend/ # CRD-based metadata (crd, rbac, server)
│ │ │ └── redis_backend/ # Redis-based metadata (redis, server)
│ │ └── client/
│ │ └── vllm/
│ │ ├── vllm-single-node.yaml # TP-only (DeepSeek-V3)
│ │ └── vllm-multi-node.yaml # TP+PP (Kimi-K2.5, 2 nodes)
│ ├── aggregated_k8s/ # Dynamo aggregated serving example
│ │ ├── README.md
│ │ └── agg.yaml
│ └── dynamo_p2p_transfer_k8s/ # Dynamo DGD with P2P weight transfer
│ ├── Dockerfile # dynamo vllm-runtime + MX client
│ ├── README.md
│ └── vllm/
│ ├── crd-modelmetadata.yaml # ModelMetadata CRD (cluster-admin)
│ ├── rbac-modelmetadata.yaml # ServiceAccount + Role + RoleBinding
│ └── vllm-multi-node-aggregated.yaml # DGD: MX server + Frontend + VllmWorker
│
├── docs/
│ ├── ARCHITECTURE.md # Architecture reference
│ ├── CLI.md # CLI tool documentation
│ ├── DEPLOYMENT.md # Deployment and configuration guide
│ └── metadata.md # Metadata storage and coordination protocol
│
├── .devcontainer/
│ ├── devcontainer.json # VSCode config: rust-analyzer, port 8001
│ └── Dockerfile # Ubuntu 24.04 dev env
│
├── .github/
│ ├── copilot-instructions.md # GitHub Copilot agent instructions
│ ├── dco.yml # DCO enforcement
│ └── workflows/
│ ├── ci.yml # CI pipeline
│ └── codeql.yml # Security scanning
│
├── .cursor/
│ └── rules/
│ └── rust.mdc # Cursor agent instructions
│
├── .pre-commit-config.yaml # Pre-commit hooks config
└── .coderabbit.yaml # CodeRabbit review config
| Crate | Package Name | Type | Binary Targets |
|---|---|---|---|
modelexpress_server |
modelexpress-server |
lib + bin | modelexpress-server, config_gen |
modelexpress_client |
modelexpress-client |
lib + bin | modelexpress-cli, test_client, fallback_test |
modelexpress_common |
modelexpress-common |
lib | (none) |
workspace-tests |
workspace-tests |
test + bench | (integration tests, criterion benchmarks) |
All cargo dependencies are declared in the root Cargo.toml. Sub-crates use workspace dependencies exclusively.
Four proto files define four services, all compiled via tonic-build in modelexpress_common/build.rs:
| RPC | Request | Response | Purpose |
|---|---|---|---|
GetHealth |
HealthRequest |
HealthResponse |
Server version, status, uptime |
| RPC | Request | Response | Purpose |
|---|---|---|---|
SendRequest |
ApiRequest |
ApiResponse |
Generic API (e.g., "ping" -> "pong") |
| RPC | Request | Response | Purpose |
|---|---|---|---|
EnsureModelDownloaded |
ModelDownloadRequest |
stream ModelStatusUpdate |
Trigger download, stream progress |
StreamModelFiles |
ModelFilesRequest |
stream FileChunk |
Stream model file contents (1MB chunks) |
ListModelFiles |
ModelFilesRequest |
ModelFileList |
List files with sizes |
Key message types: ModelProvider (HuggingFace, NGC), ModelStatus (Downloading, Downloaded, Error), ModelStatusUpdate, FileChunk.
| RPC | Request | Response | Purpose |
|---|---|---|---|
PublishMetadata |
PublishMetadataRequest |
PublishMetadataResponse |
Source publishes worker metadata (identity + tensors + backend metadata) |
ListSources |
ListSourcesRequest |
ListSourcesResponse |
Lightweight listing of available source workers (no tensor data) |
GetMetadata |
GetMetadataRequest |
GetMetadataResponse |
Fetch full tensor metadata for one specific worker (MB-scale, on demand) |
UpdateStatus |
UpdateStatusRequest |
UpdateStatusResponse |
Update per-worker lifecycle status (Initializing/Ready/Stale) |
Key message types: SourceIdentity (all fields affecting tensor layout compatibility), WorkerMetadata (rank, oneof backend_metadata, tensors, status, P2P endpoint fields), TensorDescriptor (name, addr, size, device_id, dtype), SourceInstanceRef (lightweight worker reference for listing).
| RPC | Request | Response | Purpose |
|---|---|---|---|
GetTensorManifest |
GetTensorManifestRequest |
GetTensorManifestResponse |
Fetch tensor descriptors directly from a source worker |
Per-worker gRPC service started when MX_P2P_METADATA=1. Targets call this instead of fetching tensor descriptors from the central server. Validates mx_source_id to catch stale discovery.
See metadata.md for the full metadata architecture including storage schemas and coordination protocol.
- Parse CLI args (
ServerArgsvia clap) - Load config (
ServerConfig::load()) - CLI > env vars > config file > defaults - Initialize structured logging (tracing-subscriber)
- Create SQLite database (
ModelDatabase) - Optionally start
CacheEvictionServicebackground task - Connect to metadata backend (Redis or Kubernetes CRD) for P2P state
- Start reaper background task for stale source detection and GC
- Register 4 gRPC services with tonic (max message size: 100MB)
- Listen on configured address (default
0.0.0.0:8001) - Graceful shutdown on CTRL+C (signals cache eviction service and reaper)
server:
host: "0.0.0.0" # MODEL_EXPRESS_SERVER_HOST
port: 8001 # MODEL_EXPRESS_SERVER_PORT
database:
path: "./models.db" # MODEL_EXPRESS_DATABASE_PATH
cache:
directory: "./cache" # MODEL_EXPRESS_CACHE_DIRECTORY
max_size_bytes: null
eviction:
enabled: true # MODEL_EXPRESS_CACHE_EVICTION_ENABLED
policy:
type: lru
unused_threshold: "7d"
max_models: null
min_free_space_bytes: null
check_interval: "1h"
logging:
level: info # MODEL_EXPRESS_LOG_LEVEL
format: pretty # MODEL_EXPRESS_LOG_FORMAT
file: null
structured: falseSchema: single models table with columns model_name (PK), provider, status, created_at, last_used_at, message. Indexed on last_used_at for LRU queries.
Key operations:
| Method | Purpose |
|---|---|
get_status(name) |
Query download status |
set_status(name, provider, status, msg) |
Create/update via INSERT OR REPLACE |
try_claim_for_download(name, provider) |
Atomic compare-and-swap (INSERT OR IGNORE) |
touch_model(name) |
Update last_used_at for LRU tracking |
delete_model(name) |
Remove record |
get_models_by_last_used(limit) |
LRU query, oldest first |
get_status_counts() |
Count by status (downloading, downloaded, error) |
Concurrency: Arc<Mutex<Connection>> with poison recovery.
Runs in a background tokio task on a configurable interval (default 1 hour). LRU eviction policy:
- Time-based: evict models with
last_used_atolder thanunused_threshold(default 7 days) - Count-based: if total >
max_models, evict oldest excess - Only DOWNLOADED models are eligible for eviction
The server supports two metadata backends, selected via MX_METADATA_BACKEND:
- Redis (
redis): Source index hashes (mx:source:{source_id}) with an__attributes__field storingSourceIdentityand{worker_id}fields as presence markers. Worker data stored in separate hashes (mx:source:{source_id}:{worker_id}). Stale detection and cleanup handled by the server-side reaper. - Kubernetes (
kubernetes/k8s/crd):ModelMetadataCRDs (one per worker) withConfigMaps for tensor descriptors. Owner references for automatic garbage collection. Standard Kubernetesstatus.conditions(Ready) andstatus.observedGenerationare maintained so thatkubectl wait --for=condition=Readyworks. Stale detection handled by the server-side reaper.
Each worker publishes independently (no Lua merge scripts needed). The mx_source_id is a 16-char hex key computed from SHA256(canonical_json(SourceIdentity)). Large u64 values (GPU addresses) are serialized as strings to avoid JSON precision loss.
See metadata.md for the full storage layout and schemas.
Global singleton (LazyLock<ModelDownloadTracker>) that coordinates concurrent downloads. Uses try_claim_for_download() for race-free model claiming and tokio channels for streaming status updates to multiple waiting clients.
The Client struct in modelexpress_client/src/lib.rs wraps gRPC connections:
| Method | Purpose |
|---|---|
new(config) |
Create client with config |
health_check() |
Call HealthService |
ping() |
Call ApiService with "ping" |
download_model(name, provider) |
Trigger download via streaming RPC |
download_model_direct(name, provider, cache_dir) |
Download directly from provider |
ensure_model(name, provider, strategy) |
Smart download with fallback strategy |
stream_model_files(name) |
Stream model files from server |
list_model_files(name) |
List model files on server |
delete_model(name, cache_dir) |
Delete cached model |
validate_model(name, cache_dir) |
Validate cached model integrity |
| Strategy | Behavior |
|---|---|
SmartFallback |
Try server first, fall back to direct download on failure |
ServerOnly |
Only download through the server |
DirectOnly |
Only download directly from provider |
The Cli struct in args.rs embeds ClientArgs via #[command(flatten)]. Commands:
| Command | Purpose |
|---|---|
health |
Check server health |
ping |
Ping server |
model download <name> |
Download a model |
model list-files <name> |
List model files |
model clear <name> |
Delete cached model |
model validate <name> |
Validate model integrity |
Output formats: --format human (default), --format json, --format json-pretty.
| Module | Purpose |
|---|---|
cache |
CacheEvictionConfig, LruConfig, DurationConfig (used by both server and client configs) |
client_config |
ClientConfig + ClientArgs shared struct for CLI argument handling |
config |
Config trait utilities |
download |
Download orchestration with strategy pattern |
models |
Status, ModelProvider, ModelStatus, ModelStatusResponse |
providers |
ModelProviderTrait + HuggingFaceProvider + NgcProvider |
grpc |
Generated tonic stubs for all 4 services |
constants |
DEFAULT_GRPC_PORT (8001), DEFAULT_TIMEOUT_SECS (30), DEFAULT_TRANSFER_CHUNK_SIZE (32KB) |
#[async_trait]
pub trait ModelProviderTrait: Send + Sync {
async fn download_model(&self, name: &str, cache_path: Option<PathBuf>, ignore_weights: bool) -> Result<PathBuf>;
async fn delete_model(&self, name: &str) -> Result<()>;
async fn get_model_path(&self, name: &str, cache_dir: PathBuf) -> Result<PathBuf>;
fn provider_name(&self) -> &'static str;
fn is_ignored(filename: &str) -> bool;
fn is_image(path: &Path) -> bool;
fn is_weight_file(filename: &str) -> bool;
}Two implementations:
HuggingFaceProvider— uses thehf-hubcrate with high-CPU download mode.NgcProvider— downloads from NVIDIA NGC via the V2 artifact API (Bearer-authenticated/files/{path}for team artifacts; presigned S3 URLs for org-level artifacts). Falls back tochecksums.blake3manifest enumeration when bulk file listing returns 400. Resolves the NGC API key fromNGC_API_KEY,NGC_CLI_API_KEY, or~/.ngc/config.
ClientArgs is the single source of truth for shared CLI arguments. Cli embeds it via #[command(flatten)].
Loading precedence: CLI args > environment variables > config file > defaults.
| Module | Purpose |
|---|---|
__init__.py |
Package init, exports register_modelexpress_loaders() for callers to register the mx loader with vLLM |
client.py |
MxClient - gRPC client wrapping PublishMetadata, ListSources, GetMetadata, and UpdateStatus RPCs |
heartbeat.py |
HeartbeatThread - background thread sending periodic UpdateStatus(READY) and STALE on shutdown |
nixl_transfer.py |
NixlTransferManager - NIXL agent lifecycle, tensor registration, RDMA transfers |
gds_transfer.py |
GPUDirect Storage availability check and transfer utilities |
gds_loader.py |
MxGdsLoader - GDS-based model loader (direct file-to-GPU) |
vllm_loader.py |
MxModelLoader - thin orchestration, delegates to LoadStrategyChain |
load_strategy/ |
Loading strategy chain: RdmaStrategy, ModelStreamerStrategy (S3/GCS/Azure/local), GdsStrategy, DefaultStrategy |
tensor_utils.py |
Tensor collection, checksums, storage views, capture_tensor_attrs |
metadata.py |
build_source_identity, publish_metadata_and_ready, retry logic |
rank_utils.py |
get_global_rank, get_worker_rank |
worker_server.py |
WorkerGrpcServer - per-worker gRPC server for P2P tensor manifest exchange |
vllm_worker.py |
ModelExpressWorker - custom vLLM worker class (use --worker-cls=modelexpress.vllm_worker.ModelExpressWorker) |
types.py |
TensorDescriptor, WorkerMetadata, GetMetadataResponse dataclasses |
p2p_pb2.py / p2p_pb2_grpc.py |
Generated protobuf/gRPC stubs |
gRPC client wrapping the P2P service stubs:
| Method | Purpose |
|---|---|
publish_metadata(identity, worker, worker_id) |
Publish worker metadata; returns mx_source_id |
list_sources(identity, status_filter) |
List available source workers (lightweight, no tensor data) |
get_metadata(mx_source_id, worker_id) |
Fetch full tensor metadata for one worker (on demand) |
update_status(mx_source_id, worker_id, worker_rank, status) |
Update worker lifecycle status (e.g., READY) |
close() |
Close the underlying gRPC channel |
Manages a NIXL agent and RDMA transfers for a single GPU worker:
| Method | Purpose |
|---|---|
__init__(agent_name, device_id, listen_port) |
Create NIXL agent with UCX backend; listen_port enables P2P listen thread |
register_tensors(tensors) |
Register GPU tensors for RDMA, return serialized metadata |
get_registered_descriptors() |
Return region descriptors (MX_CONTIGUOUS_REG=1) or tensor descriptors |
fetch_remote_and_wait(agent_name, ip, port) |
P2P: fetch remote NIXL metadata via listen thread (polls until loaded) |
receive_from_source(source_metadata, source_tensors, ..., remote_agent_name) |
Execute RDMA read transfer; remote_agent_name skips add_remote_agent (P2P) |
shutdown() |
Clean up NIXL agent and resources |
MxModelLoader (extends BaseModelLoader, registered as --load-format mx):
Thin orchestration layer that delegates to LoadStrategyChain.run(). Builds a LoadContext from vLLM config, initializes the model, runs the strategy chain, and updates global registries.
LoadStrategyChain (load_strategy/):
Auto-detects the best loading strategy with a prioritized chain. Each strategy is a subclass of LoadStrategy (ABC) with is_available(ctx) and load(model, ctx) methods. The chain filters to eligible strategies and runs them in order until one succeeds:
| Priority | Strategy | is_available() |
Behavior |
|---|---|---|---|
| p0 | RdmaStrategy |
NIXL available + MLA check passes | ListSources(READY), filter by worker_rank, shuffle, try candidates (max 3). On SourceTransferError or ManifestMismatchError, try next. |
| p1 | ModelStreamerStrategy |
MX_MODEL_URI set + runai_model_streamer installed |
Stream safetensors to GPU via CPU staging buffer. MX_MODEL_URI accepts remote URIs (s3://, gs://, az://), absolute local paths, or HF model IDs (resolved via HF_HUB_CACHE). All storage backends (S3, GCS, Azure) included by default. |
| p2 | GdsStrategy |
GDS hardware available | Load via MxGdsLoader (direct file-to-GPU). Falls through on failure. |
| p3 | DefaultStrategy |
Always | vLLM DefaultModelLoader (CPU-staged, auto-downloads from HF Hub). |
Each strategy is fully self-contained: it handles weight loading, post-processing (process_weights_after_loading), NIXL tensor registration, and metadata publishing. New strategies can be added by creating a new file in load_strategy/ and registering it in LoadStrategyChain.run().
RdmaStrategy.is_available() checks transfer_safety.check_transfer_allowed() before attempting P2P transfer. Currently the only blocked feature is MLA (Multi-head Latent Attention), detected via kv_lora_rank in the HF model config. MLA models (DeepSeek-V2/V3, Kimi K2/K2.5, GLM-5.1, and any future model using MLA) fall back to GDS or disk loading automatically.
NVFP4 MoE models (known case: Kimi-K2.5-NVFP4) previously produced corrupted inference after RDMA transfer despite all registered tensor bytes matching. This is a specific instance of a broader class of bugs: post-processing stashes computed state on non-Module objects (e.g. FusedMoEQuantConfig.a1_gscale = 1/activation_scale on the quant method), which is invisible to named_parameters()/named_buffers(). On the target those values are computed from dummy weights before RDMA, and RDMA only overwrites the registered tensors, so the stashed values stay wrong. Note DeepSeek-V3 (MLA + FP8) is not affected — FP8 scale state lives on Modules and is already captured. The fix (adopt_hidden_tensors()) recursively scans module attributes for orphaned CUDA tensors and registers them as non-persistent buffers so they are included in the RDMA manifest. Verified correct on vLLM v0.17.1 and v0.19.0 with Kimi-K2.5-NVFP4. Set MX_SKIP_FEATURE_CHECK=1 to bypass the MLA feature gate for models in this class.
During transfer, ManifestMismatchError is raised if source and target tensor names or sizes don't match. This triggers trying the next source candidate rather than marking the source as stale, which is important during rolling updates where pods may run different image versions.
After loading by any strategy, the worker starts a HeartbeatThread that periodically sends UpdateStatus(READY) to keep updated_at fresh. On clean shutdown, the heartbeat sends UpdateStatus(STALE) via an atexit handler. Metadata publish failures are logged but do not crash the worker.
Each GPU worker generates a unique worker_id (uuid4().hex[:8]) at init and publishes independently. Workers use torch.distributed.get_rank() as their global rank (captures both TP and PP position).
The loader uses iter_module_tensors() (in tensor_utils.py) to walk the full PyTorch module tree and find all CUDA tensors after post-processing. This discovers three categories:
| Category | Source | Example |
|---|---|---|
| Parameters | module._parameters |
layers.0.attention.weight |
| Buffers | module._buffers |
Batch norm running mean |
| Tensor attributes | dir(module) scan |
FP8 weight_scale, _k_scale |
This is more thorough than named_parameters() which only finds parameters and would miss tensors created during process_weights_after_loading(). Non-contiguous tensors (e.g. MLA's W_UV and W_UK_T sharing a dequantized intermediate) are registered as flat byte views of their underlying storage via storage_view(). Multiple views into the same storage are deduplicated by data_ptr() so tied weights are only transferred once.
Before tensor collection, adopt_hidden_tensors() scans each module's non-Module attributes recursively for CUDA tensors not already in named_parameters()/named_buffers(). These "orphaned" tensors (e.g. FusedMoEQuantConfig.a1_gscale, FlashInfer workspace buffers) are registered as non-persistent buffers so they appear in the manifest. Without this, the target's quant method objects retain values computed from dummy weights, causing incorrect inference despite all registered tensor bytes matching.
NIXL (NVIDIA Interconnect eXchange Library) provides zero-copy GPU-to-GPU RDMA transfers on top of UCX:
| Concept | Description |
|---|---|
| Agent | NIXL instance managing one GPU's memory registrations and transfers |
| Memory Registration | GPU memory must be registered before RDMA access; generates rkeys |
| Metadata | Serialized agent info (address, rkeys) shared between source/target |
| Transfer Descriptor | Prepared list of (addr, size, device) for bulk transfer |
| rkey | Remote key - RDMA authorization token for remote memory access |
# 1. Create NIXL agent (one per GPU worker)
from nixl._api import nixl_agent, nixl_agent_config
config = nixl_agent_config(backends=["UCX"])
agent = nixl_agent("worker-0", config)
# 2. Register GPU tensors for RDMA access
tensors = [(tensor.data_ptr(), tensor.numel() * tensor.element_size(), device_id, "cuda")]
agent.register_memory(tensors, "VRAM")
# 3. Get metadata for remote agent connection
metadata = agent.get_local_md() # Share this with target
# 4. On target: connect to source and transfer
agent.add_remote_agent("source-worker-0", source_metadata)
src_descs = agent.prep_xfer_dlist("source-worker-0", source_tensors, "cuda", ["UCX"])
dst_descs = agent.prep_xfer_dlist("", local_tensors, "cuda", ["UCX"])
handle = agent.make_prepped_xfer("READ", dst_descs, indices, src_descs, indices, ["UCX"])
agent.transfer(handle)
# 5. Wait for completion
while agent.check_xfer_state(handle) not in ("DONE", "SUCCESS"):
time.sleep(0.001)
agent.release_xfer_handle(handle)vLLM's process_weights_after_loading() transforms model weights into kernel-friendly formats (FP8 scale repacking, NVFP4 padding/swizzling, MLA dequantized projections) and may create new tensors as bare attributes, buffers, or on quant method objects.
The solution: both source and target run process_weights_after_loading() first, then adopt_hidden_tensors() discovers any CUDA tensors on non-Module objects (quant configs, kernel objects), and finally register_tensors() collects everything for RDMA. The target runs post-processing on dummy data to establish the correct tensor layout, receives the real data via RDMA, and all state (including hidden quant config tensors) is correct.
graph TD
subgraph Source
S1[Load real weights from disk]
S2[process_weights_after_loading]
S3[adopt_hidden_tensors - find orphaned CUDA tensors]
S4[Register ALL tensors with NIXL]
S5[Publish metadata]
S1 --> S2 --> S3 --> S4 --> S5
end
subgraph Target
T1[Load dummy weights]
T2[process_weights_after_loading on dummy]
T3[adopt_hidden_tensors - same layout as source]
T4[Register ALL tensors with NIXL]
T5[Receive processed weights via RDMA]
T1 --> T2 --> T3 --> T4 --> T5
end
S4 -- "RDMA" --> T5
- Source loads: Loads weights from storage (S3/GCS/Azure/local via ModelStreamer, GDS, or disk), runs
process_weights_after_loading() - Source publishes: Registers tensors with NIXL, calls
PublishMetadata(identity, worker, worker_id)-> getsmx_source_id(status=INITIALIZING). In P2P mode (MX_P2P_METADATA=1), publishes only lightweight endpoint pointers and starts aWorkerGrpcServerfor tensor manifest serving. - Heartbeat starts:
HeartbeatThreadsendsUpdateStatus(READY)every 30s, refreshingupdated_at - Target discovers: Calls
ListSources(identity, status=READY), filters byworker_rank - Target fetches on demand: Calls
GetMetadata(mx_source_id, worker_id)for the chosen candidate. Auto-detects P2P mode ifworker_grpc_endpointis populated - fetches tensors from the source worker'sWorkerServiceand NIXL metadata via the listen thread instead of from the central server. - Target transfers: Executes RDMA reads from source; on
SourceTransferErrortries next candidate (max 3) - Target becomes source: After receiving weights, publishes own metadata and starts its own heartbeat
- Stale detection: Server-side reaper marks workers STALE if
updated_at> 90s old; GC deletes after 1 hour
See metadata.md for the full storage schema and debugging guide.
| Variable | Default | Description |
|---|---|---|
MX_REGISTER_LOADERS |
1 |
Auto-register the mx loader with vLLM |
MODEL_EXPRESS_URL |
localhost:8001 |
gRPC server address |
MX_SERVER_ADDRESS |
localhost:8001 |
Backward-compat alias for MODEL_EXPRESS_URL |
MX_METADATA_BACKEND |
(required) | Metadata backend: redis or kubernetes |
MX_CONTIGUOUS_REG |
0 |
Enable contiguous region registration (experimental) |
MX_P2P_METADATA |
0 |
Enable P2P metadata exchange on source workers |
MX_METADATA_PORT |
5555 |
Base NIXL listen port; effective port is MX_METADATA_PORT + device_id |
MX_WORKER_GRPC_PORT |
0 |
Worker gRPC port for P2P tensor manifest serving |
MX_WORKER_HOST |
(auto-detect) | Override worker IP/hostname for P2P endpoints |
MX_HEARTBEAT_INTERVAL_SECS |
30 |
Client heartbeat frequency |
MX_HEARTBEAT_TIMEOUT_SECS |
90 |
Server reaper staleness threshold |
MX_REAPER_SCAN_INTERVAL_SECS |
30 |
Server reaper scan frequency |
MX_GC_TIMEOUT_SECS |
3600 |
Time before stale entries are deleted |
VLLM_RPC_TIMEOUT |
7200000 |
vLLM RPC timeout in ms |
| Variable | Recommended | Description |
|---|---|---|
UCX_TLS |
rc_x,rc,dc_x,dc,cuda_copy |
Transport layers for InfiniBand |
UCX_RNDV_SCHEME |
get_zcopy |
Zero-copy RDMA reads |
UCX_RNDV_THRESH |
0 |
Force rendezvous for all transfers |
NIXL_LOG_LEVEL |
INFO |
NIXL logging (DEBUG for troubleshooting) |
UCX_LOG_LEVEL |
WARN |
UCX logging (DEBUG for troubleshooting) |
Target fails with Remote access error on mlx5_X:1/IB. Common causes: source crashed/restarted (stale rkeys), UCX transport misconfiguration, premature target connection. Fix: use robust ready coordination, check for restarts, enable UCX_LOG_LEVEL=DEBUG.
When MX_CONTIGUOUS_REG=1, transfers fail even when source is stable. Current workaround: use baseline mode (MX_CONTIGUOUS_REG=0).
DeepSeek-V3 takes ~40 minutes to warm up (loading + DeepGemm + CUDA graphs). Target must wait via ready coordination.
| Metric | Value |
|---|---|
| Model | DeepSeek-V3 (671B, FP8) |
| Total Data | 681 GB (8 workers x 85 GB) |
| Transfer Time | ~15 seconds (8 parallel RDMA streams @ ~45 Gbps each) |
| Per-Worker Speed | ~45 Gbps |
| Theoretical Max | 400 Gbps per NIC |
Optimization opportunities: contiguous regions (blocked), warm source pool, DeepGemm kernel caching, multi-rail RDMA (UCX_IB_NUM_PATHS=2).
See DEPLOYMENT.md for the full deployment guide covering server/client configuration, Docker, Kubernetes, Helm, P2P transfer setup, and debugging commands.