User-facing guide for configuring and deploying ModelExpress. For architecture details, see ARCHITECTURE.md. For development setup, see ../CONTRIBUTING.md.
ModelExpress uses a layered configuration system. Sources are applied in order of precedence:
- Command line arguments (highest priority)
- Environment variables (
MODEL_EXPRESS_*prefix) - Configuration file (YAML)
- Default values (lowest priority)
cargo run --bin config_gen -- --output model-express.yamlThe generated file contains all options with their defaults:
server:
host: "0.0.0.0"
port: 8001
database:
path: "./models.db"
cache:
directory: "./cache"
max_size_bytes: null
eviction:
enabled: true
policy:
type: lru
unused_threshold: "7d"
max_models: null
min_free_space_bytes: null
check_interval: "1h"
logging:
level: info
format: pretty
file: null
structured: false# With defaults
cargo run --bin modelexpress-server
# With a configuration file
cargo run --bin modelexpress-server -- --config model-express.yaml
# With CLI overrides
cargo run --bin modelexpress-server -- --port 8080 --log-level debug
# Validate config without starting
cargo run --bin modelexpress-server -- --config model-express.yaml --validate-config| Option | CLI Flag | Env Var | Default | Description |
|---|---|---|---|---|
| host | --host |
MODEL_EXPRESS_SERVER_HOST |
0.0.0.0 |
Bind address |
| port | --port, -p |
MODEL_EXPRESS_SERVER_PORT |
8001 |
gRPC port |
| Option | CLI Flag | Env Var | Default | Description |
|---|---|---|---|---|
| path | --database-path, -d |
MODEL_EXPRESS_DATABASE_PATH |
./models.db |
SQLite file path |
In multi-node Kubernetes deployments, the database should be on a shared persistent volume.
| Option | CLI Flag | Env Var | Default | Description |
|---|---|---|---|---|
| directory | --cache-directory |
MODEL_EXPRESS_CACHE_DIRECTORY |
./cache |
Model cache directory |
| max_size_bytes | - | - | null (unlimited) | Max cache size in bytes |
| eviction.enabled | --cache-eviction-enabled |
MODEL_EXPRESS_CACHE_EVICTION_ENABLED |
true |
Enable LRU eviction |
Eviction policy settings (in config file only):
eviction.policy.unused_threshold- Evict models unused for this duration (default: 7 days)eviction.policy.max_models- Max models to keep (default: unlimited)eviction.check_interval- How often to check for eviction (default: 1 hour)
| Option | CLI Flag | Env Var | Default | Description |
|---|---|---|---|---|
| level | --log-level, -l |
MODEL_EXPRESS_LOG_LEVEL |
info |
trace, debug, info, warn, error |
| format | --log-format |
MODEL_EXPRESS_LOG_FORMAT |
pretty |
json, pretty, compact |
| file | - | - | null (stdout) | Log file path |
| structured | - | - | false |
Structured logging |
export MODEL_EXPRESS_SERVER_HOST="127.0.0.1"
export MODEL_EXPRESS_SERVER_PORT=8080
export MODEL_EXPRESS_DATABASE_PATH="/data/models.db"
export MODEL_EXPRESS_CACHE_DIRECTORY="/data/cache"
export MODEL_EXPRESS_CACHE_EVICTION_ENABLED=true
export MODEL_EXPRESS_LOG_LEVEL=debug
export MODEL_EXPRESS_LOG_FORMAT=jsonThe CLI client also uses layered configuration: CLI args > env vars > config file > defaults.
| Env Var | Default | Description |
|---|---|---|
MODEL_EXPRESS_ENDPOINT |
http://localhost:8001 |
Server endpoint |
MODEL_EXPRESS_TIMEOUT |
30 |
Request timeout (seconds) |
MODEL_EXPRESS_CACHE_DIRECTORY |
(auto) | Cache path override |
MODEL_EXPRESS_MAX_RETRIES |
(none) | Max retry attempts |
MODEL_EXPRESS_NO_SHARED_STORAGE |
false |
Use gRPC streaming instead of shared storage |
MODEL_EXPRESS_TRANSFER_CHUNK_SIZE |
32768 |
Transfer chunk size (bytes) |
Cache directory resolution for HuggingFace: MODEL_EXPRESS_CACHE_DIRECTORY -> HF_HUB_CACHE -> ~/.cache/huggingface/hub.
Cache directory resolution for NGC: MODEL_EXPRESS_CACHE_DIRECTORY -> ~/.cache/ngc.
See CLI.md for full CLI usage documentation.
The multi-stage Dockerfile builds all binaries (server, CLI, test tools):
docker build -t model-express .
docker run -p 8001:8001 model-expressSingle-service setup for local development:
docker-compose up --buildFor GPU-to-GPU weight transfers with vLLM:
docker build -f examples/p2p_transfer_k8s/Dockerfile.client \
-t your-registry/mx-client:TAG .
docker push your-registry/mx-client:TAGDeploy the server using one of the example manifests under examples/:
- With Redis backend:
examples/p2p_transfer_k8s/server/redis_backend/modelexpress-server-redis.yaml - With Kubernetes CRD backend:
examples/p2p_transfer_k8s/server/kubernetes_backend/modelexpress-server-kubernetes.yaml - Aggregated with Dynamo:
examples/aggregated_k8s/agg.yaml
Most deployments need a HuggingFace token for model downloads:
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}To download models from NVIDIA NGC, set an NGC API key. The server resolves it in this order:
NGC_API_KEYenvironment variableNGC_CLI_API_KEYenvironment variable~/.ngc/config(written byngc config set)
export NGC_API_KEY=your_ngc_api_key
kubectl create secret generic ngc-api-key-secret \
--from-literal=NGC_API_KEY=${NGC_API_KEY} \
-n ${NAMESPACE}Pass it to the server pod via envFrom or individual env entries in your deployment manifest.
The helm/ directory provides a full Helm chart with configurable replicas, PVC, ingress, and resource limits.
# Deploy with defaults (1 replica, 10Gi PVC)
helm/deploy.sh --namespace my-ns
# Development (debug logging, 512Mi memory)
helm/deploy.sh --namespace my-ns --values helm/values-development.yaml
# Production (3 replicas, 2Gi memory, ingress, pod anti-affinity)
helm/deploy.sh --namespace my-ns --values helm/values-production.yaml
# Local testing (no PVC, emptyDir)
helm/deploy.sh --namespace my-ns --values helm/values-local-storage.yamlSee ../helm/README.md for the full parameter reference and installation guide.
For deploying ModelExpress alongside Dynamo with a vLLM worker:
kubectl apply -f examples/aggregated_k8s/agg.yamlSee ../examples/aggregated_k8s/README.md for the full guide.
ModelExpress supports GPU-to-GPU model weight transfers between vLLM instances using NVIDIA NIXL over RDMA. Use --load-format mx, which auto-detects whether to load from disk or receive via RDMA.
| Variable | Default | Description |
|---|---|---|
MX_METADATA_BACKEND |
(required on server) | redis or kubernetes |
MODEL_EXPRESS_URL |
localhost:8001 |
gRPC server address |
MX_SERVER_ADDRESS |
localhost:8001 |
Backward-compat alias for MODEL_EXPRESS_URL |
MX_REGISTER_LOADERS |
1 |
Auto-register the mx loader with vLLM |
MX_CONTIGUOUS_REG |
0 |
Contiguous region registration (experimental) |
MODEL_EXPRESS_LOG_LEVEL |
(inherits vLLM) | Override log level for modelexpress.* loggers. DEBUG enables per-tensor checksums and adopted tensor details |
MX_SKIP_FEATURE_CHECK |
0 |
Bypass the MLA feature gate for P2P transfer (testing only) |
MX_P2P_METADATA |
0 |
Enable P2P metadata exchange (source workers only) |
MX_METADATA_PORT |
5555 |
Base NIXL listen port; effective port is MX_METADATA_PORT + device_id |
MX_WORKER_GRPC_PORT |
0 |
Worker gRPC port for P2P tensor manifest serving |
MX_WORKER_HOST |
(auto-detect) | Override worker IP/hostname for P2P endpoints |
MX_STATUS_TTL_SECS |
3600 |
TTL for Redis metadata keys (seconds) |
REDIS_URL |
redis://localhost:6379 |
Redis connection URL (Redis backend only) |
MX_METADATA_NAMESPACE |
default |
K8s namespace for CRD backend |
VLLM_RPC_TIMEOUT |
7200000 |
vLLM RPC timeout in ms (2 hours for large models) |
VLLM_PLUGINS |
- | Set to modelexpress to register the mx loader |
Each GPU worker publishes independently using its global rank (torch.distributed.get_rank()). No inter-worker coordination or barriers required.
By default, source workers publish full tensor metadata (NIXL blobs + tensor descriptors) to the central server. With MX_P2P_METADATA=1, source workers instead publish lightweight endpoint pointers and exchange metadata directly with targets:
- NIXL agent blobs exchanged via NIXL's native listen thread (
MX_METADATA_PORT) - Tensor descriptors served by a per-worker gRPC
WorkerService(MX_WORKER_GRPC_PORT) - Central server stores only endpoint addresses, not MB-scale metadata
Targets auto-detect which mode a source is using based on whether worker_grpc_endpoint is populated in the metadata. No configuration needed on the target side.
Set MX_METADATA_PORT and MX_WORKER_GRPC_PORT to fixed ports when running in K8s (port 0 picks an ephemeral port). Set MX_WORKER_HOST if the pod IP auto-detection doesn't produce a routable address.
ModelStreamer streams safetensors directly to GPU memory via runai-model-streamer. Supports S3, GCS, Azure Blob Storage, and local filesystem (PVC) paths. The first pod streams from storage; subsequent pods use P2P RDMA from GPU memory.
All storage backends (S3, GCS, Azure) are included as core dependencies — no extra install step needed. The strategy activates when MX_MODEL_URI is set.
General configuration:
| Variable | Default | Description |
|---|---|---|
MX_MODEL_URI |
(none) | Model location. Must be set to enable ModelStreamer. Accepts: remote URI (s3://bucket/model, gs://..., az://...), absolute local path (/models/deepseek-ai/DeepSeek-V3), or HuggingFace model ID (deepseek-ai/DeepSeek-V3 — resolved via HF_HUB_CACHE). |
RUNAI_STREAMER_CONCURRENCY |
8 |
Number of concurrent read threads |
RUNAI_STREAMER_MEMORY_LIMIT |
(none) | CPU staging buffer size in bytes. 0 reuses a single-tensor buffer (most memory efficient). See runai-model-streamer docs. |
S3 / S3-compatible:
| Variable | Description |
|---|---|
AWS_ACCESS_KEY_ID |
S3 credentials (auto-detected by boto3) |
AWS_SECRET_ACCESS_KEY |
S3 credentials |
AWS_SESSION_TOKEN |
Required for temporary credentials (SSO/IRSA) |
AWS_DEFAULT_REGION |
AWS region |
AWS_ENDPOINT_URL |
Custom endpoint for S3-compatible storage (MinIO, Ceph) |
Google Cloud Storage:
| Variable | Description |
|---|---|
GOOGLE_APPLICATION_CREDENTIALS |
Path to service account JSON key file |
Also supports GKE Workload Identity and Application Default Credentials (ADC) — no env vars needed when running on GKE with a properly configured service account.
Azure Blob Storage:
| Variable | Description |
|---|---|
AZURE_ACCOUNT_NAME |
Storage account name |
AZURE_ACCOUNT_KEY |
Storage account access key |
Or use service principal auth (AZURE_CLIENT_ID + AZURE_CLIENT_SECRET + AZURE_TENANT_ID) or Azure Managed Identity (no env vars needed on AKS).
Credentials are auto-detected by the underlying cloud SDKs. No credentials flow through the MX server or gRPC.
| Variable | Recommended | Description |
|---|---|---|
UCX_TLS |
rc_x,rc,dc_x,dc,cuda_copy |
Transport layers for InfiniBand |
UCX_RNDV_SCHEME |
get_zcopy |
Zero-copy RDMA reads |
UCX_RNDV_THRESH |
0 |
Force rendezvous for all transfers |
NIXL_LOG_LEVEL |
INFO |
NIXL logging (DEBUG for troubleshooting) |
UCX_LOG_LEVEL |
WARN |
UCX logging (DEBUG for troubleshooting) |
Deploy multiple identical instances - the first one loads from disk and subsequent ones receive via RDMA.
NAMESPACE=my-namespace
# Deploy server with Redis sidecar
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/server/redis_backend/modelexpress-server-redis.yaml
# Deploy single-node vLLM (TP=8, 1 node)
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/client/vllm/vllm-single-node.yaml# Install CRD and RBAC
kubectl apply -f examples/p2p_transfer_k8s/server/kubernetes_backend/crd-modelmetadata.yaml
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/server/kubernetes_backend/rbac-modelmetadata.yaml
# Deploy server with CRD backend
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/server/kubernetes_backend/modelexpress-server-kubernetes.yaml
# Deploy multi-node vLLM (TP=8, PP=2, 2 nodes)
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/client/vllm/vllm-multi-node.yamlSee ../examples/p2p_transfer_k8s/README.md for the full P2P transfer guide including architecture, prerequisites, and performance expectations.
# Stream server logs
kubectl -n $NAMESPACE logs -f deploy/modelexpress-server
# Stream vLLM instance logs
kubectl -n $NAMESPACE logs -f deploy/mx-vllm
# Check Redis state (P2P metadata)
kubectl -n $NAMESPACE exec deploy/modelexpress-server -c redis -- redis-cli KEYS 'mx:source:*'
# Inspect a source index (identity + worker list)
kubectl -n $NAMESPACE exec deploy/modelexpress-server -c redis -- redis-cli HGETALL 'mx:source:<source_id>'
# Flush Redis (clear stale metadata - do this on redeploy)
kubectl -n $NAMESPACE exec deploy/modelexpress-server -c redis -- redis-cli FLUSHALL
# Check Kubernetes CRD state
kubectl -n $NAMESPACE get modelmetadatas
# Test inference
kubectl -n $NAMESPACE exec deploy/mx-vllm -- curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-ai/DeepSeek-V3", "prompt": "Hello", "max_tokens": 10}'| Model | Total Data | Transfer Time | Per-Worker Speed |
|---|---|---|---|
| DeepSeek-V3 (671B, FP8) | 681 GB (8 GPUs) | ~15 seconds | ~45 Gbps |
| Llama 3.3 70B | 140 GB (8 GPUs) | ~5 seconds | ~28 Gbps |