This guide covers deploying LandmarkDiff for production or demo use, including Docker, HuggingFace Spaces, REST API setup, and production considerations.
The simplest deployment is the built-in Gradio interface:
pip install -e ".[app]"
python scripts/app.py
# Open http://localhost:7860This launches a five-tab demo (single procedure, multi-comparison, intensity sweep, face analysis, multi-angle capture) on port 7860. The first run downloads model weights (~6 GB), which are cached for subsequent launches.
To use a different port:
python scripts/app.py --port 8080For demos that only need TPS (geometric warping) mode:
# Build
docker build -t landmarkdiff:cpu -f Dockerfile.cpu .
# Run
docker run -p 7860:7860 landmarkdiff:cpuThe CPU Dockerfile uses python:3.11-slim, installs CPU-only PyTorch from https://download.pytorch.org/whl/cpu, and runs the Gradio demo in TPS mode. The resulting image is smaller and does not require any GPU drivers.
For ControlNet and diffusion-based inference:
# Build the GPU image (runtime CUDA, smaller footprint)
docker build -t landmarkdiff:gpu -f Dockerfile.gpu .
# Run with GPU passthrough
docker run --gpus all -p 7860:7860 landmarkdiff:gpuDockerfile.gpu uses nvidia/cuda:12.1.1-runtime-ubuntu22.04 with Python 3.11. It requires NVIDIA Container Toolkit on the host.
For detailed GPU prerequisites, VRAM requirements by GPU tier, verification steps, and troubleshooting, see the Docker GPU Setup guide.
The docker-compose.yml defines five services:
# CPU demo (default)
docker compose up app
# GPU demo (runtime image, recommended)
docker compose up gpu
# GPU demo (devel image, for compiling extensions)
docker compose up app-gpu
# Build Sphinx docs
docker compose run docs
# Training (requires GPU)
docker compose --profile training run trainService details:
| Service | Dockerfile | GPU | Port | Description |
|---|---|---|---|---|
app |
Dockerfile.cpu | No | 7860 | TPS-mode Gradio demo |
gpu |
Dockerfile.gpu | Yes (1 GPU) | 7861 | GPU inference (runtime image) |
app-gpu |
Dockerfile | Yes (1 GPU) | 7860 | GPU inference (devel image) |
docs |
python:3.11-slim | No | -- | Sphinx documentation builder |
train |
Dockerfile | Yes (1 GPU) | -- | ControlNet training |
Volumes:
All services mount these volumes:
./data:/app/data: training data, test pairs./checkpoints:/app/checkpoints: model checkpointsmodel-cache:/root/.cache: shared HuggingFace model cache
Create the host directories before running:
mkdir -p data checkpointsTo modify the default inference mode, set the LANDMARKDIFF_MODE environment variable:
docker run -e LANDMARKDIFF_MODE=controlnet --gpus all -p 7860:7860 landmarkdiffTo pre-download model weights during build (so the first inference is fast), add to the Dockerfile:
RUN python -c "from diffusers import ControlNetModel; ControlNetModel.from_pretrained('CrucibleAI/ControlNetMediaPipeFace')"- Create a new Space at huggingface.co/new-space.
- Select Gradio as the SDK.
- Choose hardware:
- CPU Basic (free): works for TPS mode only
- T4 Small: minimum for ControlNet inference
- A10G Small: recommended for faster inference
- Push the repository contents to the Space.
The scripts/app.py Gradio demo is compatible with HuggingFace Spaces out of the box. It auto-detects the environment and sets share=False (Spaces already provides a public URL).
Create an app.py at the repository root that imports and launches the demo, or point the Space to scripts/app.py in your README.md metadata:
---
title: LandmarkDiff
emoji: 🔬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: scripts/app.py
pinned: false
---Set secrets in the Space settings (Settings > Repository Secrets):
HF_TOKEN: if using gated modelsLANDMARKDIFF_MODE: set totpsfor CPU Spaces
HuggingFace Spaces provides /data as a persistent volume. Use it for caching model weights:
export HF_HOME=/data/huggingface_cacheFor programmatic access without the Gradio UI, wrap the pipeline in a FastAPI server:
"""LandmarkDiff REST API server."""
import io
import logging
from contextlib import asynccontextmanager
from pathlib import Path
import cv2
import numpy as np
from fastapi import FastAPI, File, HTTPException, Query, UploadFile
from fastapi.responses import StreamingResponse
from landmarkdiff.inference import LandmarkDiffPipeline
logger = logging.getLogger(__name__)
# Pipeline singleton: loaded once at startup
_pipeline = None
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Load the pipeline on startup, clean up on shutdown."""
global _pipeline
logger.info("Loading LandmarkDiff pipeline...")
_pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
_pipeline.load()
logger.info("Pipeline ready.")
yield
_pipeline = None
app = FastAPI(
title="LandmarkDiff API",
version="0.2.0",
lifespan=lifespan,
)
VALID_PROCEDURES = [
"rhinoplasty",
"blepharoplasty",
"rhytidectomy",
"orthognathic",
"brow_lift",
"mentoplasty",
]
@app.post("/predict")
async def predict(
image: UploadFile = File(...),
procedure: str = Query("rhinoplasty", enum=VALID_PROCEDURES),
intensity: int = Query(60, ge=0, le=100),
seed: int = Query(42, ge=0),
):
"""Generate a surgical prediction.
Returns the predicted post-operative image as PNG.
"""
if _pipeline is None:
raise HTTPException(503, "Pipeline not loaded")
# Read and decode image
contents = await image.read()
nparr = np.frombuffer(contents, np.uint8)
img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
if img is None:
raise HTTPException(400, "Could not decode image")
# Run prediction
try:
result = _pipeline.generate(
img,
procedure=procedure,
intensity=intensity,
seed=seed,
)
except Exception as e:
raise HTTPException(500, f"Prediction failed: {e}")
# Encode output as PNG
output_img = result["output"]
_, buffer = cv2.imencode(".png", output_img)
return StreamingResponse(
io.BytesIO(buffer.tobytes()),
media_type="image/png",
headers={"Content-Disposition": "inline; filename=prediction.png"},
)
@app.get("/health")
async def health():
"""Health check endpoint."""
return {
"status": "ok",
"pipeline_loaded": _pipeline is not None,
}Save this as scripts/api_server.py and run with uvicorn:
pip install fastapi uvicorn python-multipart
uvicorn scripts.api_server:app --host 0.0.0.0 --port 8000# Health check
curl http://localhost:8000/health
# Run prediction
curl -X POST http://localhost:8000/predict \
-F "image=@face.jpg" \
-F "procedure=rhinoplasty" \
-F "intensity=60" \
-o prediction.pngIf you prefer Flask:
"""Minimal Flask API for LandmarkDiff."""
import cv2
import numpy as np
from flask import Flask, request, send_file
import io
from landmarkdiff.inference import LandmarkDiffPipeline
app = Flask(__name__)
pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
pipeline.load()
@app.route("/predict", methods=["POST"])
def predict():
file = request.files.get("image")
if file is None:
return {"error": "No image provided"}, 400
procedure = request.form.get("procedure", "rhinoplasty")
intensity = int(request.form.get("intensity", 60))
nparr = np.frombuffer(file.read(), np.uint8)
img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
if img is None:
return {"error": "Could not decode image"}, 400
result = pipeline.generate(img, procedure=procedure, intensity=intensity)
_, buffer = cv2.imencode(".png", result["output"])
return send_file(io.BytesIO(buffer.tobytes()), mimetype="image/png")
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)On first inference, the pipeline downloads ~6 GB of model weights from HuggingFace. For production deployments:
-
Pre-download models during the Docker build or deployment setup:
export HF_HOME=/persistent/cache python -c " from diffusers import StableDiffusionControlNetPipeline, ControlNetModel ControlNetModel.from_pretrained('CrucibleAI/ControlNetMediaPipeFace') "
-
Use a persistent volume for the cache so container restarts do not re-download:
docker run -v model-cache:/root/.cache --gpus all -p 7860:7860 landmarkdiff
-
Set
HF_HOMEto a stable location:export HF_HOME=/data/huggingface_cache
For processing many images (e.g., generating training data or running evaluations):
-
Load the pipeline once and reuse it:
pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda") pipeline.load() for image_path in image_paths: result = pipeline.generate(load_image(image_path), ...)
-
Use the batch inference script for directory-level processing:
python examples/batch_inference.py /path/to/images/ \ --procedure rhinoplasty \ --intensity 50 \ --output output/batch/ -
For large batches on HPC, use SLURM array jobs:
#!/bin/bash #SBATCH --array=0-9 #SBATCH --gres=gpu:1 TOTAL=1000 PER_JOB=$((TOTAL / 10)) START=$((SLURM_ARRAY_TASK_ID * PER_JOB)) python scripts/batch_process.py \ --start $START --count $PER_JOB \ --input data/images/ --output output/
Memory: The full inference pipeline uses ~5.2 GB VRAM and ~4 GB CPU RAM. For a web server, budget at least 8 GB RAM per worker process.
Concurrency: The pipeline is not thread-safe. Use process-based concurrency:
- uvicorn: Run with
--workers Nwhere N is the number of GPUs. - gunicorn: Use
--workers N --worker-class uvicorn.workers.UvicornWorker. - Each worker loads its own pipeline and GPU. Do not share pipeline objects across workers.
Timeouts: ControlNet inference takes 3-15 seconds depending on hardware. Set request timeouts accordingly:
uvicorn scripts.api_server:app --timeout-keep-alive 30Disk: Model weights take ~6 GB on disk. Temporary files (intermediate images) are cleaned up automatically, but ensure enough temp space for concurrent requests.
For public-facing deployments, add rate limiting to prevent abuse:
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@app.post("/predict")
@limiter.limit("10/minute")
async def predict(...):
...Never expose the API without authentication in production:
from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
security = HTTPBearer()
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
if credentials.credentials != expected_token:
raise HTTPException(401, "Invalid token")
@app.post("/predict", dependencies=[Depends(verify_token)])
async def predict(...):
...Put the API behind nginx for SSL termination and additional security:
server {
listen 443 ssl;
server_name landmarkdiff.example.com;
ssl_certificate /etc/ssl/cert.pem;
ssl_certificate_key /etc/ssl/key.pem;
client_max_body_size 10M;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 60s;
}
}- Use HTTPS for all connections
- Authenticate all API endpoints
- Rate limit requests to prevent abuse
- Validate and sanitize all inputs (file type, file size)
- Do not store patient photos without explicit consent
- Follow HIPAA guidelines if handling medical data
- Set
client_max_body_sizeto a reasonable limit (10 MB) - Log requests for audit trails but do not log image content
- Run the container as a non-root user in production
Add basic metrics to track API health:
import time
from prometheus_client import Counter, Histogram
REQUEST_COUNT = Counter("predict_total", "Total prediction requests", ["procedure"])
REQUEST_LATENCY = Histogram("predict_seconds", "Prediction latency in seconds")
@app.post("/predict")
async def predict(...):
start = time.time()
# ... run prediction ...
REQUEST_LATENCY.observe(time.time() - start)
REQUEST_COUNT.labels(procedure=procedure).inc()- Evaluation Guide: Measure model quality
- Training Guide: Train your own checkpoint
- FAQ: Common questions and troubleshooting