Deployment Guide

This guide covers deploying LandmarkDiff for production or demo use, including Docker, HuggingFace Spaces, REST API setup, and production considerations.

Local Gradio Demo

The simplest deployment is the built-in Gradio interface:

pip install -e ".[app]"
python scripts/app.py
# Open http://localhost:7860

This launches a five-tab demo (single procedure, multi-comparison, intensity sweep, face analysis, multi-angle capture) on port 7860. The first run downloads model weights (~6 GB), which are cached for subsequent launches.

To use a different port:

python scripts/app.py --port 8080

Docker Deployment

CPU-only deployment

For demos that only need TPS (geometric warping) mode:

# Build
docker build -t landmarkdiff:cpu -f Dockerfile.cpu .

# Run
docker run -p 7860:7860 landmarkdiff:cpu

The CPU Dockerfile uses python:3.11-slim, installs CPU-only PyTorch from https://download.pytorch.org/whl/cpu, and runs the Gradio demo in TPS mode. The resulting image is smaller and does not require any GPU drivers.

GPU deployment

For ControlNet and diffusion-based inference:

# Build the GPU image (runtime CUDA, smaller footprint)
docker build -t landmarkdiff:gpu -f Dockerfile.gpu .

# Run with GPU passthrough
docker run --gpus all -p 7860:7860 landmarkdiff:gpu

Dockerfile.gpu uses nvidia/cuda:12.1.1-runtime-ubuntu22.04 with Python 3.11. It requires NVIDIA Container Toolkit on the host.

For detailed GPU prerequisites, VRAM requirements by GPU tier, verification steps, and troubleshooting, see the Docker GPU Setup guide.

Docker Compose

The docker-compose.yml defines five services:

# CPU demo (default)
docker compose up app

# GPU demo (runtime image, recommended)
docker compose up gpu

# GPU demo (devel image, for compiling extensions)
docker compose up app-gpu

# Build Sphinx docs
docker compose run docs

# Training (requires GPU)
docker compose --profile training run train

Service details:

Service	Dockerfile	GPU	Port	Description
`app`	Dockerfile.cpu	No	7860	TPS-mode Gradio demo
`gpu`	Dockerfile.gpu	Yes (1 GPU)	7861	GPU inference (runtime image)
`app-gpu`	Dockerfile	Yes (1 GPU)	7860	GPU inference (devel image)
`docs`	python:3.11-slim	No	--	Sphinx documentation builder
`train`	Dockerfile	Yes (1 GPU)	--	ControlNet training

Volumes:

All services mount these volumes:

./data:/app/data: training data, test pairs
./checkpoints:/app/checkpoints: model checkpoints
model-cache:/root/.cache: shared HuggingFace model cache

Create the host directories before running:

mkdir -p data checkpoints

Custom Docker configuration

To modify the default inference mode, set the LANDMARKDIFF_MODE environment variable:

docker run -e LANDMARKDIFF_MODE=controlnet --gpus all -p 7860:7860 landmarkdiff

To pre-download model weights during build (so the first inference is fast), add to the Dockerfile:

RUN python -c "from diffusers import ControlNetModel; ControlNetModel.from_pretrained('CrucibleAI/ControlNetMediaPipeFace')"

HuggingFace Spaces

Setup

Create a new Space at huggingface.co/new-space.
Select Gradio as the SDK.
Choose hardware:
- CPU Basic (free): works for TPS mode only
- T4 Small: minimum for ControlNet inference
- A10G Small: recommended for faster inference
Push the repository contents to the Space.

The scripts/app.py Gradio demo is compatible with HuggingFace Spaces out of the box. It auto-detects the environment and sets share=False (Spaces already provides a public URL).

Space configuration

Create an app.py at the repository root that imports and launches the demo, or point the Space to scripts/app.py in your README.md metadata:

---
title: LandmarkDiff
emoji: 🔬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: scripts/app.py
pinned: false
---

Environment variables

Set secrets in the Space settings (Settings > Repository Secrets):

HF_TOKEN: if using gated models
LANDMARKDIFF_MODE: set to tps for CPU Spaces

Persistence

HuggingFace Spaces provides /data as a persistent volume. Use it for caching model weights:

export HF_HOME=/data/huggingface_cache

REST API Setup

FastAPI server

For programmatic access without the Gradio UI, wrap the pipeline in a FastAPI server:

"""LandmarkDiff REST API server."""

import io
import logging
from contextlib import asynccontextmanager
from pathlib import Path

import cv2
import numpy as np
from fastapi import FastAPI, File, HTTPException, Query, UploadFile
from fastapi.responses import StreamingResponse

from landmarkdiff.inference import LandmarkDiffPipeline

logger = logging.getLogger(__name__)

# Pipeline singleton: loaded once at startup
_pipeline = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Load the pipeline on startup, clean up on shutdown."""
    global _pipeline
    logger.info("Loading LandmarkDiff pipeline...")
    _pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
    _pipeline.load()
    logger.info("Pipeline ready.")
    yield
    _pipeline = None


app = FastAPI(
    title="LandmarkDiff API",
    version="0.2.0",
    lifespan=lifespan,
)

VALID_PROCEDURES = [
    "rhinoplasty",
    "blepharoplasty",
    "rhytidectomy",
    "orthognathic",
    "brow_lift",
    "mentoplasty",
]


@app.post("/predict")
async def predict(
    image: UploadFile = File(...),
    procedure: str = Query("rhinoplasty", enum=VALID_PROCEDURES),
    intensity: int = Query(60, ge=0, le=100),
    seed: int = Query(42, ge=0),
):
    """Generate a surgical prediction.

    Returns the predicted post-operative image as PNG.
    """
    if _pipeline is None:
        raise HTTPException(503, "Pipeline not loaded")

    # Read and decode image
    contents = await image.read()
    nparr = np.frombuffer(contents, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    if img is None:
        raise HTTPException(400, "Could not decode image")

    # Run prediction
    try:
        result = _pipeline.generate(
            img,
            procedure=procedure,
            intensity=intensity,
            seed=seed,
        )
    except Exception as e:
        raise HTTPException(500, f"Prediction failed: {e}")

    # Encode output as PNG
    output_img = result["output"]
    _, buffer = cv2.imencode(".png", output_img)

    return StreamingResponse(
        io.BytesIO(buffer.tobytes()),
        media_type="image/png",
        headers={"Content-Disposition": "inline; filename=prediction.png"},
    )


@app.get("/health")
async def health():
    """Health check endpoint."""
    return {
        "status": "ok",
        "pipeline_loaded": _pipeline is not None,
    }

Save this as scripts/api_server.py and run with uvicorn:

pip install fastapi uvicorn python-multipart

uvicorn scripts.api_server:app --host 0.0.0.0 --port 8000

Testing the API

# Health check
curl http://localhost:8000/health

# Run prediction
curl -X POST http://localhost:8000/predict \
    -F "image=@face.jpg" \
    -F "procedure=rhinoplasty" \
    -F "intensity=60" \
    -o prediction.png

Flask alternative

If you prefer Flask:

"""Minimal Flask API for LandmarkDiff."""

import cv2
import numpy as np
from flask import Flask, request, send_file
import io

from landmarkdiff.inference import LandmarkDiffPipeline

app = Flask(__name__)

pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
pipeline.load()


@app.route("/predict", methods=["POST"])
def predict():
    file = request.files.get("image")
    if file is None:
        return {"error": "No image provided"}, 400

    procedure = request.form.get("procedure", "rhinoplasty")
    intensity = int(request.form.get("intensity", 60))

    nparr = np.frombuffer(file.read(), np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    if img is None:
        return {"error": "Could not decode image"}, 400

    result = pipeline.generate(img, procedure=procedure, intensity=intensity)

    _, buffer = cv2.imencode(".png", result["output"])
    return send_file(io.BytesIO(buffer.tobytes()), mimetype="image/png")


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

Production Considerations

Model caching

On first inference, the pipeline downloads ~6 GB of model weights from HuggingFace. For production deployments:

Pre-download models during the Docker build or deployment setup:

export HF_HOME=/persistent/cache
python -c "
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
ControlNetModel.from_pretrained('CrucibleAI/ControlNetMediaPipeFace')
"

Use a persistent volume for the cache so container restarts do not re-download:
```
docker run -v model-cache:/root/.cache --gpus all -p 7860:7860 landmarkdiff
```
Set HF_HOME to a stable location:
```
export HF_HOME=/data/huggingface_cache
```

Batch processing

For processing many images (e.g., generating training data or running evaluations):

Load the pipeline once and reuse it:

pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
pipeline.load()

for image_path in image_paths:
    result = pipeline.generate(load_image(image_path), ...)

Use the batch inference script for directory-level processing:

python examples/batch_inference.py /path/to/images/ \
    --procedure rhinoplasty \
    --intensity 50 \
    --output output/batch/

For large batches on HPC, use SLURM array jobs:

#!/bin/bash
#SBATCH --array=0-9
#SBATCH --gres=gpu:1

TOTAL=1000
PER_JOB=$((TOTAL / 10))
START=$((SLURM_ARRAY_TASK_ID * PER_JOB))

python scripts/batch_process.py \
    --start $START --count $PER_JOB \
    --input data/images/ --output output/

Resource limits

Memory: The full inference pipeline uses ~5.2 GB VRAM and ~4 GB CPU RAM. For a web server, budget at least 8 GB RAM per worker process.

Concurrency: The pipeline is not thread-safe. Use process-based concurrency:

uvicorn: Run with --workers N where N is the number of GPUs.
gunicorn: Use --workers N --worker-class uvicorn.workers.UvicornWorker.
Each worker loads its own pipeline and GPU. Do not share pipeline objects across workers.

Timeouts: ControlNet inference takes 3-15 seconds depending on hardware. Set request timeouts accordingly:

uvicorn scripts.api_server:app --timeout-keep-alive 30

Disk: Model weights take ~6 GB on disk. Temporary files (intermediate images) are cleaned up automatically, but ensure enough temp space for concurrent requests.

Rate limiting

For public-facing deployments, add rate limiting to prevent abuse:

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/predict")
@limiter.limit("10/minute")
async def predict(...):
    ...

Authentication

Never expose the API without authentication in production:

from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
    if credentials.credentials != expected_token:
        raise HTTPException(401, "Invalid token")

@app.post("/predict", dependencies=[Depends(verify_token)])
async def predict(...):
    ...

Reverse proxy

Put the API behind nginx for SSL termination and additional security:

server {
    listen 443 ssl;
    server_name landmarkdiff.example.com;

    ssl_certificate /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;

    client_max_body_size 10M;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 60s;
    }
}

Security checklist

Use HTTPS for all connections
Authenticate all API endpoints
Rate limit requests to prevent abuse
Validate and sanitize all inputs (file type, file size)
Do not store patient photos without explicit consent
Follow HIPAA guidelines if handling medical data
Set client_max_body_size to a reasonable limit (10 MB)
Log requests for audit trails but do not log image content
Run the container as a non-root user in production

Monitoring

Add basic metrics to track API health:

import time
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter("predict_total", "Total prediction requests", ["procedure"])
REQUEST_LATENCY = Histogram("predict_seconds", "Prediction latency in seconds")

@app.post("/predict")
async def predict(...):
    start = time.time()
    # ... run prediction ...
    REQUEST_LATENCY.observe(time.time() - start)
    REQUEST_COUNT.labels(procedure=procedure).inc()

Next Steps

Evaluation Guide: Measure model quality
Training Guide: Train your own checkpoint
FAQ: Common questions and troubleshooting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Guide

Local Gradio Demo

Docker Deployment

CPU-only deployment

GPU deployment

Docker Compose

Custom Docker configuration

HuggingFace Spaces

Setup

Space configuration

Environment variables

Persistence

REST API Setup

FastAPI server

Testing the API

Flask alternative

Production Considerations

Model caching

Batch processing

Resource limits

Rate limiting

Authentication

Reverse proxy

Security checklist

Monitoring

Next Steps

FilesExpand file tree

deployment.md

Latest commit

History

deployment.md

File metadata and controls

Deployment Guide

Local Gradio Demo

Docker Deployment

CPU-only deployment

GPU deployment

Docker Compose

Custom Docker configuration

HuggingFace Spaces

Setup

Space configuration

Environment variables

Persistence

REST API Setup

FastAPI server

Testing the API

Flask alternative

Production Considerations

Model caching

Batch processing

Resource limits

Rate limiting

Authentication

Reverse proxy

Security checklist

Monitoring

Next Steps