Skip to content

Latest commit

 

History

History
267 lines (211 loc) · 7.44 KB

File metadata and controls

267 lines (211 loc) · 7.44 KB

AMD64/Intel Setup Guide

This guide covers setting up and running the AFM-4.5B model on AMD64/Intel architecture using Docker, without needing Amazon SageMaker or the SageMaker SDK.

Prerequisites

  • AMD64/Intel system (x86_64)
  • Docker with AMD64 support
  • HuggingFace account and token (for gated models)
  • Sufficient disk space (models can be several GB)

Quick Start

1. Build the Container

# Auto-detect your AMD64 system
source scripts/detect-architecture.sh

# Build specifically for AMD64
./scripts/build-amd64.sh

2. Run with a Public Model

For public models that don't require authentication:

docker run -d -p 8080:8080 \
  -e HF_MODEL_ID="arcee-ai/arcee-lite" \
  -e QUANTIZATION="Q4_K_M" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

3. Run with a Private or Gated Model

For private and gated models and persistent storage:

First Run (Download + Convert + Quantize)

# Create local directory for models
mkdir -p ./local_models

# First run: Download, convert, and quantize the model
docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="arcee-ai/AFM-4.5B" \
  -e HF_TOKEN="your_hf_token_here" \
  -e QUANTIZATION="Q4_K_M" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

Subsequent Runs (Use Existing Quantized Model)

# Subsequent runs: Use the existing quantized model (much faster!)
docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e MODEL_FILENAME="model-f16.Q4_K_M.gguf" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

4. Run with Docker Compose (Alternative)

# First run (download, convert, quantize)
docker-compose -f docker/amd64/docker-compose.yml --profile first-run up --build afm-first-run

# Subsequent runs (fast startup)
docker-compose -f docker/amd64/docker-compose.yml --profile fast up afm-fast

Important:

  • Replace your_hf_token_here with your actual HuggingFace token from HuggingFace tokens page or ~/.cache/huggingface/token
  • First run: Takes 5+ minutes (download, convert, quantize)
  • Subsequent runs: Takes ~30 seconds (loads existing model directly)

AMD64-Specific Optimizations

Compilation Optimizations

The AMD64 build includes the following Intel optimizations:

  • AVX/AVX2/AVX-512: Advanced Vector Extensions for Intel processors
  • OpenBLAS: Optimized BLAS library for x86_64
  • Intel MKL: Math Kernel Library (if available)
  • Native compilation: -march=native -mtune=native
  • Memory alignment: Optimized for Intel cache lines

Environment Variables

Variable Description Required Example
HF_MODEL_ID HuggingFace model repository Yes arcee-ai/AFM-4.5B
HF_TOKEN HuggingFace token for private/gated models For gated models hf_xxxxxxxxxxxx
QUANTIZATION Quantization level No (default: F16) Q4_K_M, Q8_0
LLAMA_CPP_ARGS Additional llama-server arguments No "--ctx-size 4096"
MODEL_FILENAME Specific GGUF file (for pre-quantized models) For GGUF models model.q4_k_m.gguf

Volume Mounts Explained

HuggingFace Cache Mount

-v ~/.cache/huggingface:/root/.cache/huggingface
  • Purpose: Reuses downloaded models from your local HuggingFace cache
  • Benefit: Avoids re-downloading models you already have
  • Contains: Model files, tokens, and HuggingFace metadata

Models Directory Mount

-v $(pwd)/local_models:/opt/models
  • Purpose: Persists converted and quantized models locally
  • Benefit: Subsequent runs skip conversion/quantization (much faster!)
  • Contains: Original model files, GGUF conversions, and quantized models

Testing the APIs

Health Check

curl http://localhost:8080/ping
# Expected: OK

OpenAI-Compatible Chat Completions

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

OpenAI-Compatible Completions

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The benefits of small language models include:",
    "max_tokens": 80,
    "temperature": 0.7
  }'

SageMaker-Style Invocations

curl -X POST http://localhost:8080/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a haiku about Docker"}
    ],
    "max_tokens": 50
  }'

Streaming Responses

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Count from 1 to 10"}
    ],
    "max_tokens": 50,
    "stream": true
  }'

Model Configuration Examples

Using a Pre-quantized GGUF Model

docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="TheBloke/Llama-2-7B-Chat-GGUF" \
  -e MODEL_FILENAME="llama-2-7b-chat.q4_k_m.gguf" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

Custom llama.cpp Arguments

docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="arcee-ai/arcee-lite" \
  -e QUANTIZATION="Q4_K_M" \
  -e LLAMA_CPP_ARGS="--ctx-size 4096 --threads 8" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

Finding Your MODEL_FILENAME

After the first run, you can find the quantized model filename by listing the files:

ls -la ./local_models/current/*.gguf

The filename pattern depends on your quantization setting:

  • Q4_K_Mmodel-f16.Q4_K_M.gguf
  • Q8_0model-f16.Q8_0.gguf
  • F16 (no quantization) → model-f16.gguf

Use this filename in the MODEL_FILENAME environment variable for subsequent runs.

File Structure After Running

local_models/
└── current/
    ├── config.json                      # Model configuration
    ├── model-00001-of-00002.safetensors # Original model (part 1)
    ├── model-00002-of-00002.safetensors # Original model (part 2)
    ├── model-f16.gguf                   # Converted F16 GGUF
    ├── model-f16.Q4_K_M.gguf           # Quantized model (used by server)
    ├── tokenizer.json                   # Tokenizer
    └── ... (other model files)

The quantized .gguf file is what the llama.cpp server actually uses for inference.

Troubleshooting

Common Issues

  1. Build failures: Ensure you have x86_64 Docker
  2. Performance issues: Check AVX support and thread count
  3. Model loading errors: Verify sufficient disk space and memory

Debug Commands

# Check architecture
uname -m

# Check CPU features
lscpu

# Check Docker platform
docker version

# Check container logs
docker-compose -f docker/amd64/docker-compose.yml logs afm-fast

# Check resource usage
docker stats

Next Steps