AMD64/Intel Setup Guide

This guide covers setting up and running the AFM-4.5B model on AMD64/Intel architecture using Docker, without needing Amazon SageMaker or the SageMaker SDK.

Prerequisites

AMD64/Intel system (x86_64)
Docker with AMD64 support
HuggingFace account and token (for gated models)
Sufficient disk space (models can be several GB)

Quick Start

1. Build the Container

# Auto-detect your AMD64 system
source scripts/detect-architecture.sh

# Build specifically for AMD64
./scripts/build-amd64.sh

2. Run with a Public Model

For public models that don't require authentication:

docker run -d -p 8080:8080 \
  -e HF_MODEL_ID="arcee-ai/arcee-lite" \
  -e QUANTIZATION="Q4_K_M" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

3. Run with a Private or Gated Model

For private and gated models and persistent storage:

First Run (Download + Convert + Quantize)

# Create local directory for models
mkdir -p ./local_models

# First run: Download, convert, and quantize the model
docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="arcee-ai/AFM-4.5B" \
  -e HF_TOKEN="your_hf_token_here" \
  -e QUANTIZATION="Q4_K_M" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

Subsequent Runs (Use Existing Quantized Model)

# Subsequent runs: Use the existing quantized model (much faster!)
docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e MODEL_FILENAME="model-f16.Q4_K_M.gguf" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

4. Run with Docker Compose (Alternative)

# First run (download, convert, quantize)
docker-compose -f docker/amd64/docker-compose.yml --profile first-run up --build afm-first-run

# Subsequent runs (fast startup)
docker-compose -f docker/amd64/docker-compose.yml --profile fast up afm-fast

Important:

Replace your_hf_token_here with your actual HuggingFace token from HuggingFace tokens page or ~/.cache/huggingface/token
First run: Takes 5+ minutes (download, convert, quantize)
Subsequent runs: Takes ~30 seconds (loads existing model directly)

AMD64-Specific Optimizations

Compilation Optimizations

The AMD64 build includes the following Intel optimizations:

AVX/AVX2/AVX-512: Advanced Vector Extensions for Intel processors
OpenBLAS: Optimized BLAS library for x86_64
Intel MKL: Math Kernel Library (if available)
Native compilation: -march=native -mtune=native
Memory alignment: Optimized for Intel cache lines

Environment Variables

Variable	Description	Required	Example
`HF_MODEL_ID`	HuggingFace model repository	Yes	`arcee-ai/AFM-4.5B`
`HF_TOKEN`	HuggingFace token for private/gated models	For gated models	`hf_xxxxxxxxxxxx`
`QUANTIZATION`	Quantization level	No (default: F16)	`Q4_K_M`, `Q8_0`
`LLAMA_CPP_ARGS`	Additional llama-server arguments	No	`"--ctx-size 4096"`
`MODEL_FILENAME`	Specific GGUF file (for pre-quantized models)	For GGUF models	`model.q4_k_m.gguf`

Volume Mounts Explained

HuggingFace Cache Mount

-v ~/.cache/huggingface:/root/.cache/huggingface

Purpose: Reuses downloaded models from your local HuggingFace cache
Benefit: Avoids re-downloading models you already have
Contains: Model files, tokens, and HuggingFace metadata

Models Directory Mount

-v $(pwd)/local_models:/opt/models

Purpose: Persists converted and quantized models locally
Benefit: Subsequent runs skip conversion/quantization (much faster!)
Contains: Original model files, GGUF conversions, and quantized models

Testing the APIs

Health Check

curl http://localhost:8080/ping
# Expected: OK

OpenAI-Compatible Chat Completions

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

OpenAI-Compatible Completions

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The benefits of small language models include:",
    "max_tokens": 80,
    "temperature": 0.7
  }'

SageMaker-Style Invocations

curl -X POST http://localhost:8080/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a haiku about Docker"}
    ],
    "max_tokens": 50
  }'

Streaming Responses

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Count from 1 to 10"}
    ],
    "max_tokens": 50,
    "stream": true
  }'

Model Configuration Examples

Using a Pre-quantized GGUF Model

docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="TheBloke/Llama-2-7B-Chat-GGUF" \
  -e MODEL_FILENAME="llama-2-7b-chat.q4_k_m.gguf" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

Custom llama.cpp Arguments

docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="arcee-ai/arcee-lite" \
  -e QUANTIZATION="Q4_K_M" \
  -e LLAMA_CPP_ARGS="--ctx-size 4096 --threads 8" \
  --name llm-test \
  sagemaker-inference-container-cpu:amd64

Finding Your MODEL_FILENAME

After the first run, you can find the quantized model filename by listing the files:

ls -la ./local_models/current/*.gguf

The filename pattern depends on your quantization setting:

Q4_K_M → model-f16.Q4_K_M.gguf
Q8_0 → model-f16.Q8_0.gguf
F16 (no quantization) → model-f16.gguf

Use this filename in the MODEL_FILENAME environment variable for subsequent runs.

File Structure After Running

local_models/
└── current/
    ├── config.json                      # Model configuration
    ├── model-00001-of-00002.safetensors # Original model (part 1)
    ├── model-00002-of-00002.safetensors # Original model (part 2)
    ├── model-f16.gguf                   # Converted F16 GGUF
    ├── model-f16.Q4_K_M.gguf           # Quantized model (used by server)
    ├── tokenizer.json                   # Tokenizer
    └── ... (other model files)

The quantized .gguf file is what the llama.cpp server actually uses for inference.

Troubleshooting

Common Issues

Build failures: Ensure you have x86_64 Docker
Performance issues: Check AVX support and thread count
Model loading errors: Verify sufficient disk space and memory

Debug Commands

# Check architecture
uname -m

# Check CPU features
lscpu

# Check Docker platform
docker version

# Check container logs
docker-compose -f docker/amd64/docker-compose.yml logs afm-fast

# Check resource usage
docker stats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD64/Intel Setup Guide

Prerequisites

Quick Start

1. Build the Container

2. Run with a Public Model

3. Run with a Private or Gated Model

First Run (Download + Convert + Quantize)

Subsequent Runs (Use Existing Quantized Model)

4. Run with Docker Compose (Alternative)

AMD64-Specific Optimizations

Compilation Optimizations

Environment Variables

Volume Mounts Explained

HuggingFace Cache Mount

Models Directory Mount

Testing the APIs

Health Check

OpenAI-Compatible Chat Completions

OpenAI-Compatible Completions

SageMaker-Style Invocations

Streaming Responses

Model Configuration Examples

Using a Pre-quantized GGUF Model

Custom llama.cpp Arguments

Finding Your MODEL_FILENAME

File Structure After Running

Troubleshooting

Common Issues

Debug Commands

Next Steps

FilesExpand file tree

amd64-setup.md

Latest commit

History

amd64-setup.md

File metadata and controls

AMD64/Intel Setup Guide

Prerequisites

Quick Start

1. Build the Container

2. Run with a Public Model

3. Run with a Private or Gated Model

First Run (Download + Convert + Quantize)

Subsequent Runs (Use Existing Quantized Model)

4. Run with Docker Compose (Alternative)

AMD64-Specific Optimizations

Compilation Optimizations

Environment Variables

Volume Mounts Explained

HuggingFace Cache Mount

Models Directory Mount

Testing the APIs

Health Check

OpenAI-Compatible Chat Completions

OpenAI-Compatible Completions

SageMaker-Style Invocations

Streaming Responses

Model Configuration Examples

Using a Pre-quantized GGUF Model

Custom llama.cpp Arguments

Finding Your MODEL_FILENAME

File Structure After Running

Troubleshooting

Common Issues

Debug Commands

Next Steps