This guide covers setting up and running the AFM-4.5B model on ARM64 architecture (Apple Silicon, AWS Graviton, etc.) using Docker, without needing Amazon SageMaker or the SageMaker SDK.
- ARM64 system (Apple Silicon Mac, AWS Graviton, etc.)
- Docker with ARM64/multi-platform support
- HuggingFace account and token (for gated models)
- Sufficient disk space (models can be several GB)
# Auto-detect your ARM64 system
source scripts/detect-architecture.sh
# Build specifically for ARM64
./scripts/build-arm64.shFor public models that don't require authentication:
docker run -d -p 8080:8080 \
-e HF_MODEL_ID="arcee-ai/arcee-lite" \
-e QUANTIZATION="Q4_K_M" \
--name llm-test \
sagemaker-inference-container-cpu:arm64For private and gated models and persistent storage:
# Create local directory for models
mkdir -p ./local_models
# First run: Download, convert, and quantize the model
docker run -d -p 8080:8080 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd)/local_models:/opt/models \
-e HF_MODEL_ID="arcee-ai/AFM-4.5B" \
-e HF_TOKEN="your_hf_token_here" \
-e QUANTIZATION="Q4_K_M" \
--name llm-test \
sagemaker-inference-container-cpu:arm64# Subsequent runs: Use the existing quantized model (much faster!)
docker run -d -p 8080:8080 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd)/local_models:/opt/models \
-e MODEL_FILENAME="model-f16.Q4_K_M.gguf" \
--name llm-test \
sagemaker-inference-container-cpu:arm64# First run (download, convert, quantize)
docker-compose -f docker/arm64/docker-compose.yml --profile first-run up --build afm-first-run
# Subsequent runs (fast startup)
docker-compose -f docker/arm64/docker-compose.yml --profile fast up afm-fastImportant:
- Replace
your_hf_token_herewith your actual HuggingFace token from HuggingFace tokens page or~/.cache/huggingface/token - First run: Takes 5+ minutes (download, convert, quantize)
- Subsequent runs: Takes ~30 seconds (loads existing model directly)
The ARM64 build includes the following optimizations:
- ARM NEON: Vector instructions for ARM64
- OpenBLAS: Optimized BLAS library for ARM64
- Native compilation:
-march=native -mtune=native - Memory alignment: Optimized for ARM64 cache lines
| Variable | Description | Required | Example |
|---|---|---|---|
HF_MODEL_ID |
HuggingFace model repository | Yes | arcee-ai/AFM-4.5B |
HF_TOKEN |
HuggingFace token for private/gated models | For gated models | hf_xxxxxxxxxxxx |
QUANTIZATION |
Quantization level | No (default: F16) | Q4_K_M, Q8_0 |
LLAMA_CPP_ARGS |
Additional llama-server arguments | No | "--ctx-size 4096" |
MODEL_FILENAME |
Specific GGUF file (for pre-quantized models) | For GGUF models | model.q4_k_m.gguf |
-v ~/.cache/huggingface:/root/.cache/huggingface- Purpose: Reuses downloaded models from your local HuggingFace cache
- Benefit: Avoids re-downloading models you already have
- Contains: Model files, tokens, and HuggingFace metadata
-v $(pwd)/local_models:/opt/models- Purpose: Persists converted and quantized models locally
- Benefit: Subsequent runs skip conversion/quantization (much faster!)
- Contains: Original model files, GGUF conversions, and quantized models
curl http://localhost:8080/ping
# Expected: OKcurl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Hello! How are you?"}
],
"max_tokens": 100,
"temperature": 0.7
}'curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The benefits of small language models include:",
"max_tokens": 80,
"temperature": 0.7
}'curl -X POST http://localhost:8080/invocations \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a haiku about Docker"}
],
"max_tokens": 50
}'curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Count from 1 to 10"}
],
"max_tokens": 50,
"stream": true
}'docker run -d -p 8080:8080 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd)/local_models:/opt/models \
-e HF_MODEL_ID="TheBloke/Llama-2-7B-Chat-GGUF" \
-e MODEL_FILENAME="llama-2-7b-chat.q4_k_m.gguf" \
--name llm-test \
sagemaker-inference-container-cpu:arm64docker run -d -p 8080:8080 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd)/local_models:/opt/models \
-e HF_MODEL_ID="arcee-ai/arcee-lite" \
-e QUANTIZATION="Q4_K_M" \
-e LLAMA_CPP_ARGS="--ctx-size 4096 --threads 8" \
--name llm-test \
sagemaker-inference-container-cpu:arm64After the first run, you can find the quantized model filename by listing the files:
ls -la ./local_models/current/*.ggufThe filename pattern depends on your quantization setting:
Q4_K_M→model-f16.Q4_K_M.ggufQ8_0→model-f16.Q8_0.ggufF16(no quantization) →model-f16.gguf
Use this filename in the MODEL_FILENAME environment variable for subsequent runs.
local_models/
└── current/
├── config.json # Model configuration
├── model-00001-of-00002.safetensors # Original model (part 1)
├── model-00002-of-00002.safetensors # Original model (part 2)
├── model-f16.gguf # Converted F16 GGUF
├── model-f16.Q4_K_M.gguf # Quantized model (used by server)
├── tokenizer.json # Tokenizer
└── ... (other model files)
The quantized .gguf file is what the llama.cpp server actually uses for inference.
- Build failures: Ensure you have ARM64-compatible Docker
- Performance issues: Check thread count and memory allocation
- Model loading errors: Verify sufficient disk space and memory
# Check architecture
uname -m
# Check Docker platform
docker version
# Check container logs
docker-compose -f docker/arm64/docker-compose.yml logs afm-fast
# Check resource usage
docker stats