ARM64 Setup Guide

This guide covers setting up and running the AFM-4.5B model on ARM64 architecture (Apple Silicon, AWS Graviton, etc.) using Docker, without needing Amazon SageMaker or the SageMaker SDK.

Prerequisites

ARM64 system (Apple Silicon Mac, AWS Graviton, etc.)
Docker with ARM64/multi-platform support
HuggingFace account and token (for gated models)
Sufficient disk space (models can be several GB)

Quick Start

1. Build the Container

# Auto-detect your ARM64 system
source scripts/detect-architecture.sh

# Build specifically for ARM64
./scripts/build-arm64.sh

2. Run with a Public Model

For public models that don't require authentication:

docker run -d -p 8080:8080 \
  -e HF_MODEL_ID="arcee-ai/arcee-lite" \
  -e QUANTIZATION="Q4_K_M" \
  --name llm-test \
  sagemaker-inference-container-cpu:arm64

3. Run with a Private or Gated Model

For private and gated models and persistent storage:

First Run (Download + Convert + Quantize)

# Create local directory for models
mkdir -p ./local_models

# First run: Download, convert, and quantize the model
docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="arcee-ai/AFM-4.5B" \
  -e HF_TOKEN="your_hf_token_here" \
  -e QUANTIZATION="Q4_K_M" \
  --name llm-test \
  sagemaker-inference-container-cpu:arm64

Subsequent Runs (Use Existing Quantized Model)

# Subsequent runs: Use the existing quantized model (much faster!)
docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e MODEL_FILENAME="model-f16.Q4_K_M.gguf" \
  --name llm-test \
  sagemaker-inference-container-cpu:arm64

4. Run with Docker Compose (Alternative)

# First run (download, convert, quantize)
docker-compose -f docker/arm64/docker-compose.yml --profile first-run up --build afm-first-run

# Subsequent runs (fast startup)
docker-compose -f docker/arm64/docker-compose.yml --profile fast up afm-fast

Important:

Replace your_hf_token_here with your actual HuggingFace token from HuggingFace tokens page or ~/.cache/huggingface/token
First run: Takes 5+ minutes (download, convert, quantize)
Subsequent runs: Takes ~30 seconds (loads existing model directly)

ARM64-Specific Optimizations

Compilation Optimizations

The ARM64 build includes the following optimizations:

ARM NEON: Vector instructions for ARM64
OpenBLAS: Optimized BLAS library for ARM64
Native compilation: -march=native -mtune=native
Memory alignment: Optimized for ARM64 cache lines

Environment Variables

Variable	Description	Required	Example
`HF_MODEL_ID`	HuggingFace model repository	Yes	`arcee-ai/AFM-4.5B`
`HF_TOKEN`	HuggingFace token for private/gated models	For gated models	`hf_xxxxxxxxxxxx`
`QUANTIZATION`	Quantization level	No (default: F16)	`Q4_K_M`, `Q8_0`
`LLAMA_CPP_ARGS`	Additional llama-server arguments	No	`"--ctx-size 4096"`
`MODEL_FILENAME`	Specific GGUF file (for pre-quantized models)	For GGUF models	`model.q4_k_m.gguf`

Volume Mounts Explained

HuggingFace Cache Mount

-v ~/.cache/huggingface:/root/.cache/huggingface

Purpose: Reuses downloaded models from your local HuggingFace cache
Benefit: Avoids re-downloading models you already have
Contains: Model files, tokens, and HuggingFace metadata

Models Directory Mount

-v $(pwd)/local_models:/opt/models

Purpose: Persists converted and quantized models locally
Benefit: Subsequent runs skip conversion/quantization (much faster!)
Contains: Original model files, GGUF conversions, and quantized models

Testing the APIs

Health Check

curl http://localhost:8080/ping
# Expected: OK

OpenAI-Compatible Chat Completions

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

OpenAI-Compatible Completions

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The benefits of small language models include:",
    "max_tokens": 80,
    "temperature": 0.7
  }'

SageMaker-Style Invocations

curl -X POST http://localhost:8080/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a haiku about Docker"}
    ],
    "max_tokens": 50
  }'

Streaming Responses

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Count from 1 to 10"}
    ],
    "max_tokens": 50,
    "stream": true
  }'

Model Configuration Examples

Using a Pre-quantized GGUF Model

docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="TheBloke/Llama-2-7B-Chat-GGUF" \
  -e MODEL_FILENAME="llama-2-7b-chat.q4_k_m.gguf" \
  --name llm-test \
  sagemaker-inference-container-cpu:arm64

Custom llama.cpp Arguments

docker run -d -p 8080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/local_models:/opt/models \
  -e HF_MODEL_ID="arcee-ai/arcee-lite" \
  -e QUANTIZATION="Q4_K_M" \
  -e LLAMA_CPP_ARGS="--ctx-size 4096 --threads 8" \
  --name llm-test \
  sagemaker-inference-container-cpu:arm64

Finding Your MODEL_FILENAME

After the first run, you can find the quantized model filename by listing the files:

ls -la ./local_models/current/*.gguf

The filename pattern depends on your quantization setting:

Q4_K_M → model-f16.Q4_K_M.gguf
Q8_0 → model-f16.Q8_0.gguf
F16 (no quantization) → model-f16.gguf

Use this filename in the MODEL_FILENAME environment variable for subsequent runs.

File Structure After Running

local_models/
└── current/
    ├── config.json                      # Model configuration
    ├── model-00001-of-00002.safetensors # Original model (part 1)
    ├── model-00002-of-00002.safetensors # Original model (part 2)
    ├── model-f16.gguf                   # Converted F16 GGUF
    ├── model-f16.Q4_K_M.gguf           # Quantized model (used by server)
    ├── tokenizer.json                   # Tokenizer
    └── ... (other model files)

The quantized .gguf file is what the llama.cpp server actually uses for inference.

Troubleshooting

Common Issues

Build failures: Ensure you have ARM64-compatible Docker
Performance issues: Check thread count and memory allocation
Model loading errors: Verify sufficient disk space and memory

Debug Commands

# Check architecture
uname -m

# Check Docker platform
docker version

# Check container logs
docker-compose -f docker/arm64/docker-compose.yml logs afm-fast

# Check resource usage
docker stats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM64 Setup Guide

Prerequisites

Quick Start

1. Build the Container

2. Run with a Public Model

3. Run with a Private or Gated Model

First Run (Download + Convert + Quantize)

Subsequent Runs (Use Existing Quantized Model)

4. Run with Docker Compose (Alternative)

ARM64-Specific Optimizations

Compilation Optimizations

Environment Variables

Volume Mounts Explained

HuggingFace Cache Mount

Models Directory Mount

Testing the APIs

Health Check

OpenAI-Compatible Chat Completions

OpenAI-Compatible Completions

SageMaker-Style Invocations

Streaming Responses

Model Configuration Examples

Using a Pre-quantized GGUF Model

Custom llama.cpp Arguments

Finding Your MODEL_FILENAME

File Structure After Running

Troubleshooting

Common Issues

Debug Commands

Next Steps

FilesExpand file tree

arm64-setup.md

Latest commit

History

arm64-setup.md

File metadata and controls

ARM64 Setup Guide

Prerequisites

Quick Start

1. Build the Container

2. Run with a Public Model

3. Run with a Private or Gated Model

First Run (Download + Convert + Quantize)

Subsequent Runs (Use Existing Quantized Model)

4. Run with Docker Compose (Alternative)

ARM64-Specific Optimizations

Compilation Optimizations

Environment Variables

Volume Mounts Explained

HuggingFace Cache Mount

Models Directory Mount

Testing the APIs

Health Check

OpenAI-Compatible Chat Completions

OpenAI-Compatible Completions

SageMaker-Style Invocations

Streaming Responses

Model Configuration Examples

Using a Pre-quantized GGUF Model

Custom llama.cpp Arguments

Finding Your MODEL_FILENAME

File Structure After Running

Troubleshooting

Common Issues

Debug Commands

Next Steps