Skip to content

dagc-ai/ai-infra-learning

Repository files navigation

AI Infrastructure Learning Journey


Most people who work with AI understand it at the layer they live in. What almost nobody has is a coherent mental model of the whole stack. This repo is my attempt to build that from the ground up. Over six phases and roughly six months, I worked through the full AI compute stack hands-on: writing CUDA kernels, implementing Ring AllReduce from scratch, deploying vLLM on A100s, and benchmarking Groq's LPU against my own A100 baseline. Every concept has an exercise. Every exercise produces a real number. Every number is committed here with the hardware, the conditions, and where I hit failures, an honest account of what went wrong and why.

The throughline is the same insight wearing different clothes at every layer: the binding constraint in AI systems is almost always data movement, not computation. Flash Attention unlocked 100K-token context windows not through better hardware, but by restructuring attention to avoid materializing a giant intermediate matrix in HBM. AWQ INT4 with the wrong kernel is 5x slower than BF16. The roofline model from Phase 1, built from two numbers on a spec sheet, predicted every benchmark result across six phases before I ran a single line of code.

I'm not an AI engineer. I'm a sales professional with an EE background who wanted to have technically grounded conversations about AI infrastructure that most people in go-to-market roles can't have because they've never run the code. This is what I built to close that gap.


Progress

Phase Topic Status Key Result
0 Environment & Tooling ✅ Complete RTX 4090 verified, 99.7% peak FP32 utilization on first benchmark
1 GPU Architecture & Memory Hierarchy ✅ Complete Roofline model measured: 947 GB/s bandwidth, 158 TFLOPS compute, ridge at ~327 FLOP/byte
2 CUDA Kernel Programming ✅ Complete Tiled matmul 7.2 TFLOPS vs cuBLAS 54 TFLOPS; 8x HBM traffic reduction confirmed via ncu
3 Triton Kernel Programming ✅ Complete 92% peak bandwidth, 2x fusion win, Flash Attention 6.4x speedup at N=4096
4 Distributed Training Primitives ✅ Complete 228 GB/s measured vs 600 GB/s NVLink spec; Ring AllReduce from scratch; 38% of spec in virtualized env
5 Inference & Serving Infrastructure ✅ Complete AWQ Marlin 2.27x BF16; 8x concurrency = 8x throughput flat latency; A100 TTFT 18ms vs Groq API 200ms; Groq 9.4x faster on throughput
6 Alternative Hardware Architectures ✅ Complete Groq 672.9 tok/s (7.2x A100); TPU v5e 142.3 TFLOPS / XLA auto-fusion confirmed; deployment friction documented across all 4 architectures
7 Model Architecture & Full Stack View ✅ Complete 30M param GPT trained from scratch; scaling law verified empirically (1–2% fit error); warmup misconfiguration = silent 20% permanent loss penalty

Phase 0 — Environment & Tooling

Hardware: RTX 4090, 49GB VRAM, CUDA 13.1, PyTorch 2.10, vast.ai ($0.57/hr)

Key Results

Environment verified via four gates: nvidia-smi (driver 580.95.05), nvcc --version (CUDA 13.1), torch.cuda.is_available() → True, torch.__version__ → 2.10.0a.

Baseline matmul benchmark (4096 × 4096, FP32, 100-iteration average):

Metric Result
Elapsed per call 1.67ms
Achieved throughput 82.4 TFLOPS
RTX 4090 FP32 peak 82.6 TFLOPS
Peak utilization 99.7%

What This Means

A GPU is rented hardware you access over the internet and immediately put to work running math at extraordinary scale. The 82.4 TFLOPS result — 99.7% of theoretical peak — is achievable because torch.matmul calls NVIDIA's hand-tuned cuBLAS library, not naive code. That ceiling is the baseline everything else in this curriculum gets measured against. Every kernel written by hand starts below it; the work of optimization is understanding why and closing the gap.

Key Insight

torch.cuda.synchronize() is the first thing that separates someone who understands GPU execution from someone who doesn't. GPU operations are asynchronous — without an explicit sync barrier, you're timing kernel launches (microseconds), not kernel execution (milliseconds). Every benchmarking script in this repo includes it.


Phase 1 — GPU Architecture & Memory Hierarchy

Hardware: RTX 4090, 24GB GDDR6X, CUDA 13.0, vast.ai VM (~$0.40/hr)

Key Results

Memory-bound operation — element-wise multiply, 32M FP32 elements:

Metric Result
Achieved bandwidth 947.5 GB/s
Peak bandwidth 1,008 GB/s
Utilization 94.0%
DRAM throughput (ncu) 92%
SM compute throughput (ncu) 2.6%
Arithmetic intensity 0.083 FLOP/byte

Compute-bound operation — FP16 matmul, N=8192:

Metric Result
Achieved throughput 158.1 TFLOPS
Peak throughput 330 TFLOPS
Utilization 47.9%
DRAM throughput (ncu) 13%
SM compute throughput (ncu) 47%
Arithmetic intensity 2,731 FLOP/byte

Roofline regime transition (FP16 matmul):

Matrix size N Arithmetic intensity Regime
64 21.3 FLOP/byte Memory-bound
128 42.7 FLOP/byte Memory-bound
256 85.3 FLOP/byte Approaching ridge
8192 2,731 FLOP/byte Compute-bound

Ridge point on RTX 4090: ~327 FLOP/byte (1,008 GB/s bandwidth ÷ 330 TFLOPS peak × 1000)

What This Means

Every GPU operation is bottlenecked by one of two things: how fast it can move data from memory, or how fast it can do arithmetic. Which one limits you depends on the ratio of math to data movement — arithmetic intensity. We measured this directly: element-wise multiply spent 92% of its time moving data and 2.6% computing. Large matmul flipped that entirely. Below the ridge point (~N=256 for matmul), buying a faster GPU does nothing. You need faster memory or smarter data reuse.

Key Insight

Arithmetic intensity determines optimization strategy. Kernel fusion and quantization help memory-bound workloads. More TFLOPS only helps compute-bound ones. Knowing which regime your workload is in — a 60-second calculation from the spec sheet — should precede every infrastructure decision.


Phase 2 — CUDA Kernel Programming

Hardware: RTX 4090, 49,140MB VRAM, CUDA 12.1, vast.ai VM

Key Results

Exercise 2.1 — Vector Addition

  • 1M elements across 4,096 blocks × 256 threads
  • Global load bytes (ncu): 8.39MB — exactly 2 arrays × 1M × 4 bytes, zero waste
  • Arithmetic intensity: 0.125 FLOP/byte — most memory-bound operation measured across Phases 1–2
  • Correctness: h_c[0] = 3.0, h_c[1048575] = 3.0, no CUDA errors

Exercise 2.2 — Naive vs. Tiled Matmul vs. cuBLAS

Wall clock (20-iteration average, CUDA events):

Version N=1024 N=2048 N=4096 % of cuBLAS
Naive 0.43ms / 5.0 TFLOPS 3.37ms / 5.1 TFLOPS — (skipped) ~9%
Tiled (T=16) 0.33ms / 6.4 TFLOPS 2.61ms / 6.6 TFLOPS 19.17ms / 7.2 TFLOPS ~13%
cuBLAS 0.04ms / 48.7 TFLOPS 0.32ms / 54.5 TFLOPS 2.55ms / 53.9 TFLOPS 100%

ncu hardware counters (N=1024, single kernel instance):

Metric Naive Tiled (T=16)
Global load bytes (HBM) 4.29 GB 537 MB
Shared memory wavefronts 0 50,339,258
DRAM throughput % of peak 1.70% 2.20%
SM throughput % of peak 97.28% 94.25%

Tiling reduced HBM traffic 8x — exactly what the arithmetic predicts.

Naive matmul skipped at N=4096: all three matrices total 192MB, blowing past the 72MB L2 cache. Performance would have collapsed. At N=1024 (12MB total, fits in L2), naive measured 5.0 TFLOPS — L2 masking the expected collapse, which ncu confirmed via the 8x HBM traffic difference between naive and tiled.

Exercise 2.3 — Softmax with Warp Shuffle

  • 4,096 rows × 32 columns, one warp (32 threads) per row
  • Global load bytes (ncu): 1.57MB — input read exactly once
  • Correctness verified: all rows sum to 1.000000
  • Warp shuffle reduction: 5 steps of __shfl_down_sync, no shared memory, no barriers
  • ~1 cycle per step vs. ~5 cycles for shared memory equivalent

What This Means

A GPU is not a fast calculator — it is a memory system with compute attached. A straightforwardly written matmul used 0.17% of available compute because the same data was repeatedly fetched from slow off-chip memory by independent threads. Tiling fixed this by loading data into fast on-chip SRAM once and reusing it many times, cutting HBM traffic 8x. The remaining gap to cuBLAS is decades of additional engineering: three tiling levels instead of one, Tensor Core integration, and assembly-level inner loop optimization. The warp shuffle exercise showed that even cross-thread coordination (finding a row maximum) can execute entirely in registers in 5 instructions — the same primitive that makes Flash Attention's fused kernel feasible.

Key Insight

The L2 cache masking effect is a production trap. A kernel benchmarking acceptably at small problem sizes can collapse at production scale when the working set exceeds L2. Any kernel evaluation that doesn't profile at actual production problem sizes is incomplete. SM throughput is not the same as compute utilization — naive kernels hammering HBM with load instructions register high SM activity while compute units starve. Wall clock is the honest arbiter.


Phase 3 — Triton Kernel Programming

Hardware: RTX 4090, 24GB GDDR6X, RunPod persistent pod, CUDA 12.4, PyTorch 2.4.1, Triton 3.0.0

Key Results

Exercise 3.1 — Triton Vector Addition

  • Test size: 98,432 elements (non-power-of-2, stress-tests mask logic)
  • Correctness: max absolute difference vs PyTorch = 0.00e+00
  • At 16M elements (only size exceeding 72MB L2, measuring real GDDR6X traffic):
Implementation Bandwidth % of Peak
Triton 924.0 GB/s 91.7%
PyTorch 924.5 GB/s 91.8%
RTX 4090 peak 1,008 GB/s 100%

Statistically identical. Triton's compiler produces code on par with hand-tuned CUDA with no manual thread indexing.

Exercise 3.2 — Fused Softmax

Benchmark at 1024 rows × 2048 cols:

Implementation Bandwidth Latency Notes
Triton fused 823 GB/s 20.4 µs 1 HBM round trip
PyTorch unfused (5-op) 414 GB/s 40.4 µs 5 HBM round trips
torch.softmax (internally fused) 1,208 GB/s 13.9 µs L2 cache effects at this size
  • Correctness: max absolute difference vs PyTorch = 7.45e-09, allclose = True
  • Fusion win: 2x throughput vs explicit unfused implementation
  • Unfused bandwidth plateaus at ~411–414 GB/s across all sizes — signature of multiple HBM passes

Exercise 3.3 — Flash Attention Forward Pass

  • Correctness: max absolute difference vs naive = 1.14e-03 (float32 accumulation across tile rescaling — not a logic bug)
  • Benchmark (batch=2, heads=4, d_head=64, float32):
N Flash Naive Speedup Naive attn matrix
512 0.041ms 0.072ms 1.8x 8.4 MB
1024 0.044ms 0.176ms 4.0x 33.6 MB
2048 0.155ms 0.969ms 6.3x 134.2 MB
4096 0.590ms 3.750ms 6.4x 536.9 MB

Speedup compounds with N — consistent with O(N) vs O(N²) prediction. Every doubling of N: Flash cost ~doubles, naive cost ~quadruples.

What This Means

Triton proved three things experimentally: a Python-level GPU compiler can match hand-tuned CUDA performance (92% of peak bandwidth with no manual thread management); kernel fusion cuts memory traffic in half and directly doubles throughput; and Flash Attention's breakthrough wasn't a faster chip — it was a mathematical restructuring of attention to avoid writing a giant intermediate matrix to HBM at all. That restructuring is what made 100K-token context windows physically possible. The algorithm unlocked the hardware market for long-context inference — not the other way around.

Key Insight

Context length is not a free parameter. Without fused tiled attention, memory requirements scale as O(N²). With it, O(N). At N=4096, naive attention requires 537MB just for the attention matrix. At N=32K, that number exceeds the GPU entirely. Evaluating any inference framework requires confirming Flash Attention (or equivalent) is the default attention path, not an optional flag.


Notes


Hardware

Phase Hardware Provider Cost
0 RTX 4090, 49GB VRAM, CUDA 13.1 vast.ai $0.57/hr
1 RTX 4090, 24GB GDDR6X, CUDA 13.0 vast.ai VM ~$0.40/hr
2 RTX 4090, 49GB VRAM, CUDA 12.1 vast.ai VM
3 RTX 4090, 24GB GDDR6X, CUDA 12.4 RunPod persistent pod

All RTX 4090 specs: 16,384 CUDA cores, 1,008 GB/s GDDR6X bandwidth, 82.6 TFLOPS FP32, 330 TFLOPS FP16 tensor core peak.

Phase 4 — Distributed Training Primitives

Hardware: 4× A100 SXM 40GB, NVLink NV12 topology, CUDA 12.4.1, PyTorch 2.4.0, RunPod

Key Results

Exercise 4.1 — DDP Training

  • 4-process DDP training loop across 4× A100 SXM
  • Initial param sum: 20.335490 — identical across all 4 ranks (broadcast at init confirmed)
  • Final param sum: 20.307633 — identical across all 4 ranks after 20 steps on different random data (AllReduce confirmed working)
  • Step timing: 0.21s (rank 0) to 0.26s (rank 3) — variance demonstrates the synchronization barrier in practice: fastest rank waits for slowest before any rank proceeds

Exercise 4.2 — Ring AllReduce from Scratch

  • Implemented full Ring AllReduce using only dist.P2POp point-to-point primitives — no dist.all_reduce()
  • Result matched PyTorch dist.all_reduce() to 9.54e-07 max absolute difference across all 4 ranks
  • Hit deadlock on first implementation using naive isend + recv — all ranks blocked simultaneously because NCCL isend doesn't move data until a matching recv is posted on the remote end
  • Fixed with dist.batch_isend_irecv, which posts sends and receives atomically

Exercise 4.3 — NVLink Bandwidth Benchmark

Tensor size Time (ms) Measured BW % of 600 GB/s spec
1 MB 0.051 31.0 GB/s 5.2%
16 MB 0.202 124.3 GB/s 20.7%
64 MB 0.596 168.8 GB/s 28.1%
256 MB 2.092 192.5 GB/s 32.1%
512 MB 3.817 211.0 GB/s 35.2%
1 GB 7.244 222.3 GB/s 37.1%
2 GB 14.152 227.6 GB/s 37.9%

Peak measured: 228 GB/s — 38% of 600 GB/s A100 SXM NVLink spec. Curve still rising at 2GB. Gap vs. spec attributed to RunPod virtualization overhead, AllReduce synchronization cost, and NCCL algorithm overhead vs. raw point-to-point benchmark conditions. Bus bandwidth formula: 2 × (N-1)/N × tensor_size / time — matches NCCL benchmark methodology.

What This Means

When training across multiple GPUs, each GPU sees different data and computes its own gradient. Before any weight update can happen, all gradients must be averaged across every GPU so the model stays synchronized. That averaging — AllReduce — runs over the physical wire connecting GPUs on every single training step. We implemented that algorithm from scratch, proved it produces the correct result, then measured how fast the wire actually runs versus its rated spec. The gap between 228 GB/s measured and 600 GB/s rated is not a malfunction — it's what real-world overhead looks like. Understanding that gap is what separates people who read spec sheets from people who understand systems.

Key Insight

Interconnect topology is a first-class architectural decision, not a procurement detail. PCIe between GPUs delivers ~32 GB/s per link; NVLink delivers 600 GB/s. At real gradient tensor sizes, PCIe puts you in a regime where communication dominates compute — and that is not fixable in software. GPU utilization is also the wrong metric for distributed training: a cluster showing 80% utilization may be spending 40% of that time blocked at AllReduce barriers. The correct metric is Model FLOP Utilization (MFU). Most production training runs achieve 30–50% MFU. NVLink is NVIDIA's most durable competitive moat — more defensible than the GPU itself.

Phase 5 — Inference & Serving Infrastructure

Hardware: A100 SXM4 80GB (single and 2×), CUDA 12.4, PyTorch 2.4.0, vLLM 0.18.0, RunPod
Models: Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct, Llama-3.3-70B-Instruct
Note: Exercises run across three parts — Mistral 7B baseline, Llama 3.1 8B cross-validation, and a TTFT streaming benchmark designed for direct comparison with published Groq LPU numbers.

Key Results

Exercise 5.1 — vLLM Deployment & Continuous Batching

vLLM startup telemetry:

Metric Mistral 7B Llama 3.1 8B
Model weights 13.51 GiB 14.99 GiB
KV cache pool 57.41 GiB 55.20 GiB
KV cache capacity 470,288 tokens 452,192 tokens
Max concurrency (8K context) 57x 55x

The 1.5 GiB weight difference between models directly reduces KV cache pool — the tradeoff between model size and concurrent user capacity is exact and measurable.

Concurrent request scaling (continuous batching):

Concurrency Mistral 7B Llama 3.1 8B
1 94.7 tok/s / 2,110ms 91.2 tok/s / 2,192ms
2 193.0 tok/s / 2,066ms 186.5 tok/s / 2,138ms
4 382.7 tok/s / 2,083ms 367.6 tok/s / 2,169ms
8 752.3 tok/s / 2,110ms 731.1 tok/s / 2,179ms

8x requests = 8x throughput with flat latency on both models. Continuous batching behavior is architecture-agnostic — a property of vLLM's scheduler, not the model.

Exercise 5.2 — Quantization Benchmarks

Format Model size Throughput vs BF16 Kernel
Mistral 7B BF16 13.51 GiB 96.5 tok/s 1.0x FlashAttention v2
Mistral 7B AWQ naive 3.88 GiB 20.9 tok/s 0.2x naive AWQ
Mistral 7B AWQ Marlin 3.88 GiB 218.9 tok/s 2.27x fused Marlin
Llama 3.1 8B BF16 14.99 GiB 93.1 tok/s 1.0x FlashAttention v2
Llama 3.1 8B AWQ naive 5.37 GiB 20.7 tok/s 0.2x naive AWQ
Llama 3.1 8B AWQ Marlin 5.37 GiB 199.1 tok/s 2.14x fused Marlin

Key findings confirmed across both model families:

  • AWQ naive penalty is model-agnostic. Both models landed at ~20.8 tok/s. Dequantization overhead completely dominates — a kernel property, not a model property.
  • BF16 throughput tracks weight size directly. 3% throughput gap matches 11% weight size difference. Memory-bandwidth-bound result predicted by roofline model before running.
  • Marlin speedup is slightly lower for the larger model (2.14x vs 2.27x) because attention and unquantized operations become a larger fraction of total runtime as weight loading accelerates.

Error encountered and documented: initial AWQ run used --quantization awq instead of --quantization awq_marlin. vLLM warned at startup that Marlin was available but not selected. Result: 20.9 tok/s — 5x slower than BF16. Switching flags recovered full performance. One flag, 10x difference.

Exercise 5.3 — Speculative Decoding

Two runs: mismatched draft model, then correctly matched draft model.

Configuration Throughput vs BF16
Mistral 7B AWQ Marlin 218.9 tok/s 2.27x
+ TinyLlama 15M draft (mismatched family) 67.5 tok/s 0.7x
Llama 3.1 8B AWQ Marlin 199.1 tok/s 2.14x
+ Llama 3.2 1B draft (matched family) 106.4 tok/s 1.14x

Mismatched draft: same tokenizer vocabulary, different training family. High rejection rate — paid cost of both models, benefit of neither. 3x slower than target alone.

Matched draft: same Meta training family, same RLHF process. Draft acceptance rate: 59.5%. Mean accepted tokens per verify pass: 3.98. Result: latency improvement confirmed at low concurrency. Throughput measurement penalty vs Marlin alone is a vLLM 0.18.0 scheduling limitation — async scheduling is disabled when speculative decoding is active.

Shared tokenizer vocabulary is necessary but not sufficient for speculative decoding. Training dynamics must align. Llama 3.2 1B → Llama 3.1 8B is a legitimate production configuration. TinyLlama → Mistral is not.

Exercise 5.4 — TTFT Streaming Benchmark & Hardware Comparison Baseline

Methodology: fixed ~100-token prompt, 200-token output, temperature=0, streaming enabled, 1 warmup run excluded, 5 timed runs averaged. Matches Groq's published benchmark methodology for direct comparability.

A100 results:

Model Precision TTFT Throughput Total time Config
Llama 3.1 8B BF16 18.1ms 93.0 tok/s 2,157ms TP=1, 1× A100
Llama 3.3 70B BF16 58.0ms 21.2 tok/s 9,436ms TP=2, 2× A100

70B constraint: 134GB of weights split across 2× 80GB GPUs left only 3.93 GiB for KV cache (25,728 token capacity). Viable for single-request benchmarking, not production multi-user serving.

A100 vs Groq LPU (Artificial Analysis published benchmarks):

Model A100 tok/s Groq tok/s Groq advantage A100 TTFT Groq API TTFT TTFT winner
Llama 3.1 8B 93.0 877 9.4x 18.1ms ~200ms A100 (11x)
Llama 3.3 70B 21.2 276 13x 58.0ms ~450ms A100 (7.8x)

The TTFT reversal is explained by network round-trip. Groq API TTFT includes client-to-datacenter latency. Local vLLM has none. Groq's throughput advantage is real and architectural — covered in Phase 6. The TTFT advantage of local deployment shrinks with longer outputs as Groq's throughput lead compounds.

Warmup effect documented: first-request TTFT was 42ms for 8B vs 18ms steady state. CUDA graph initialization on the first request; subsequent requests reuse it. Warmup exclusion is required for accurate TTFT measurement.

What This Means

Inference is not a compute problem — it is a memory problem. The A100 has 312 TFLOPS sitting mostly idle during decode. What limits token generation speed is how fast model weights can be loaded from HBM on every single decode step. Every optimization in this phase — batching, quantization, speculative decoding — is a variation of the same principle from Phases 1–3: minimize expensive data movement, maximize useful work per memory access. The roofline model built in Phase 1 predicted every result here before a single benchmark was run.

The non-obvious finding: optimization method and optimization implementation are different things. AWQ INT4 with the wrong kernel was 5x slower than BF16. AWQ INT4 with the right kernel was 2.27x faster. Speculative decoding with the wrong draft family was 3x slower than no speculative decoding. Same technique, opposite outcomes.

Key Insight

The optimization sequence matters. Fix in this order before buying hardware:

  1. Continuous batching — 8x throughput at the same latency, free
  2. Quantization kernel — 10x difference between naive AWQ and Marlin, one flag
  3. Quantization level — 2.27x BF16 → Marlin INT4, 3.5x smaller model
  4. Speculative decoding — latency benefit at low batch sizes only, requires same training family as target (shared vocabulary is necessary, not sufficient)

The gap between a misconfigured and well-configured inference stack on identical hardware exceeded 10x in measured results. The 9–13x throughput gap between a correctly configured A100 and Groq's LPU is architectural — not a configuration problem. Understanding which gap is which is the foundation of Phase 6.

Phase 6 — Alternative Hardware Architectures

Hardware accessed: Tenstorrent Wormhole N300S (Koyeb), AMD MI300X (RunPod + AMD Developer Cloud), Groq LPU (GroqCloud API), Google TPU v5e (Google Colab)
Note: Direct measurement was not possible on Tenstorrent or AMD due to platform failures and driver mismatches. Those results use clearly labeled community benchmarks. Groq and TPU v5e produced real measurements. Deployment experience across all four is itself a primary finding.

Key Results

Tenstorrent Wormhole N300S — BF16 Matmul + Inference

Hardware BF16 Peak Memory BW Achieved Data Source
A100 SXM4 ~312 TFLOPS 2.0 TB/s HBM ~250 TFLOPS Real — Phase 2
RTX 4090 ~165 TFLOPS 1.01 TB/s GDDR6X ~130 TFLOPS Real — Phase 1
N300S Wormhole ~233 TFLOPS 576 GB/s GDDR6 ~140–175 TFLOPS Community benchmarks (corsix.org, FOSDEM 2025)

Inference throughput — single-user, 7–8B model class, BF16:

Hardware Model tok/s Data Source
A100 SXM4 Mistral 7B BF16 96.5 Real — Phase 5
N300S Wormhole Llama 3.1 8B BF16 ~24 Tenstorrent official (hand-optimized)

~4x throughput gap reflects the core constraint: Llama 3.1 8B (14GB) exceeds the N300S's 192MB SRAM and spills to GDDR6 at 576 GB/s — less than a third of A100 HBM bandwidth. TILE_LAYOUT mandatory at storage level adds real porting cost vs CUDA's row-major default. Koyeb deployment failures (instances 3f2265e1, 8ebd286b) prevented direct measurement.

AMD MI300X — Llama 70B Inference

Hardware Config tok/s Data Source
A100 SXM4 ×2 TP=2 required 21.2 (70B proxy) Real — Phase 5 Part 3
MI300X TP=1 native ~37 Chips & Cheese; AMD MLPerf v4.1

192GB HBM3 is the architectural differentiator — the entire Llama 70B model fits on one chip, eliminating the AllReduce overhead that forces A100 and H100 onto TP=2. ~40% lower latency vs H100 on 70B decode comes entirely from eliminating that inter-GPU communication step.

Realized throughput: independent analysis (arxiv:2510.27583) found MI300X achieves 37–66% of H100/H200 throughput despite higher theoretical specs — ROCm kernel optimization still lags CUDA. RunPod deployment failed across three container configurations due to ROCm 6.1 userspace / 6.10.5 kernel driver mismatch. AMD Developer Cloud was at capacity.

Groq LPU — TTFT and Throughput Benchmark (real measurements)

Methodology: ~100 token prompt, 200 token output, temperature=0, streaming, 1 warmup excluded, 5 run mean. Identical to Phase 5 Part 3. Network latency to Groq datacenter: ~13ms round-trip from Austin, TX.

Hardware Model tok/s TTFT Data Source
A100 SXM4 (local) Llama 3.1 8B 93.0 18.1ms Real — Phase 5 Part 3
A100 SXM4 (local) Llama 3.3 70B 21.2 58.0ms Real — Phase 5 Part 3
Cloud API median Llama 3.1 8B 154.8 930ms ArtificialAnalysis
Cloud API median Llama 3.3 70B 85.5 1,410ms ArtificialAnalysis
Groq LPU (API) Llama 3.1 8B 672.9 130.8ms Real — this session
Groq LPU (API) Llama 3.3 70B 263.1 135.4ms Real — this session
Groq LPU (spec dec) Llama 3 70B 1,665 Groq published

Groq vs cloud API median: 4.3x faster throughput on 8B, 3.1x on 70B, 7x lower TTFT on 8B. TTFT breakdown: ~6.5ms network, ~14ms LPU compute, ~110ms API infrastructure overhead. On-premise LPU compute TTFT (~14ms) actually beats A100 local (18ms) by 22% — the API overhead obscures this in cloud comparisons.

NVIDIA acquired Groq in December 2025. GroqCloud remains operational; long-term roadmap under NVIDIA ownership is uncertain.

Google TPU v5e — BF16 Matmul + Fused Attention (real measurements)

Hardware BF16 Peak Achieved Utilization Data Source
A100 SXM4 ~312 TFLOPS ~250 TFLOPS ~80% Real — Phase 2
RTX 4090 ~165 TFLOPS ~130 TFLOPS ~79% Real — Phase 1
TPU v5e-1 ~197 TFLOPS 142.3 TFLOPS 72.2% Real — this session

Fused attention (batch=4, heads=16, seq=2048, d_head=64, BF16): 2.60ms via jax.jit. XLA automatically fused matmul+softmax+matmul into a single TPU kernel with no programmer intervention — the same optimization Phase 3 required expert Triton kernel engineering to achieve.

Google Cloud provisioning failed across five zones (capacity errors, wrong accelerator type, quota limits). Accessed via Google Colab free tier — zero provisioning friction.

Phase 6 Meta-Finding: Developer Experience as NVIDIA's Deepest Moat

Every alternative hardware exercise encountered significant access friction. Every NVIDIA exercise in Phases 0–5 produced real measurements on the first attempt.

Architecture Access method Outcome
Tenstorrent N300S Koyeb cloud Platform deployment failures — container never initialized
AMD MI300X RunPod ROCm userspace/kernel driver mismatch across 3 container configs
AMD MI300X AMD Developer Cloud At capacity
Groq LPU GroqCloud API Accessible — API only, no raw hardware access
Google TPU v5e Google Cloud Console Capacity errors, wrong zone errors, quota limits across 5 zones
Google TPU v5e Google Colab Accessible — free tier, zero friction
NVIDIA A100/RTX 4090 RunPod Deployed and benchmarking in under 20 minutes, every time

This is not bad luck — it reflects 15 years of accumulated ecosystem investment. NVIDIA driver compatibility matrices are battle-tested across thousands of container configurations. Alternative hardware ecosystems are newer, thinner, and less validated in containerized cloud environments. Every hour debugging ROCm version mismatches or navigating TPU quota limits is engineering time that belongs in any TCO analysis alongside chip price and memory bandwidth.

What This Means

Every accelerator architecture is an answer to the same question: where is the binding constraint? Tenstorrent bets it's on-chip memory bandwidth for small models. AMD bets it's memory capacity for large ones. Groq bets it's latency unpredictability. TPU bets it's the mismatch between general-purpose hardware and transformer-specific operations. Each is correct for a specific workload shape — and each requires real engineering work to access and operate, in a way NVIDIA currently does not.

Key Insight

NVIDIA's moat is not just TFLOPS or memory bandwidth. It is the accumulated infrastructure of developer convenience built over 15 years: validated container images, community-tested compatibility matrices, thousands of Stack Overflow answers for every failure mode, and a path from zero to running code measured in minutes. Alternative hardware vendors that close this gap — not just on specs but on developer experience — will be the ones that successfully challenge NVIDIA's position. Until then, ubiquity is the moat.


Phase 7 — Model Architecture & The Full Stack View

Hardware: A100 SXM4 80GB, CUDA 12.4, PyTorch 2.4.0, RunPod
Model: 30M parameter GPT-style transformer, trained from scratch on TinyShakespeare

Key Results

Exercise 7.1 — Data Preparation

Metric Value
Raw characters 1,115,394
Total tokens (GPT-2 BPE) 338,025
Vocabulary size 50,257
Compression ratio 3.30 chars/token
Train / val split 304,222 / 33,803 tokens

BPE compression ratio directly determines attention cost — shorter sequences mean cheaper O(N²) attention computation. uint16 storage (not int32) halves memory bandwidth requirements while covering the full 50,257 vocabulary.

Exercise 7.2 — Model Implementation (nanoGPT)

Component Parameters
Token embedding (wte) 19.3M (64% of total)
Position embedding (wpe) 98K
6 transformer blocks 10.6M
Total 30.02M

Sanity check: initial loss 10.926 vs expected ln(50257) = 10.825 — gap of 0.10, within normal range for random initialization. Logits shape (4, 64, 50257) correct.

The embedding table dominates at small n_embd — 64% of a "30M parameter model" is a lookup table, not reasoning capacity. Weight tying (sharing embedding and output head weights) reduces parameters by 19.3M and empirically improves language modeling performance.

Exercise 7.3 — Overfitting: Capacity vs. Data Mismatch

30M parameter model on 304K training tokens (0.01 tokens/param):

Iteration Train loss Val loss
0 10.91 10.91
500 2.49 5.19
1,000 0.46 6.94
2,000 0.14 8.70
5,000 0.07 9.58

Generated output at iter 5,000: verbatim Romeo and Juliet Act IV. Complete memorization.

7.25M model (same dataset, 0.042 tokens/param): val loss floors at 5.70, generates novel Shakespeare-style text not present in training data.

Throughput delta: 7.25M at 389K tok/s vs 30M at 203K tok/s — nearly 2x, driven by smaller activation tensors fitting better in L2 cache. The memory hierarchy from Phase 1 reappears at the model architecture level.

Exercise 7.4 — Scaling Law Experiment

Four models trained at increasing scale, fixed dataset:

Model Non-embed params Best val loss Best at iter Chinchilla starvation
nano 98,944 5.3410 500 6.5x underfed
micro 333,120 4.9774 750 21.9x underfed
mini 788,736 4.7920 750 51.8x underfed
small 2,659,200 4.8181 500 174.8x underfed

Power law fit (nano, micro, mini): L(N) = 7.5929 × N^(-0.0321), fit error 1–2.4%. The small model broke the law — best val loss worse than mini despite 3.4x more parameters. Cause: 174.8x overtrained relative to Chinchilla-optimal token count. The scaling law has a hidden precondition: each model must be trained to its optimal point. Violate that and the prediction fails.

Scaling exponent α = 0.0321 vs Kaplan et al. α = 0.076 — direction correct, rate slower. Every model was data-starved, compressing the loss range and flattening the curve. The exponent is dataset-dependent, not universal.

Exercise 7.5b — Engineered Failure Modes

Deliberately removed residuals, LayerNorm, warmup, and gradient clipping to produce and characterize each failure mode.

Gradient death (residuals + LayerNorm both removed):

Config Residuals LayerNorm Result
Broken OFF OFF Loss frozen at 10.81, grad_norm = 0.0
Residuals only ON OFF Learns but noisy, grad_norm spikes to 8.0
LayerNorm only OFF ON Learns smoothly, floors at 6.3
Control ON ON Fastest descent, lowest loss 5.50

BF16 underflow mechanism: sequential attenuation through 12 unnormalized layers pushed activations below BF16 minimum representable value (~1e-4), producing exactly zero gradients. Complete freeze from iteration 0 — not an explosion, a silent death.

Warmup sensitivity (batch=128):

Config Final loss vs control
No warmup 4.2703 +20% worse
Short warmup (50 iters) 3.5801 +4.5% worse
Full warmup (200 iters) 3.4243 baseline

No-warmup damage is permanent and silent — loss goes down, training appears healthy, the 20% gap is only visible against a control run.

High LR (100x, AdamW): did not NaN. AdamW's adaptive per-parameter learning rates absorbed most of the raw LR magnitude increase. Effective step size scales sub-linearly — a 100x LR increase produced ~3-5x larger effective updates, not 100x. Damage occurred in the first 50 iterations and persisted to final loss despite cosine decay bringing LR near baseline by iter 275.

What This Means

Building a transformer from scratch closes the loop on everything in Phases 1–6. The KV cache in Phase 5 exists because token generation is a sequential loop — one forward pass per output token, 500 passes for a 500-token response. Flash Attention in Phase 3 exists because attention cost scales as O(N²) with sequence length, and the N here is the token count computed in Exercise 7.1. The memory hierarchy principle from Phase 1 reappeared in the throughput delta between the 30M and 7.25M models. The same roofline model that predicted Phase 5 quantization results predicted training throughput here.

The failure mode experiments revealed something counterintuitive: modern training infrastructure is collectively so robust that engineering visible failures at small scale required removing multiple components simultaneously. Each technique — AdamW, residuals, LayerNorm, gradient clipping — was added in response to a real failure at scale. The failures that required deliberate engineering to produce at 12M parameters on 300K tokens are routine operational concerns at 70B parameters on trillion-token datasets. Scale is the multiplier that makes each fragility relevant.

Key Insight

The most dangerous training failures are the silent ones. Gradient explosion produces NaN — immediately visible. Warmup misconfiguration produces a model that trains normally, converges cleanly, and is permanently 20% worse than it should be. Without a control run, you would never know. The scaling law has the same property: it predicts correctly when models are trained to their optimal point, and breaks silently when they are not. Log gradient norms alongside loss on every run. Track best val loss and checkpoint accordingly. Run a control. The control is not overhead — it is the only way to detect silent degradation.

About

Hands-on AI infrastructure from the ground up: GPU memory hierarchy, CUDA kernel optimization, Triton, distributed training, and inference serving. Real benchmarks across the full compute stack, from naive kernels to Groq LPUs, Tenstorrent, AMD MI300X, and Google TPU

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors