Most people who work with AI understand it at the layer they live in. What almost nobody has is a coherent mental model of the whole stack. This repo is my attempt to build that from the ground up. Over six phases and roughly six months, I worked through the full AI compute stack hands-on: writing CUDA kernels, implementing Ring AllReduce from scratch, deploying vLLM on A100s, and benchmarking Groq's LPU against my own A100 baseline. Every concept has an exercise. Every exercise produces a real number. Every number is committed here with the hardware, the conditions, and where I hit failures, an honest account of what went wrong and why.
The throughline is the same insight wearing different clothes at every layer: the binding constraint in AI systems is almost always data movement, not computation. Flash Attention unlocked 100K-token context windows not through better hardware, but by restructuring attention to avoid materializing a giant intermediate matrix in HBM. AWQ INT4 with the wrong kernel is 5x slower than BF16. The roofline model from Phase 1, built from two numbers on a spec sheet, predicted every benchmark result across six phases before I ran a single line of code.
I'm not an AI engineer. I'm a sales professional with an EE background who wanted to have technically grounded conversations about AI infrastructure that most people in go-to-market roles can't have because they've never run the code. This is what I built to close that gap.
| Phase | Topic | Status | Key Result |
|---|---|---|---|
| 0 | Environment & Tooling | ✅ Complete | RTX 4090 verified, 99.7% peak FP32 utilization on first benchmark |
| 1 | GPU Architecture & Memory Hierarchy | ✅ Complete | Roofline model measured: 947 GB/s bandwidth, 158 TFLOPS compute, ridge at ~327 FLOP/byte |
| 2 | CUDA Kernel Programming | ✅ Complete | Tiled matmul 7.2 TFLOPS vs cuBLAS 54 TFLOPS; 8x HBM traffic reduction confirmed via ncu |
| 3 | Triton Kernel Programming | ✅ Complete | 92% peak bandwidth, 2x fusion win, Flash Attention 6.4x speedup at N=4096 |
| 4 | Distributed Training Primitives | ✅ Complete | 228 GB/s measured vs 600 GB/s NVLink spec; Ring AllReduce from scratch; 38% of spec in virtualized env |
| 5 | Inference & Serving Infrastructure | ✅ Complete | AWQ Marlin 2.27x BF16; 8x concurrency = 8x throughput flat latency; A100 TTFT 18ms vs Groq API 200ms; Groq 9.4x faster on throughput |
| 6 | Alternative Hardware Architectures | ✅ Complete | Groq 672.9 tok/s (7.2x A100); TPU v5e 142.3 TFLOPS / XLA auto-fusion confirmed; deployment friction documented across all 4 architectures |
| 7 | Model Architecture & Full Stack View | ✅ Complete | 30M param GPT trained from scratch; scaling law verified empirically (1–2% fit error); warmup misconfiguration = silent 20% permanent loss penalty |
Hardware: RTX 4090, 49GB VRAM, CUDA 13.1, PyTorch 2.10, vast.ai ($0.57/hr)
Environment verified via four gates: nvidia-smi (driver 580.95.05), nvcc --version
(CUDA 13.1), torch.cuda.is_available() → True, torch.__version__ → 2.10.0a.
Baseline matmul benchmark (4096 × 4096, FP32, 100-iteration average):
| Metric | Result |
|---|---|
| Elapsed per call | 1.67ms |
| Achieved throughput | 82.4 TFLOPS |
| RTX 4090 FP32 peak | 82.6 TFLOPS |
| Peak utilization | 99.7% |
A GPU is rented hardware you access over the internet and immediately put to work
running math at extraordinary scale. The 82.4 TFLOPS result — 99.7% of theoretical
peak — is achievable because torch.matmul calls NVIDIA's hand-tuned cuBLAS library,
not naive code. That ceiling is the baseline everything else in this curriculum gets
measured against. Every kernel written by hand starts below it; the work of optimization
is understanding why and closing the gap.
torch.cuda.synchronize() is the first thing that separates someone who understands
GPU execution from someone who doesn't. GPU operations are asynchronous — without an
explicit sync barrier, you're timing kernel launches (microseconds), not kernel
execution (milliseconds). Every benchmarking script in this repo includes it.
Hardware: RTX 4090, 24GB GDDR6X, CUDA 13.0, vast.ai VM (~$0.40/hr)
Memory-bound operation — element-wise multiply, 32M FP32 elements:
| Metric | Result |
|---|---|
| Achieved bandwidth | 947.5 GB/s |
| Peak bandwidth | 1,008 GB/s |
| Utilization | 94.0% |
| DRAM throughput (ncu) | 92% |
| SM compute throughput (ncu) | 2.6% |
| Arithmetic intensity | 0.083 FLOP/byte |
Compute-bound operation — FP16 matmul, N=8192:
| Metric | Result |
|---|---|
| Achieved throughput | 158.1 TFLOPS |
| Peak throughput | 330 TFLOPS |
| Utilization | 47.9% |
| DRAM throughput (ncu) | 13% |
| SM compute throughput (ncu) | 47% |
| Arithmetic intensity | 2,731 FLOP/byte |
Roofline regime transition (FP16 matmul):
| Matrix size N | Arithmetic intensity | Regime |
|---|---|---|
| 64 | 21.3 FLOP/byte | Memory-bound |
| 128 | 42.7 FLOP/byte | Memory-bound |
| 256 | 85.3 FLOP/byte | Approaching ridge |
| 8192 | 2,731 FLOP/byte | Compute-bound |
Ridge point on RTX 4090: ~327 FLOP/byte (1,008 GB/s bandwidth ÷ 330 TFLOPS peak × 1000)
Every GPU operation is bottlenecked by one of two things: how fast it can move data from memory, or how fast it can do arithmetic. Which one limits you depends on the ratio of math to data movement — arithmetic intensity. We measured this directly: element-wise multiply spent 92% of its time moving data and 2.6% computing. Large matmul flipped that entirely. Below the ridge point (~N=256 for matmul), buying a faster GPU does nothing. You need faster memory or smarter data reuse.
Arithmetic intensity determines optimization strategy. Kernel fusion and quantization help memory-bound workloads. More TFLOPS only helps compute-bound ones. Knowing which regime your workload is in — a 60-second calculation from the spec sheet — should precede every infrastructure decision.
Hardware: RTX 4090, 49,140MB VRAM, CUDA 12.1, vast.ai VM
Exercise 2.1 — Vector Addition
- 1M elements across 4,096 blocks × 256 threads
- Global load bytes (ncu): 8.39MB — exactly 2 arrays × 1M × 4 bytes, zero waste
- Arithmetic intensity: 0.125 FLOP/byte — most memory-bound operation measured across Phases 1–2
- Correctness: h_c[0] = 3.0, h_c[1048575] = 3.0, no CUDA errors
Exercise 2.2 — Naive vs. Tiled Matmul vs. cuBLAS
Wall clock (20-iteration average, CUDA events):
| Version | N=1024 | N=2048 | N=4096 | % of cuBLAS |
|---|---|---|---|---|
| Naive | 0.43ms / 5.0 TFLOPS | 3.37ms / 5.1 TFLOPS | — (skipped) | ~9% |
| Tiled (T=16) | 0.33ms / 6.4 TFLOPS | 2.61ms / 6.6 TFLOPS | 19.17ms / 7.2 TFLOPS | ~13% |
| cuBLAS | 0.04ms / 48.7 TFLOPS | 0.32ms / 54.5 TFLOPS | 2.55ms / 53.9 TFLOPS | 100% |
ncu hardware counters (N=1024, single kernel instance):
| Metric | Naive | Tiled (T=16) |
|---|---|---|
| Global load bytes (HBM) | 4.29 GB | 537 MB |
| Shared memory wavefronts | 0 | 50,339,258 |
| DRAM throughput % of peak | 1.70% | 2.20% |
| SM throughput % of peak | 97.28% | 94.25% |
Tiling reduced HBM traffic 8x — exactly what the arithmetic predicts.
Naive matmul skipped at N=4096: all three matrices total 192MB, blowing past the 72MB L2 cache. Performance would have collapsed. At N=1024 (12MB total, fits in L2), naive measured 5.0 TFLOPS — L2 masking the expected collapse, which ncu confirmed via the 8x HBM traffic difference between naive and tiled.
Exercise 2.3 — Softmax with Warp Shuffle
- 4,096 rows × 32 columns, one warp (32 threads) per row
- Global load bytes (ncu): 1.57MB — input read exactly once
- Correctness verified: all rows sum to 1.000000
- Warp shuffle reduction: 5 steps of
__shfl_down_sync, no shared memory, no barriers - ~1 cycle per step vs. ~5 cycles for shared memory equivalent
A GPU is not a fast calculator — it is a memory system with compute attached. A straightforwardly written matmul used 0.17% of available compute because the same data was repeatedly fetched from slow off-chip memory by independent threads. Tiling fixed this by loading data into fast on-chip SRAM once and reusing it many times, cutting HBM traffic 8x. The remaining gap to cuBLAS is decades of additional engineering: three tiling levels instead of one, Tensor Core integration, and assembly-level inner loop optimization. The warp shuffle exercise showed that even cross-thread coordination (finding a row maximum) can execute entirely in registers in 5 instructions — the same primitive that makes Flash Attention's fused kernel feasible.
The L2 cache masking effect is a production trap. A kernel benchmarking acceptably at small problem sizes can collapse at production scale when the working set exceeds L2. Any kernel evaluation that doesn't profile at actual production problem sizes is incomplete. SM throughput is not the same as compute utilization — naive kernels hammering HBM with load instructions register high SM activity while compute units starve. Wall clock is the honest arbiter.
Hardware: RTX 4090, 24GB GDDR6X, RunPod persistent pod, CUDA 12.4, PyTorch 2.4.1, Triton 3.0.0
Exercise 3.1 — Triton Vector Addition
- Test size: 98,432 elements (non-power-of-2, stress-tests mask logic)
- Correctness: max absolute difference vs PyTorch = 0.00e+00
- At 16M elements (only size exceeding 72MB L2, measuring real GDDR6X traffic):
| Implementation | Bandwidth | % of Peak |
|---|---|---|
| Triton | 924.0 GB/s | 91.7% |
| PyTorch | 924.5 GB/s | 91.8% |
| RTX 4090 peak | 1,008 GB/s | 100% |
Statistically identical. Triton's compiler produces code on par with hand-tuned CUDA with no manual thread indexing.
Exercise 3.2 — Fused Softmax
Benchmark at 1024 rows × 2048 cols:
| Implementation | Bandwidth | Latency | Notes |
|---|---|---|---|
| Triton fused | 823 GB/s | 20.4 µs | 1 HBM round trip |
| PyTorch unfused (5-op) | 414 GB/s | 40.4 µs | 5 HBM round trips |
| torch.softmax (internally fused) | 1,208 GB/s | 13.9 µs | L2 cache effects at this size |
- Correctness: max absolute difference vs PyTorch = 7.45e-09, allclose = True
- Fusion win: 2x throughput vs explicit unfused implementation
- Unfused bandwidth plateaus at ~411–414 GB/s across all sizes — signature of multiple HBM passes
Exercise 3.3 — Flash Attention Forward Pass
- Correctness: max absolute difference vs naive = 1.14e-03 (float32 accumulation across tile rescaling — not a logic bug)
- Benchmark (batch=2, heads=4, d_head=64, float32):
| N | Flash | Naive | Speedup | Naive attn matrix |
|---|---|---|---|---|
| 512 | 0.041ms | 0.072ms | 1.8x | 8.4 MB |
| 1024 | 0.044ms | 0.176ms | 4.0x | 33.6 MB |
| 2048 | 0.155ms | 0.969ms | 6.3x | 134.2 MB |
| 4096 | 0.590ms | 3.750ms | 6.4x | 536.9 MB |
Speedup compounds with N — consistent with O(N) vs O(N²) prediction. Every doubling of N: Flash cost ~doubles, naive cost ~quadruples.
Triton proved three things experimentally: a Python-level GPU compiler can match hand-tuned CUDA performance (92% of peak bandwidth with no manual thread management); kernel fusion cuts memory traffic in half and directly doubles throughput; and Flash Attention's breakthrough wasn't a faster chip — it was a mathematical restructuring of attention to avoid writing a giant intermediate matrix to HBM at all. That restructuring is what made 100K-token context windows physically possible. The algorithm unlocked the hardware market for long-context inference — not the other way around.
Context length is not a free parameter. Without fused tiled attention, memory requirements scale as O(N²). With it, O(N). At N=4096, naive attention requires 537MB just for the attention matrix. At N=32K, that number exceeds the GPU entirely. Evaluating any inference framework requires confirming Flash Attention (or equivalent) is the default attention path, not an optional flag.
| Phase | Hardware | Provider | Cost |
|---|---|---|---|
| 0 | RTX 4090, 49GB VRAM, CUDA 13.1 | vast.ai | $0.57/hr |
| 1 | RTX 4090, 24GB GDDR6X, CUDA 13.0 | vast.ai VM | ~$0.40/hr |
| 2 | RTX 4090, 49GB VRAM, CUDA 12.1 | vast.ai VM | — |
| 3 | RTX 4090, 24GB GDDR6X, CUDA 12.4 | RunPod persistent pod | — |
All RTX 4090 specs: 16,384 CUDA cores, 1,008 GB/s GDDR6X bandwidth, 82.6 TFLOPS FP32, 330 TFLOPS FP16 tensor core peak.
Hardware: 4× A100 SXM 40GB, NVLink NV12 topology, CUDA 12.4.1, PyTorch 2.4.0, RunPod
Exercise 4.1 — DDP Training
- 4-process DDP training loop across 4× A100 SXM
- Initial param sum:
20.335490— identical across all 4 ranks (broadcast at init confirmed) - Final param sum:
20.307633— identical across all 4 ranks after 20 steps on different random data (AllReduce confirmed working) - Step timing: 0.21s (rank 0) to 0.26s (rank 3) — variance demonstrates the synchronization barrier in practice: fastest rank waits for slowest before any rank proceeds
Exercise 4.2 — Ring AllReduce from Scratch
- Implemented full Ring AllReduce using only
dist.P2POppoint-to-point primitives — nodist.all_reduce() - Result matched PyTorch
dist.all_reduce()to9.54e-07max absolute difference across all 4 ranks - Hit deadlock on first implementation using naive
isend+recv— all ranks blocked simultaneously because NCCLisenddoesn't move data until a matchingrecvis posted on the remote end - Fixed with
dist.batch_isend_irecv, which posts sends and receives atomically
Exercise 4.3 — NVLink Bandwidth Benchmark
| Tensor size | Time (ms) | Measured BW | % of 600 GB/s spec |
|---|---|---|---|
| 1 MB | 0.051 | 31.0 GB/s | 5.2% |
| 16 MB | 0.202 | 124.3 GB/s | 20.7% |
| 64 MB | 0.596 | 168.8 GB/s | 28.1% |
| 256 MB | 2.092 | 192.5 GB/s | 32.1% |
| 512 MB | 3.817 | 211.0 GB/s | 35.2% |
| 1 GB | 7.244 | 222.3 GB/s | 37.1% |
| 2 GB | 14.152 | 227.6 GB/s | 37.9% |
Peak measured: 228 GB/s — 38% of 600 GB/s A100 SXM NVLink spec. Curve still rising at
2GB. Gap vs. spec attributed to RunPod virtualization overhead, AllReduce synchronization
cost, and NCCL algorithm overhead vs. raw point-to-point benchmark conditions. Bus
bandwidth formula: 2 × (N-1)/N × tensor_size / time — matches NCCL benchmark
methodology.
When training across multiple GPUs, each GPU sees different data and computes its own gradient. Before any weight update can happen, all gradients must be averaged across every GPU so the model stays synchronized. That averaging — AllReduce — runs over the physical wire connecting GPUs on every single training step. We implemented that algorithm from scratch, proved it produces the correct result, then measured how fast the wire actually runs versus its rated spec. The gap between 228 GB/s measured and 600 GB/s rated is not a malfunction — it's what real-world overhead looks like. Understanding that gap is what separates people who read spec sheets from people who understand systems.
Interconnect topology is a first-class architectural decision, not a procurement detail. PCIe between GPUs delivers ~32 GB/s per link; NVLink delivers 600 GB/s. At real gradient tensor sizes, PCIe puts you in a regime where communication dominates compute — and that is not fixable in software. GPU utilization is also the wrong metric for distributed training: a cluster showing 80% utilization may be spending 40% of that time blocked at AllReduce barriers. The correct metric is Model FLOP Utilization (MFU). Most production training runs achieve 30–50% MFU. NVLink is NVIDIA's most durable competitive moat — more defensible than the GPU itself.
Hardware: A100 SXM4 80GB (single and 2×), CUDA 12.4, PyTorch 2.4.0, vLLM 0.18.0, RunPod
Models: Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct, Llama-3.3-70B-Instruct
Note: Exercises run across three parts — Mistral 7B baseline, Llama 3.1 8B cross-validation, and a TTFT streaming benchmark designed for direct comparison with published Groq LPU numbers.
Exercise 5.1 — vLLM Deployment & Continuous Batching
vLLM startup telemetry:
| Metric | Mistral 7B | Llama 3.1 8B |
|---|---|---|
| Model weights | 13.51 GiB | 14.99 GiB |
| KV cache pool | 57.41 GiB | 55.20 GiB |
| KV cache capacity | 470,288 tokens | 452,192 tokens |
| Max concurrency (8K context) | 57x | 55x |
The 1.5 GiB weight difference between models directly reduces KV cache pool — the tradeoff between model size and concurrent user capacity is exact and measurable.
Concurrent request scaling (continuous batching):
| Concurrency | Mistral 7B | Llama 3.1 8B |
|---|---|---|
| 1 | 94.7 tok/s / 2,110ms | 91.2 tok/s / 2,192ms |
| 2 | 193.0 tok/s / 2,066ms | 186.5 tok/s / 2,138ms |
| 4 | 382.7 tok/s / 2,083ms | 367.6 tok/s / 2,169ms |
| 8 | 752.3 tok/s / 2,110ms | 731.1 tok/s / 2,179ms |
8x requests = 8x throughput with flat latency on both models. Continuous batching behavior is architecture-agnostic — a property of vLLM's scheduler, not the model.
Exercise 5.2 — Quantization Benchmarks
| Format | Model size | Throughput | vs BF16 | Kernel |
|---|---|---|---|---|
| Mistral 7B BF16 | 13.51 GiB | 96.5 tok/s | 1.0x | FlashAttention v2 |
| Mistral 7B AWQ naive | 3.88 GiB | 20.9 tok/s | 0.2x | naive AWQ |
| Mistral 7B AWQ Marlin | 3.88 GiB | 218.9 tok/s | 2.27x | fused Marlin |
| Llama 3.1 8B BF16 | 14.99 GiB | 93.1 tok/s | 1.0x | FlashAttention v2 |
| Llama 3.1 8B AWQ naive | 5.37 GiB | 20.7 tok/s | 0.2x | naive AWQ |
| Llama 3.1 8B AWQ Marlin | 5.37 GiB | 199.1 tok/s | 2.14x | fused Marlin |
Key findings confirmed across both model families:
- AWQ naive penalty is model-agnostic. Both models landed at ~20.8 tok/s. Dequantization overhead completely dominates — a kernel property, not a model property.
- BF16 throughput tracks weight size directly. 3% throughput gap matches 11% weight size difference. Memory-bandwidth-bound result predicted by roofline model before running.
- Marlin speedup is slightly lower for the larger model (2.14x vs 2.27x) because attention and unquantized operations become a larger fraction of total runtime as weight loading accelerates.
Error encountered and documented: initial AWQ run used --quantization awq instead
of --quantization awq_marlin. vLLM warned at startup that Marlin was available but
not selected. Result: 20.9 tok/s — 5x slower than BF16. Switching flags recovered
full performance. One flag, 10x difference.
Exercise 5.3 — Speculative Decoding
Two runs: mismatched draft model, then correctly matched draft model.
| Configuration | Throughput | vs BF16 |
|---|---|---|
| Mistral 7B AWQ Marlin | 218.9 tok/s | 2.27x |
| + TinyLlama 15M draft (mismatched family) | 67.5 tok/s | 0.7x |
| Llama 3.1 8B AWQ Marlin | 199.1 tok/s | 2.14x |
| + Llama 3.2 1B draft (matched family) | 106.4 tok/s | 1.14x |
Mismatched draft: same tokenizer vocabulary, different training family. High rejection rate — paid cost of both models, benefit of neither. 3x slower than target alone.
Matched draft: same Meta training family, same RLHF process. Draft acceptance rate: 59.5%. Mean accepted tokens per verify pass: 3.98. Result: latency improvement confirmed at low concurrency. Throughput measurement penalty vs Marlin alone is a vLLM 0.18.0 scheduling limitation — async scheduling is disabled when speculative decoding is active.
Shared tokenizer vocabulary is necessary but not sufficient for speculative decoding. Training dynamics must align. Llama 3.2 1B → Llama 3.1 8B is a legitimate production configuration. TinyLlama → Mistral is not.
Exercise 5.4 — TTFT Streaming Benchmark & Hardware Comparison Baseline
Methodology: fixed ~100-token prompt, 200-token output, temperature=0, streaming enabled, 1 warmup run excluded, 5 timed runs averaged. Matches Groq's published benchmark methodology for direct comparability.
A100 results:
| Model | Precision | TTFT | Throughput | Total time | Config |
|---|---|---|---|---|---|
| Llama 3.1 8B | BF16 | 18.1ms | 93.0 tok/s | 2,157ms | TP=1, 1× A100 |
| Llama 3.3 70B | BF16 | 58.0ms | 21.2 tok/s | 9,436ms | TP=2, 2× A100 |
70B constraint: 134GB of weights split across 2× 80GB GPUs left only 3.93 GiB for KV cache (25,728 token capacity). Viable for single-request benchmarking, not production multi-user serving.
A100 vs Groq LPU (Artificial Analysis published benchmarks):
| Model | A100 tok/s | Groq tok/s | Groq advantage | A100 TTFT | Groq API TTFT | TTFT winner |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | 93.0 | 877 | 9.4x | 18.1ms | ~200ms | A100 (11x) |
| Llama 3.3 70B | 21.2 | 276 | 13x | 58.0ms | ~450ms | A100 (7.8x) |
The TTFT reversal is explained by network round-trip. Groq API TTFT includes client-to-datacenter latency. Local vLLM has none. Groq's throughput advantage is real and architectural — covered in Phase 6. The TTFT advantage of local deployment shrinks with longer outputs as Groq's throughput lead compounds.
Warmup effect documented: first-request TTFT was 42ms for 8B vs 18ms steady state. CUDA graph initialization on the first request; subsequent requests reuse it. Warmup exclusion is required for accurate TTFT measurement.
Inference is not a compute problem — it is a memory problem. The A100 has 312 TFLOPS sitting mostly idle during decode. What limits token generation speed is how fast model weights can be loaded from HBM on every single decode step. Every optimization in this phase — batching, quantization, speculative decoding — is a variation of the same principle from Phases 1–3: minimize expensive data movement, maximize useful work per memory access. The roofline model built in Phase 1 predicted every result here before a single benchmark was run.
The non-obvious finding: optimization method and optimization implementation are different things. AWQ INT4 with the wrong kernel was 5x slower than BF16. AWQ INT4 with the right kernel was 2.27x faster. Speculative decoding with the wrong draft family was 3x slower than no speculative decoding. Same technique, opposite outcomes.
The optimization sequence matters. Fix in this order before buying hardware:
- Continuous batching — 8x throughput at the same latency, free
- Quantization kernel — 10x difference between naive AWQ and Marlin, one flag
- Quantization level — 2.27x BF16 → Marlin INT4, 3.5x smaller model
- Speculative decoding — latency benefit at low batch sizes only, requires same training family as target (shared vocabulary is necessary, not sufficient)
The gap between a misconfigured and well-configured inference stack on identical hardware exceeded 10x in measured results. The 9–13x throughput gap between a correctly configured A100 and Groq's LPU is architectural — not a configuration problem. Understanding which gap is which is the foundation of Phase 6.
Hardware accessed: Tenstorrent Wormhole N300S (Koyeb), AMD MI300X (RunPod + AMD Developer Cloud),
Groq LPU (GroqCloud API), Google TPU v5e (Google Colab)
Note: Direct measurement was not possible on Tenstorrent or AMD due to platform failures and driver
mismatches. Those results use clearly labeled community benchmarks. Groq and TPU v5e produced real
measurements. Deployment experience across all four is itself a primary finding.
Tenstorrent Wormhole N300S — BF16 Matmul + Inference
| Hardware | BF16 Peak | Memory BW | Achieved | Data Source |
|---|---|---|---|---|
| A100 SXM4 | ~312 TFLOPS | 2.0 TB/s HBM | ~250 TFLOPS | Real — Phase 2 |
| RTX 4090 | ~165 TFLOPS | 1.01 TB/s GDDR6X | ~130 TFLOPS | Real — Phase 1 |
| N300S Wormhole | ~233 TFLOPS | 576 GB/s GDDR6 | ~140–175 TFLOPS | Community benchmarks (corsix.org, FOSDEM 2025) |
Inference throughput — single-user, 7–8B model class, BF16:
| Hardware | Model | tok/s | Data Source |
|---|---|---|---|
| A100 SXM4 | Mistral 7B BF16 | 96.5 | Real — Phase 5 |
| N300S Wormhole | Llama 3.1 8B BF16 | ~24 | Tenstorrent official (hand-optimized) |
~4x throughput gap reflects the core constraint: Llama 3.1 8B (14GB) exceeds the N300S's 192MB SRAM and spills to GDDR6 at 576 GB/s — less than a third of A100 HBM bandwidth. TILE_LAYOUT mandatory at storage level adds real porting cost vs CUDA's row-major default. Koyeb deployment failures (instances 3f2265e1, 8ebd286b) prevented direct measurement.
AMD MI300X — Llama 70B Inference
| Hardware | Config | tok/s | Data Source |
|---|---|---|---|
| A100 SXM4 ×2 | TP=2 required | 21.2 (70B proxy) | Real — Phase 5 Part 3 |
| MI300X | TP=1 native | ~37 | Chips & Cheese; AMD MLPerf v4.1 |
192GB HBM3 is the architectural differentiator — the entire Llama 70B model fits on one chip, eliminating the AllReduce overhead that forces A100 and H100 onto TP=2. ~40% lower latency vs H100 on 70B decode comes entirely from eliminating that inter-GPU communication step.
Realized throughput: independent analysis (arxiv:2510.27583) found MI300X achieves 37–66% of H100/H200 throughput despite higher theoretical specs — ROCm kernel optimization still lags CUDA. RunPod deployment failed across three container configurations due to ROCm 6.1 userspace / 6.10.5 kernel driver mismatch. AMD Developer Cloud was at capacity.
Groq LPU — TTFT and Throughput Benchmark (real measurements)
Methodology: ~100 token prompt, 200 token output, temperature=0, streaming, 1 warmup excluded, 5 run mean. Identical to Phase 5 Part 3. Network latency to Groq datacenter: ~13ms round-trip from Austin, TX.
| Hardware | Model | tok/s | TTFT | Data Source |
|---|---|---|---|---|
| A100 SXM4 (local) | Llama 3.1 8B | 93.0 | 18.1ms | Real — Phase 5 Part 3 |
| A100 SXM4 (local) | Llama 3.3 70B | 21.2 | 58.0ms | Real — Phase 5 Part 3 |
| Cloud API median | Llama 3.1 8B | 154.8 | 930ms | ArtificialAnalysis |
| Cloud API median | Llama 3.3 70B | 85.5 | 1,410ms | ArtificialAnalysis |
| Groq LPU (API) | Llama 3.1 8B | 672.9 | 130.8ms | Real — this session |
| Groq LPU (API) | Llama 3.3 70B | 263.1 | 135.4ms | Real — this session |
| Groq LPU (spec dec) | Llama 3 70B | 1,665 | — | Groq published |
Groq vs cloud API median: 4.3x faster throughput on 8B, 3.1x on 70B, 7x lower TTFT on 8B. TTFT breakdown: ~6.5ms network, ~14ms LPU compute, ~110ms API infrastructure overhead. On-premise LPU compute TTFT (~14ms) actually beats A100 local (18ms) by 22% — the API overhead obscures this in cloud comparisons.
NVIDIA acquired Groq in December 2025. GroqCloud remains operational; long-term roadmap under NVIDIA ownership is uncertain.
Google TPU v5e — BF16 Matmul + Fused Attention (real measurements)
| Hardware | BF16 Peak | Achieved | Utilization | Data Source |
|---|---|---|---|---|
| A100 SXM4 | ~312 TFLOPS | ~250 TFLOPS | ~80% | Real — Phase 2 |
| RTX 4090 | ~165 TFLOPS | ~130 TFLOPS | ~79% | Real — Phase 1 |
| TPU v5e-1 | ~197 TFLOPS | 142.3 TFLOPS | 72.2% | Real — this session |
Fused attention (batch=4, heads=16, seq=2048, d_head=64, BF16): 2.60ms via jax.jit.
XLA automatically fused matmul+softmax+matmul into a single TPU kernel with no programmer
intervention — the same optimization Phase 3 required expert Triton kernel engineering to achieve.
Google Cloud provisioning failed across five zones (capacity errors, wrong accelerator type, quota limits). Accessed via Google Colab free tier — zero provisioning friction.
Every alternative hardware exercise encountered significant access friction. Every NVIDIA exercise in Phases 0–5 produced real measurements on the first attempt.
| Architecture | Access method | Outcome |
|---|---|---|
| Tenstorrent N300S | Koyeb cloud | Platform deployment failures — container never initialized |
| AMD MI300X | RunPod | ROCm userspace/kernel driver mismatch across 3 container configs |
| AMD MI300X | AMD Developer Cloud | At capacity |
| Groq LPU | GroqCloud API | Accessible — API only, no raw hardware access |
| Google TPU v5e | Google Cloud Console | Capacity errors, wrong zone errors, quota limits across 5 zones |
| Google TPU v5e | Google Colab | Accessible — free tier, zero friction |
| NVIDIA A100/RTX 4090 | RunPod | Deployed and benchmarking in under 20 minutes, every time |
This is not bad luck — it reflects 15 years of accumulated ecosystem investment. NVIDIA driver compatibility matrices are battle-tested across thousands of container configurations. Alternative hardware ecosystems are newer, thinner, and less validated in containerized cloud environments. Every hour debugging ROCm version mismatches or navigating TPU quota limits is engineering time that belongs in any TCO analysis alongside chip price and memory bandwidth.
Every accelerator architecture is an answer to the same question: where is the binding constraint? Tenstorrent bets it's on-chip memory bandwidth for small models. AMD bets it's memory capacity for large ones. Groq bets it's latency unpredictability. TPU bets it's the mismatch between general-purpose hardware and transformer-specific operations. Each is correct for a specific workload shape — and each requires real engineering work to access and operate, in a way NVIDIA currently does not.
NVIDIA's moat is not just TFLOPS or memory bandwidth. It is the accumulated infrastructure of developer convenience built over 15 years: validated container images, community-tested compatibility matrices, thousands of Stack Overflow answers for every failure mode, and a path from zero to running code measured in minutes. Alternative hardware vendors that close this gap — not just on specs but on developer experience — will be the ones that successfully challenge NVIDIA's position. Until then, ubiquity is the moat.
Hardware: A100 SXM4 80GB, CUDA 12.4, PyTorch 2.4.0, RunPod
Model: 30M parameter GPT-style transformer, trained from scratch on TinyShakespeare
Exercise 7.1 — Data Preparation
| Metric | Value |
|---|---|
| Raw characters | 1,115,394 |
| Total tokens (GPT-2 BPE) | 338,025 |
| Vocabulary size | 50,257 |
| Compression ratio | 3.30 chars/token |
| Train / val split | 304,222 / 33,803 tokens |
BPE compression ratio directly determines attention cost — shorter sequences mean cheaper O(N²) attention computation. uint16 storage (not int32) halves memory bandwidth requirements while covering the full 50,257 vocabulary.
Exercise 7.2 — Model Implementation (nanoGPT)
| Component | Parameters |
|---|---|
| Token embedding (wte) | 19.3M (64% of total) |
| Position embedding (wpe) | 98K |
| 6 transformer blocks | 10.6M |
| Total | 30.02M |
Sanity check: initial loss 10.926 vs expected ln(50257) = 10.825 — gap of 0.10, within normal range for random initialization. Logits shape (4, 64, 50257) correct.
The embedding table dominates at small n_embd — 64% of a "30M parameter model" is a lookup table, not reasoning capacity. Weight tying (sharing embedding and output head weights) reduces parameters by 19.3M and empirically improves language modeling performance.
Exercise 7.3 — Overfitting: Capacity vs. Data Mismatch
30M parameter model on 304K training tokens (0.01 tokens/param):
| Iteration | Train loss | Val loss |
|---|---|---|
| 0 | 10.91 | 10.91 |
| 500 | 2.49 | 5.19 |
| 1,000 | 0.46 | 6.94 |
| 2,000 | 0.14 | 8.70 |
| 5,000 | 0.07 | 9.58 |
Generated output at iter 5,000: verbatim Romeo and Juliet Act IV. Complete memorization.
7.25M model (same dataset, 0.042 tokens/param): val loss floors at 5.70, generates novel Shakespeare-style text not present in training data.
Throughput delta: 7.25M at 389K tok/s vs 30M at 203K tok/s — nearly 2x, driven by smaller activation tensors fitting better in L2 cache. The memory hierarchy from Phase 1 reappears at the model architecture level.
Exercise 7.4 — Scaling Law Experiment
Four models trained at increasing scale, fixed dataset:
| Model | Non-embed params | Best val loss | Best at iter | Chinchilla starvation |
|---|---|---|---|---|
| nano | 98,944 | 5.3410 | 500 | 6.5x underfed |
| micro | 333,120 | 4.9774 | 750 | 21.9x underfed |
| mini | 788,736 | 4.7920 | 750 | 51.8x underfed |
| small | 2,659,200 | 4.8181 | 500 | 174.8x underfed |
Power law fit (nano, micro, mini): L(N) = 7.5929 × N^(-0.0321), fit error 1–2.4%. The small model broke the law — best val loss worse than mini despite 3.4x more parameters. Cause: 174.8x overtrained relative to Chinchilla-optimal token count. The scaling law has a hidden precondition: each model must be trained to its optimal point. Violate that and the prediction fails.
Scaling exponent α = 0.0321 vs Kaplan et al. α = 0.076 — direction correct, rate slower. Every model was data-starved, compressing the loss range and flattening the curve. The exponent is dataset-dependent, not universal.
Exercise 7.5b — Engineered Failure Modes
Deliberately removed residuals, LayerNorm, warmup, and gradient clipping to produce and characterize each failure mode.
Gradient death (residuals + LayerNorm both removed):
| Config | Residuals | LayerNorm | Result |
|---|---|---|---|
| Broken | OFF | OFF | Loss frozen at 10.81, grad_norm = 0.0 |
| Residuals only | ON | OFF | Learns but noisy, grad_norm spikes to 8.0 |
| LayerNorm only | OFF | ON | Learns smoothly, floors at 6.3 |
| Control | ON | ON | Fastest descent, lowest loss 5.50 |
BF16 underflow mechanism: sequential attenuation through 12 unnormalized layers pushed activations below BF16 minimum representable value (~1e-4), producing exactly zero gradients. Complete freeze from iteration 0 — not an explosion, a silent death.
Warmup sensitivity (batch=128):
| Config | Final loss | vs control |
|---|---|---|
| No warmup | 4.2703 | +20% worse |
| Short warmup (50 iters) | 3.5801 | +4.5% worse |
| Full warmup (200 iters) | 3.4243 | baseline |
No-warmup damage is permanent and silent — loss goes down, training appears healthy, the 20% gap is only visible against a control run.
High LR (100x, AdamW): did not NaN. AdamW's adaptive per-parameter learning rates absorbed most of the raw LR magnitude increase. Effective step size scales sub-linearly — a 100x LR increase produced ~3-5x larger effective updates, not 100x. Damage occurred in the first 50 iterations and persisted to final loss despite cosine decay bringing LR near baseline by iter 275.
Building a transformer from scratch closes the loop on everything in Phases 1–6. The KV cache in Phase 5 exists because token generation is a sequential loop — one forward pass per output token, 500 passes for a 500-token response. Flash Attention in Phase 3 exists because attention cost scales as O(N²) with sequence length, and the N here is the token count computed in Exercise 7.1. The memory hierarchy principle from Phase 1 reappeared in the throughput delta between the 30M and 7.25M models. The same roofline model that predicted Phase 5 quantization results predicted training throughput here.
The failure mode experiments revealed something counterintuitive: modern training infrastructure is collectively so robust that engineering visible failures at small scale required removing multiple components simultaneously. Each technique — AdamW, residuals, LayerNorm, gradient clipping — was added in response to a real failure at scale. The failures that required deliberate engineering to produce at 12M parameters on 300K tokens are routine operational concerns at 70B parameters on trillion-token datasets. Scale is the multiplier that makes each fragility relevant.
The most dangerous training failures are the silent ones. Gradient explosion produces NaN — immediately visible. Warmup misconfiguration produces a model that trains normally, converges cleanly, and is permanently 20% worse than it should be. Without a control run, you would never know. The scaling law has the same property: it predicts correctly when models are trained to their optimal point, and breaks silently when they are not. Log gradient norms alongside loss on every run. Track best val loss and checkpoint accordingly. Run a control. The control is not overhead — it is the only way to detect silent degradation.