AI Infrastructure Learning Journey

dagc.ai | LinkedIn

Most people who work with AI understand it at the layer they live in. What almost nobody has is a coherent mental model of the whole stack. This repo is my attempt to build that from the ground up. Over six phases and roughly six months, I worked through the full AI compute stack hands-on: writing CUDA kernels, implementing Ring AllReduce from scratch, deploying vLLM on A100s, and benchmarking Groq's LPU against my own A100 baseline. Every concept has an exercise. Every exercise produces a real number. Every number is committed here with the hardware, the conditions, and where I hit failures, an honest account of what went wrong and why.

The throughline is the same insight wearing different clothes at every layer: the binding constraint in AI systems is almost always data movement, not computation. Flash Attention unlocked 100K-token context windows not through better hardware, but by restructuring attention to avoid materializing a giant intermediate matrix in HBM. AWQ INT4 with the wrong kernel is 5x slower than BF16. The roofline model from Phase 1, built from two numbers on a spec sheet, predicted every benchmark result across six phases before I ran a single line of code.

I'm not an AI engineer. I'm a sales professional with an EE background who wanted to have technically grounded conversations about AI infrastructure that most people in go-to-market roles can't have because they've never run the code. This is what I built to close that gap.

Progress

Phase	Topic	Status	Key Result
0	Environment & Tooling	✅ Complete	RTX 4090 verified, 99.7% peak FP32 utilization on first benchmark
1	GPU Architecture & Memory Hierarchy	✅ Complete	Roofline model measured: 947 GB/s bandwidth, 158 TFLOPS compute, ridge at ~327 FLOP/byte
2	CUDA Kernel Programming	✅ Complete	Tiled matmul 7.2 TFLOPS vs cuBLAS 54 TFLOPS; 8x HBM traffic reduction confirmed via ncu
3	Triton Kernel Programming	✅ Complete	92% peak bandwidth, 2x fusion win, Flash Attention 6.4x speedup at N=4096
4	Distributed Training Primitives	✅ Complete	228 GB/s measured vs 600 GB/s NVLink spec; Ring AllReduce from scratch; 38% of spec in virtualized env
5	Inference & Serving Infrastructure	✅ Complete	AWQ Marlin 2.27x BF16; 8x concurrency = 8x throughput flat latency; A100 TTFT 18ms vs Groq API 200ms; Groq 9.4x faster on throughput
6	Alternative Hardware Architectures	✅ Complete	Groq 672.9 tok/s (7.2x A100); TPU v5e 142.3 TFLOPS / XLA auto-fusion confirmed; deployment friction documented across all 4 architectures
7	Model Architecture & Full Stack View	✅ Complete	30M param GPT trained from scratch; scaling law verified empirically (1–2% fit error); warmup misconfiguration = silent 20% permanent loss penalty

Phase 0 — Environment & Tooling

Hardware: RTX 4090, 49GB VRAM, CUDA 13.1, PyTorch 2.10, vast.ai ($0.57/hr)

Key Results

Environment verified via four gates: nvidia-smi (driver 580.95.05), nvcc --version (CUDA 13.1), torch.cuda.is_available() → True, torch.__version__ → 2.10.0a.

Baseline matmul benchmark (4096 × 4096, FP32, 100-iteration average):

Metric	Result
Elapsed per call	1.67ms
Achieved throughput	82.4 TFLOPS
RTX 4090 FP32 peak	82.6 TFLOPS
Peak utilization	99.7%

What This Means

A GPU is rented hardware you access over the internet and immediately put to work running math at extraordinary scale. The 82.4 TFLOPS result — 99.7% of theoretical peak — is achievable because torch.matmul calls NVIDIA's hand-tuned cuBLAS library, not naive code. That ceiling is the baseline everything else in this curriculum gets measured against. Every kernel written by hand starts below it; the work of optimization is understanding why and closing the gap.

Key Insight

torch.cuda.synchronize() is the first thing that separates someone who understands GPU execution from someone who doesn't. GPU operations are asynchronous — without an explicit sync barrier, you're timing kernel launches (microseconds), not kernel execution (milliseconds). Every benchmarking script in this repo includes it.

Phase 1 — GPU Architecture & Memory Hierarchy

Hardware: RTX 4090, 24GB GDDR6X, CUDA 13.0, vast.ai VM (~$0.40/hr)

Key Results

Memory-bound operation — element-wise multiply, 32M FP32 elements:

Metric	Result
Achieved bandwidth	947.5 GB/s
Peak bandwidth	1,008 GB/s
Utilization	94.0%
DRAM throughput (ncu)	92%
SM compute throughput (ncu)	2.6%
Arithmetic intensity	0.083 FLOP/byte

Compute-bound operation — FP16 matmul, N=8192:

Metric	Result
Achieved throughput	158.1 TFLOPS
Peak throughput	330 TFLOPS
Utilization	47.9%
DRAM throughput (ncu)	13%
SM compute throughput (ncu)	47%
Arithmetic intensity	2,731 FLOP/byte

Roofline regime transition (FP16 matmul):

Matrix size N	Arithmetic intensity	Regime
64	21.3 FLOP/byte	Memory-bound
128	42.7 FLOP/byte	Memory-bound
256	85.3 FLOP/byte	Approaching ridge
8192	2,731 FLOP/byte	Compute-bound

Ridge point on RTX 4090: ~327 FLOP/byte (1,008 GB/s bandwidth ÷ 330 TFLOPS peak × 1000)

What This Means

Every GPU operation is bottlenecked by one of two things: how fast it can move data from memory, or how fast it can do arithmetic. Which one limits you depends on the ratio of math to data movement — arithmetic intensity. We measured this directly: element-wise multiply spent 92% of its time moving data and 2.6% computing. Large matmul flipped that entirely. Below the ridge point (~N=256 for matmul), buying a faster GPU does nothing. You need faster memory or smarter data reuse.

Key Insight

Arithmetic intensity determines optimization strategy. Kernel fusion and quantization help memory-bound workloads. More TFLOPS only helps compute-bound ones. Knowing which regime your workload is in — a 60-second calculation from the spec sheet — should precede every infrastructure decision.

Phase 2 — CUDA Kernel Programming

Hardware: RTX 4090, 49,140MB VRAM, CUDA 12.1, vast.ai VM

Key Results

Exercise 2.1 — Vector Addition

1M elements across 4,096 blocks × 256 threads
Global load bytes (ncu): 8.39MB — exactly 2 arrays × 1M × 4 bytes, zero waste
Arithmetic intensity: 0.125 FLOP/byte — most memory-bound operation measured across Phases 1–2
Correctness: h_c[0] = 3.0, h_c[1048575] = 3.0, no CUDA errors

Exercise 2.2 — Naive vs. Tiled Matmul vs. cuBLAS

Wall clock (20-iteration average, CUDA events):

Version	N=1024	N=2048	N=4096	% of cuBLAS
Naive	0.43ms / 5.0 TFLOPS	3.37ms / 5.1 TFLOPS	— (skipped)	~9%
Tiled (T=16)	0.33ms / 6.4 TFLOPS	2.61ms / 6.6 TFLOPS	19.17ms / 7.2 TFLOPS	~13%
cuBLAS	0.04ms / 48.7 TFLOPS	0.32ms / 54.5 TFLOPS	2.55ms / 53.9 TFLOPS	100%

ncu hardware counters (N=1024, single kernel instance):

Metric	Naive	Tiled (T=16)
Global load bytes (HBM)	4.29 GB	537 MB
Shared memory wavefronts	0	50,339,258
DRAM throughput % of peak	1.70%	2.20%
SM throughput % of peak	97.28%	94.25%

Tiling reduced HBM traffic 8x — exactly what the arithmetic predicts.

Naive matmul skipped at N=4096: all three matrices total 192MB, blowing past the 72MB L2 cache. Performance would have collapsed. At N=1024 (12MB total, fits in L2), naive measured 5.0 TFLOPS — L2 masking the expected collapse, which ncu confirmed via the 8x HBM traffic difference between naive and tiled.

Exercise 2.3 — Softmax with Warp Shuffle

4,096 rows × 32 columns, one warp (32 threads) per row
Global load bytes (ncu): 1.57MB — input read exactly once
Correctness verified: all rows sum to 1.000000
Warp shuffle reduction: 5 steps of __shfl_down_sync, no shared memory, no barriers
~1 cycle per step vs. ~5 cycles for shared memory equivalent

What This Means

A GPU is not a fast calculator — it is a memory system with compute attached. A straightforwardly written matmul used 0.17% of available compute because the same data was repeatedly fetched from slow off-chip memory by independent threads. Tiling fixed this by loading data into fast on-chip SRAM once and reusing it many times, cutting HBM traffic 8x. The remaining gap to cuBLAS is decades of additional engineering: three tiling levels instead of one, Tensor Core integration, and assembly-level inner loop optimization. The warp shuffle exercise showed that even cross-thread coordination (finding a row maximum) can execute entirely in registers in 5 instructions — the same primitive that makes Flash Attention's fused kernel feasible.

Key Insight

The L2 cache masking effect is a production trap. A kernel benchmarking acceptably at small problem sizes can collapse at production scale when the working set exceeds L2. Any kernel evaluation that doesn't profile at actual production problem sizes is incomplete. SM throughput is not the same as compute utilization — naive kernels hammering HBM with load instructions register high SM activity while compute units starve. Wall clock is the honest arbiter.

Phase 3 — Triton Kernel Programming

Hardware: RTX 4090, 24GB GDDR6X, RunPod persistent pod, CUDA 12.4, PyTorch 2.4.1, Triton 3.0.0

Key Results

Exercise 3.1 — Triton Vector Addition

Test size: 98,432 elements (non-power-of-2, stress-tests mask logic)
Correctness: max absolute difference vs PyTorch = 0.00e+00
At 16M elements (only size exceeding 72MB L2, measuring real GDDR6X traffic):

Implementation	Bandwidth	% of Peak
Triton	924.0 GB/s	91.7%
PyTorch	924.5 GB/s	91.8%
RTX 4090 peak	1,008 GB/s	100%

Statistically identical. Triton's compiler produces code on par with hand-tuned CUDA with no manual thread indexing.

Exercise 3.2 — Fused Softmax

Benchmark at 1024 rows × 2048 cols:

Implementation	Bandwidth	Latency	Notes
Triton fused	823 GB/s	20.4 µs	1 HBM round trip
PyTorch unfused (5-op)	414 GB/s	40.4 µs	5 HBM round trips
torch.softmax (internally fused)	1,208 GB/s	13.9 µs	L2 cache effects at this size

Correctness: max absolute difference vs PyTorch = 7.45e-09, allclose = True
Fusion win: 2x throughput vs explicit unfused implementation
Unfused bandwidth plateaus at ~411–414 GB/s across all sizes — signature of multiple HBM passes

Exercise 3.3 — Flash Attention Forward Pass

Correctness: max absolute difference vs naive = 1.14e-03 (float32 accumulation across tile rescaling — not a logic bug)
Benchmark (batch=2, heads=4, d_head=64, float32):

N	Flash	Naive	Speedup	Naive attn matrix
512	0.041ms	0.072ms	1.8x	8.4 MB
1024	0.044ms	0.176ms	4.0x	33.6 MB
2048	0.155ms	0.969ms	6.3x	134.2 MB
4096	0.590ms	3.750ms	6.4x	536.9 MB

Speedup compounds with N — consistent with O(N) vs O(N²) prediction. Every doubling of N: Flash cost ~doubles, naive cost ~quadruples.

What This Means

Triton proved three things experimentally: a Python-level GPU compiler can match hand-tuned CUDA performance (92% of peak bandwidth with no manual thread management); kernel fusion cuts memory traffic in half and directly doubles throughput; and Flash Attention's breakthrough wasn't a faster chip — it was a mathematical restructuring of attention to avoid writing a giant intermediate matrix to HBM at all. That restructuring is what made 100K-token context windows physically possible. The algorithm unlocked the hardware market for long-context inference — not the other way around.

Key Insight

Context length is not a free parameter. Without fused tiled attention, memory requirements scale as O(N²). With it, O(N). At N=4096, naive attention requires 537MB just for the attention matrix. At N=32K, that number exceeds the GPU entirely. Evaluating any inference framework requires confirming Flash Attention (or equivalent) is the default attention path, not an optional flag.

Notes

Hardware

Phase	Hardware	Provider	Cost
0	RTX 4090, 49GB VRAM, CUDA 13.1	vast.ai	$0.57/hr
1	RTX 4090, 24GB GDDR6X, CUDA 13.0	vast.ai VM	~$0.40/hr
2	RTX 4090, 49GB VRAM, CUDA 12.1	vast.ai VM	—
3	RTX 4090, 24GB GDDR6X, CUDA 12.4	RunPod persistent pod	—

All RTX 4090 specs: 16,384 CUDA cores, 1,008 GB/s GDDR6X bandwidth, 82.6 TFLOPS FP32, 330 TFLOPS FP16 tensor core peak.

Phase 4 — Distributed Training Primitives

Hardware: 4× A100 SXM 40GB, NVLink NV12 topology, CUDA 12.4.1, PyTorch 2.4.0, RunPod

Key Results

Exercise 4.1 — DDP Training

4-process DDP training loop across 4× A100 SXM
Initial param sum: 20.335490 — identical across all 4 ranks (broadcast at init confirmed)
Final param sum: 20.307633 — identical across all 4 ranks after 20 steps on different random data (AllReduce confirmed working)
Step timing: 0.21s (rank 0) to 0.26s (rank 3) — variance demonstrates the synchronization barrier in practice: fastest rank waits for slowest before any rank proceeds

Exercise 4.2 — Ring AllReduce from Scratch

Implemented full Ring AllReduce using only dist.P2POp point-to-point primitives — no dist.all_reduce()
Result matched PyTorch dist.all_reduce() to 9.54e-07 max absolute difference across all 4 ranks
Hit deadlock on first implementation using naive isend + recv — all ranks blocked simultaneously because NCCL isend doesn't move data until a matching recv is posted on the remote end
Fixed with dist.batch_isend_irecv, which posts sends and receives atomically

Exercise 4.3 — NVLink Bandwidth Benchmark

Tensor size	Time (ms)	Measured BW	% of 600 GB/s spec
1 MB	0.051	31.0 GB/s	5.2%
16 MB	0.202	124.3 GB/s	20.7%
64 MB	0.596	168.8 GB/s	28.1%
256 MB	2.092	192.5 GB/s	32.1%
512 MB	3.817	211.0 GB/s	35.2%
1 GB	7.244	222.3 GB/s	37.1%
2 GB	14.152	227.6 GB/s	37.9%

Peak measured: 228 GB/s — 38% of 600 GB/s A100 SXM NVLink spec. Curve still rising at 2GB. Gap vs. spec attributed to RunPod virtualization overhead, AllReduce synchronization cost, and NCCL algorithm overhead vs. raw point-to-point benchmark conditions. Bus bandwidth formula: 2 × (N-1)/N × tensor_size / time — matches NCCL benchmark methodology.

What This Means

When training across multiple GPUs, each GPU sees different data and computes its own gradient. Before any weight update can happen, all gradients must be averaged across every GPU so the model stays synchronized. That averaging — AllReduce — runs over the physical wire connecting GPUs on every single training step. We implemented that algorithm from scratch, proved it produces the correct result, then measured how fast the wire actually runs versus its rated spec. The gap between 228 GB/s measured and 600 GB/s rated is not a malfunction — it's what real-world overhead looks like. Understanding that gap is what separates people who read spec sheets from people who understand systems.

Key Insight

Interconnect topology is a first-class architectural decision, not a procurement detail. PCIe between GPUs delivers ~32 GB/s per link; NVLink delivers 600 GB/s. At real gradient tensor sizes, PCIe puts you in a regime where communication dominates compute — and that is not fixable in software. GPU utilization is also the wrong metric for distributed training: a cluster showing 80% utilization may be spending 40% of that time blocked at AllReduce barriers. The correct metric is Model FLOP Utilization (MFU). Most production training runs achieve 30–50% MFU. NVLink is NVIDIA's most durable competitive moat — more defensible than the GPU itself.

Phase 5 — Inference & Serving Infrastructure

Hardware: A100 SXM4 80GB (single and 2×), CUDA 12.4, PyTorch 2.4.0, vLLM 0.18.0, RunPod
Models: Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct, Llama-3.3-70B-Instruct
Note: Exercises run across three parts — Mistral 7B baseline, Llama 3.1 8B cross-validation, and a TTFT streaming benchmark designed for direct comparison with published Groq LPU numbers.

Key Results

Exercise 5.1 — vLLM Deployment & Continuous Batching

vLLM startup telemetry:

Metric	Mistral 7B	Llama 3.1 8B
Model weights	13.51 GiB	14.99 GiB
KV cache pool	57.41 GiB	55.20 GiB
KV cache capacity	470,288 tokens	452,192 tokens
Max concurrency (8K context)	57x	55x

The 1.5 GiB weight difference between models directly reduces KV cache pool — the tradeoff between model size and concurrent user capacity is exact and measurable.

Concurrent request scaling (continuous batching):

Concurrency	Mistral 7B	Llama 3.1 8B
1	94.7 tok/s / 2,110ms	91.2 tok/s / 2,192ms
2	193.0 tok/s / 2,066ms	186.5 tok/s / 2,138ms
4	382.7 tok/s / 2,083ms	367.6 tok/s / 2,169ms
8	752.3 tok/s / 2,110ms	731.1 tok/s / 2,179ms

8x requests = 8x throughput with flat latency on both models. Continuous batching behavior is architecture-agnostic — a property of vLLM's scheduler, not the model.

Exercise 5.2 — Quantization Benchmarks

Format	Model size	Throughput	vs BF16	Kernel
Mistral 7B BF16	13.51 GiB	96.5 tok/s	1.0x	FlashAttention v2
Mistral 7B AWQ naive	3.88 GiB	20.9 tok/s	0.2x	naive AWQ
Mistral 7B AWQ Marlin	3.88 GiB	218.9 tok/s	2.27x	fused Marlin
Llama 3.1 8B BF16	14.99 GiB	93.1 tok/s	1.0x	FlashAttention v2
Llama 3.1 8B AWQ naive	5.37 GiB	20.7 tok/s	0.2x	naive AWQ
Llama 3.1 8B AWQ Marlin	5.37 GiB	199.1 tok/s	2.14x	fused Marlin

Key findings confirmed across both model families:

AWQ naive penalty is model-agnostic. Both models landed at ~20.8 tok/s. Dequantization overhead completely dominates — a kernel property, not a model property.
BF16 throughput tracks weight size directly. 3% throughput gap matches 11% weight size difference. Memory-bandwidth-bound result predicted by roofline model before running.
Marlin speedup is slightly lower for the larger model (2.14x vs 2.27x) because attention and unquantized operations become a larger fraction of total runtime as weight loading accelerates.

Error encountered and documented: initial AWQ run used --quantization awq instead of --quantization awq_marlin. vLLM warned at startup that Marlin was available but not selected. Result: 20.9 tok/s — 5x slower than BF16. Switching flags recovered full performance. One flag, 10x difference.

Exercise 5.3 — Speculative Decoding

Two runs: mismatched draft model, then correctly matched draft model.

Configuration	Throughput	vs BF16
Mistral 7B AWQ Marlin	218.9 tok/s	2.27x
+ TinyLlama 15M draft (mismatched family)	67.5 tok/s	0.7x
Llama 3.1 8B AWQ Marlin	199.1 tok/s	2.14x
+ Llama 3.2 1B draft (matched family)	106.4 tok/s	1.14x

Mismatched draft: same tokenizer vocabulary, different training family. High rejection rate — paid cost of both models, benefit of neither. 3x slower than target alone.

Matched draft: same Meta training family, same RLHF process. Draft acceptance rate: 59.5%. Mean accepted tokens per verify pass: 3.98. Result: latency improvement confirmed at low concurrency. Throughput measurement penalty vs Marlin alone is a vLLM 0.18.0 scheduling limitation — async scheduling is disabled when speculative decoding is active.

Shared tokenizer vocabulary is necessary but not sufficient for speculative decoding. Training dynamics must align. Llama 3.2 1B → Llama 3.1 8B is a legitimate production configuration. TinyLlama → Mistral is not.

Exercise 5.4 — TTFT Streaming Benchmark & Hardware Comparison Baseline

Methodology: fixed ~100-token prompt, 200-token output, temperature=0, streaming enabled, 1 warmup run excluded, 5 timed runs averaged. Matches Groq's published benchmark methodology for direct comparability.

A100 results:

Model	Precision	TTFT	Throughput	Total time	Config
Llama 3.1 8B	BF16	18.1ms	93.0 tok/s	2,157ms	TP=1, 1× A100
Llama 3.3 70B	BF16	58.0ms	21.2 tok/s	9,436ms	TP=2, 2× A100

70B constraint: 134GB of weights split across 2× 80GB GPUs left only 3.93 GiB for KV cache (25,728 token capacity). Viable for single-request benchmarking, not production multi-user serving.

A100 vs Groq LPU (Artificial Analysis published benchmarks):

Model	A100 tok/s	Groq tok/s	Groq advantage	A100 TTFT	Groq API TTFT	TTFT winner
Llama 3.1 8B	93.0	877	9.4x	18.1ms	~200ms	A100 (11x)
Llama 3.3 70B	21.2	276	13x	58.0ms	~450ms	A100 (7.8x)

The TTFT reversal is explained by network round-trip. Groq API TTFT includes client-to-datacenter latency. Local vLLM has none. Groq's throughput advantage is real and architectural — covered in Phase 6. The TTFT advantage of local deployment shrinks with longer outputs as Groq's throughput lead compounds.

Warmup effect documented: first-request TTFT was 42ms for 8B vs 18ms steady state. CUDA graph initialization on the first request; subsequent requests reuse it. Warmup exclusion is required for accurate TTFT measurement.

What This Means

Inference is not a compute problem — it is a memory problem. The A100 has 312 TFLOPS sitting mostly idle during decode. What limits token generation speed is how fast model weights can be loaded from HBM on every single decode step. Every optimization in this phase — batching, quantization, speculative decoding — is a variation of the same principle from Phases 1–3: minimize expensive data movement, maximize useful work per memory access. The roofline model built in Phase 1 predicted every result here before a single benchmark was run.

The non-obvious finding: optimization method and optimization implementation are different things. AWQ INT4 with the wrong kernel was 5x slower than BF16. AWQ INT4 with the right kernel was 2.27x faster. Speculative decoding with the wrong draft family was 3x slower than no speculative decoding. Same technique, opposite outcomes.

Key Insight

The optimization sequence matters. Fix in this order before buying hardware:

Continuous batching — 8x throughput at the same latency, free
Quantization kernel — 10x difference between naive AWQ and Marlin, one flag
Quantization level — 2.27x BF16 → Marlin INT4, 3.5x smaller model
Speculative decoding — latency benefit at low batch sizes only, requires same training family as target (shared vocabulary is necessary, not sufficient)

The gap between a misconfigured and well-configured inference stack on identical hardware exceeded 10x in measured results. The 9–13x throughput gap between a correctly configured A100 and Groq's LPU is architectural — not a configuration problem. Understanding which gap is which is the foundation of Phase 6.

Phase 6 — Alternative Hardware Architectures

Hardware accessed: Tenstorrent Wormhole N300S (Koyeb), AMD MI300X (RunPod + AMD Developer Cloud), Groq LPU (GroqCloud API), Google TPU v5e (Google Colab)
Note: Direct measurement was not possible on Tenstorrent or AMD due to platform failures and driver mismatches. Those results use clearly labeled community benchmarks. Groq and TPU v5e produced real measurements. Deployment experience across all four is itself a primary finding.

Key Results

Tenstorrent Wormhole N300S — BF16 Matmul + Inference

Hardware	BF16 Peak	Memory BW	Achieved	Data Source
A100 SXM4	~312 TFLOPS	2.0 TB/s HBM	~250 TFLOPS	Real — Phase 2
RTX 4090	~165 TFLOPS	1.01 TB/s GDDR6X	~130 TFLOPS	Real — Phase 1
N300S Wormhole	~233 TFLOPS	576 GB/s GDDR6	~140–175 TFLOPS	Community benchmarks (corsix.org, FOSDEM 2025)

Inference throughput — single-user, 7–8B model class, BF16:

Hardware	Model	tok/s	Data Source
A100 SXM4	Mistral 7B BF16	96.5	Real — Phase 5
N300S Wormhole	Llama 3.1 8B BF16	~24	Tenstorrent official (hand-optimized)

~4x throughput gap reflects the core constraint: Llama 3.1 8B (14GB) exceeds the N300S's 192MB SRAM and spills to GDDR6 at 576 GB/s — less than a third of A100 HBM bandwidth. TILE_LAYOUT mandatory at storage level adds real porting cost vs CUDA's row-major default. Koyeb deployment failures (instances 3f2265e1, 8ebd286b) prevented direct measurement.

AMD MI300X — Llama 70B Inference

Hardware	Config	tok/s	Data Source
A100 SXM4 ×2	TP=2 required	21.2 (70B proxy)	Real — Phase 5 Part 3
MI300X	TP=1 native	~37	Chips & Cheese; AMD MLPerf v4.1

192GB HBM3 is the architectural differentiator — the entire Llama 70B model fits on one chip, eliminating the AllReduce overhead that forces A100 and H100 onto TP=2. ~40% lower latency vs H100 on 70B decode comes entirely from eliminating that inter-GPU communication step.

Realized throughput: independent analysis (arxiv:2510.27583) found MI300X achieves 37–66% of H100/H200 throughput despite higher theoretical specs — ROCm kernel optimization still lags CUDA. RunPod deployment failed across three container configurations due to ROCm 6.1 userspace / 6.10.5 kernel driver mismatch. AMD Developer Cloud was at capacity.

Groq LPU — TTFT and Throughput Benchmark (real measurements)

Methodology: ~100 token prompt, 200 token output, temperature=0, streaming, 1 warmup excluded, 5 run mean. Identical to Phase 5 Part 3. Network latency to Groq datacenter: ~13ms round-trip from Austin, TX.

Hardware	Model	tok/s	TTFT	Data Source
A100 SXM4 (local)	Llama 3.1 8B	93.0	18.1ms	Real — Phase 5 Part 3
A100 SXM4 (local)	Llama 3.3 70B	21.2	58.0ms	Real — Phase 5 Part 3
Cloud API median	Llama 3.1 8B	154.8	930ms	ArtificialAnalysis
Cloud API median	Llama 3.3 70B	85.5	1,410ms	ArtificialAnalysis
Groq LPU (API)	Llama 3.1 8B	672.9	130.8ms	Real — this session
Groq LPU (API)	Llama 3.3 70B	263.1	135.4ms	Real — this session
Groq LPU (spec dec)	Llama 3 70B	1,665	—	Groq published

Groq vs cloud API median: 4.3x faster throughput on 8B, 3.1x on 70B, 7x lower TTFT on 8B. TTFT breakdown: ~6.5ms network, ~14ms LPU compute, ~110ms API infrastructure overhead. On-premise LPU compute TTFT (~14ms) actually beats A100 local (18ms) by 22% — the API overhead obscures this in cloud comparisons.

NVIDIA acquired Groq in December 2025. GroqCloud remains operational; long-term roadmap under NVIDIA ownership is uncertain.

Google TPU v5e — BF16 Matmul + Fused Attention (real measurements)

Hardware	BF16 Peak	Achieved	Utilization	Data Source
A100 SXM4	~312 TFLOPS	~250 TFLOPS	~80%	Real — Phase 2
RTX 4090	~165 TFLOPS	~130 TFLOPS	~79%	Real — Phase 1
TPU v5e-1	~197 TFLOPS	142.3 TFLOPS	72.2%	Real — this session

Fused attention (batch=4, heads=16, seq=2048, d_head=64, BF16): 2.60ms via jax.jit. XLA automatically fused matmul+softmax+matmul into a single TPU kernel with no programmer intervention — the same optimization Phase 3 required expert Triton kernel engineering to achieve.

Google Cloud provisioning failed across five zones (capacity errors, wrong accelerator type, quota limits). Accessed via Google Colab free tier — zero provisioning friction.

Phase 6 Meta-Finding: Developer Experience as NVIDIA's Deepest Moat

Every alternative hardware exercise encountered significant access friction. Every NVIDIA exercise in Phases 0–5 produced real measurements on the first attempt.

Architecture	Access method	Outcome
Tenstorrent N300S	Koyeb cloud	Platform deployment failures — container never initialized
AMD MI300X	RunPod	ROCm userspace/kernel driver mismatch across 3 container configs
AMD MI300X	AMD Developer Cloud	At capacity
Groq LPU	GroqCloud API	Accessible — API only, no raw hardware access
Google TPU v5e	Google Cloud Console	Capacity errors, wrong zone errors, quota limits across 5 zones
Google TPU v5e	Google Colab	Accessible — free tier, zero friction
NVIDIA A100/RTX 4090	RunPod	Deployed and benchmarking in under 20 minutes, every time

This is not bad luck — it reflects 15 years of accumulated ecosystem investment. NVIDIA driver compatibility matrices are battle-tested across thousands of container configurations. Alternative hardware ecosystems are newer, thinner, and less validated in containerized cloud environments. Every hour debugging ROCm version mismatches or navigating TPU quota limits is engineering time that belongs in any TCO analysis alongside chip price and memory bandwidth.

What This Means

Every accelerator architecture is an answer to the same question: where is the binding constraint? Tenstorrent bets it's on-chip memory bandwidth for small models. AMD bets it's memory capacity for large ones. Groq bets it's latency unpredictability. TPU bets it's the mismatch between general-purpose hardware and transformer-specific operations. Each is correct for a specific workload shape — and each requires real engineering work to access and operate, in a way NVIDIA currently does not.

Key Insight

NVIDIA's moat is not just TFLOPS or memory bandwidth. It is the accumulated infrastructure of developer convenience built over 15 years: validated container images, community-tested compatibility matrices, thousands of Stack Overflow answers for every failure mode, and a path from zero to running code measured in minutes. Alternative hardware vendors that close this gap — not just on specs but on developer experience — will be the ones that successfully challenge NVIDIA's position. Until then, ubiquity is the moat.

Phase 7 — Model Architecture & The Full Stack View

Hardware: A100 SXM4 80GB, CUDA 12.4, PyTorch 2.4.0, RunPod
Model: 30M parameter GPT-style transformer, trained from scratch on TinyShakespeare

Key Results

Exercise 7.1 — Data Preparation

Metric	Value
Raw characters	1,115,394
Total tokens (GPT-2 BPE)	338,025
Vocabulary size	50,257
Compression ratio	3.30 chars/token
Train / val split	304,222 / 33,803 tokens

BPE compression ratio directly determines attention cost — shorter sequences mean cheaper O(N²) attention computation. uint16 storage (not int32) halves memory bandwidth requirements while covering the full 50,257 vocabulary.

Exercise 7.2 — Model Implementation (nanoGPT)

Component	Parameters
Token embedding (wte)	19.3M (64% of total)
Position embedding (wpe)	98K
6 transformer blocks	10.6M
Total	30.02M

Sanity check: initial loss 10.926 vs expected ln(50257) = 10.825 — gap of 0.10, within normal range for random initialization. Logits shape (4, 64, 50257) correct.

The embedding table dominates at small n_embd — 64% of a "30M parameter model" is a lookup table, not reasoning capacity. Weight tying (sharing embedding and output head weights) reduces parameters by 19.3M and empirically improves language modeling performance.

Exercise 7.3 — Overfitting: Capacity vs. Data Mismatch

30M parameter model on 304K training tokens (0.01 tokens/param):

Iteration	Train loss	Val loss
0	10.91	10.91
500	2.49	5.19
1,000	0.46	6.94
2,000	0.14	8.70
5,000	0.07	9.58

Generated output at iter 5,000: verbatim Romeo and Juliet Act IV. Complete memorization.

7.25M model (same dataset, 0.042 tokens/param): val loss floors at 5.70, generates novel Shakespeare-style text not present in training data.

Throughput delta: 7.25M at 389K tok/s vs 30M at 203K tok/s — nearly 2x, driven by smaller activation tensors fitting better in L2 cache. The memory hierarchy from Phase 1 reappears at the model architecture level.

Exercise 7.4 — Scaling Law Experiment

Four models trained at increasing scale, fixed dataset:

Model	Non-embed params	Best val loss	Best at iter	Chinchilla starvation
nano	98,944	5.3410	500	6.5x underfed
micro	333,120	4.9774	750	21.9x underfed
mini	788,736	4.7920	750	51.8x underfed
small	2,659,200	4.8181	500	174.8x underfed

Power law fit (nano, micro, mini): L(N) = 7.5929 × N^(-0.0321), fit error 1–2.4%. The small model broke the law — best val loss worse than mini despite 3.4x more parameters. Cause: 174.8x overtrained relative to Chinchilla-optimal token count. The scaling law has a hidden precondition: each model must be trained to its optimal point. Violate that and the prediction fails.

Scaling exponent α = 0.0321 vs Kaplan et al. α = 0.076 — direction correct, rate slower. Every model was data-starved, compressing the loss range and flattening the curve. The exponent is dataset-dependent, not universal.

Exercise 7.5b — Engineered Failure Modes

Deliberately removed residuals, LayerNorm, warmup, and gradient clipping to produce and characterize each failure mode.

Gradient death (residuals + LayerNorm both removed):

Config	Residuals	LayerNorm	Result
Broken	OFF	OFF	Loss frozen at 10.81, grad_norm = 0.0
Residuals only	ON	OFF	Learns but noisy, grad_norm spikes to 8.0
LayerNorm only	OFF	ON	Learns smoothly, floors at 6.3
Control	ON	ON	Fastest descent, lowest loss 5.50

BF16 underflow mechanism: sequential attenuation through 12 unnormalized layers pushed activations below BF16 minimum representable value (~1e-4), producing exactly zero gradients. Complete freeze from iteration 0 — not an explosion, a silent death.

Warmup sensitivity (batch=128):

Config	Final loss	vs control
No warmup	4.2703	+20% worse
Short warmup (50 iters)	3.5801	+4.5% worse
Full warmup (200 iters)	3.4243	baseline

No-warmup damage is permanent and silent — loss goes down, training appears healthy, the 20% gap is only visible against a control run.

High LR (100x, AdamW): did not NaN. AdamW's adaptive per-parameter learning rates absorbed most of the raw LR magnitude increase. Effective step size scales sub-linearly — a 100x LR increase produced ~3-5x larger effective updates, not 100x. Damage occurred in the first 50 iterations and persisted to final loss despite cosine decay bringing LR near baseline by iter 275.

What This Means

Building a transformer from scratch closes the loop on everything in Phases 1–6. The KV cache in Phase 5 exists because token generation is a sequential loop — one forward pass per output token, 500 passes for a 500-token response. Flash Attention in Phase 3 exists because attention cost scales as O(N²) with sequence length, and the N here is the token count computed in Exercise 7.1. The memory hierarchy principle from Phase 1 reappeared in the throughput delta between the 30M and 7.25M models. The same roofline model that predicted Phase 5 quantization results predicted training throughput here.

The failure mode experiments revealed something counterintuitive: modern training infrastructure is collectively so robust that engineering visible failures at small scale required removing multiple components simultaneously. Each technique — AdamW, residuals, LayerNorm, gradient clipping — was added in response to a real failure at scale. The failures that required deliberate engineering to produce at 12M parameters on 300K tokens are routine operational concerns at 70B parameters on trillion-token datasets. Scale is the multiplier that makes each fragility relevant.

Key Insight

The most dangerous training failures are the silent ones. Gradient explosion produces NaN — immediately visible. Warmup misconfiguration produces a model that trains normally, converges cleanly, and is permanently 20% worse than it should be. Without a control run, you would never know. The scaling law has the same property: it predicts correctly when models are trained to their optimal point, and breaks silently when they are not. Log gradient norms alongside loss on every run. Track best val loss and checkpoint accordingly. Run a control. The control is not overhead — it is the only way to detect silent degradation.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
notes		notes
phase1-gpu-architecture		phase1-gpu-architecture
phase2-cuda-kernels		phase2-cuda-kernels
phase3-triton		phase3-triton
phase4-distributed		phase4-distributed
phase5-inference		phase5-inference
phase6-alternative-hardware		phase6-alternative-hardware
phase7-model-architecture		phase7-model-architecture
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AI Infrastructure Learning Journey

dagc.ai | LinkedIn

Progress

Phase 0 — Environment & Tooling

Key Results

What This Means

Key Insight

Phase 1 — GPU Architecture & Memory Hierarchy

Key Results

What This Means

Key Insight

Phase 2 — CUDA Kernel Programming

Key Results

What This Means

Key Insight

Phase 3 — Triton Kernel Programming

Key Results

What This Means

Key Insight

Notes

Hardware

Phase 4 — Distributed Training Primitives

Key Results

What This Means

Key Insight

Phase 5 — Inference & Serving Infrastructure

Key Results

What This Means

Key Insight

Phase 6 — Alternative Hardware Architectures

Key Results

Phase 6 Meta-Finding: Developer Experience as NVIDIA's Deepest Moat

What This Means

Key Insight

Phase 7 — Model Architecture & The Full Stack View

Key Results

What This Means

Key Insight

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages