Hardware tuning¶
This guide is about getting peak throughput from whatever GPUs you have. It focuses on NVIDIA data-center GPUs (V100 / A100 / H100), which are the common targets for LLM pretraining, but the config advice generalizes.
TL;DR¶
- GPT-Simple already picks a good attention kernel and precision via PyTorch SDPA + Accelerate. On modern hardware the main wins are free given a recent PyTorch and a sensible config.
- Keep
mixed_precision: bf16(ornullfor auto),attention_mode: causal, andcompile: true, then pushper_device_batch_sizeup. - fp8 (Hopper) is a further ~1.5–2× but is not free — it needs extra integration that GPT-Simple does not ship today.
The hardware hierarchy¶
For LLM pretraining (bf16/fp16, seq_len ≈ 2k–8k, matmul-dominated):
| GPU | Arch | bf16 tensor-core throughput | HBM bandwidth | Memory | Practical training factor |
|---|---|---|---|---|---|
| V100 (32GB) | Volta (sm_70) | ~125 TFLOPS (fp16) | 0.9 TB/s | 32 GB | ~0.35–0.5× A100 |
| A100 (80GB) | Ampere (sm_80) | ~312 TFLOPS | 2.0 TB/s | 80 GB | 1.0× (baseline) |
| H100 (80GB) | Hopper (sm_90) | ~756 TFLOPS dense | 3.35 TB/s | 80 GB | 2.5–3× A100 (bf16) |
| ~1500 TFLOPS (fp8) | 3–5× A100 (fp8) |
The "practical training factor" is end-to-end on realistic workloads (attention, optimizer step, communication, dataloader), not isolated kernel benchmarks.
Notes:
- V100 lacks FlashAttention-2 — it falls back to memory-efficient attention, which is meaningfully slower for long sequences.
- A100 → H100 is the largest single-generation jump in the LLM era, in both tensor-core throughput and bandwidth.
- fp8 is Hopper-only — Ampere has no native fp8 tensor cores.
Why newer GPUs are faster¶
In rough order of impact for LLM training:
- More tensor-core throughput per cycle — Hopper produces ~2.4× more bf16 FLOPS/clock than Ampere; this dominates compute-bound kernels.
- Higher HBM bandwidth — the memory-bound ops (optimizer step, normalization, residual adds) speed up roughly linearly with bandwidth.
- FlashAttention-3 (Hopper, opt-in) — uses Hopper-specific hardware (TMA, WGMMA). Not engaged by default (see below).
- fp8 tensor cores (Hopper) — double the bf16 throughput at the cost of per-tensor scaling.
- Faster interconnect and larger L2 — cheaper all-reduce and better attention block reuse.
Config knobs that matter¶
GPT-Simple picks good defaults; these are the ones worth thinking about.
Mixed precision — bf16 on Ampere and newer¶
training:
mixed_precision: bf16 # or null to auto-detect
null) picks bf16 on Ampere+, fp16 on V100/T4, no mixed
precision on CPU. Prefer bf16 over fp16 where available: it has the same
exponent range as fp32 and needs no loss scaling.
Attention — keep causal unless you need a custom mask¶
model:
attention_mode: causal
causal calls SDPA with is_causal=True, which dispatches to
FlashAttention-2 on both A100 and H100 — no mask tensor materialized.
sdpa_mask and flex are slower fast-path-wise and only worth it when
you genuinely need per-document or custom masking (see
Architecture).
Compile — more impactful on faster GPUs¶
training:
compile: true
compile typically gains ~10–20% on A100 and ~20–30% on H100.
Batch size — push it up on faster GPUs¶
With a faster step, the same 80 GB can feed a larger batch:
training:
per_device_batch_size: 32 # vs 16 for the same model on A100
gradient_accumulation_steps: 1 # compensate to keep global batch constant
Dataloader workers — scale with step speed¶
data:
num_workers: 8
tokens_per_second plateaus as you raise per_device_batch_size, the
dataloader is the bottleneck — raise num_workers, then prefer
pretokenized over jsonl (no on-the-fly tokenization).
FlashAttention-3 (Hopper, advanced)¶
FA-3 is not engaged by default, even on H100. On a modern PyTorch,
scaled_dot_product_attention(..., is_causal=True) dispatches to bundled
FA-2 on both Ampere and Hopper. The Hopper FA-3 kernel lives in cuDNN
under the CUDNN_ATTENTION SDPA backend, but FLASH_ATTENTION outranks
it in SDPA's selection priority, so cuDNN only fires when explicitly
forced.
Indicative per-call cost at B=2, H=12, S=2048, D=64, bf16, is_causal
(PyTorch 2.6):
| Setup | µs/call | vs A100 |
|---|---|---|
| A100, default (FA-2) | ~149 | 1.00× |
| H100, default (FA-2 on better silicon) | ~76 | 1.96× |
| H100, forced cuDNN (FA-3) | ~52 | 2.84× |
The headline 2.5–3× end-to-end H100 win comes from FA-2 on Hopper
hardware, not FA-3. The marginal FA-3 win (~1.45× at the attention call)
is roughly 10–15% end-to-end at 125M/seq=2048 — bigger at longer
sequences. Engaging it requires patching the SDPA call site in
src/gpt_simple/model.py to wrap it in
sdpa_kernel([SDPBackend.CUDNN_ATTENTION]) (with a probe-once-at-init or
try/except fallback, since sdpa_kernel takes a set of allowed
backends and FLASH still outranks CUDNN within it). GPT-Simple does not
do this by default.
fp8 — the bigger Hopper win, with engineering cost¶
fp8 (E4M3/E5M2) offers another ~1.5–2× over bf16 on H100. The silicon is
free; the software is not. It requires per-tensor scaling factors, fp8-aware
layer wrapping, recipe management (which layers run fp8 vs bf16), and
careful numerics validation. The two routes are NVIDIA's
Transformer Engine
(production-grade — swap nn.Linear/norm/attention for te equivalents
and wrap the step in te.fp8_autocast) or PyTorch's native fp8 dtypes
(experimental). GPT-Simple ships neither integration today.
It is worth the effort only for long, expensive runs on large models (≥1B, ideally ≥7B) where attention + matmul dominate step time; skip it for models below ~350M or while still iterating on architecture/data.
Verifying you're on the fast path¶
It's easy to think you're running FA + bf16 + compile while a flag is wrong. Quick checks:
python -c "import torch; print(torch.__version__, torch.version.cuda)"
Confirm SDPA picks a FlashAttention backend for your shapes:
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
q = k = v = torch.randn(2, 16, 2048, 64, device="cuda", dtype=torch.bfloat16)
with sdpa_kernel([SDPBackend.FLASH_ATTENTION]):
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, is_causal=True)
print("FA backend works:", out.shape)
Identify which kernel actually fires by profiling and reading the CUDA
trace (torch.profiler). Kernel-name fragments:
| Fragment | Backend | Where |
|---|---|---|
pytorch_flash::flash_fwd_kernel |
FA-2 | A100/H100 default |
cudnn_generated_..._sdpa_sm90_... |
FA-3 (cuDNN) | H100, only when forced |
fmha_cutlassF_... |
memory-efficient (CUTLASS) | FA disabled/ineligible |
anything with math |
naive eager | last-resort — investigate |
Throughput sanity ranges¶
At step ~50 (past warmup), for a 125M model at seq=2048, 8 GPUs, bf16 + FA + compile:
- 8× V100: ~100–150 k tok/s
- 8× A100: ~400–600 k tok/s
- 8× H100: ~1.2–1.8 M tok/s
Far below these? Check, in order: GPU SM utilization (nvidia-smi),
num_workers/dataloader timing, whether compile actually engaged, then
SDPA backend selection.
Migrating a run between GPU generations¶
The checkpoint is portable, so migrating (e.g. A100 → H100) is mostly a config and launch change:
- Don't change
model.*ordata.*— keep the architecture and dataset identical so the checkpoint stays compatible. - Update the launch/scheduler script for the new partition and GPU
count per node (your scheduler's account/constraint/queue and
--gres). - Bump
per_device_batch_sizetoward ~85% memory. - Adjust
gradient_accumulation_stepsto keep the global batch constant. - Increase
num_workers. - Keep
mixed_precision: bf16,attention_mode: causal,compile: true. - Resume from the same checkpoint —
resume: autowith the sameoutput_dircontinues the run; a change in GPU count per node is handled by the topology-agnostic data cursors (see Checkpointing & resume). - Watch
tokens_per_second— expect roughly the per-generation factor from the table; if you see less, work down the verification checklist above.
For a concrete, site-specific worked example (partition names, module
loads, scheduler headers), see the notes alongside the templates in
examples/orchestrators/.