Inference¶
Two CLI commands run trained checkpoints:
generate— one model, one set of sampling parameters, many prompts.batch-generate— a self-describing JSONL where each record can name its own model and sampling parameters (eval sweeps, multi-checkpoint comparisons).
Both also exist as Python functions in gpt_simple (generate,
load_for_inference, validate_checkpoint).
Pointing at a checkpoint¶
Anywhere a model location is accepted, you can pass either:
- a run directory (
output_dir) — the latest checkpoint undercheckpoints/is selected automatically (same rule asresume: auto); or - a specific checkpoint directory (
.../checkpoint-N).
The tokenizer is found automatically next to the checkpoint or at the run
root; override it with --tokenizer. Paths may contain ~ and
environment variables (e.g. $WORK), which are expanded even inside a
JSONL.
generate¶
gpt-simple generate \
--output-dir ./outputs \
--prompt "Once upon a time" \
--max-new-tokens 200
Output is JSONL on stdout by default (pipe to jq -r .completion for just
the text); --output FILE.jsonl writes to a file. Feed many prompts with
--prompts-file prompts.jsonl (one {"prompt": ...} per line). Each
output record echoes the input and adds completion plus a generation
block recording the sampling parameters used.
batch-generate¶
Use this when records need different models or sampling. Each line is
self-describing; only prompt is required, everything else falls back to
the CLI defaults:
{"id": "ex-1", "prompt": "Once upon a time", "model": {"checkpoint": "./runs/r1/checkpoints/checkpoint-12000", "dtype": "bf16"}, "generation": {"max_new_tokens": 200, "temperature": 0.8, "top_k": 50}}
{"id": "ex-2", "prompt": "def fibonacci(n):", "model": {"output_dir": "./runs/code-model"}, "generation": {"greedy": true}}
gpt-simple batch-generate \
--input setup.jsonl \
--output completions.jsonl \
--output-dir ./runs/r1 \
--temperature 0.8 --top-k 50
Records that share the same (model, dtype, tokenizer) are grouped so
each model is loaded once (a multi-billion-parameter checkpoint costs
minutes to load), and only one model is held in memory at a time. Output
order always matches input order.
Per-record model accepts checkpoint or output_dir, plus optional
dtype and tokenizer. The generation block accepts the sampling
parameters below.
Sampling parameters¶
| Parameter | Meaning |
|---|---|
max_new_tokens |
Number of tokens to generate. |
temperature |
Softmax temperature; lower is more deterministic. |
top_k |
Keep only the top-k logits before sampling. |
top_p |
Nucleus sampling: keep the smallest set of tokens with cumulative probability ≥ p. |
greedy / do_sample |
Greedy (argmax) decoding instead of sampling. |
repetition_penalty |
Penalize already-generated tokens (1.0 = off). |
seed |
Seed for reproducible sampling. |
dtype |
Load/compute dtype: fp32, fp16/half, bf16. |
return_full_text |
Include the prompt in the output (default: completion only). |
Pre-flight validation (--dry-run)¶
Before any weights load, batch-generate parses every record, checks the
sampling-parameter ranges, and confirms each distinct checkpoint resolves
(config parses, weights and tokenizer present). --dry-run stops there
and prints the execution plan without touching a GPU — safe to run on a
login node to gate a job submission:
gpt-simple batch-generate --input setup.jsonl --output-dir ./runs/r1 --dry-run
Structural problems (bad JSON, missing prompt, out-of-range parameters,
a missing checkpoint) are hard errors: the job exits non-zero with
nothing loaded. Failures that only surface during generation are
soft: that record gets an error field instead of completion and
the run continues.
The same checks are available programmatically via
gpt_simple.validate_checkpoint(path, ...), which returns the resolved
locations without loading weights or touching CUDA.
Scaling out¶
batch-generate is a plain CLI entry point with no orchestrator
assumptions. To fan out across a job array, shard the input by model
(one model per task) so no checkpoint is loaded twice. See
examples/orchestrators/ for inference job
templates.
Authoritative source: src/gpt_simple/generate.py,
src/gpt_simple/cli/batch_generate_cmd.py.