Data pipeline¶
GPT-Simple trains on causal-LM token streams. It supports two input formats and handles tokenization, long-document windowing, sequence packing, and (optionally) curriculum mixing.
Two formats¶
| Format | What it is | Use it when |
|---|---|---|
pretokenized (default) |
Memory-mapped .bin/.idx shards, tokenized ahead of time. |
Real training runs — no per-step tokenization or JSON parsing overhead, and it supports curriculum learning and deterministic resume. |
jsonl |
Raw .jsonl files (one {"text": ...} per line), tokenized on the fly. |
Quick experiments and small datasets. No curriculum support. |
Set the format with data.format. See Configuration
for all data fields.
Pretokenizing¶
Convert raw text into binary shards once, up front:
gpt-simple tokenize \
--input_dir ./raw_data \
--output_dir ./data/tokenized \
--tokenizer_path gpt2 \
--max_length 2048 \
--overlap_size 256 \
--num_workers 8
Inputs are .jsonl or .jsonl.gz files with a text field. Each output
shard is a .bin + .idx pair.
Binary format¶
.bin— a flat array of token IDs (uint16, oruint32for large vocabularies). Documents and windows are concatenated back-to-back, each terminated with an end-of-document (EOD) token..idx— an index with a 16-byte header (magicGPTS, version, dtype code, document count), anint64offsets array (token-level start of each entry, plus a final total-count sentinel), and a per-entryuint16overlap-prefix length (how many leading tokens are windowing overlap and must be masked in the loss).
At training time the .bin is memory-mapped, so shards far larger than
RAM stream with constant memory.
Document windowing¶
Documents longer than max_length are split into windows. Consecutive
windows can share an overlap_size-token overlap so the model still sees
local context across the cut. The overlapping prefix of each later window
is masked out of the loss (label -100) to avoid training on the same
tokens twice. overlap_size must be at most half of max_length.
Sequence packing¶
With packing: true, multiple shorter documents are packed into a single
max_length sequence to minimize padding (length-binned greedy packing).
Within a packed sequence:
- positions are continuous (RoPE is not reset at document boundaries);
- the token after each EOD is masked in the loss;
- with
attention_modeofsdpa_maskorflex, per-tokendoc_idsprevent attention across document boundaries (see Architecture).
Curriculum learning¶
A curriculum trains through ordered phases, each mixing data buckets at
chosen ratios for a token budget (pretokenized format only). Buckets are
subdirectories of data.path:
<data.path>/
├── train/
│ ├── web/ *.bin *.idx
│ ├── code/ ...
│ └── math/ ...
└── val/
└── ...
data:
path: ./data/tokenized
format: pretokenized
curriculum:
- duration_tokens: 5_000_000_000
mix: {web: 0.6, code: 0.2, math: 0.1, wiki: 0.1}
- duration_tokens: 5_000_000_000
mix: {web: 0.3, code: 0.3, math: 0.2, wiki: 0.2}
Bucket selection uses a counter-based PRNG so the exact mix is reproducible and resumable.
Bucket exhaustion¶
When a phase asks for more of a bucket than exists, that bucket runs dry.
What happens then is controlled by data.allow_bucket_exhaustion, which
expresses intent and is enforced at two stages:
allow_bucket_exhaustion: false (default) |
allow_bucket_exhaustion: true |
|
|---|---|---|
Validation (gpt-simple validate) |
A predicted shortfall (curriculum demand > inventory) is a blocking error. | The shortfall is a warning; the run proceeds. |
| Runtime (a bucket actually runs dry) | The trainer halts: it saves a checkpoint and reports status halted, rather than silently changing the mix. |
The loader drops the exhausted bucket and renormalizes the remaining weights; training continues. |
A single bucket emptying never silently alters your mix. In a correctly validated default run no bucket should exhaust at all, so the runtime halt is a safety net for an estimate that was off or for resume-time drift. The halt is coordinated across ranks and fires at the first sign of drift (the first worker slot to drain the bucket).
To continue past a halt with a renormalized mix, resume with the flag set:
gpt-simple train --config config.yaml --data.allow_bucket_exhaustion true
The halted status is terminal: the auto-resume orchestrators treat it as
a stop (they do not resubmit) so the run doesn't loop on the same
exhaustion — see Orchestration. Note this is the same
flag you would already need to set to pass validation for a deliberate
drain. A budget mismatch (curriculum total ≠ tokens the loop consumes)
is governed separately by allow_budget_mismatch.
Deterministic resume¶
The pretokenized path resumes the data stream exactly. Each emitted item
carries a small cursor describing the dataset position after it; the
training loop commits per-worker cursors at checkpoint time and restores
them on resume. Because progress is tracked per file (not per
global step), a run can resume with a different world_size or
num_workers and still consume every document exactly once. Details in
Checkpointing & resume.
Authoritative source: src/gpt_simple/pretokenize.py,
src/gpt_simple/data.py.