Orchestration¶

GPT-Simple has no built-in scheduler and makes no assumptions about your cluster. It is designed to be driven by any orchestrator — SLURM, Kubernetes, a plain shell loop — by combining two features:

resume: auto — the same gpt-simple train command starts a fresh run or resumes the latest checkpoint, so re-running it is always correct.
Graceful, walltime-aware shutdown — the trainer saves and exits 0 before a deadline or on a signal, so the orchestrator can simply re-queue the job.

Together these turn N sequential time-limited jobs into one continuous run. The details of what is saved and restored are in Checkpointing & resume.

The contract with your orchestrator¶

Your orchestrator only needs to:

Re-run the same command until training reaches max_steps. Each run resumes where the last left off.
Communicate the deadline (optional but recommended), via either:
the SLURM_JOB_END_TIME environment variable (set automatically by SLURM), or
the generic GPT_SIMPLE_MAX_RUNTIME environment variable (seconds), or
the training.max_runtime_seconds config field.
Send a signal to stop early (optional): SIGTERM/SIGUSR1 trigger a graceful save-and-exit.

The trainer reserves walltime_reserve_seconds before the deadline to finish its final checkpoint — size this to your checkpoint write time.

When to stop vs. resume¶

After each run, decide what to do from gpt-simple status, not the exit code (a graceful stop and a finished run both exit 0):

COMPLETED — reached max_steps. Stop.
HALTED — a curriculum bucket ran dry with data.allow_bucket_exhaustion=false. Terminal — do not resubmit, or the next run halts at the same point. Resume manually with --data.allow_bucket_exhaustion true to continue with a renormalized mix (see Bucket exhaustion).
ERROR / CRASHED — refuse to resubmit; inspect the logs.
Anything else (STOPPED, walltime, transient) — resume.

The bundled templates already implement these checks.

Templates¶

Ready-to-adapt templates live in examples/orchestrators/:

Template	Use case
`slurm_resume_chain.sh`	Auto-resubmitting SLURM job (generic clusters).
`kubernetes_job.yaml`	Kubernetes Job with `restartPolicy: OnFailure`.
`local_loop.sh`	A plain bash loop on a single machine.

The examples/ directory may contain site-specific details (account names, partitions, module loads). Treat those as illustrations and adapt them to your environment — the library itself never depends on them.

Minimal example: a shell loop¶

A graceful shutdown and a completed run both exit 0, so the loop checks gpt-simple status rather than the exit code to decide when to stop:

export GPT_SIMPLE_MAX_RUNTIME=7200          # stop ~2h in, save, exit 0
while true; do
  gpt-simple train --config config.yaml
  status=$(gpt-simple status --output_dir ./outputs 2>/dev/null || true)
  echo "$status" | grep -qE "COMPLETED|HALTED" && break   # HALTED needs attention
  sleep 5                                    # otherwise resume and continue
done

Each iteration trains until the budget, saves a shutdown checkpoint, and exits; the loop relaunches and resume: auto continues until status reports COMPLETED. examples/orchestrators/local_loop.sh is a more complete version (attempt cap, error detection).

Authoritative source: src/gpt_simple/_shutdown.py, examples/orchestrators/.