Production-style, decoder-only LLM engineering project focused on reproducible data pipelines, tokenizer/sharding workflows, and GPU training from scratch.
- Scope: end-to-end LLM workflow from raw corpora to checkpoints and generation.
- Data focus: ZIM + FineWeb workflows with hot (
./data) and warm (/mnt/ceph/llm/data) storage patterns. - Engineering focus: deterministic scripts, integrity checks, CI gating, and wiki-backed docs.
- Wiki:
wiki/ - Setup:
docs/SERVER_SETUP.md - RTX 5070 tuning:
docs/RTX5070_TUNING.md - HF release + deploy:
docs/HF_RELEASE_AND_DEPLOY.md - Contributor guide:
AGENTS.md
- Build a minimal but production-style training stack incrementally.
- Keep each subsystem testable (
tokenizer,data,model,training,evaluation). - Favor reproducible experiments through explicit configs and scripts.
src/llm/: core Python packagetests/: unit testsdocs/: architecture and roadmap notesinformation/: reference material and external links for project guidancerequirements/: system and Python dependency lists for server setupscripts/: bootstrap/install/doctor scriptsdata/: local/intermediate corpora (gitignored exceptdata/README.md)artifacts/: local outputs (vocab, checkpoints, logs; gitignored)Makefile: common developer commands
bash scripts/bootstrap_dev.shmake setup-infer # install inference/deploy dependencies
make install-systemd-services # install/reload long-run systemd units
make install-user-systemd-services # install/reload user-level systemd units (no sudo)
make test # run unit tests
make lint # run Ruff checks
make format # run Black formatter
make typecheck # run MyPy
make smoke # tiny CLI smoke check
make verify-shards # print shard integrity check usage
make train # print baseline training command usage
make generate # print checkpoint text-generation command usage
make average-checkpoints # print checkpoint averaging usage
make eval-checkpoint # print standardized prompt-suite eval usage
make render-eval-dashboard # print eval trend dashboard render usage
make package-inference-bundle # print deploy bundle packaging usage
make train-tokenizer-global # print shared-tokenizer command usage
make corpus-quality-report # print quality report command usage
make clean-corpus-batch # print batch cleanup command usage
make dataset-risk-report # print heuristic dataset risk audit command usage
make pull-hf-rows # print Hugging Face rows API pull helper usage
make fineweb-parquet-to-shards # print direct FineWeb parquet->token-shards usage
make fineweb-manifest-dedupe # print overlap-manifest dedupe helper usage
make stage-fineweb-from-warm # print warm->hot FineWeb chunk staging usage
make fineweb-prefetch-hot-queue # print hot-queue prefetch worker usage
make fineweb-revalidate-bad-parquet # print bad parquet revalidate/restage usage
make offload-shard-bins-warm # print shard .bin offload-to-warm usage
make fineweb-stage-shard-loop # print rolling stage->shard->verify->sync->purge usage
make fineweb-stage-shard-watchdog # print auto-restart watchdog usage for stage/shard loop
make lr-sweep-350bt # print RTX 5070 LR sweep usage for staged 350BT shards
make train-350bt-v2 # print 350BT long-run launcher usage
make train-350bt-ctx1024 # print long-context continuation launcher usage
make train-supervisor-350bt # print auto-resume trainer supervisor usage
make train-supervisor-phase1-talk # print phase-1 English conversation supervisor usage
make pipeline-eta # print combined download/shard/train ETA reporter usage
make pipeline-live # print live terminal pipeline dashboard usage
make shard-corpus-batch # print shared-tokenizer batch sharding usage
make hf-download-resumable # print self-healing HF resume-download worker usage
make hf-download-watchdog # print auto-restart wrapper for stalled/exited HF downloads
make sync-warm # sync raw/training data + artifacts to warm storage
make hydrate-warm # hydrate hot workspace from warm storage
make offload-zim # continuously move raw ZIMs hot -> warm
make checkpoint-offload-prune # sync checkpoints to warm and prune older local runs
make set-swappiness # print vm.swappiness tuning usage (root)
make hf-prepare-publish # print HF bundle/publish usage
make hf-download-model # print full HF model download usage
make serve-openai # print local OpenAI-compatible server usage
make doctor # verify binaries and Python depsmake smoke is expected to run in CI without installing torch; keep non-training CLI import paths torch-optional.
GitHub Actions workflows are defined in .github/workflows/:
ci.yml: script sanity (bash -n+py_compile), lint, typecheck, unit tests, smoke checks on pull requests and pushes tomainwiki-sync.yml: publishwiki/*.mdchanges to the GitHub Wiki- Dependabot config:
.github/dependabot.yml(weekly updates forpip,requirements/, and GitHub Actions)
Recommended branch protection for main:
- Require pull request before merging
- Require status checks:
CI Gate - Require branches to be up to date before merge
- Install system packages:
bash scripts/install_server_system.sh - Bootstrap dev environment:
bash scripts/bootstrap_dev.sh - Install training extras:
bash scripts/bootstrap_train.sh - Run health check:
bash scripts/doctor.sh - Install persistent workers:
- system units (root):
bash scripts/install_systemd_services.sh --install-watchdog - user units (no sudo):
bash scripts/install_user_systemd_services.sh --install-watchdog
- system units (root):
Detailed guide: docs/SERVER_SETUP.md
Keep raw .zim files on server storage (for example /data/iiab/zim/), not in Git.
For a first-pass talking-only dataset profile (English prose focus), generate include/exclude manifests:
bash scripts/first_pass_zim_profile.shTo also move excluded local ZIMs from hot storage to warm storage:
bash scripts/first_pass_zim_profile.sh --move-excludedThis writes:
artifacts/reports/first_pass_include_targets.txt(target profile, includes Gutenberg)artifacts/reports/first_pass_include_zims.txt(currently present and included)artifacts/reports/first_pass_exclude_zims.txt(currently present and excluded)
- Extract text corpus from ZIM:
PYTHONPATH=src .venv/bin/python -m llm.cli extract-zim-text \
--input-zim /data/iiab/zim/wikipedia_en_all_maxi.zim \
--output data/extracted/wiki_corpus.txt \
--max-articles 50000 \
--min-chars 200If extraction returns written_articles=0, retry with a lower --min-chars (for example 20).
If extract-zim-text reports no fulltext index, generate a --paths-file from
ZIM suggestions/title index and rerun extraction with that file.
- Analyze extracted corpora and generate boilerplate candidates:
PYTHONPATH=src .venv/bin/python -m llm.cli corpus-quality-report \
--input-dir data/extracted \
--output artifacts/reports/corpus_quality.json- Clean corpora before tokenizer training:
PYTHONPATH=src .venv/bin/python -m llm.cli clean-corpus-batch \
--input-dir data/extracted \
--output-dir data/cleaned \
--boilerplate-report artifacts/reports/corpus_quality.json \
--en-onlyBy default this cleanup step also decodes HTML entities and strips common web-shell artifacts
(HTML-like tags, repeated nav/menu phrases, site suffixes such as - Stack Overflow).
Disable individual transforms with:
--no-decode-html-entities, --no-strip-html-tags, --no-strip-site-suffixes,
--no-strip-nav-phrases, --no-strip-stack-metadata, --no-collapse-repeated-prefix,
--no-strip-inline-score-tokens.
To enforce English-only cleanup, add --en-only (with tunable thresholds:
--en-min-words, --en-min-stopword-ratio, --en-min-stopword-count,
--en-min-latin-ratio).
Additional quality guards are enabled by default:
- minimum words per line (
--min-words, default6) - symbol-density filter (
--max-symbol-ratio, default0.20) - URL-heavy line filter (
--max-urls-per-line, default1) - repetitive-token noise filter (
--repeated-token-run-threshold, default8) - normalized dedupe keys across punctuation/case variants (
--no-dedupe-normalizedto disable) - contamination filter for benchmark/prompt/refusal fragments (
--no-drop-contaminationto disable) For talking-only passes, keep code filtering enabled (default) or tune with:--code-symbol-ratio-thresholdand--code-keyword-hits-threshold. You can extend contamination filtering with repeatable--contamination-patternor--contamination-patterns-file.
3a. Pull a bounded Hugging Face dataset slice (for example FineWeb sample rows):
python3 scripts/pull_hf_rows.py \
--dataset HuggingFaceFW/fineweb \
--config sample-350BT \
--split train \
--output /mnt/ceph/llm/data/extracted/fineweb_sample-350BT_rows100k.txt \
--max-rows 100000Use warm storage for these pulls first; full FineWeb variants are much larger than typical hot disk.
3aa. Bulk-download FineWeb parquet shards (resumable):
# create token in Hugging Face web UI: Settings -> Access Tokens (read scope)
export HF_TOKEN=hf_xxx
# sample-350BT (~1.06 TB) -> warm storage, auto-resume + retry forever
bash scripts/hf_download_resumable.sh \
--dataset HuggingFaceFW/fineweb \
--repo-type dataset \
--include "sample/350BT/*.parquet" \
--local-dir /mnt/ceph/llm/data/fineweb/sample-350BT \
--max-workers 6 \
--enable-hf-transfer \
--skip-dry-run \
--attempt-timeout-seconds 5400 \
--retry-delay-seconds 30 \
--max-retries 0 \
--log-file artifacts/reports/fineweb_350bt_download_resumable.logNotes:
HF_TOKENis recommended (higher limits), not strictly required for public datasets.- Hugging Face SSH keys are for Git-over-SSH and are not used by
hf download. hf_download_resumable.shwrites a lock file in the local dir to prevent duplicate workers.hf_download_resumable.shauto-detectshf_transferand can be forced with--enable-hf-transfer.- For very large pulls (like 350BT),
--skip-dry-runavoids metadata preflight stalls. --attempt-timeout-secondsprevents one hung transfer from stalling progress forever.- Keep 350BT parquet on warm storage and stage bounded chunks to hot storage before sharding.
- For unattended runs, wrap with
scripts/hf_download_watchdog.shto auto-restart on stalls.
3aaa. Optional watchdog wrapper for stalled/exited downloads:
bash scripts/hf_download_watchdog.sh \
--dataset HuggingFaceFW/fineweb \
--repo-type dataset \
--include "sample/350BT/*.parquet" \
--local-dir /mnt/ceph/llm/data/fineweb/sample-350BT \
--max-workers 4 \
--enable-hf-transfer \
--skip-dry-run \
--attempt-timeout-seconds 5400 \
--stall-seconds 1200 \
--exit-on-complete \
--expected-parquet-files 510 \
--expected-bytes 1061360917731 \
--worker-log-file artifacts/reports/fineweb_350bt_download_resumable.log \
--watchdog-log-file artifacts/reports/hf_download_watchdog.logUse --exit-on-complete with expected file and/or byte targets so the watchdog exits once
download is complete (instead of looping forever and relaunching workers).
3ab. Stage FineWeb chunks from warm to hot as needed:
bash scripts/stage_fineweb_from_warm.sh --max-files 4 --max-gib 8 --copy-jobs 2You can pass --skip-list artifacts/reports/fineweb_stage_shard_loop/bad_parquet_files.txt
to avoid restaging files previously flagged as invalid.
Use --min-free-gib <N> to keep a floor of free space on hot storage while staging.
The staging script now copies into *.parquet.incomplete first and renames atomically,
so sharding/preflight never reads partially written parquet files.
3ac. Run rolling warm->hot staging + sharding loop (recommended for 350BT on limited hot disk):
bash scripts/fineweb_stage_shard_loop.sh \
--hot-queue-min-files 18 \
--stage-max-files 12 \
--stage-copy-jobs 2 \
--stage-min-free-gib 80 \
--process-max-files 12 \
--shard-jobs 2 \
--auto-tune-shard-jobs \
--auto-tune-min-shard-jobs 1 \
--auto-tune-max-shard-jobs 4 \
--tokenizer-threads 10 \
--encode-batch-size 1024 \
--shard-size-tokens 20000000 \
--sync-background \
--sync-max-inflight 2 \
--sleep-seconds 60 \
--shard-min-batch-size 512This loop stages bounded parquet files to hot storage, builds verified shard batches under
data/shards_global/fineweb-global-bpe-v1/, syncs those batches back to warm storage,
and purges processed hot parquet files.
Before sharding each batch, the loop now runs a parquet preflight check (row groups/rows/field),
quarantines failing hot files, and records their basenames in
artifacts/reports/fineweb_stage_shard_loop/bad_parquet_files.txt so they are skipped in future staging.
It also bootstraps processed parquet basenames from existing shard manifests on startup,
builds a combined stage skip list (processed + bad), and removes already-known files from hot storage,
so restarted loops continue forward instead of re-staging the earliest parquet files.
It also reconciles bad_parquet_files.txt against warm-source parquet validity on startup, so
transient hot-copy failures do not permanently blacklist valid warm files.
--hot-queue-min-files keeps a small parquet queue staged locally so shard building is less likely to idle on copy waits.
--stage-copy-jobs controls warm->hot copy parallelism for staging throughput.
--stage-min-free-gib prevents staging from filling hot disk below a safety floor.
--auto-tune-shard-jobs adapts --shard-jobs (and matching tokenizer threads) from loadavg + batch runtime.
--sync-background overlaps warm-storage sync with the next shard batch to reduce idle gaps.
--shard-size-tokens 20000000 reduces shard file-count overhead vs the old 5M-token default.
If a shard build fails with OOM-like errors, the loop retries automatically with a smaller batch size.
Batch guardrails now require valid report/manifest + non-empty shard outputs before files are marked
processed or purged from hot storage.
If a shard build fails with non-OOM errors (for example parquet decode errors), that job's input files
are quarantined as bad and the loop continues with remaining files.
Guardrail checks are implemented in src/llm/fineweb_guardrails.py and are unit-tested.
For 20-core hosts, --shard-jobs 2 --tokenizer-threads 10 --encode-batch-size 1024 is the
current high-throughput profile.
3ad. Optional watchdog for stage/shard loop auto-restart on exit/stall:
bash scripts/fineweb_stage_shard_watchdog.sh \
--worker-args "--hot-queue-min-files 18 --stage-max-files 12 --stage-copy-jobs 2 --process-max-files 12 --shard-jobs 2 --tokenizer-threads 10 --encode-batch-size 1024 --sleep-seconds 60 --shard-min-batch-size 512" \
--check-interval-seconds 120 \
--stall-seconds 5400The stage-watchdog now enforces a singleton lock in the stage state directory
(artifacts/reports/fineweb_stage_shard_loop/watchdog.lock by default), independent of
the log filename. It also adopts an already-running stage-loop controller by default so
watchdog restarts do not leave direct loop runs unmanaged. Use --no-adopt-existing-loop
to force launching a fresh worker process.
Watchdog progress snapshots now include hot .incomplete file count/bytes, so long warm->hot
copy phases are treated as active progress (not false stalls).
3ad. Build tokenizer + token shards directly from FineWeb parquet:
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
--input-dir data/fineweb/sample-350BT \
--output-dir data/shards_global/fineweb-global-bpe-v1 \
--tokenizer-out artifacts/tokenizer/fineweb-global-bpe-v1.json \
--field text \
--min-chars 80 \
--shard-size-tokens 5000000 \
--val-ratio 0.01This writes manifest.json + shard .bin files directly, skipping extracted text.
Use --max-files to do bounded test runs.
3b. Run heuristic dataset risk audit:
PYTHONPATH=src .venv/bin/python -m llm.cli dataset-risk-report \
--input-dir data/cleaned \
--output artifacts/reports/dataset_risk.jsonThis reports lexical cues for toxicity, stereotypes, political content, and refusal-like phrases. Use it as a screening signal, then manually review high-risk segments.
- Train tokenizer on cleaned corpus:
PYTHONPATH=src .venv/bin/python -m llm.cli train-tokenizer \
--input data/cleaned/wiki_corpus.clean.txt \
--output artifacts/tokenizer/vocab.json \
--bpe-vocab-size 32000 \
--bpe-min-frequency 2- Shard tokenized corpus for training:
PYTHONPATH=src .venv/bin/python -m llm.cli shard-corpus \
--input data/cleaned/wiki_corpus.clean.txt \
--tokenizer artifacts/tokenizer/vocab.json \
--output-dir data/shards/wiki_bpe \
--shard-size-tokens 5000000 \
--val-ratio 0.015b. Build one global tokenizer for multi-dataset training:
PYTHONPATH=src .venv/bin/python -m llm.cli train-tokenizer-global \
--input-dir data/cleaned \
--pattern "*.clean.txt" \
--from-shards-path data/shards \
--output artifacts/tokenizer/global-bpe-v1.json \
--bpe-vocab-size 32000 \
--bpe-min-frequency 25c. Re-shard many corpora with that global tokenizer:
PYTHONPATH=src .venv/bin/python -m llm.cli shard-corpus-batch \
--input-dir data/cleaned \
--pattern "*.clean.txt" \
--from-shards-path data/shards \
--tokenizer artifacts/tokenizer/global-bpe-v1.json \
--output-root data/shards_global/global-bpe-v1- Inspect corpus quickly:
PYTHONPATH=src .venv/bin/python -m llm.cli stats --input data/cleaned/wiki_corpus.clean.txt- Verify shard integrity before training:
PYTHONPATH=src .venv/bin/python -m llm.cli verify-shards \
--path data/shards \
--raw-zim-dir data/raw_zim \
--strict-source- Run a baseline training test:
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards/medlineplus.gov_en_all_2025-01 \
--output-dir artifacts/checkpoints/medlineplus_baseline \
--max-steps 200 \
--batch-size 8 \
--context-length 256 \
--lr-schedule cosine \
--lr-warmup-steps 50 \
--grad-accum-steps 1 \
--fail-on-eval-regression \
--precision autoNote: train requires all selected manifests to share the exact same tokenizer mapping.
Use a global tokenizer + shard-corpus-batch output root for multi-dataset runs.
For higher sustained GPU utilization on CUDA, use --precision auto and keep
validation less frequent (--eval-interval 500 --eval-steps 10).
If utilization is still bursty on smaller models, test --compile-model.
Training now supports:
- warmup + cosine LR schedule (
--lr-schedule,--lr-warmup-steps,--lr-min-ratio) - gradient accumulation (
--grad-accum-steps) - fixed held-out eval batches (
--no-eval-freeze-batchesto disable) - eval regression gate (
--fail-on-eval-regression --eval-regression-tolerance 0.20) - checkpoint retention pruning (
--checkpoint-keep-last,--checkpoint-keep-every) - optional EMA weights (
--ema-decay,--ema-update-every,--ema-start-step) - optional weights-only export (
--export-safetensors)
- Generate text from a checkpoint:
PYTHONPATH=src .venv/bin/python -m llm.cli generate \
--checkpoint artifacts/checkpoints/medlineplus_baseline/last.pt \
--prompt "The future of medicine is" \
--max-new-tokens 200 \
--temperature 0.9 \
--top-k 50Use --use-ema to generate from ema_state when the checkpoint includes EMA weights.
9a. Average multiple checkpoints for a more stable inference snapshot:
PYTHONPATH=src .venv/bin/python -m llm.cli average-checkpoints \
--checkpoint artifacts/checkpoints/medlineplus_baseline/ckpt_step_0001000.pt \
--checkpoint artifacts/checkpoints/medlineplus_baseline/ckpt_step_0002000.pt \
--output artifacts/checkpoints/medlineplus_baseline/avg_last2.pt \
--state-key model_state- Run standardized checkpoint eval (fixed prompt suite + scored report):
PYTHONPATH=src .venv/bin/python scripts/eval_checkpoint_prompts.py \
--checkpoint artifacts/checkpoints/medlineplus_baseline/last.pt \
--suite configs/eval/standard_prompt_suite_v3.json \
--baseline-report artifacts/reports/evals/<previous_report>.json \
--promotion-policy configs/eval/promotion_policy_v1.json \
--fail-on-regressionWrites a JSON report under artifacts/reports/evals/ so runs can be compared over time.
The report now includes regression deltas and a promotion verdict when a policy is provided.
Use this when you want round-1 pretraining only from FineWeb (no ZIM mix yet):
# 1) build tokenizer + shards directly from parquet
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
--input-dir data/fineweb/sample-350BT \
--output-dir data/shards_global/fineweb-global-bpe-v1 \
--tokenizer-out artifacts/tokenizer/fineweb-global-bpe-v1.json \
--field text \
--min-chars 80 \
--shard-size-tokens 5000000 \
--val-ratio 0.01
# 2) verify and train
PYTHONPATH=src .venv/bin/python -m llm.cli verify-shards \
--path data/shards_global/fineweb-global-bpe-v1
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards_global/fineweb-global-bpe-v1 \
--output-dir artifacts/checkpoints/fineweb-350bt-run1 \
--device cuda \
--max-steps 1000 \
--batch-size 12 \
--context-length 256 \
--lr-schedule cosine \
--lr-warmup-steps 200 \
--fail-on-eval-regression \
--precision autoResume training from the latest checkpoint:
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards_global/fineweb-global-bpe-v1 \
--output-dir artifacts/checkpoints/fineweb-350bt-run1 \
--device cuda \
--resume-from artifacts/checkpoints/fineweb-350bt-run1/last.pt \
--max-steps 3000Long-context continuation from a converged ctx512 run:
bash scripts/train_rtx5070_fineweb_350bt_bpe_v2_ctx1024.shThis path resumes from the base run and uses --allow-context-extension.
Optional text-first path still exists for inspection-heavy runs:
parquet_to_corpus -> clean-corpus-batch -> train-tokenizer-global -> shard-corpus-batch.
You can start training on a subset, then add new parquet files with the same tokenizer and resume:
# phase 1 file snapshot (example: first 10 files)
find data/fineweb/sample-350BT/sample/350BT -maxdepth 1 -type f -name '*.parquet' | sort | head -n 10 | sed 's#^data/fineweb/sample-350BT/##' > artifacts/reports/fineweb_sample350bt_phase1_files.txt
# build phase 1 tokenizer + shards
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
--input-dir data/fineweb/sample-350BT \
--files-list artifacts/reports/fineweb_sample350bt_phase1_files.txt \
--output-dir data/shards_global/fineweb-350bt-incremental/phase1 \
--tokenizer-out artifacts/tokenizer/fineweb-350bt-incremental-bpe-v1.json \
--field text
# start training on phase 1
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards_global/fineweb-350bt-incremental \
--output-dir artifacts/checkpoints/fineweb-350bt-incremental-run1 \
--device cuda
# later: build phase 2 from newly arrived files using same tokenizer
find data/fineweb/sample-350BT/sample/350BT -maxdepth 1 -type f -name '*.parquet' | sort | sed 's#^data/fineweb/sample-350BT/##' > /tmp/all_parquets.txt
comm -23 /tmp/all_parquets.txt artifacts/reports/fineweb_sample350bt_phase1_files.txt > artifacts/reports/fineweb_sample350bt_phase2_files.txt
PYTHONPATH=src .venv/bin/python scripts/fineweb_parquet_to_shards.py \
--input-dir data/fineweb/sample-350BT \
--files-list artifacts/reports/fineweb_sample350bt_phase2_files.txt \
--output-dir data/shards_global/fineweb-350bt-incremental/phase2 \
--tokenizer-in artifacts/tokenizer/fineweb-350bt-incremental-bpe-v1.json \
--field text
# resume; train sees both manifests under shards-path
PYTHONPATH=src .venv/bin/python -m llm.cli train \
--shards-path data/shards_global/fineweb-350bt-incremental \
--output-dir artifacts/checkpoints/fineweb-350bt-incremental-run1 \
--device cuda \
--resume-from artifacts/checkpoints/fineweb-350bt-incremental-run1/last.ptOn this 20-core host, default FineWeb shard splitting should use 15 parallel streams.
- Tuned profile docs:
docs/RTX5070_TUNING.md - Saved JSON profiles:
configs/train/rtx5070/fineweb_global_bpe_v1_big.json(recommended, BPE)configs/train/rtx5070/fineweb_350bt_bpe_v2_longrun.json(350BT long-run preset)
- Launch tuned big profile:
bash scripts/train_rtx5070_fineweb_bpe_v1_big.sh- 350BT-first LR sweep (ctx 512, LR
2e-4..4e-4):
bash scripts/lr_sweep_rtx5070_fineweb_350bt_ctx512.sh- 350BT-first long run launcher:
bash scripts/train_rtx5070_fineweb_350bt_bpe_v2.sh- Auto-resume supervisor (refreshes manifest set between step chunks):
bash scripts/train_supervisor_rtx5070_350bt.sh \
--step-chunk 2000 \
--poll-seconds 60 \
--batch-size 12 \
--target-effective-batch 24 \
--min-unique-input-files 510 \
--min-batch-size 6 \
--max-batch-size 20 \
--batch-step 2 \
--generation-suite configs/eval/generation_smoke_suite_v1.json \
--generation-every-chunks 1- Phase-1 English conversation gating profile (before coding specialization):
bash scripts/train_supervisor_phase1_english_talk.shThis uses configs/eval/english_talk_suite_v1.json,
configs/eval/generation_talk_smoke_v1.json, and
configs/eval/promotion_policy_talk_v1.json.
It also uses a dedicated state dir (artifacts/reports/train_supervisor_phase1_talk) and
lower-variance generation-gate settings (--generation-temperature 0.2 --generation-top-k 1).
Successful chunks update artifacts/reports/train_supervisor_phase1_talk/trained_batch_names.txt,
which can be used to gate shard offload so only already-trained batches move to warm storage.
On each supervisor loop, hot-only manifest guard now runs automatically and disables any
active manifest that references symlinked shard bins.
When monitoring this profile, point status tools at that state dir:
PYTHONPATH=src .venv/bin/python scripts/pipeline_live_view.py --supervisor-state-dir artifacts/reports/train_supervisor_phase1_talk
and
PYTHONPATH=src .venv/bin/python scripts/pipeline_eta_report.py --supervisor-state-dir artifacts/reports/train_supervisor_phase1_talk.
For continuous 350BT ingestion/training, keep exactly one stage watchdog and one train supervisor running.
Avoid launching one-off llm.cli train --max-steps ... jobs in parallel with the supervisor.
Stage watchdog now performs stale worker cleanup before relaunch, so restarted controllers
do not leave orphan shard-build workers behind.
Supervisor now runs a manifest dedupe pass before each train chunk launch
(scripts/fineweb_manifest_dedupe.py, keep strategy newest) that disables exact duplicate
manifest file-sets and reports partial overlaps for review.
Use --no-dedupe-overlap-manifests to disable, or --dedupe-dry-run to audit without disabling duplicates.
Use --dedupe-report-keep <N> to cap saved dedupe report/log artifacts during long waits.
Use --min-unique-input-files <N> to hold training until enough unique parquet inputs are represented in manifests.
Use --min-train-tokens <N> to gate startup by total train-token coverage instead of raw file count.
Supervisor enforces a singleton lock at
artifacts/reports/train_supervisor_350bt/supervisor.lock.
Add --no-train-fail-on-eval-regression if you want chunk runs to continue even when
the train-loop held-out perplexity gate is noisy; prompt-suite regression/promotion
checks still run in the supervisor eval step.
Supervisor resume guardrails now validate last.pt/ckpt_step_*.pt before resume and
quarantine invalid checkpoint files automatically, then continue from the newest valid one.
When post-chunk eval passes promotion logic (or beats prior pass-rate baseline), supervisor
also exports best.pt, best_eval_report.json, and safetensors best aliases.
Supervisor outputs:
artifacts/reports/train_supervisor_350bt/train_trend.tsv(per-chunk train telemetry)artifacts/reports/train_supervisor_350bt/eval_trend.tsv(post-chunk eval trend, including regression/promotion columns)artifacts/reports/train_supervisor_350bt/generation_trend.tsv(scheduled generation-gate trend, with regression columns)artifacts/reports/train_supervisor_350bt/eval_dashboard.html(rendered trend dashboard)artifacts/reports/train_supervisor_350bt/eval_dashboard_summary.json(dashboard summary JSON) The supervisor now auto-selects the latest successful eval baseline from the same suite name/path as the active eval suite (and same behavior for generation-gate suite baselines), so changing suites does not compare against mismatched historical reports.
Combined pipeline ETA/status reporter:
PYTHONPATH=src .venv/bin/python scripts/pipeline_eta_report.py --loop --interval-seconds 60Use --once for explicit single-snapshot mode (default behavior when --loop is not set).
Outputs:
artifacts/reports/pipeline_status.jsonartifacts/reports/pipeline_status.txtIncludes embedded snapshots oftop -b -n1,free -h,nvidia-smi, anddf -h. Also reports manifest coverage metrics (manifest_unique_input_files, overlap counts,coverage_complete). Also reports hot-manifest metrics (active_manifests,offloaded_manifests,active_manifests_with_symlink_bins,trained_batch_names_count). Also reportstrainer_stall_secondsand shard offload eligibility (offload_eligible_batches, raw/capped counts, trained-registry presence). Also includes per-taskRUN/STOPstate with stop reasons (for exampledownload complete,staging handled by stage-loop,idle between chunks/eval, or gate waits). Task process counts are root-deduped (controller processes), so wrapper/child shells do not inflateRUN xN.
Live terminal view (single command to watch continuously):
PYTHONPATH=src .venv/bin/python scripts/pipeline_live_view.py --refresh-seconds 5This is a live-only monitor (no report/status files written) and includes:
- system status (CPU, memory, GPU, disk mounts)
- pipeline progress (download/staging/sharding/training)
- staging line includes
hot_parquetandhot_incompleteto show active warm->hot copy progress - hot-set status (
active_manifests,offloaded_manifests,active_symlink_manifests,trained_batches) - hot-set also shows shard offload readiness (
offload_eligible_batches, raw eligible, cap) - manifest coverage status (
unique/510, overlap inputs/manifests, coverage rate + ETA, completion flag) - supervisor gate status (for example waiting on
min_unique_input_files) - training row includes
stall=<seconds since last step progress>for direct trainer stall visibility - running project task states with pid/runtime/cpu/mem summaries
- explicit stop reasons for tasks that are not running
- alert rows for stage-controller health and shard-manifest stall conditions
- training ETA fallback from
pipeline_status.json(--eta-status-file) when live step deltas are temporarily flat
Coverage ETA/rate now falls back to sharding throughput when manifest overlap is zero, so
ETA remains visible between manifest update bursts.
Alerts also flag duplicate train controllers (train-supervisor/trainer) and unmanaged
stage-loop runs (stage-loop active without stage-watchdog).
Alerts also flag active manifests that still reference symlinked shard bins.
The train supervisor also self-checks process singleton by PID age within the same
--state-dir scope and exits newer duplicates, so accidental second launches do not persist.
It refreshes in-place (full-screen mode). If your terminal does not handle full-screen
escape codes well, add --no-alt-screen.
For reboot-safe long runs, install service units for supervisor + stage watchdog:
make install-systemd-servicesNo-sudo alternative (user units):
make install-user-systemd-servicesIf you launch supervisor as a transient user unit (systemd-run --user), set a high
open-files limit (for example --property=LimitNOFILE=1048576) so large shard sets
do not fail with OSError: [Errno 24] Too many open files.
Templates:
deploy/systemd/llm-train-supervisor.servicedeploy/systemd/llm-fineweb-stage-shard-loop.servicedeploy/systemd/llm-fineweb-stage-shard-watchdog.servicedeploy/systemd/llm-hf-download-watchdog.servicedeploy/systemd/llm-checkpoint-offload-prune.servicedeploy/systemd/llm-checkpoint-offload-prune.timerdeploy/systemd/llm-bad-parquet-revalidate.servicedeploy/systemd/llm-bad-parquet-revalidate.timerdeploy/systemd/llm-shard-offload.servicedeploy/systemd/llm-shard-offload.timerdeploy/systemd/llm-vm-swappiness.service- user equivalents under
deploy/systemd/user/
Note: prefetch is optional when stage-loop already uses hot-queue staging flags
(--hot-queue-min-files, --stage-max-files, --stage-copy-jobs, --stage-min-free-gib).
Install/enable the prefetch unit only when you explicitly want separate queue prefetching:
bash scripts/install_systemd_services.sh --install-watchdog --install-prefetchIf you run prefetch with stage-loop, keep auto skip enabled so it respects
processed/bad parquet lists from artifacts/reports/fineweb_stage_shard_loop/.
Prefetch now forwards --min-free-gib to stage_fineweb_from_warm.sh, so
stage-loop and prefetch apply the same hot-disk free-space guardrail.
Revalidate and optionally restore bad parquet files:
PYTHONPATH=src .venv/bin/python scripts/revalidate_bad_parquet.py \
--restage-valid \
--max-restage-files 15 \
--min-free-gib 80This also prunes artifacts/reports/fineweb_stage_shard_loop/quarantine_bad_parquet by default:
- removes quarantine copies for files no longer marked bad
- for still-bad files, keeps only the newest copy per basename (
--quarantine-keep-per-name 1) - disable with
--no-prune-quarantine
Offload older shard binaries to warm storage while keeping training hot-only:
PYTHONPATH=src .venv/bin/python scripts/offload_shard_bins_to_warm.py \
--keep-local-batches 24 \
--target-free-gib 180 \
--max-batches 40 \
--disable-offloaded-manifests \
--require-trained-batches-file artifacts/reports/train_supervisor_phase1_talk/trained_batch_names.txt \
--min-active-manifests 48This replaces older local shard .bin files with warm-storage symlinks and renames
their manifest.json to manifest.offloaded.json, so llm.cli train only sees
local hot-disk manifests while disk usage stays bounded.
The --require-trained-batches-file guard prevents offloading any batch that has
not yet been included in a successful supervisor training chunk.
Use --min-active-manifests (and optional --min-active-train-tokens) as an offload
safety floor so hot-local training coverage never drops below your target.
Environment template:
deploy/systemd/llm.env.example(installed to/etc/llm/llm.env)
Recommended LLM_STAGE_SHARD_LOOP_ARGS baseline for 20-core hosts:
LLM_STAGE_SHARD_LOOP_ARGS="--hot-queue-min-files 10 --stage-max-files 8 --stage-copy-jobs 4 --stage-min-free-gib 80 --process-max-files 15 --shard-jobs 2 --auto-tune-shard-jobs --auto-tune-min-shard-jobs 2 --auto-tune-max-shard-jobs 3 --auto-tune-low-load-pct 80 --auto-tune-high-load-pct 95 --auto-tune-min-batch-seconds 300 --tokenizer-threads 10 --encode-batch-size 1024 --shard-size-tokens 20000000 --sync-background --sync-max-inflight 2 --sleep-seconds 60 --shard-min-batch-size 512"Recommended stage watchdog wrapper:
LLM_STAGE_SHARD_WATCHDOG_ARGS="--worker-args \"${LLM_STAGE_SHARD_LOOP_ARGS}\" --check-interval-seconds 120 --stall-seconds 5400 --watchdog-log-file artifacts/reports/fineweb_stage_shard_loop/watchdog.log"Build a portable local deploy bundle (with checksums and optional tarball):
PYTHONPATH=src .venv/bin/python scripts/package_inference_bundle.py \
--checkpoint artifacts/checkpoints/fineweb-350bt-bpe-v2-run1/best.pt \
--model-id local/fineweb-bpe-v2 \
--create-tarUse ./data and ./artifacts as the hot working set.
Use /mnt/ceph/llm/data as warm cache/backup for durability and overflow.
- Recommended mount layout:
/mnt/ceph/llm/data/raw_zim//mnt/ceph/llm/data/extracted//mnt/ceph/llm/data/shards//mnt/ceph/llm/data/tokenizer/
- Version datasets by ZIM date stamp:
- ZIM:
serverfault.com_en_all_2025-08.zim - Version tag:
serverfault_2025-08 - Raw ZIM:
/mnt/ceph/llm/data/raw_zim/serverfault.com_en_all_2025-08.zim - Extracted text:
/mnt/ceph/llm/data/extracted/serverfault_2025-08.txt - Tokenizer:
/mnt/ceph/llm/data/tokenizer/serverfault_2025-08-vocab.json - Shards:
/mnt/ceph/llm/data/shards/serverfault_2025-08/
- ZIM:
- Default run model:
- Process locally in
data/extracted,data/shards, andartifacts/tokenizer. - Periodically sync to Ceph for backup/caching.
- Process locally in
- Push local artifacts to warm storage:
bash scripts/sync_warm_storage.sh /mnt/ceph/llm/dataThis now syncs training-critical inputs/outputs including:
data/raw_zim, data/fineweb, data/cleaned, data/extracted,
data/shards, data/shards_global, artifacts/tokenizer,
artifacts/checkpoints, and artifacts/reports.
- Periodic checkpoint offload + local prune:
bash scripts/checkpoint_offload_prune.sh \
--local-checkpoints-dir artifacts/checkpoints \
--warm-checkpoints-dir /mnt/ceph/llm/data/checkpoints \
--keep-local-runs 1- VM swappiness tuning (root):
sudo bash scripts/set_swappiness.sh --value 10 --persist- Continuous ZIM offload worker (hot -> warm):
bash scripts/zim_offload_worker.sh data/raw_zim /mnt/ceph/llm/data/raw_zim 120- Pull artifacts back from warm storage to local hot workspace:
bash scripts/hydrate_from_warm_storage.sh /mnt/ceph/llm/data- Text stats CLI for quick corpus sanity checks.
- Batch corpus quality report generation (
corpus-quality-report). - Batch corpus cleanup and dedupe (
clean-corpus-batch). - Heuristic dataset risk auditing (
dataset-risk-report). - Direct FineWeb parquet -> tokenizer -> shard pipeline (
scripts/fineweb_parquet_to_shards.py). - BPE tokenizer workflow with train/save/load + contract fingerprinting.
- Token-window data pipeline (
TokenWindowDataset) for next-token training pairs. - ZIM archive text extraction (
extract-zim-text) for server-hosted.zimfiles.- Automatically falls back to suggestion-index paths if fulltext search returns no matches.
- Corpus sharding (
shard-corpus) into train/val token shard binaries + manifest. - Batch corpus sharding (
shard-corpus-batch) with one shared tokenizer. - Baseline GPT training (
train) with checkpoint save/resume.- Default architecture: RoPE + RMSNorm + SwiGLU (
gpt_rope_rmsnorm_swiglu_v1). - Includes AdamW no-decay param groups, warmup/cosine LR, and grad accumulation.
- Default architecture: RoPE + RMSNorm + SwiGLU (
- Checkpoint-based text generation (
generate) with temperature/top-k sampling. - Optional safetensors export for deployment (
--export-safetensors). - Unit tests for tokenizer round-trips and unknown token behavior.
- Expand checkpoint eval suite and track regressions in CI.
- Add tokenizer-aware dataset manifests for long-running incremental FineWeb phases.
- Add larger-context training profiles and memory/throughput benchmarking.
- Add finetuning flows for classification and instruction datasets.
- Internal reference index:
information/README.md - Working notes from loaded PDF + external references:
information/raschka-reference-notes.md - Implementation checklist from those references:
information/raschka-implementation-checklist.md - Sebastian Raschka article: https://magazine.sebastianraschka.com/p/coding-llms-from-the-ground-up
- Raschka repository: https://github.com/rasbt/LLMs-from-scratch
- Local checkout (submodule):
information/external/LLMs-from-scratch
git submodule update --init --recursive
git submodule update --remote information/external/LLMs-from-scratchUse the first command after clone; use the second to pull newer upstream reference commits.
Repository wiki pages are maintained from wiki/*.md.
Publish updates to GitHub wiki:
bash scripts/publish_wiki.sh git@github.com:aditaa/llm.wiki.gitPreferred workflow:
- Update
README.mdandAGENTS.mdas needed. - Update matching pages in
wiki/. - Publish wiki with
scripts/publish_wiki.sh.
Dataset inventory and intended use are tracked in:
wiki/Dataset-Registry.md