Σ₀ Continual-Training Loop — the closed flywheel
📖 In plain English (start here)
What this is: the machinery that lets the local coding model get better over time from its own work — without a human hand-feeding it examples each round.
The loop, in one breath: the system writes coding tasks → it actually runs the code to check each one really works → the ones that pass become study material → a new version of the model is trained on them → both the new and old versions sit a coding exam → the new one is kept only if it scores higher. If it doesn't beat the old one, nothing changes. Then it goes around again.
The honest part — proof beats claims. A piece of code only becomes training data if it survives execution in a sealed sandbox. The model saying "this works" counts for nothing; only a green test run counts. Same for promotion: a new model is adopted only on a measured exam win, not a vibe.
What's actually built today: the harvest step, the run-it-to-prove-it gate, and the keep-only-if-better promotion gate are all built and tested. The one expensive step — the actual retrain — is a button you press on the GPU box, on purpose: see the boundary note below.
The deliberate limit: this loop runs offline and only when you ask it to. It is not wired into the live chat. That's not laziness — the project's North Star forbids the live loop from ever retraining its own weights. So the flywheel is real, but a human (or a scheduled job) is the one who spins it.
The rest of this page is the precise version. ↓
This documents the continual-training loop for the Σ₀ Ouro Coder: the offline pipeline that turns the system's own execution-verified coding successes into a better local adapter, gated end-to-end by ground truth. It closes the data→train→promote flywheel that was previously hand-cranked.
The loop
harvest ──▶ execution-verify ──▶ train ──▶ eval ──▶ EVAL-GATED promote ──▶ (repeat)
│ │ │ │ │
│ │ │ │ └─ swap final/ iff
│ │ │ │ pass@1 ≥ incumbent + margin
│ │ │ └─ HumanEval pass@1 on candidate AND incumbent
│ │ └─ QLoRA r16/α32 on the verified set (gated; ~hours)
│ └─ compile + exec + run asserts in an isolated subprocess
│ THE Σ₀ GROUND-TRUTH GATE — only a green subprocess counts
└─ gather {fn, instruction, code, asserts} from the system's own runs
| Stage | Script | Status |
|---|---|---|
| harvest | scripts/harvest_coding_corpus.py |
✅ built + run |
| execution-verify | build_ouro_coding_dataset.load_extra_candidates (sandboxed subprocess) |
✅ built + run |
| train | scripts/train-qlora-ouro.py |
✅ built (GPU, opt-in) |
| eval | scripts/eval_humaneval_ouro.py |
✅ built |
| eval-gated promote | scripts/continual_ouro_pipeline.py stage_promote |
✅ built + self-tested |
The orchestrator is scripts/continual_ouro_pipeline.py.
The two ground-truth gates
This loop has exactly two places where something is accepted, and both require external evidence, per the Σ₀ rule (nothing is accepted without evidence):
- Into training — execution gate. Every harvested candidate is compiled,
exec'd,
and run against its own asserts in an isolated subprocess with a timeout (verify_candidate_sandboxed). A model's claim that its code works does not count; only returncode == 0 counts. Anything that errors, infinite-loops, or fails an assertion is dropped. (Evidence: a first run verified 185/186 harvested candidates; the one drop was a real AssertionError the gate caught.)
- Into production — eval gate. A newly trained adapter replaces the live
final/
adapter only if its HumanEval pass@1 beats the incumbent's by --margin (default: any strict improvement). Ties and regressions are rejected; final/ is untouched. The decision is a pure, unit-tested function (decide_promotion); the incumbent is backed up to final.bak-<ts>/ before any swap.
Harvest sources
The harvester normalizes {fn, instruction, code, asserts} candidates from the system's own runs, dedups (by name + code hash), and writes one --extra-candidates JSONL:
data/ouro-corpus-raw.json— the multi-agent corpus receipt (a 12-agent run
producedcandidates). Populated today.
--source-jsonl <path>(repeatable) — the live-run extension point: any future
stream of coding successes that autowork / keystone / chat emit as {fn, instruction, code, asserts} rows. Interface built; no live emitter yet.
Normalization detail: agent corpora emit a placeholder def fn(...) while labeling the task with a descriptive name (swap_case). The harvester renames the function consistently across instruction + code + asserts so the verifier accepts it and the training signal uses real names.
Deliberately NOT harvested:
data/eval/humaneval/*.jsonlpassing completions — training on HumanEval would
contaminate the pass@1 eval the loop optimizes against. Excluded on purpose.
lessons.db— that is the Kalshi/trading lessons store, not a coding-success source.
The architectural boundary (why it's offline)
The Σ₀ Briefing North Star states:
PERSISTENT LEARNING, NOT WEIGHT MODIFICATION. Improve via retrieval and reasoning, not retraining.
QLoRA is weight modification. This loop survives that rule by being offline, opt-in, and outside the live request path:
- It is a script you (or a cron) run — it is never invoked from chat / the live
Observe→Reason→Act→Verify→Converge loop.
- The live loop still learns the Σ₀ way: append-only memory + routing leaderboard, no
retraining.
- Promotion only swaps a drop-in adapter the model-broker already treats as one
interchangeable local model — the architecture doesn't change, one pluggable brain does.
Wiring this into the live loop (auto-retrain on the fly) would cross that line and is out of scope by design. The flywheel is real; a human or scheduled job spins it.
Run it
# Safe default — harvest + execution-verify only; reports what WOULD train (no GPU):
.venv-train/Scripts/python scripts/continual_ouro_pipeline.py
# Verify the promote-gate logic with no GPU/model:
.venv-train/Scripts/python scripts/continual_ouro_pipeline.py --self-test
# Full loop on the GPU box (train → eval both → eval-gated promote):
.venv-train/Scripts/python scripts/continual_ouro_pipeline.py --train --eval --promote
# Eval-gate a pre-trained candidate against the incumbent (no retrain). The adapter dir
# is the one holding adapter_config.json — train-qlora-ouro.py writes that to <out>/final:
.venv-train/Scripts/python scripts/continual_ouro_pipeline.py \
--candidate D:/lantern-train/ouro-sigma0-adapters/coding-v3/final --eval --promote --full
Outputs:
data/ouro-harvest-candidates.jsonl— normalized candidatesmodels/lantern-sigma0-coder/training-data.harvested.jsonl— base + verified rows (non-destructive)data/eval/leaderboard.jsonl— onepass@1summary row per evaldata/eval/ouro-promotion-log.jsonl— append-only Convergence Record per promote decision
Measured baseline (incumbent final/)
The promotion gate does not read a stored number — stage_eval re-evaluates the incumbent final/ adapter live each run and compares it to the candidate in memory.
The full set is harder than the first-20 subset. Logged points to date (FAST cached path) were on a 20-problem slice: baseline HE-20 pass@1 0.05 → coding-v3 HE-20 pass@1 0.75. The full 164-problem rerun (ouro-final-rerun-full) was attempted but did not complete — no entry exists in data/eval/leaderboard.jsonl for it. The in-flight partial (~67/115 at ~58% pass rate) is an interrupted estimate, not a measured result. Treat the first-20 0.75 as the only hard data point until the full rerun completes. HumanEval's back half is materially harder; the real full-set figure is likely well below 0.75 but is pending measurement.
Status (2026-06-22): harvest + execution-verify + promote-gate built, run, and tested (adversarially reviewed — loop-breaking and data-quality findings fixed; the crash-safe promote swap is unit-tested incl. forced mid-swap failure → rollback). Full-HumanEval rerun of
final/(ouro-final-rerun-full) did not complete — the eval was interrupted;data/eval/leaderboard.jsonlhas no full-run entry. A 600-step combined adapter training run was started 2026-06-22 ~06:29 (est. ~6.3 hr); the eval + promote-gate have not been run against the resulting adapter yet.Next required steps (see #911): stop
ouro_serve.pyto free the GPU, then run:continual_ouro_pipeline.py --eval --promote --full --candidate <new-adapter-path>This will re-evaluate the incumbent live, compare against the candidate, and — if the candidate wins — swap it into
final/and append a Convergence Record todata/eval/ouro-promotion-log.jsonl. Record theouro-final-rerun-fullpass@1in this section when it completes.
Cloud GPU Rotation Orchestrator (2026-06-22)
The local RTX(8GB VRAM) runs at seq=1536 in ~38 sec/step but is single-process — it can't train while serving Ouro. To decouple training from the live session, all cloud training now goes through a provider rotation system in routes/gpu-training.js and lib/training-dispatcher.js, configured by data/pcsf/gpu-training.pcsf.json.
Providers (rotation order, most to least preferred)
| Provider | GPU | VRAM | Quota | Automatable | Status |
|---|---|---|---|---|---|
| Kaggle | T4 / P100 | 16 GB | 30 h/wk | Yes (REST kernel push) | Active |
| SageMaker Studio Lab | T4 | 16 GB | 28 h/wk | Manual | Next up |
| Google Colab | T4 | 15 GB | 22 h/wk | Manual (notebook emit) | Available |
| Lightning AI | T4 / A10 | 16–24 GB | 22 h/wk | Yes (lightning-sdk) | Available |
| Local RTX 3070 | RTX 3070 | 8 GB | Unlimited | Yes | Available when idle |
| Paperspace | M4000 | 8 GB | 0 h/wk | Yes | PRO required ($8/mo) |
Total free cloud quota:h/wk.
API endpoints
GET /api/gpu-training/status — providers + last 20 jobs + active job + next provider
GET /api/gpu-training/providers — PCSF config
POST /api/gpu-training/dispatch — { provider?, steps?, checkpointUri? }
POST /api/gpu-training/poll — { provider, jobId }
POST /api/gpu-training/test — smoke-test credential resolution (no GPU spin-up)
GET /api/gpu-training/keys — credential status (masked)
POST /api/gpu-training/keys — { key, value } — set + persist to Windows User env
The UI (Dashboard / orchestration.html) wraps these endpoints with a provider card grid, a dispatch form, and a live job log.
Training job log (as of 2026-06-23)
| Time (UTC) | Provider | Job | Status | Notes |
|---|---|---|---|---|
| 2026-06-22 22:14 | Colab | ouro-training-1782166464438 | manual_required |
Notebook emitted; awaiting manual launch |
| 2026-06-22 22:14 | Kaggle | ouro-training-test | done |
Smoke-test — no training run |
| 2026-06-22 22:45 | Kaggle | ouro-training-600-steps | running (last poll) |
Polled once as running; no completion record — likely session expired |
| 2026-06-22 23:40 | Kaggle | ouro-training-202606222340-600-steps | failed |
rawStatus: error; polled 00:19 UTC; no failure message returned by Kaggle API |
Root cause of Kaggle failure (likely): the kernel title slug mismatch logged in cliOutput ("Your kernel title does not resolve to the specified id") suggests the kernel was pushed but Kaggle's scheduler rejected it before it ran. The training script itself was not the issue — the kernel push/title resolution needs fixing in training-dispatcher.js.
Next action: SageMaker is next in rotation (manual launch). Open https://studiolab.sagemaker.aws/, start a GPU runtime, paste the notebook template from the next dispatch record, run all cells.
Convergence record implication
Every dispatch + poll is written to data/self-improvement/training-jobs.jsonl. Every model interaction in dream-chat now also writes a convergence record to data/convergence/records.jsonl (implemented 2026-06-22, PR #1066). Failures are logged with priority: HIGH and confidence: 0.1 for audit.
Related
- SIGMA0-OURO-CODER.md — the looped local coder this trains (single source of truth, incl. the adaptive-depth Q-exit mechanism)
- CONVERGANCE-SIGMA0-BRIEFING.md — the North Star boundary