Σ₀ Ouro Coder — the local coding agent (single source of truth)
This is the one doc for the local Σ₀ coder — then and now. It supersedes and folds in two older pages:
LANTERN-SIGMA0-CODER.md— what we had then: the Qwen2.5-Coder-3B QLoRA modelserved via the Ollama binary (deprecated, kept as a tombstone).
OURO-LOOPLM.md— the loop mechanism (Q-exit math + the two loopimplementations), now described in §The loop mechanism below (deprecated, kept as a redirect).
If you landed on one of those, you're in the right place now.
📖 In plain English (start here)
What this is: a coding assistant that runs entirely on your own computer — no cloud, no internet needed. Its "brain" is a small AI model called Ouro.
The trick — it thinks in loops. Most AI models get smarter by being bigger. Ouro gets smarter by going around again: it reuses the same small set of layers several times on one problem — like re-reading a hard sentence until it clicks. That's why it's named "Ouro," after the ouroboros, the snake that eats its own tail. A loop.
It decides how hard to think. Easy question? It loops a couple of times and answers fast. Hard question? It keeps looping to think it through. A built-in "good enough yet?" check (the Q-exit gate) decides when to stop — so a tiny model can punch above its size on the hard parts without being slow on the easy ones.
It learned this project. It was fine-tuned on this repo's own past coding sessions, so it already knows the house style and conventions.
Two speeds: a Fast mode (the default — quick, reuses cached work) and a Deep "think-harder" mode you switch on for tough problems (slower, ~1 second per word).
Where it fits: it's just one swappable "brain" plugged into the bigger Lantern loop — Observe → Remember → Reason → Act → Verify → Converge. Unplug it, drop in a different model, and the rest of the system doesn't change.
What came before: an earlier version used a different brain (Qwen) and needed a separate "Ollama" program to run it. We retired that — the new one is smaller, loops, and runs itself. See *What we had then → what we have now*.
Honest about limits: it's small (1.4 billion parameters), it's a real but modest fine-tune (not a production-grade model), and its Deep mode is genuinely slow. A capable local helper — not a frontier model.
🎙️ Want it read aloud? Press the Listen bar at the bottom of this page.
The rest of this page is the precise, technical version. ↓
The Σ₀ Ouro Coder is the Σ₀ coding agent running on Ouro (the Ouroboros looped language model, arXiv:2510.25741) instead of a plain transformer. It is the same Convergence-Core coder path — Reason → Act for code — but its local brain is Ouro-1.4B with weight-tied recurrent depth + a learned Q-exit gate, plus our own Σ₀ fine-tune. It runs fully local, served as a drop-in Ollama-API model.
What we had then → what we have now
There used to be two local coders documented separately; there is now one. This is the arc:
Then (lantern-sigma0-coder, 2026-06-18) |
Now (Σ₀ Ouro Coder, since 2026-06-20) | |
|---|---|---|
| Base model | Qwen/Qwen2.5-Coder-3B-Instruct (plain transformer) |
ByteDance/Ouro-1.4B-Thinking (weight-tied looped transformer) |
| Σ₀ tune | QLoRA onpairs /sessions;epochs /steps; loss 2.87 → 1.78 | QLoRA on the Σ₀ Claude-session set;epochs, bf16 base (4-bit arch-gated), r=16/α=32 over all-linear |
| Serving | Ollama binary (lantern-sigma0-coder-v2) |
scripts/ouro_serve.py — drop-in Ollama HTTP API, no Ollama binary |
| Adaptive depth | none (single forward pass) | Sigma0LoopLM Q-exit — loop until CDF(t) ≥ q |
| Routing | leaderboard-preferred (model-leaderboard.js) |
drop-in on :11434; leaderboard integration is a follow-up |
| Status | deprecated & removed | active local coder |
Why the switch (issue #811 / PR #823 — "Ollama sunset"): we retired the external Ollama binary as a hard dependency and moved to a Python server that speaks the Ollama API. That made it natural to swap the brain for Ouro, whose looped recurrent depth lets a 1.4B model spend extra computation on hard turns and exit early on easy ones — a better trade for a small local model than a larger single-pass one.
Verified on disk (2026-06-20): the Qwen training outputs (D:\lantern-train\sigma0-adapters, sigma0-merged) were removed and lantern-sigma0-coder-v2 is no longer registered in Ollama (only the base qwen2.5-coder blob remained). The active local coder is the Ouro Σ₀ adapter at D:\lantern-train\ouro-sigma0-adapters\final\ (base ByteDance/Ouro-1.4B, LoRA r=16/α=32, trained locally). The Qwen continual-training track was deleted as bloat — do not rebuild it; the live retrain pipeline is SIGMA0-CONTINUAL-TRAINING.md.
What it is
| Base model | ByteDance/Ouro-1.4B-Thinking (weight-tied recurrent transformer) |
| Σ₀ tune | QLoRA on the Σ₀ Claude-session set (scripts/train-qlora-ouro.py;epochs, bf16 base, LoRA r=16/α=32 over all-linear, lr 2e-4, seq 1536) |
| Adaptive depth | Sigma0LoopLM (src/sigma0/loop_lm.py) — three exit policies (OURO_MODE): qexit (trained gate, default), converge (first-order fixed point), accel (spiral-robust, certificate-consistent) |
| Collapse guard | DecodeCanary per-token sigma0_proximity monitor (observe-only by default; OURO_ADAPT=1 lets it deepen the loop to fight its own incipient collapse) — the collapse certificate is the safety foundation |
| Serving | scripts/ouro_serve.py — drop-in Ollama HTTP API (ouro:latest on :11434); fast cached default + opt-in native deep mode |
| 8GB / long-context | OURO_4BIT=1 (NF4 base, ~7.7→1.85 GB) + OURO_KV_INT8=1 (int8 KV cache) + OURO_UT_STEPS=2 (halves the recurrent KV) — reaches CC-scale (15–20k) prompts on anGB card |
| Integration | transparent: the coder/agent path POSTs to OLLAMA_BASE_URL (default :11434) — point it at ouro_serve and the whole path uses Ouro |
| Claude Code | protocol bridge solved (scripts/ouro_anthropic_bridge.py); the 1.4B adapter is not yet reliable enough to drive CC — see integration status |
Why Ouro for the coder
Ouro builds reasoning into computation depth — reusing weight-tied layers R times in latent space — rather than into token length (the paper's "third scaling axis": loop depth). For a small local model that's a good trade: spend extra recurrent steps on hard coding/reasoning turns and exit early on easy ones. The Σ₀ QLoRA tune adapts it to this codebase from past Claude-Code sessions, so it learns the repo's idioms while staying 1.4B and local.
The loop mechanism
(absorbed from the former OURO-LOOPLM.md.)
Source: Scaling Latent Reasoning via Looped Language Models (Ouro, arXiv:2510.25741). PDF in repo: docs/research-papers/ouro-looped-llm-2510.25741.pdf.
The idea (paper)
LoopLM builds reasoning into computation by reusing weight-tied layers R times in latent space (a "third scaling axis": loop depth). Key mechanisms we borrow:
- Adaptive depth + learned early-exit (Q-exit): a gate emits per-step exit probabilities;
exit at the first step where the cumulative CDF(t) ≥ q. q trades compute for accuracy.
- Entropy-regularized depth (uniform prior) prevents collapse to always-shallow/deep.
- Deeper-is-better, with diminishing returns — most inputs converge by mid-depth.
1. Native latent loop on real Ouro weights (the real thing)
src/sigma0/loop_lm.py — Sigma0LoopLM is our implementation of the paper's Q-exit adaptive-depth policy (λ→survival→CDF→first-step-≥q), run on Ouro's pretrained weight-tied block + exit gate (we do not pretrain a LoopLM — that needs 7.7T tokens). This activates the adaptive inference the stock Ouro checkpoint leaves off: its generate() threads no per-call exit threshold, so it runs fixed full depth. Our module reads the per-step gates, applies Q-exit, and reports the realized per-token loop depth (mean_depth); generate() returns exit_reason: "adaptive_qexit". Defaults: q=0.5, max_new_tokens=200, repetition penalty 1.3.
- Probe it:
python -m sigma0.loop_lmprints the realized mean depth — adaptive and
below the recurrent step count (total_ut_steps), i.e. not fixed-depth. (This probe output is not yet persisted to an eval artifact, so treat the number as a live observation, not a benchmark.)
- Trained on our data: QLoRA fine-tune of Ouro-1.4B on the Σ₀ Claude-session set
(scripts/train-qlora-ouro.py). Adapter loads via Sigma0LoopLM.load(base, adapter=…).
- Three exit policies (
OURO_MODE, now wired into serving):Sigma0LoopLM.generate()takes
mode=:
qexit(default,exit_reason: "adaptive_qexit") — the trained entropy/confidence
gate; exit at the first step with CDF(t) ≥ q. This is what Ouro was trained for.
converge(exit_reason: "convergence_exit", returnsmean_contraction) — exit on a
first-order latent fixed point ‖hₜ − hₜ₋₁‖/‖hₜ₋₁‖ < ε. The falsifiable "spiral" experiment (E2).
accel— exit on the spiral-robust second-order acceleration criterion
‖Δᵏ − Δᵏ⁻¹‖/‖·‖ < ε held for patience steps (Two-Scale, arXiv:2509.23314). First-order converge false-exits on SPIRAL dynamics — the case the collapse certificate §1.1 flags as hard (where the energy proof fails); accel is the certificate-consistent upgrade.
All three are selectable on the served deep path (ouro_serve.py, OURO_NATIVE=1, OURO_MODE=…); qexit remains the default. See research/2026-06-19-convergence-tesseract-spiral.md and the collapse explainer.
- DecodeCanary + depth coupling (the intrinsic anti-collapse mechanism): in native mode the
per-token DecodeCanary (#766/#793) folds self-repeat / n-gram echo / argmax-margin / entropy-collapse z-alarms into one sigma0_proximity score. OURO_CANARY=1 (default in native) runs it observe-only — telemetry only (canary_max_proximity / spooks / signal). OURO_ADAPT=1 arms the actuator: as proximity rises, knobs() deepens the recurrent loop and raises the repetition penalty — the model stepping deeper to resolve its own incipient degeneration (#1014, divergence→depth coupling). Native loop only; the fast cached path is plain HF decode and never sees the canary.
2. API-level re-prompt loop (provider-agnostic approximation)
For any plain (non-looped) local model, we also approximate the loop by re-prompting:
lib/loop-reasoner.js—loopedReason()
runs the model up to MAX_LOOPS (4, = Ouro R4), feeding each prior answer back as a Coconut-style context prefix, and exits via cdfExit():
threshold_met— confidence≥ CDF_THRESHOLD(0.85)converged—|Δconfidence| < CONVERGENCE_EPS(0.04), the entropy-plateau analog
(requires ≥loops)
max_loops— compute budget hit
Confidence is heuristic — extractConfidence() parses a Confidence: field or estimates from structure. The module also exports a one-shot singleReason() (exit_reason: "single_pass") and the three constants; callers may override maxLoops/cdfThreshold per call.
- Wired into
lib/stream-chat.js— for
reasoning/coding intents (and only when not Keystone-debug, not roleplay, and no explicit provider was picked), a looped pass runs on the local model and the done event carries loop_n / confidence / exit_reason. The "Loop Depth (Σ₀)" panel in dream-chat.html renders them as ⟳ N loop(s) · X% conf · <exit_reason>; the provider dropdown's "Local Σ₀ Loop (Ouro)" option is the user-facing entry. On error the pass falls through to normal streaming (non-fatal).
Where it maps in the codebase
| Paper concept | Lantern |
|---|---|
| Recurrent steps R | Ouro total_ut_steps (native) · MAX_LOOPS (re-prompt) |
Q-exit CDF(t) ≥ q |
qexit_step() in loop_lm.py (native) · cdfExit() (re-prompt) |
| Realized adaptive depth | mean_depth (native) · "Loop Depth (Σ₀)" panel (loop_n/confidence/exit_reason) |
| Deeper-is-better, diminishing | early-exit at the first step with CDF ≥ q |
| Knowledge manipulation > capacity | small local model + KB grounding (CSF spec §2.9) |
Grounding note: this doc is markdown, indexed by
scripts/build_knowledge_index.pyintodata/knowledge/index.jsonl, so the Knowledge Center can ground / near-route on it. A doc becomes grounded by being linked fromknowledgecenter.html(the indexer scrapes/repo/*.mdhrefs). Re-run the indexer after editing so the snapshot matches the live text.
How the agent uses it (no code change)
ouro_serve.py speaks the Ollama HTTP API (/api/chat, /api/generate, /api/tags) and defaults to port 11434, advertising the model as ouro:latest. The Σ₀ coder/agent path already calls a local model over exactly that API:
- streaming chat is Ollama-first (
OLLAMA_BASE_URL, defaulthttp://127.0.0.1:11434); - the looped re-prompt pass (
lib/loop-reasoner.js)
and the MCP Kernel worker (task_run → /api/dream/chat) hit the same local endpoint.
So run ouro_serve.py on 11434 and the entire coder/agent path transparently runs on Ouro — Observe → Remember → Reason → Act → Verify → Converge with a looped brain, no code change. OURO_MODEL defaults to ByteDance/Ouro-1.4B-Thinking; set OURO_ADAPTER for the Σ₀ tune. (Ouro is a drop-in; unlike the old Qwen coder it is not yet registered in the model-broker leaderboard — that's a follow-up.)
Two inference modes
- Default — fast cached. Uses Ouro's
UniversalTransformerCache; this is the chat/coder
default (the product gate is speed). Plain HF decode — no canary, no adaptive depth.
- Deep — native adaptive loop.
OURO_NATIVE=1activatesSigma0LoopLM: per-token exit by
OURO_MODE (qexit default, q = OURO_Q, default 0.5; or converge/accel, eps = OURO_EPS, default 0.05), realized depth reported as mean_depth. It is no-cache (~1 s/token), so it's an opt-in "think-harder" mode. Tunable via OURO_NATIVE_MAX (80). The canary (OURO_CANARY, on) and actuator (OURO_ADAPT) live here.
Knob reference (ouro_serve.py)
| Knob | Default | What it does |
|---|---|---|
OURO_NATIVE |
0 |
1 = deep adaptive loop; 0 = fast cached |
OURO_MODE |
qexit |
exit policy: qexit / converge / accel (native only) |
OURO_Q / OURO_EPS |
0.5 / 0.05 |
Q-exit threshold · convergence/accel ε |
OURO_CANARY / OURO_ADAPT |
1 / 0 |
collapse monitor (observe) · depth-coupling actuator |
OURO_UT_STEPS |
model default | recurrent-step count — the proven decode-speed lever (3 ≈ 1.28×) and the long-context KV lever (2 halves the recurrent cache) |
OURO_4BIT |
0 |
NF4 base (~7.7→1.85 GB; forces LoRA unmerged) |
OURO_KV_INT8 |
0 |
int8 KV cache (~halves it, near-lossless; cached path) |
OURO_MERGE / OURO_ATTN |
1 / sdpa |
merge LoRA into base · attention kernel — together ~2.8× faster (#775) |
OURO_REP_PENALTY / OURO_NO_REPEAT_NGRAM |
1.3 / 3 |
small-model degeneration guards (both paths) |
OURO_SAMPLE / OURO_TEMPERATURE / OURO_TOP_P |
0(greedy) / 0.7 / 0.9 |
sampling for chat-natural output |
OURO_ADAPTER / OURO_MODEL |
— / …/Ouro-1.4B-Thinking |
Σ₀ adapter dir · base model id |
The API re-prompt loop (§2 above) is gated separately by LOOP_REASONER=1.
Transformers version: Ouro's custom modeling code requires transformers ≥ 4.54 (its configuration_ouro.py imports layer_type_validation, added in 4.54); the local .venv-train runs 4.57 and the Kaggle/Lightning dispatch wrappers are now pinned to 4.57 (the old >=4.40,<4.53 cap broke the model load — fixed in b5c62465). OuroConfig.pad_token_id is None and must be patched to bos_token_id before from_pretrained; the train/serve scripts do this. No transformers entry in requirements.txt (training env only).
Run it
# 1. (optional) train the Σ₀ adapter — needs transformers>=4.40 + a CUDA GPU (local: 4.57.6 works)
python scripts/train-qlora-ouro.py --epochs 3
# 2. serve Ouro as a drop-in Ollama model on :11434
# set OURO_ADAPTER to the adapter dir produced by step 1
OURO_ADAPTER=<adapter_dir> python scripts/ouro_serve.py
# 3. (optional) deep adaptive-depth mode — qexit (default), or converge/accel
OURO_NATIVE=1 OURO_ADAPTER=<adapter_dir> python scripts/ouro_serve.py
OURO_NATIVE=1 OURO_MODE=accel OURO_ADAPT=1 OURO_ADAPTER=<adapter_dir> python scripts/ouro_serve.py
# 3b. 8GB / CC-scale (15-20k-token) prompts: 4-bit base + int8 KV + shallow loop
OURO_4BIT=1 OURO_KV_INT8=1 OURO_UT_STEPS=2 OURO_ADAPTER=<adapter_dir> python scripts/ouro_serve.py
# 4. probe the realized loop depth directly
python -m sigma0.loop_lm
The garage chat path (4177/4178) and the MCP task_run worker then use Ouro with no further config — they already point at :11434.
Continual training
The local adapter improves offline via the Σ₀ continual-training loop (SIGMA0-CONTINUAL-TRAINING.md): harvest → execution-verify → train → eval → eval-gated promote. Two ground-truth gates (only green subprocesses train; only a measured pass@1 win promotes), kept offline by design. This replaces the old scripts/continual-train.ps1 Qwen flow.
Where it fits the loop
This is [06] LANTERN-CODER realized on a looped model: Coder = Kernel + Memory + Tools + "improve the codebase" task type — a task type, not a separate system. Ouro plugs into [02] LANTERN-MODEL-BROKER as one interchangeable local model; its adaptive depth serves the Reason/Act stages; every turn still emits a PCSF receipt + Convergence Record (Verify/Converge). It is fully in-house and offline — see the Σ₀ Briefing and the Superfleet design (workers run this loop on Tasks).
Honest scope
- 1.4B, single-pass QLoRA — a genuine fine-tune, not production-grade; quality ratchets
via continual training.
- Native loop (§1) is real adaptive depth on Ouro's weight-tied checkpoint — but
inference-time only (we don't pretrain), and the no-cache Q-exit path is slow (~1 s/token), so it's opt-in deep mode; the default served path is the fast cached one.
- Re-prompt loop (§2) is an API-level approximation — it refines by re-prompting a
standard model, not shared-weight latent loops. Confidence and exit are heuristic.
- Drop-in, not yet leaderboard-routed — you select Ouro by serving it on 11434, not via
the model-broker leaderboard (that integration is a follow-up).
- **Can't yet drive Claude Code** — the protocol bridge round-trips cleanly, but the 1.4B
adapter under-triggers tools and is overwhelmed by CC's ~20k-token system prompt. The reliable surfaces are the in-app chat and the standalone agent loop. See SIGMA0-CODER-CLAUDE-CODE-STATUS.md.
- Deep-mode depth is now logged; bench-grade numbers still aren't — the served native path
appends realized mean_depth + contraction to the eval leaderboard (_persist_loop_meta, #777), but the python -m sigma0.loop_lm probe output remains a live observation, not a persisted benchmark.
Training status (2026-06-25)
Cloud GPU dispatch (orchestration.html → routes/gpu-training.js → lib/training-dispatcher.js) is wired; dispatch + poll + convergence-logging all work, and the weekly scheduled task (KeystoneWeeklyTraining, Mondays 00:00 UTC) is live. Providers are configured via data/pcsf/gpu-training.pcsf.json; credentials live in Windows User-scope env vars and sync into process.env at first call. Dispatching a real run this week drove out a chain of five bugs and a strategic finding — full write-up: research/gpu-training-pipeline-diagnosis-2026-06-25.md.
The strategic finding — Kaggle is the wrong GPU class. The recipe deliberately prefers bf16 (fp16 QLoRA on this reasoning LM overflows gradients to NaN, which clipping bakes into a garbage adapter), and bf16 is Ampere-only (cc ≥ 8.0). Kaggle's free fleet is exclusively pre-Ampere (P100 cc 6.0, T4 cc 7.5), so it can't be a trustworthy target. The arch-aware fixes stop the crashes (Kaggle now degrades to plain fp16 LoRA instead of dying), but the correct automatable target is Lightning AI's A10 (cc 8.6) — the same Ampere class as the localGB RTX where the good adapters trained.
| Provider | Class | Status |
|---|---|---|
| Local RTX (8 GB) | Ampere (bf16 ✓) | ✓ primary — where the live Σ₀ adapter trained |
| Lightning AI (A10) | Ampere (bf16 ✓) | recommended cloud target; wired, but dispatch currently fails on a Lightning-SDK teamspace/owner-inference bug (error_count: 3 in the PCSF) — restore is the open follow-up |
| Kaggle (P100/T4,h/wk free) | pre-Ampere (no bf16) | crash-free fallback only; reaches the training loop but fp16 adapter quality is not dependable |
HuggingFace Hub (lanternfounder/ouro-checkpoints) |
— | ✓ upload + download roundtrip passes |
| Paperspace / Colab / SageMaker | — | credentials present; full dispatch untested/blocked |
The five bugs fixed this week (each advanced the run one stage):
| Stage reached | Root cause | Fix |
|---|---|---|
| deploy gate | api-tools-log.js exported an Express Router() not the (req,res,url,deps)⇒bool convention; threw fn.apply and 500'd /api/convergence/health — the deploy health-check endpoint — silently rolling back every stable deploy for ~2 days |
0e98dbfe |
| dataset mount | kernel looked for .json; the Kaggle Dataset ships .jsonl |
225880ee (probe both) |
| model load | transformers>=4.40,<4.53 pin too old — Ouro needs layer_type_validation (4.54+) |
b5c62465 (pin → 4.57) |
| CUDA init (4-bit) | hardcoded NF4; bitsandbytes kernels need cc ≥ 7.5; Kaggle gave a P100 (6.0) | 5e7e9e87 (arch-aware: skip 4-bit on cc < 7.5) |
| CUDA init (bf16) | torch.cuda.is_bf16_supported() false-positives on P100; first bf16 op crashes |
8b1475a0 (gate bf16 on cc ≥ 8.0) |
Seq-length note: corpus p99 audited attokens; bumped to seq=1536 so the tail of function-call outputs is no longer truncated — fits an A10/local-RTX without swapping to CPU.
Related
- SIGMA0-CONTINUAL-TRAINING.md — the offline retrain flywheel that improves this adapter
- SIGMA0-COLLAPSE-CERTIFICATE.md · SIGMA0-COLLAPSE-EXPLAINER.md — the safety foundation; why
accelexit is the certificate-consistent policy - SIGMA0-CODER-CLAUDE-CODE-STATUS.md — can it drive Claude Code? (bridge solved, model-reliability blocked)
- research/gpu-training-pipeline-diagnosis-2026-06-25.md — the 5-bug chain + Kaggle-is-pre-Ampere finding
- SUPERFLEET-SWARM-DESIGN.md — the worker swarm that runs this loop on Tasks
- CONVERGANCE-SIGMA0-BRIEFING.md — the architecture North Star
- CSF-FORMAT-SPECIFICATION.md — §2.9 KB grounding index + near routing
- LANTERN-SIGMA0-CODER.md · OURO-LOOPLM.md — the two superseded pages this consolidates (tombstones)