docs/SIGMA0-OURO-CODER.md
By Alex Place · Updated 2026-06-25

Σ₀ Ouro Coder — the local coding agent (single source of truth)

This is the one doc for the local Σ₀ coder — then and now. It supersedes and folds in two older pages:

  • LANTERN-SIGMA0-CODER.mdwhat we had then: the Qwen2.5-Coder-3B QLoRA model

served via the Ollama binary (deprecated, kept as a tombstone).

  • OURO-LOOPLM.md — the loop mechanism (Q-exit math + the two loop

implementations), now described in §The loop mechanism below (deprecated, kept as a redirect).

If you landed on one of those, you're in the right place now.

📖 In plain English (start here)

What this is: a coding assistant that runs entirely on your own computer — no cloud, no internet needed. Its "brain" is a small AI model called Ouro.

The trick — it thinks in loops. Most AI models get smarter by being bigger. Ouro gets smarter by going around again: it reuses the same small set of layers several times on one problem — like re-reading a hard sentence until it clicks. That's why it's named "Ouro," after the ouroboros, the snake that eats its own tail. A loop.

It decides how hard to think. Easy question? It loops a couple of times and answers fast. Hard question? It keeps looping to think it through. A built-in "good enough yet?" check (the Q-exit gate) decides when to stop — so a tiny model can punch above its size on the hard parts without being slow on the easy ones.

It learned this project. It was fine-tuned on this repo's own past coding sessions, so it already knows the house style and conventions.

Two speeds: a Fast mode (the default — quick, reuses cached work) and a Deep "think-harder" mode you switch on for tough problems (slower, ~1 second per word).

Where it fits: it's just one swappable "brain" plugged into the bigger Lantern loop — Observe → Remember → Reason → Act → Verify → Converge. Unplug it, drop in a different model, and the rest of the system doesn't change.

What came before: an earlier version used a different brain (Qwen) and needed a separate "Ollama" program to run it. We retired that — the new one is smaller, loops, and runs itself. See *What we had then → what we have now*.

Honest about limits: it's small (1.4 billion parameters), it's a real but modest fine-tune (not a production-grade model), and its Deep mode is genuinely slow. A capable local helper — not a frontier model.

🎙️ Want it read aloud? Press the Listen bar at the bottom of this page.

The rest of this page is the precise, technical version. ↓

The Σ₀ Ouro Coder is the Σ₀ coding agent running on Ouro (the Ouroboros looped language model, arXiv:2510.25741) instead of a plain transformer. It is the same Convergence-Core coder path — Reason → Act for code — but its local brain is Ouro-1.4B with weight-tied recurrent depth + a learned Q-exit gate, plus our own Σ₀ fine-tune. It runs fully local, served as a drop-in Ollama-API model.

What we had then → what we have now

There used to be two local coders documented separately; there is now one. This is the arc:

Then (lantern-sigma0-coder, 2026-06-18) Now (Σ₀ Ouro Coder, since 2026-06-20)
Base model Qwen/Qwen2.5-Coder-3B-Instruct (plain transformer) ByteDance/Ouro-1.4B-Thinking (weight-tied looped transformer)
Σ₀ tune QLoRA onpairs /sessions;epochs /steps; loss 2.87 → 1.78 QLoRA on the Σ₀ Claude-session set;epochs, bf16 base (4-bit arch-gated), r=16/α=32 over all-linear
Serving Ollama binary (lantern-sigma0-coder-v2) scripts/ouro_serve.pydrop-in Ollama HTTP API, no Ollama binary
Adaptive depth none (single forward pass) Sigma0LoopLM Q-exit — loop until CDF(t) ≥ q
Routing leaderboard-preferred (model-leaderboard.js) drop-in on :11434; leaderboard integration is a follow-up
Status deprecated & removed active local coder

Why the switch (issue #811 / PR #823 — "Ollama sunset"): we retired the external Ollama binary as a hard dependency and moved to a Python server that speaks the Ollama API. That made it natural to swap the brain for Ouro, whose looped recurrent depth lets a 1.4B model spend extra computation on hard turns and exit early on easy ones — a better trade for a small local model than a larger single-pass one.

Verified on disk (2026-06-20): the Qwen training outputs (D:\lantern-train\sigma0-adapters, sigma0-merged) were removed and lantern-sigma0-coder-v2 is no longer registered in Ollama (only the base qwen2.5-coder blob remained). The active local coder is the Ouro Σ₀ adapter at D:\lantern-train\ouro-sigma0-adapters\final\ (base ByteDance/Ouro-1.4B, LoRA r=16/α=32, trained locally). The Qwen continual-training track was deleted as bloat — do not rebuild it; the live retrain pipeline is SIGMA0-CONTINUAL-TRAINING.md.

What it is

Base model ByteDance/Ouro-1.4B-Thinking (weight-tied recurrent transformer)
Σ₀ tune QLoRA on the Σ₀ Claude-session set (scripts/train-qlora-ouro.py;epochs, bf16 base, LoRA r=16/α=32 over all-linear, lr 2e-4, seq 1536)
Adaptive depth Sigma0LoopLM (src/sigma0/loop_lm.py) — three exit policies (OURO_MODE): qexit (trained gate, default), converge (first-order fixed point), accel (spiral-robust, certificate-consistent)
Collapse guard DecodeCanary per-token sigma0_proximity monitor (observe-only by default; OURO_ADAPT=1 lets it deepen the loop to fight its own incipient collapse) — the collapse certificate is the safety foundation
Serving scripts/ouro_serve.py — drop-in Ollama HTTP API (ouro:latest on :11434); fast cached default + opt-in native deep mode
8GB / long-context OURO_4BIT=1 (NF4 base, ~7.7→1.85 GB) + OURO_KV_INT8=1 (int8 KV cache) + OURO_UT_STEPS=2 (halves the recurrent KV) — reaches CC-scale (15–20k) prompts on anGB card
Integration transparent: the coder/agent path POSTs to OLLAMA_BASE_URL (default :11434) — point it at ouro_serve and the whole path uses Ouro
Claude Code protocol bridge solved (scripts/ouro_anthropic_bridge.py); the 1.4B adapter is not yet reliable enough to drive CC — see integration status

Why Ouro for the coder

Ouro builds reasoning into computation depth — reusing weight-tied layers R times in latent space — rather than into token length (the paper's "third scaling axis": loop depth). For a small local model that's a good trade: spend extra recurrent steps on hard coding/reasoning turns and exit early on easy ones. The Σ₀ QLoRA tune adapts it to this codebase from past Claude-Code sessions, so it learns the repo's idioms while staying 1.4B and local.

The loop mechanism

(absorbed from the former OURO-LOOPLM.md.)

Source: Scaling Latent Reasoning via Looped Language Models (Ouro, arXiv:2510.25741). PDF in repo: docs/research-papers/ouro-looped-llm-2510.25741.pdf.

The idea (paper)

LoopLM builds reasoning into computation by reusing weight-tied layers R times in latent space (a "third scaling axis": loop depth). Key mechanisms we borrow:

  • Adaptive depth + learned early-exit (Q-exit): a gate emits per-step exit probabilities;

exit at the first step where the cumulative CDF(t) ≥ q. q trades compute for accuracy.

  • Entropy-regularized depth (uniform prior) prevents collapse to always-shallow/deep.
  • Deeper-is-better, with diminishing returns — most inputs converge by mid-depth.

1. Native latent loop on real Ouro weights (the real thing)

src/sigma0/loop_lm.pySigma0LoopLM is our implementation of the paper's Q-exit adaptive-depth policy (λ→survival→CDF→first-step-≥q), run on Ouro's pretrained weight-tied block + exit gate (we do not pretrain a LoopLM — that needs 7.7T tokens). This activates the adaptive inference the stock Ouro checkpoint leaves off: its generate() threads no per-call exit threshold, so it runs fixed full depth. Our module reads the per-step gates, applies Q-exit, and reports the realized per-token loop depth (mean_depth); generate() returns exit_reason: "adaptive_qexit". Defaults: q=0.5, max_new_tokens=200, repetition penalty 1.3.

  • Probe it: python -m sigma0.loop_lm prints the realized mean depth — adaptive and

below the recurrent step count (total_ut_steps), i.e. not fixed-depth. (This probe output is not yet persisted to an eval artifact, so treat the number as a live observation, not a benchmark.)

  • Trained on our data: QLoRA fine-tune of Ouro-1.4B on the Σ₀ Claude-session set

(scripts/train-qlora-ouro.py). Adapter loads via Sigma0LoopLM.load(base, adapter=…).

  • Three exit policies (OURO_MODE, now wired into serving): Sigma0LoopLM.generate() takes

mode=:

  • qexit (default, exit_reason: "adaptive_qexit") — the trained entropy/confidence

gate; exit at the first step with CDF(t) ≥ q. This is what Ouro was trained for.

  • converge (exit_reason: "convergence_exit", returns mean_contraction) — exit on a

first-order latent fixed point ‖hₜ − hₜ₋₁‖/‖hₜ₋₁‖ < ε. The falsifiable "spiral" experiment (E2).

  • accel — exit on the spiral-robust second-order acceleration criterion

‖Δᵏ − Δᵏ⁻¹‖/‖·‖ < ε held for patience steps (Two-Scale, arXiv:2509.23314). First-order converge false-exits on SPIRAL dynamics — the case the collapse certificate §1.1 flags as hard (where the energy proof fails); accel is the certificate-consistent upgrade.

All three are selectable on the served deep path (ouro_serve.py, OURO_NATIVE=1, OURO_MODE=…); qexit remains the default. See research/2026-06-19-convergence-tesseract-spiral.md and the collapse explainer.

  • DecodeCanary + depth coupling (the intrinsic anti-collapse mechanism): in native mode the

per-token DecodeCanary (#766/#793) folds self-repeat / n-gram echo / argmax-margin / entropy-collapse z-alarms into one sigma0_proximity score. OURO_CANARY=1 (default in native) runs it observe-only — telemetry only (canary_max_proximity / spooks / signal). OURO_ADAPT=1 arms the actuator: as proximity rises, knobs() deepens the recurrent loop and raises the repetition penalty — the model stepping deeper to resolve its own incipient degeneration (#1014, divergence→depth coupling). Native loop only; the fast cached path is plain HF decode and never sees the canary.

2. API-level re-prompt loop (provider-agnostic approximation)

For any plain (non-looped) local model, we also approximate the loop by re-prompting:

runs the model up to MAX_LOOPS (4, = Ouro R4), feeding each prior answer back as a Coconut-style context prefix, and exits via cdfExit():

  • threshold_met — confidence ≥ CDF_THRESHOLD (0.85)
  • converged|Δconfidence| < CONVERGENCE_EPS (0.04), the entropy-plateau analog

(requires ≥loops)

  • max_loops — compute budget hit

Confidence is heuristicextractConfidence() parses a Confidence: field or estimates from structure. The module also exports a one-shot singleReason() (exit_reason: "single_pass") and the three constants; callers may override maxLoops/cdfThreshold per call.

reasoning/coding intents (and only when not Keystone-debug, not roleplay, and no explicit provider was picked), a looped pass runs on the local model and the done event carries loop_n / confidence / exit_reason. The "Loop Depth (Σ₀)" panel in dream-chat.html renders them as ⟳ N loop(s) · X% conf · <exit_reason>; the provider dropdown's "Local Σ₀ Loop (Ouro)" option is the user-facing entry. On error the pass falls through to normal streaming (non-fatal).

Where it maps in the codebase

Paper concept Lantern
Recurrent steps R Ouro total_ut_steps (native) · MAX_LOOPS (re-prompt)
Q-exit CDF(t) ≥ q qexit_step() in loop_lm.py (native) · cdfExit() (re-prompt)
Realized adaptive depth mean_depth (native) · "Loop Depth (Σ₀)" panel (loop_n/confidence/exit_reason)
Deeper-is-better, diminishing early-exit at the first step with CDF ≥ q
Knowledge manipulation > capacity small local model + KB grounding (CSF spec §2.9)

Grounding note: this doc is markdown, indexed by scripts/build_knowledge_index.py into data/knowledge/index.jsonl, so the Knowledge Center can ground / near-route on it. A doc becomes grounded by being linked from knowledgecenter.html (the indexer scrapes /repo/*.md hrefs). Re-run the indexer after editing so the snapshot matches the live text.

How the agent uses it (no code change)

ouro_serve.py speaks the Ollama HTTP API (/api/chat, /api/generate, /api/tags) and defaults to port 11434, advertising the model as ouro:latest. The Σ₀ coder/agent path already calls a local model over exactly that API:

  • streaming chat is Ollama-first (OLLAMA_BASE_URL, default http://127.0.0.1:11434);
  • the looped re-prompt pass (lib/loop-reasoner.js)

and the MCP Kernel worker (task_run/api/dream/chat) hit the same local endpoint.

So run ouro_serve.py on 11434 and the entire coder/agent path transparently runs on Ouro — Observe → Remember → Reason → Act → Verify → Converge with a looped brain, no code change. OURO_MODEL defaults to ByteDance/Ouro-1.4B-Thinking; set OURO_ADAPTER for the Σ₀ tune. (Ouro is a drop-in; unlike the old Qwen coder it is not yet registered in the model-broker leaderboard — that's a follow-up.)

Two inference modes

  • Default — fast cached. Uses Ouro's UniversalTransformerCache; this is the chat/coder

default (the product gate is speed). Plain HF decode — no canary, no adaptive depth.

  • Deep — native adaptive loop. OURO_NATIVE=1 activates Sigma0LoopLM: per-token exit by

OURO_MODE (qexit default, q = OURO_Q, default 0.5; or converge/accel, eps = OURO_EPS, default 0.05), realized depth reported as mean_depth. It is no-cache (~1 s/token), so it's an opt-in "think-harder" mode. Tunable via OURO_NATIVE_MAX (80). The canary (OURO_CANARY, on) and actuator (OURO_ADAPT) live here.

Knob reference (ouro_serve.py)

Knob Default What it does
OURO_NATIVE 0 1 = deep adaptive loop; 0 = fast cached
OURO_MODE qexit exit policy: qexit / converge / accel (native only)
OURO_Q / OURO_EPS 0.5 / 0.05 Q-exit threshold · convergence/accel ε
OURO_CANARY / OURO_ADAPT 1 / 0 collapse monitor (observe) · depth-coupling actuator
OURO_UT_STEPS model default recurrent-step count — the proven decode-speed lever (3 ≈ 1.28×) and the long-context KV lever (2 halves the recurrent cache)
OURO_4BIT 0 NF4 base (~7.7→1.85 GB; forces LoRA unmerged)
OURO_KV_INT8 0 int8 KV cache (~halves it, near-lossless; cached path)
OURO_MERGE / OURO_ATTN 1 / sdpa merge LoRA into base · attention kernel — together ~2.8× faster (#775)
OURO_REP_PENALTY / OURO_NO_REPEAT_NGRAM 1.3 / 3 small-model degeneration guards (both paths)
OURO_SAMPLE / OURO_TEMPERATURE / OURO_TOP_P 0(greedy) / 0.7 / 0.9 sampling for chat-natural output
OURO_ADAPTER / OURO_MODEL — / …/Ouro-1.4B-Thinking Σ₀ adapter dir · base model id

The API re-prompt loop (§2 above) is gated separately by LOOP_REASONER=1.

Transformers version: Ouro's custom modeling code requires transformers ≥ 4.54 (its configuration_ouro.py imports layer_type_validation, added in 4.54); the local .venv-train runs 4.57 and the Kaggle/Lightning dispatch wrappers are now pinned to 4.57 (the old >=4.40,<4.53 cap broke the model load — fixed in b5c62465). OuroConfig.pad_token_id is None and must be patched to bos_token_id before from_pretrained; the train/serve scripts do this. No transformers entry in requirements.txt (training env only).

Run it


# 1. (optional) train the Σ₀ adapter — needs transformers>=4.40 + a CUDA GPU (local: 4.57.6 works)

python scripts/train-qlora-ouro.py --epochs 3



# 2. serve Ouro as a drop-in Ollama model on :11434

#    set OURO_ADAPTER to the adapter dir produced by step 1

OURO_ADAPTER=<adapter_dir> python scripts/ouro_serve.py



# 3. (optional) deep adaptive-depth mode — qexit (default), or converge/accel

OURO_NATIVE=1 OURO_ADAPTER=<adapter_dir> python scripts/ouro_serve.py

OURO_NATIVE=1 OURO_MODE=accel OURO_ADAPT=1 OURO_ADAPTER=<adapter_dir> python scripts/ouro_serve.py



# 3b. 8GB / CC-scale (15-20k-token) prompts: 4-bit base + int8 KV + shallow loop

OURO_4BIT=1 OURO_KV_INT8=1 OURO_UT_STEPS=2 OURO_ADAPTER=<adapter_dir> python scripts/ouro_serve.py



# 4. probe the realized loop depth directly

python -m sigma0.loop_lm

The garage chat path (4177/4178) and the MCP task_run worker then use Ouro with no further config — they already point at :11434.

Continual training

The local adapter improves offline via the Σ₀ continual-training loop (SIGMA0-CONTINUAL-TRAINING.md): harvest → execution-verify → train → eval → eval-gated promote. Two ground-truth gates (only green subprocesses train; only a measured pass@1 win promotes), kept offline by design. This replaces the old scripts/continual-train.ps1 Qwen flow.

Where it fits the loop

This is [06] LANTERN-CODER realized on a looped model: Coder = Kernel + Memory + Tools + "improve the codebase" task type — a task type, not a separate system. Ouro plugs into [02] LANTERN-MODEL-BROKER as one interchangeable local model; its adaptive depth serves the Reason/Act stages; every turn still emits a PCSF receipt + Convergence Record (Verify/Converge). It is fully in-house and offline — see the Σ₀ Briefing and the Superfleet design (workers run this loop on Tasks).

Honest scope

  • 1.4B, single-pass QLoRA — a genuine fine-tune, not production-grade; quality ratchets

via continual training.

  • Native loop (§1) is real adaptive depth on Ouro's weight-tied checkpoint — but

inference-time only (we don't pretrain), and the no-cache Q-exit path is slow (~1 s/token), so it's opt-in deep mode; the default served path is the fast cached one.

  • Re-prompt loop (§2) is an API-level approximation — it refines by re-prompting a

standard model, not shared-weight latent loops. Confidence and exit are heuristic.

  • Drop-in, not yet leaderboard-routed — you select Ouro by serving it on 11434, not via

the model-broker leaderboard (that integration is a follow-up).

  • **Can't yet drive Claude Code** — the protocol bridge round-trips cleanly, but the 1.4B

adapter under-triggers tools and is overwhelmed by CC's ~20k-token system prompt. The reliable surfaces are the in-app chat and the standalone agent loop. See SIGMA0-CODER-CLAUDE-CODE-STATUS.md.

  • Deep-mode depth is now logged; bench-grade numbers still aren't — the served native path

appends realized mean_depth + contraction to the eval leaderboard (_persist_loop_meta, #777), but the python -m sigma0.loop_lm probe output remains a live observation, not a persisted benchmark.

Training status (2026-06-25)

Cloud GPU dispatch (orchestration.htmlroutes/gpu-training.jslib/training-dispatcher.js) is wired; dispatch + poll + convergence-logging all work, and the weekly scheduled task (KeystoneWeeklyTraining, Mondays 00:00 UTC) is live. Providers are configured via data/pcsf/gpu-training.pcsf.json; credentials live in Windows User-scope env vars and sync into process.env at first call. Dispatching a real run this week drove out a chain of five bugs and a strategic finding — full write-up: research/gpu-training-pipeline-diagnosis-2026-06-25.md.

The strategic finding — Kaggle is the wrong GPU class. The recipe deliberately prefers bf16 (fp16 QLoRA on this reasoning LM overflows gradients to NaN, which clipping bakes into a garbage adapter), and bf16 is Ampere-only (cc ≥ 8.0). Kaggle's free fleet is exclusively pre-Ampere (P100 cc 6.0, T4 cc 7.5), so it can't be a trustworthy target. The arch-aware fixes stop the crashes (Kaggle now degrades to plain fp16 LoRA instead of dying), but the correct automatable target is Lightning AI's A10 (cc 8.6) — the same Ampere class as the localGB RTX where the good adapters trained.

Provider Class Status
Local RTX (8 GB) Ampere (bf16 ✓) ✓ primary — where the live Σ₀ adapter trained
Lightning AI (A10) Ampere (bf16 ✓) recommended cloud target; wired, but dispatch currently fails on a Lightning-SDK teamspace/owner-inference bug (error_count: 3 in the PCSF) — restore is the open follow-up
Kaggle (P100/T4,h/wk free) pre-Ampere (no bf16) crash-free fallback only; reaches the training loop but fp16 adapter quality is not dependable
HuggingFace Hub (lanternfounder/ouro-checkpoints) ✓ upload + download roundtrip passes
Paperspace / Colab / SageMaker credentials present; full dispatch untested/blocked

The five bugs fixed this week (each advanced the run one stage):

Stage reached Root cause Fix
deploy gate api-tools-log.js exported an Express Router() not the (req,res,url,deps)⇒bool convention; threw fn.apply and 500'd /api/convergence/health — the deploy health-check endpoint — silently rolling back every stable deploy for ~2 days 0e98dbfe
dataset mount kernel looked for .json; the Kaggle Dataset ships .jsonl 225880ee (probe both)
model load transformers>=4.40,<4.53 pin too old — Ouro needs layer_type_validation (4.54+) b5c62465 (pin → 4.57)
CUDA init (4-bit) hardcoded NF4; bitsandbytes kernels need cc ≥ 7.5; Kaggle gave a P100 (6.0) 5e7e9e87 (arch-aware: skip 4-bit on cc < 7.5)
CUDA init (bf16) torch.cuda.is_bf16_supported() false-positives on P100; first bf16 op crashes 8b1475a0 (gate bf16 on cc ≥ 8.0)

Seq-length note: corpus p99 audited attokens; bumped to seq=1536 so the tail of function-call outputs is no longer truncated — fits an A10/local-RTX without swapping to CPU.