Re-grounding the "Keystone OS kernel" model question — Ouro LoopLM vs thedense small-model frontier
📖 In plain English (start here)
What this is about: every AI assistant needs a "brain" — the model that does the actual thinking. Lantern's brain is a small model called Ouro, and this page asks a simple question: is Ouro still the right pick, or has something better come out?
Why we even checked. Someone ran a ChatGPT/Google report suggesting which model Lantern should use. It had two problems: it never mentioned Ouro (the brain we already run), and its suggestions were about a year out of date. So we redid the homework against real, checkable sources.
What makes Ouro special. Most AI gets smarter by getting bigger. Ouro gets smarter by thinking in loops — re-reading a hard problem a few times instead of answering in one pass. That lets a tiny model (1.4 billion "knobs") keep up with models several times its size on reasoning — while still fitting on a normal gaming GPU.
The honest comparison. The fair contest is Ouro against today's best small open models — Qwen3.5-4B, Phi-4-mini, and Gemma 4 (the tiny phone-friendly one from Google). Not the older models the original report listed. (Updated 2026-06-21: the frontier moved again — Gemmaand Qwen3.5 shipped in 2026; see §2.)
The catch — we measure, we don't guess. We score these brains on Lantern's own tests (does it continue a thought correctly? does it cite real sources? does it call the right tool?), not on generic trivia quizzes. Right now only Ouro has real scores (80% correct). So this page is a plan for what to test next — not a final pick. Nothing here changes the running system.
🎙️ Want it read aloud? Press the Listen bar at the bottom of this page.
The rest of this page is the precise, technical version. ↓
Date: 2026-06-19 Type: Research note (grounded follow-up to an external ChatGPT/Google shortlist) Status: Research-only. No serving code changed; no model selected. Recommends what to bench, not what to ship. Grounding contract: External Reality Rule — every model claim below carries a primary source (HF model card / vendor blog / arXiv). The prior report's citeturn… markers were ChatGPT rendering artifacts, not citations, and are treated as unsourced.
Related canon: RESEARCH-CANON.md · OURO-LOOPLM.md · SERVING-ARCHITECTURE-2026.md · KEYSTONE-PRODUCT.md · CONVERGANCE-SIGMA0-BRIEFING.md
0. Why the external shortlist needed re-grounding
A ChatGPT/Google report shortlisted edge LLMs as a "Keystone OS kernel": Llama 3.2 3B, Gemma "4 E2B", Phi-3.5-mini, Qwen2.5-7B (+ Mistral-7B / Mixtral / Falcon3 / MPT alternates). Two grounding faults:
- It omits the kernel Lantern actually runs. Lantern's served default is ByteDance Ouro 1.4B LoopLM behind an Ollama-compatible endpoint (
scripts/ouro_serve.py), with a native Σ₀ adaptive Q-exit loop as opt-in deep mode (src/sigma0/loop_lm.py). The question is not "pick an edge model from scratch" — it is "does anything beat the looped model we already serve, measured on our own metrics?" - Its list is ~1 generation stale. Llama 3.2 3B, Phi-3.5-mini, Qwen2.5-7B, Mistral-7B-v0.3 are the 2024–25 generation. (Correction 2026-06-21: the report's "GemmaE2B" is not a mis-name — GemmaE2B is a real model Google shipped Apr 2026; see the verified update in §2. This note originally read it as Gemma 3n E2B (Google, 2025), and that reinterpretation was itself the grounding error.) The realdense small-model frontier (verified 2026-06-21) is Qwen3.5-4B, Phi-4-mini (3.8B), GemmaE2B/E4B — with Ouro-1.4B as the incumbent and Llama 3.3 dropped (70B, wrong tier; caveat below).
This note re-frames the question on the axis Ouro's own paper uses — Gemma/ Qwen3 — and scores candidates on Lantern's metrics, not MMLU.
1. The current kernel: Ouro LoopLM (what the external report omitted)
Paper: Scaling Latent Reasoning via Looped Language Models — arXiv:2510.25741 (HF paper page). PDF in repo: docs/research-papers/ouro-looped-llm-2510.25741.pdf.
Mechanism (verified against the paper): LoopLM reuses weight-tied layers R times in latent space (loop depth as a "third scaling axis"), with an entropy-regularized objective for learned depth allocation and adaptive early-exit (Q-exit). Pre-trained to 7.7T tokens. Core claim: the advantage is **knowledge manipulation, not knowledge capacity** — exactly the property a small grounded kernel wants (the KB supplies facts; the model reasons over them).
Headline benchmark results (from the paper / HF page — primary):
- Ouro-1.4B withrecurrent steps performs comparably to Qwen3-4B, and scores 82.40 on MATH500 vs 59.60 for Qwen3-4B.
- Ouro-2.6B outperforms dense models up to 8B: 80.46 on BBH, surpassing Qwen3-8B (77.65) and Llama-3.1-8B (71.56).
- Net: 1.4B ≈ 4B-class, 2.6B ≈ 8B-class on reasoning (BBH, GSM8K, MATH500). The paper benchmarks against Gemma(1B/4B/12B) and Qwen3 (1.7B/4B/8B) — that is the correct comparison axis for Lantern, not themodels the external report listed.
What Lantern actually implements (from OURO-LOOPLM.md):
Sigma0LoopLMruns the paper's Q-exit policy (λ→survival→CDF→first-step ≥ q) on Ouro's pretrained weight-tied block + exit gate. We do not pretrain (that needs 7.7T tokens) — we activate adaptive inference the stock checkpoint leaves off (stockgenerate()runs fixed full depth).- Two serving modes (decided 2026-06-18,
SERVING-ARCHITECTURE-2026.md): FAST = stockgenerate()+UniversalTransformerCache+ anti-repetition decode (product default, target <2 s); DEEP = native no-cache Q-exit loop (OURO_NATIVE=1, ~1 s/token, opt-in). - Measured baseline (
KEYSTONE-PRODUCT.md, RTX8GB,eval_keystonegolden set):ouro-fast-merged-sdpa= 80% (8/10) accuracy, 23.7 s avg latency (OURO_MERGE=1 OURO_ATTN=sdpa, ~2.8× faster than the 65.8 s unmerged/eager row at equal accuracy). Persistent misses: #5 multi-step arithmetic, #6 primary-colors→RGB — small-model reasoning limits.
Honest scope (carried from the canon): the native loop is inference-time only (no pretraining); realized-depth and training-loss figures are live console observations, not persisted benchmark artifacts. Speed — not degeneration — is now the product gate; llama.cpp cannot run the looped arch (relevant to runtime choices below).
2. The realfrontier (primary-source verified)
Every row verified against the HF model card / vendor blog. The external report's picks are shown for contrast.
| Model | Params | Context | Modality | Released | Primary source |
|---|---|---|---|---|---|
| Ouro-1.4B / 2.6B LoopLM (current kernel) | 1.4B / 2.6B (4 loop steps) | — | text | Oct 2025 | arXiv:2510.25741 |
| Qwen3-4B | 4B dense | 32K native / 131K YaRN | text, hybrid thinking | 2025 | HF: Qwen/Qwen3-4B |
| Qwen3-8B | 8B dense | 32K / 131K YaRN | text | 2025 | HF: Qwen/Qwen3-8B |
| Phi-4-mini | 3.8B dense | 128K | text, function-calling | Feb 2025 | HF: microsoft/Phi-4-mini-instruct |
| Gemma 3n E2B | ~1.91B effective (5B raw, PLE/MatFormer) | 32K | multimodal (text/image/audio), on-device | 2025 | Google AI: Gemma 3n · dev blog |
| Llama 3.3 | 70B only | 128K | text | Dec 2024 | HF: meta-llama/Llama-3.3-70B-Instruct |
| (external report's stale picks) | |||||
| Llama 3.2 3B | 3B | 128K | text, edge | Sep 2024 | Meta blog |
| Phi-3.5-mini | 3.8B | 128K | text | Aug 2024 | HF: microsoft/Phi-3.5-mini-instruct |
| Qwen2.5-7B / Mistral-7B-v0.3 | 7B | — | text | 2024 | vendor cards |
Three corrections that matter for kernel selection:
- "GemmaE2B" was real all along — NOT a mis-name (corrected 2026-06-21; see the verified update above). Google shipped Gemma 4 in Aprwith a genuine E2B = 2.3B-effective (5.1B w/ embeddings) edge variant, so the external report's name resolves to an actual model. This note's original move — reinterpreting it as Gemma 3n E2B — was the grounding error (whether the source report predated Gemma 4's Apr-2026 release, making the name a lucky hallucination, or referenced the real model, is unknown; either way it must now be treated as real). Historical context: Gemma 3n E2B (Google 2025) is the predecessor it was confused with —
E= Effective params (raw 5B → ~1.91B effective / ~2 GB VRAM via PLE caching + MatFormer + parameter-skipping), multimodal (MobileNet-v5 vision + audio,langs) (Gemma 3n). GemmaE2B/E4B now supersede it as the edge/multimodal candidate — still capacity-light cores, the property that cuts against the LoopLM thesis. - Llama 3.3 is 70B-only. Meta shipped no small 3.3 variant (HF card, Dec 2024). The edge-class Llama is 3.2 1B/3B (Meta blog). So "Llama 3.3" does not belong on a sub-8B edge-kernel shortlist — at 70B it is a different deployment tier than Ouro-1.4B (~6 GB on GPU). Treat it as a cloud/ceiling reference, not a kernel candidate.
- Qwen3-4B is the paper's head-to-head; Qwen3.5-4B is now the current one. Qwen3-4B is exactly the model Ouro-1.4B is benchmarked against in the source paper — so reproduce that comparison on our golden set rather than trust the paper's external benchmarks transitively. (Corrected 2026-06-21: for a current kernel decision, bench against Qwen3.5-4B, the shipped successor — see §2 verified update; Qwen3-4B remains the apples-to-apples check against the paper's numbers.)
Frontier-has-moved-again — VERIFIED 2026-06-21 (was a low-confidence forward-watch item; now grounded to model-card / arXiv level). The June-2026 titles were real, not rendering artifacts. Confirmed against primary sources:
- Gemma— REAL. Google shipped it (initial Apr 2026; encoder-free "unified" 12B variant Jun2026). Edge-class sizes: E2B = 2.3B effective (5.1B w/ embeddings) and E4B = 4.5B effective (8B w/ embeddings), both 128K, multimodal text/image/audio; plus 12B (256K), 31B dense, and a 26B-A4B MoE (3.8B active) (Gemmamodel card · blog.google). This supersedes Gemma 3n E2B as the edge/multimodal candidate scored below.
- Qwen3.5 & Qwen3.6 — REAL (both). Qwen3.5 (Feb2026; 397B-A17B flagship, 262K ctx,langs) ships small dense variants Qwen3.5-9B / 4B / 2B / 0.8B; Qwen3.6 (Apr 2026; agentic-coding focus, 27B dense + 35B-A3B MoE, 262K ctx) (QwenLM/Qwen3.6 · CNBC). Qwen3.5-4B is the new head-to-head, superseding Qwen3-4B.
- Phi-4 successor — NOT FOUND. No Phi-5 / Phi-4.5 surfaced; Phi-4-mini (3.8B) remains current (Azure Phi). The Phi rows below stand unchanged.
- arXiv:2604.07035 — REAL paper, wrong title. The title guessed above was incorrect. Actual = "Unified Deployment-Aware Evaluation of Open Reasoning Language Models" (Md Motaleb Hossen Manik & Ge Wang) — a deployment-aware, multi-objective (accuracy × latency × memory) eval, i.e. exactly Lantern's bytes-per-correct axis, not a leaderboard. Finding: Gemma-4-E4B is "a strong practical operating point … substantially lower latency and memory" (arXiv:2604.07035).
Net: the frontier this note itself named (Qwen3-4B / Phi-4-mini / Gemma 3n E2B) is ~1 generation stale as of 2026-06-21. Corrected head-to-head set = Qwen3.5-4B, Phi-4-mini (unchanged), GemmaE2B/E4B. Ouro-1.4B remains the incumbent (no new LoopLM release found). The §3 benchmark plan is unchanged in shape — only the opponent names move. The §2 table and §3 cells below are preserved as the original (now prior-gen) snapshot; read them through this correction.
3. Score on Lantern's OWN metrics, not MMLU
Per the Convergence Core Research Program (Drive: Instrumenting the Black Box — A Ten-Year Research Program; repo PDFs Convergence-Core-Research-Program-v1.1.pdf) and CONVERGANCE-SIGMA0-BRIEFING.md, Lantern grades a kernel on whether it advances the single loop (Observe→Remember→Reason→Act→Verify→Converge), not on leaderboard trivia. The kernel-relevant metrics:
| Lantern metric | Loop stage | What it measures | Why a general benchmark misses it |
|---|---|---|---|
| Continuation accuracy | Reason | next-step correctness on our golden set (data/eval/sigma0-prompts.jsonl) |
MMLU is trivia; we serve repo-grounded reasoning |
| Grounding precision | Observe/Verify | fraction of [claim, evidence, confidence, source] tuples whose source actually supports the claim (External Reality Rule) |
no public benchmark scores citation-faithfulness against your KB |
| Action correctness | Act | did the chosen tool/route achieve the intended state change | benchmarks don't exercise our convergence router |
| Tool-call validity | Act | fraction of emitted tool calls that are schema-valid and executable | function-calling benches ≠ our MCP tool schemas |
| CSF compression ratio | Remember | bytes of CSF needed to round-trip a belief/observation state | architecture-specific to the Status Cube / CSF format |
| Bytes-per-correct-continuation | (efficiency) | served bytes (latency×tok) per correct continuation — the true product cost | conflates the two things FAST/DEEP split apart |
scripts/eval_keystone.py already produces the first two cheaply for any Ollama-API backend → data/eval/leaderboard.jsonl. The rubric below is the measurement plan; the cells are priors to be tested, not measured results (only the Ouro-fast row is measured today).
Qualitative priors (to be replaced by eval_keystone rows)
| Candidate | Continuation acc. | Grounding precision | Action / tool-call validity | Bytes-per-correct (efficiency) | CSF / fit notes |
|---|---|---|---|---|---|
| Ouro-1.4B (current) | 80% measured (8/10) | strong-by-design (manipulation > capacity; KB supplies facts) | untested on MCP schemas | FAST 23.7 s/prompt measured; loop recurrence is the per-token tax | runtime locked to transformers 4.57; no llama.cpp |
| Ouro-2.6B | prior: ≥1.4B (paper: 8B-class) | ≥1.4B | untested | slower — 2.6B ×loops × VRAM; gated on "bench before VRAM spend" roadmap item | same runtime constraint |
| Qwen3-4B | the head-to-head (paper: ≈Ouro-1.4B, worse on MATH500) | hybrid-thinking may help Reason; needs measuring | mature tool/function-calling ecosystem | single forward pass → likely cheapest bytes-per-correct of the dense set | runs on llama.cpp/vLLM/Ollama → trivial to serve |
| Phi-4-mini (3.8B) | strong on math/code/reasoning per card; unmeasured here | unknown vs our KB | explicit function-calling in card — promising for Act | single pass, 3.8B → cheap | drop-in via Ollama; good Act candidate |
| Gemma 3n E2B | ~2B-effective → likely below Ouro on Reason | unknown | unknown | cheapest VRAM (~2 GB); multimodal opens Observe over images/audio | only multimodal + truest edge option; capacity-light cuts against LoopLM thesis |
| Llama 3.3 (70B) | ceiling reference, not a kernel | — | — | wrong tier (70B ≠ edge) | use as cloud upper-bound only |
Reading: the LoopLM bet is that continuation accuracy + grounding precision per VRAM beats the dense models because reasoning is looped in latent space rather than spent on knowledge capacity — and the one measured Lantern number (80% on the golden set at 1.4B) is consistent with that. The dense frontier's counter-bet is bytes-per-correct-continuation: a single forward pass with no 4× recurrence tax, on runtimes (llama.cpp/vLLM) that the looped arch can't use. That tradeoff is precisely what eval_keystone + blind_study exist to settle — so settle it with rows, not priors.
4. LoopLM-adjacent context (Drive local_llm + Coconut)
LoopLM is one point in a latent-reasoning family; the kernel decision should be read against it:
- **Coconut — Training LLMs to Reason in a Continuous Latent Space (arXiv:2412.06769, Meta, Dec 2024; code). Feeds the model's last hidden state back as the next input embedding ("continuous thought") instead of decoding to tokens — reasoning in latent rather than language space, BFS-like exploration of multiple paths. This is the conceptual parent of Lantern's re-prompt loop:**
lib/loop-reasoner.jsalready feeds each prior answer back as a "Coconut-style context prefix" (an API-level approximation — not true shared-weight latent loops, confidence/exit are heuristic). Coconut = the why behind LoopLM; Ouro = the pretrained-in version Lantern serves. - Distinction worth keeping straight: Coconut/LoopLM (continuous latent loop, weight-tied) vs. token-space chain-of-thought (the dense frontier's "thinking" modes, e.g. Qwen3 hybrid thinking). Gemma 3n / Phi-4-mini / Qwen3 all reason in token space; only Ouro reasons in latent loop space. That is the single biggest architectural axis in this decision — and the one MMLU cannot see.
- Drive
local_llmfolder + the Ouro PDF (founder's Google Drive) are the off-repo source set; per data architecture real papers/data live in Drive, code + sanitized notes in repo. This note is the repo-side index; the Drive folder is the corpus.
5. Conclusions (research-only — no decision taken)
- The kernel question is not open — it has an incumbent. Lantern serves Ouro-1.4B LoopLM (FAST default, 80%/23.7 s measured). The external report's omission of it makes its entire shortlist moot as written. Reframe: "what, measured on our metrics, beats the looped model we already run?"
- The honest comparison axis is Gemma / Qwen3 — the model families Ouro's own paper benchmarks against — not Llama 3.2 3B / Phi-3.5 / Qwen2.5-7B. Use the verified-currentfrontier (Qwen3.5-4B, Phi-4-mini, GemmaE2B/E4B; see §2 update of 2026-06-21); drop "Llama 3.3" as a kernel candidate (70B, wrong tier).
- Decide with
eval_keystonerows, not paper transitivity. The highest-value next experiments, in order:- (a) Ouro-1.4B vs Qwen3.5-4B on the golden set (continuation acc + grounding precision + bytes-per-correct) — the current successor to the paper's exact head-to-head (Qwen3-4B), reproduced on our data; run Qwen3-4B too if cross-checking the paper's numbers.
- (b) Phi-4-mini as the Act/tool-call candidate (explicit function-calling) — measure tool-call validity against our MCP schemas.
- (c) Ouro-2.6B vs 1.4B — already a roadmap item ("bench before VRAM spend"); gate the VRAM on a measured accuracy delta.
- (d) GemmaE2B/E4B only if multimodal Observe (image/audio grounding) becomes a product requirement — these supersede Gemma 3n E2B, but the ~2–4.5B-effective cores are still unlikely to win on pure Reason.
- Grow the golden set first (roadmap item 5): today'sprompts are trivia-heavy; grounding precision / action correctness / CSF compression are not yet instrumented in
eval_keystone. Those metrics must exist before any of the above rows mean anything — otherwise we'd be re-importing the MMLU mistake the external report made.
Nothing here changes serving config or selects a model. It converts an unsourced external shortlist into a grounded, primary-source-verified benchmark plan on Lantern's own axes.
Sources (primary, verified 2026-06-19)
- Ouro LoopLM paper — arXiv:2510.25741 · HF paper page
- Coconut — arXiv:2412.06769 · facebookresearch/coconut
- Qwen3-4B — HF model card · Qwen3-8B — HF
- Phi-4-mini — HF model card
- Gemma 3n — Google AI for Developers · Google dev blog
- Llama 3.3 70B — HF model card · Llama 3.2 edge — Meta blog
- 2026-06-21 frontier verification (primary): GemmaAzure Phi— model card · blog.google · Qwen3.5/3.6 — QwenLM/Qwen3.6 · CNBC · Phi-4-mini still current — Azure Phi · arXiv:2604.07035 = Unified Deployment-Aware Evaluation of Open Reasoning Language Models (Manik & Wang) — arXiv
Internal: OURO-LOOPLM.md · SERVING-ARCHITECTURE-2026.md · KEYSTONE-PRODUCT.md · RESEARCH-CANON.md · CONVERGANCE-SIGMA0-BRIEFING.md · lib/loop-reasoner.js