Keystone — Progress Report (Shareholder Edition)
Period ending: 2026-06-19 · Product: Keystone chat (the Convergence Core / Σ₀ agent) · Status: Active development, pre-revenue.
Reading contract. This report follows Keystone's own External Reality Rule: every material claim carries evidence (a commit, a test count, a measured number) and is tagged [shipped], [measured], or [design]. We do not report aspiration as achievement. No financial or user-base figures are claimed — Keystone is solo-built and pre-revenue; this is an engineering-progress report.
1. The thesis, in one line
Keystone is a self-driving car for reasoning — it observes, remembers, reasons, acts, verifies, and improves from checked experience, with a safety system (Σ₀) that mathematically bounds it so it won't collapse or run away.
"Tesla-class" is the honest version of the metaphor: a real, autonomous, production vehicle — not a concept car, and not a claim of general intelligence. It drives well on the roads you give it. It is engineered to stay on the road and not crash itself. Where it drives is set by the tools and tasks you put in it.
2. The car, part by part (and how complete each is)
| Part of the car | What it is in Keystone | Status |
|---|---|---|
| Engine | the six-stage loop: Observe → Remember → Reason → Act → Verify → Converge | built (Python kernel.py) |
| Drivetrain coupling | the loop actually running end-to-end in the live product | first slice now engaged — see §3 |
| Memory / odometer | append-only JSONL + CSF archive; per-session token-budgeted context | shipped (#772) |
| Lane-keeping | grounding throttle — buy evidence when uncertain, "stop reasoning, go look" | built; not yet gating Act |
| Crash-avoidance (the differentiator) | Σ₀ collapse certificate + surprise canary — won't collapse or run away | built + proven (30 passing tests) |
| Self-improvement | grade decisions against real outcomes; compile what survives | first reasoner now closing (§3) |
The crash-avoidance math is the part most "agent" products never ship. Keystone has it, proven.
3. Shipped this period (evidence-tagged)
- Σ₀-K1 kernel spec frozen + a real measurement harness. [shipped
fb523163/7b0a776a]
Replaced a 10-prompt trivia eval with a 65-prompt, repo-grounded golden set (Gate A) so every model/serving change is graded, not asserted. Measured cold baseline: 34% on the local kernel, with a clean difficulty gradient (100/50/29/13% across smoke→hard) — proof the harness discriminates. [measured — data/eval/leaderboard.jsonl]
- Token-budgeted memory (the REMEMBER stage). [shipped
66ad7024, closes #772]
Long chats used to silently drop their own beginning. Keystone now assembles a token-budgeted context — a rolling summary of older turns plus recent verbatim turns within the active model's window — from the full session log. 28 unit tests; live in production.
- The loop's first real slice now closes end-to-end. [shipped
8608e5e7+25101abf]
The Kalshi trading reasoner — which has ground truth (a trade wins or loses) — now runs Reason → Verify → Converge: it emits a prediction, the settled market grades it, and the survivors compile into patterns. Demonstrated: a record went unverified 0.90 → verified 0.95 → extracted as a pattern. [measured, end-to-end] 12 unit tests.
Why slice #3 matters most: it is the difference between an agent that talks about improving and one that measurably does. Keystone now has at least one loop where experience is checked against reality and compounded — the foundation everything else stacks on.
4. Metrics that are real (no vanity numbers)
| Metric | Value | Source |
|---|---|---|
| Golden-set baseline (cold local kernel) | 34% (22/65) | [measured] leaderboard.jsonl |
| Golden-set difficulty gradient | 100 /// 13% | [measured] |
| Loop slices closing end-to-end | 1 (Kalshi: Reason→Verify→Converge) | [measured] §3 |
| Σ₀ safety certificate | 30 passing tests | [shipped] SIGMA0-COLLAPSE-CERTIFICATE.md |
| New automated tests added this period | 47 (7 Gate A +#772 +Kalshi) | [shipped] |
We deliberately do not report the local kernel as "smart": it scores ~10% pass@1 on HumanEval. That is the point of the architecture — value comes from the loop, grounding, and provider routing, not a single model. Models are interchangeable; the Core is the asset.
5. The honest ceiling (forward-looking statement)
Keystone is becoming a dependable, bounded agent: it won't quietly fall apart, won't run away, and improves by stacking checked experience. It is not a general "smart-at-everything" mind:
- **Σ₀ guarantees it won't collapse; it does not guarantee it will be clever.** Safety ≠ capability.
- The local kernel is intentionally small/cheap; capability is bought via grounding + cloud routing, not a bigger brain.
- One loop slice closes today; the chat path emits records but has no ground truth to grade against yet.
This is a larger, more honest claim than most agent stacks ship — because it is bounded and verified, not despite it.
6. Roadmap — next milestones
| Next | What it buys | Source |
|---|---|---|
| Gate Act with the grounding throttle + attach the Σ₀ canary to the live loop | lane-keeping + crash-avoidance engaged on every action | agent-spine §6.5 |
| Schedule the close-loop pass (periodic / on-settlement) | the Kalshi slice closes continuously, unattended | this period's follow-up |
| State-ABI shim (Σ₀-K1 component 6) | connect the Ouro reasoning loop to the hot-swap VM | SIGMA0-K1-KERNEL-SPEC.md |
| Grow the golden set + measure grounded-vs-cold lift | turn the 34% baseline into a tracked, improving curve | Gate B |
7. Governance & honesty note
Every claim above is traceable to a commit, a test, or a measured artifact on disk. This report contains no projected revenue, user counts, or capability claims unsupported by a run. Keystone's credibility is the External Reality Rule — we would rather under-claim and be trusted than over-claim and be checked.
Sources (verified on disk 2026-06-19)
- Loop + four objects —
docs/research/2026-06-19-convergence-core-agent-spine.md - Kernel spec + Gate A —
docs/SIGMA0-K1-KERNEL-SPEC.md·data/eval/leaderboard.jsonl - Token-budgeted memory —
apps/lantern-garage/lib/stream-chat/context-budget.js(#772) - Loop-closing slice —
apps/lantern-garage/lib/kalshi-convergence-outcomes.js·scripts/convergence_close_loop.py - Safety certificate —
docs/SIGMA0-COLLAPSE-CERTIFICATE.md