ADR-0010: Distillation is a deferred last resort — verify-gated, benchmark-never-the-target, never removed

Status

Proposed — awaiting approval from Alex Place.

Context

The target is high real reliability (aiming for ~99% verified correctness on tasks where ground truth exists), not a high score on a public leaderboard. Those are different bars, and conflating them is the trap this ADR exists to prevent:

"99% on HumanEval" is vanity — the benchmark is saturated/contaminated (we already

decontaminate leaks, [[humaneval-optimized-corpus]] / build_humaneval_corpus.py).

"99% on SWE-bench Verified / GPQA" is past the field's frontier today (SOTA ~70s;

GPQA human-expert ceiling ~65–75%).

"99% of what we ship passes its tests / is grounded" is real, reachable, and frozen-base —

it comes from verification (run the code, check the source), which needs no weight change.

The decision this forces: continual learning by updating model weights (a LoRA distillation flywheel) is one available lever for compounding capability over time. The North Star currently forbids it outright (principle 5: "never retrain"; restated in ADR-0004: "improvement comes from retrieval over accumulated memory, never retraining"). Two facts pull against a blanket ban:

Weight-distillation is a real lever — verified self-distillation compounds capability

(multi-agent finetuning, SWE-Gym / DeepSWE-style verifier-trained agents).

Used naively it is the most dangerous lever — training on your own outputs to chase a

number is Goodhart's law plus model collapse (diversity degrades across cycles). In the gains analysis it was the smallest, lowest-confidence, highest-risk lever on the board.

Loop stage: Converge (how the loop improves itself over time). The owner's instruction is explicit: keep the option, but make it the last resort — only after the frozen-base levers are exhausted, and never delete it from the toolbox.

Decision

We will preserve distillation (continual weight updates) as a deferred, guard-railed, last-resort option — never the first move, and never removed.

RuleADR-0004— Default mode is unchanged. The system's default operating mode remains frozen-base + retrieval/experience (North Star principle 5, ADR-0004). This ADR does not start a training program; it records a sanctioned, dormant escape hatch.

Rule— Frozen-base levers come first, in order. To raise any measured capability gap, exhaust the frozen-base path first, with evidence at each step:

Route to the best available member (cloud/local specialist) — ADR-0009.
Execution-verify + best-of-N — sample, then select by running tests / a real verifier.
RAG + citation grounding — retrieve and verify sources actually support the claim.

None of these touch weights. They are the primary, always-preferred path.

**Rule— Distillation may only be proposed after Rule[claim, evidence, confidence, source]is demonstrably exhausted** for a specific, measured capability gap, with the exhaustion recorded as a Convergence Record (data/convergence/records.jsonl) carrying [claim, evidence, confidence, source]. "We haven't tried verification yet" is never grounds to distill.

Rule— If initiated, distillation runs only under all of these guardrails:

Source gate: only externally-verified-correct experience may become training data —

code that passed its tests, claims that passed grounding checks, operator-approved decisions. Never raw model self-output. Every pair traces to a Convergence Record.

Adapter-only, base frozen: updates land in a replaceable LoRA adapter, never the base

weights. The base stays interchangeable (ADR-0005); the adapter is regenerable from the append-only memory log — a distilled cache of verified experience, not a new memory store.

Benchmark is never the target: the optimization signal is **verified-pass-rate on held-out,

decontaminated, freshly-sourced tasks. Public benchmarks (HumanEval/SWE/GPQA) are read-only instruments**, never a loss signal. Saturated/contaminated benchmarks are explicitly barred as targets.

Collapse tripwire (mandatory): every cycle evaluates a frozen hold-out and **mixes in

external (non-self) data. If the hold-out or a diversity metric regresses across cycles, distillation halts and the last-good adapter is restored**.

Reversible + audited: each adapter version is content-addressed, logged to the convergence

audit, and revertible in one step. No cycle is irreversible.

Operator-gated: cycles are operator-initiated/approved, not autonomous, until the tripwire

history earns trust.

Rule— The option is never removed. Forbidding distillation outright is itself rejected (see Alternatives). The toolbox keeps it; discipline keeps it last.

Consequences

Positive:
- Keeps a path to compounding, long-horizon improvement without sacrificing the grounding

guarantee — the benchmark can never become the target, so "99%" stays real (verified), not fake (overfit).

The base model stays interchangeable; the adapter is a regenerable cache, not a dependency.
Everything is reversible, audited, and operator-gated — the dangerous lever is defanged.
Honors the owner's intent precisely: keep the option, use it last.

Negative / trade-offs:
- This is a real, named departure from North Star principle 5's blanket "never retrain."

We accept a narrow, dormant exception in exchange for not foreclosing the flywheel.

When (if) activated, it adds a training pipeline (GPU, eval upkeep, collapse monitoring) and

a model-specific artifact — mitigated by adapter-only + regenerable-from-memory + the tripwire.

Requires ongoing discipline: the moment a public benchmark leaks into the loss, the guarantee

is broken. The hold-out + source-gate are the enforcement, not good intentions.

Follow-ups:
- Build the collapse-tripwire eval harness (frozen hold-out + diversity metric) — prerequisite

before any cycle.

Wire execution-verify as the source gate (Rule 1.2 / Rulesource gate share machinery).
Run the operator-escalation backtest (separate work) — part of proving Rulelevers first.
On acceptance: Alex updates briefing principleto reference this ADR's bounded exception.

Alternatives considered

Keep the blanket ban (do nothing). Rejected per the owner's instruction — but legitimately

the safe default: it preserves a clean grounding guarantee at the cost of no compounding. If this ADR is rejected, this is what we keep, and that is an acceptable outcome.

Allow active continual learning now (benchmark-targeted). Rejected — Goodhart + model collapse;

the exact failure mode the North Star exists to prevent. Chasing a contaminated benchmark by distilling own outputs buys a fake 99%.

Unguarded / autonomous self-distillation. Rejected — training on own output without a

source-gate and tripwire collapses the model over cycles.

Full base-weight retraining (not adapter). Rejected — breaks model interchangeability

(ADR-0005) and is irreversible.

Evidence

Claim	Evidence (file:line / commit / PR)	Confidence	Source
North Star forbids weight modification (the rule amended)	CONVERGANCE-SIGMA0-BRIEFING.md principle 5; ADR-0004	High	project docs
Frozen-base verification is the primary lever (route first)	ADR-0009	High	repo ADR
Local coder weak → routing/verify needed before distill	Ouro-1.4B ~0/5 via chat harness	High	#1263, `eval_humaneval_chat.py`
Execution-verify + best-of-N is the high-confidence capability lever	Satori-SWE Best@50 ≈ Best@500 SOTA	Med	arXiv 2505.23604
RAG + citation cuts hallucination materially	MEGA-RAG ~40% reduction vs baseline	Med	PMC12540348
Verified self-distillation compounds capability	SWE-Gym / DeepSWE verifier-trained agents	Med	arXiv 2504.21798 / together.ai DeepSWE
Naive self-training collapses (tripwire justified)	model-collapse on training-over-own-output	Med	recursive-training literature
Adapter is regenerable from the one memory log	`data/convergence/records.jsonl` + ADR-0004	High	repo