Lantern OS Serving Architecture 2026: Fast Default + Deep Research Mode

Decision Date: 2026-06-18 Status: Implementation Complete (Phase 1) Owner: Lantern Core Team

Problem Statement

Previous serving approach produced degraded replies: 70–85 second latencies that degenerated into token loops (✅✅✅), blocking a sustainable product.

Two blockers existed:

No-cache speed: Every reply required full inference without KV caching
Decode degeneration: Missing anti-repetition parameters allowed token loops

THE DECISION: Make fast cached inference the product default. Keep native Σ₀ Q-exit loop as an opt-in research mode.

Architecture

Default: FAST MODE (Fast Cached Inference)

When: All requests unless OURO_NATIVE=1 is set.

What:

Uses cached KV inference (Ollama/Ouro UniversalTransformerCache)
Anti-repetition decode parameters enabled by default
Target latency: <2 seconds for dream chat
Suitable for interactive UX, real-time feedback, production use

Decode Parameters (FAST mode):


{

    "temperature": 0.7,

    "top_p": 0.95,

    "frequency_penalty": 0.5,      # OpenAI/Deepseek/Groq

    "repetition_penalty": 1.1,     # Ollama

    "repeat_last_n": 64,           # Ollama context window

}

Rationale:

Fast KV cache prevents the "decode degeneration" problem
Aggressive repetition penalties (top_p=0.95, freq_penalty=0.5) kill the ✅✅✅ loop
64-token context for repetition detection balances freshness vs. tone consistency

Opt-In: DEEP MODE (Native Σ₀ Q-exit Loop)

When: OURO_NATIVE=1 environment variable is set.

What:

Adaptive depth via native Σ₀ Q-exit loop (grounded reasoning)
No KV cache — full adaptive inference per query
Higher latency acceptable (70–85 seconds for complex reasoning)
Suitable for: architecture decisions, research, grant writing, core system design

Decode Parameters (DEEP mode):


{

    "temperature": 0.7,

    "top_p": 0.98,                 # Slightly less aggressive

    "frequency_penalty": 0.2,      # Allow more repetition for grounding

    "repetition_penalty": 1.05,    # Ollama: softer penalty

    "repeat_last_n": 128,          # Ollama: wider context

}

Rationale:

Adaptive loop may reference prior states — softer antirepetition allows this
Higher top_p (0.98 vs 0.95) gives reasoning more token diversity
128-token context captures cross-turn grounding patterns

Configuration

Environment Variables


# Product default: fast cached inference

(unset OURO_NATIVE) → FAST mode



# Opt-in to deep research mode

OURO_NATIVE=1 → DEEP mode

Detection (Code)


from serving_modes import get_serving_mode, get_decode_params



mode = get_serving_mode()  # Returns FAST_MODE or DEEP_MODE

decode_params = get_decode_params(mode)

Performance Baseline (measured, not estimated)

The baseline accrues from real benchmark runs in data/benchmarks/leaderboard.jsonl — appended by the daily workflow and by python src/serving_benchmark.py --providers <provider:model> --mode fast.

⚠️ A previous version of this section hard-coded estimated numbers (e.g. "450ms / 0.92") that were never measured. They were removed per the External Reality Rule: the leaderboard and data/benchmarks/REPORT.md are the single source of truth, and the benchmark records only real provider responses (an unreachable provider is logged as an error, never as fabricated data).

Measurement & Iteration

Standing Benchmark


# Run one provider:model pair (FAST mode, the product default)

python src/serving_benchmark.py --run anthropic:claude-haiku-4-5-20251001 --mode fast



# Run several at once; DEEP mode sets OURO_NATIVE for the run

python src/serving_benchmark.py --providers "anthropic:claude-haiku-4-5-20251001,openai:gpt-4.1-mini" --mode fast



# Validate the latest run per config against the FAST/DEEP contract (exit 1 on regression)

python src/serving_benchmark.py --validate



# Refresh the Markdown monitoring report

python src/serving_benchmark.py --report   # → data/benchmarks/REPORT.md

Golden set:diverse prompts (reasoning, creative, code, domain). Metrics: latency, tokens, repetition_ratio (+ per-task pass), cost, throughput. Leaderboard: data/benchmarks/leaderboard.jsonl · Report: data/benchmarks/REPORT.md.

Honesty contract

The connector silently falls back to a canned offline persona stub when a provider is unreachable. The benchmark therefore pins the requested model onto the provider config, streams with fallback=False, and rejects any run whose metadata reports source: offline or whose output is empty — recording it as an error, never as provider data. (The CLI model used to be ignored; it is now the model actually queried, so a leaderboard row always belongs to the model it names.)

Validation contract (#730)

Mode	Latency	Repetition (target / floor)	Success rate
FAST	≤s (error)	0.85 / 0.80	≥ 0.90 (error)
DEEP	70-85 s band (warn\*)	0.80 / 0.75	≥ 0.90 (error)

Repetition is WARN below target but ERROR only below the floor — a real ✅✅✅ token-loop scores ~0.1-0.3, far under the floor, while honest short replies hover near the target. A per-task check also fails if any single golden-set task collapses below repetition 0.5. \* The DEEP latency band only binds the native Σ₀ runtime; it is informational for cached providers.

Daily automation: .github/workflows/serving-benchmark.yml runs the logic tests, benchmarks every cloud provider whose API key is a repo secret, regenerates the report, commits the leaderboard, then runs --validate as a gate. Providers without a key are skipped — never fabricated. Over time the leaderboard becomes a measurable performance history; each run appends one row.

Code Integration

Updated Files

src/unified_agent_connector.py

Integrated serving modes
All provider streamers now respect mode-appropriate decode params

src/serving_modes.py (NEW)

Mode definitions (FAST, DEEP)
get_serving_mode(), get_decode_params()

src/serving_benchmark.py (NEW)

Golden set runner
Leaderboard tracking
CLI: --run provider:model, --summarize

apps/lantern-garage/lib/unified-agent.js (TODO in Phase 2)

Add OURO_NATIVE detection for Node.js side
Route requests to correct inference path

Success Criteria

✅ FAST mode (product default):

Latency < 2s for 90% of dream chat requests
No token loops (✅✅✅ degeneration)
Repetition ratio > 0.85 (unique words)
Cost stable or decreasing

✅ DEEP mode (research opt-in):

Grounded reasoning (Σ₀ loop validates claims)
Latency 70–85s acceptable for complex decisions
Repetition ratio > 0.80 (grounding may repeat concepts)
Used successfully for ≥3 real architecture decisions

✅ Benchmark sustainability:

Daily runs on CI (appendable leaderboard)
Golden set results tracked for all changes
Alert on regression (latency +20%, repetition -0.05)

Backward Compatibility

Existing code paths unchanged (no breaking changes)
Default behavior is now FAST; opt-in to DEEP
unified-agent Python interface: drop-in compatible
Node.js callers: no changes needed (will use FAST by default)

References

Σ₀ Framework: docs/CONVERGANCE-SIGMA0-BRIEFING.md
Model Architecture: docs/research-canon.md
Benchmark: src/serving_benchmark.py
Serving Modes Config: src/serving_modes.py