Lantern OS Serving Architecture 2026: Fast Default + Deep Research Mode
Decision Date: 2026-06-18 Status: Implementation Complete (Phase 1) Owner: Lantern Core Team
Problem Statement
Previous serving approach produced degraded replies: 70–85 second latencies that degenerated into token loops (✅✅✅), blocking a sustainable product.
Two blockers existed:
- No-cache speed: Every reply required full inference without KV caching
- Decode degeneration: Missing anti-repetition parameters allowed token loops
THE DECISION: Make fast cached inference the product default. Keep native Σ₀ Q-exit loop as an opt-in research mode.
Architecture
Default: FAST MODE (Fast Cached Inference)
When: All requests unless OURO_NATIVE=1 is set.
What:
- Uses cached KV inference (Ollama/Ouro
UniversalTransformerCache) - Anti-repetition decode parameters enabled by default
- Target latency: <2 seconds for dream chat
- Suitable for interactive UX, real-time feedback, production use
Decode Parameters (FAST mode):
{
"temperature": 0.7,
"top_p": 0.95,
"frequency_penalty": 0.5, # OpenAI/Deepseek/Groq
"repetition_penalty": 1.1, # Ollama
"repeat_last_n": 64, # Ollama context window
}
Rationale:
- Fast KV cache prevents the "decode degeneration" problem
- Aggressive repetition penalties (top_p=0.95, freq_penalty=0.5) kill the
✅✅✅loop - 64-token context for repetition detection balances freshness vs. tone consistency
Opt-In: DEEP MODE (Native Σ₀ Q-exit Loop)
When: OURO_NATIVE=1 environment variable is set.
What:
- Adaptive depth via native Σ₀ Q-exit loop (grounded reasoning)
- No KV cache — full adaptive inference per query
- Higher latency acceptable (70–85 seconds for complex reasoning)
- Suitable for: architecture decisions, research, grant writing, core system design
Decode Parameters (DEEP mode):
{
"temperature": 0.7,
"top_p": 0.98, # Slightly less aggressive
"frequency_penalty": 0.2, # Allow more repetition for grounding
"repetition_penalty": 1.05, # Ollama: softer penalty
"repeat_last_n": 128, # Ollama: wider context
}
Rationale:
- Adaptive loop may reference prior states — softer antirepetition allows this
- Higher top_p (0.98 vs 0.95) gives reasoning more token diversity
- 128-token context captures cross-turn grounding patterns
Configuration
Environment Variables
# Product default: fast cached inference
(unset OURO_NATIVE) → FAST mode
# Opt-in to deep research mode
OURO_NATIVE=1 → DEEP mode
Detection (Code)
from serving_modes import get_serving_mode, get_decode_params
mode = get_serving_mode() # Returns FAST_MODE or DEEP_MODE
decode_params = get_decode_params(mode)
Performance Baseline (measured, not estimated)
The baseline accrues from real benchmark runs in data/benchmarks/leaderboard.jsonl — appended by the daily workflow and by python src/serving_benchmark.py --providers <provider:model> --mode fast.
⚠️ A previous version of this section hard-coded estimated numbers (e.g. "450ms / 0.92") that were never measured. They were removed per the External Reality Rule: the leaderboard and
data/benchmarks/REPORT.mdare the single source of truth, and the benchmark records only real provider responses (an unreachable provider is logged as an error, never as fabricated data).
Measurement & Iteration
Standing Benchmark
# Run one provider:model pair (FAST mode, the product default)
python src/serving_benchmark.py --run anthropic:claude-haiku-4-5-20251001 --mode fast
# Run several at once; DEEP mode sets OURO_NATIVE for the run
python src/serving_benchmark.py --providers "anthropic:claude-haiku-4-5-20251001,openai:gpt-4.1-mini" --mode fast
# Validate the latest run per config against the FAST/DEEP contract (exit 1 on regression)
python src/serving_benchmark.py --validate
# Refresh the Markdown monitoring report
python src/serving_benchmark.py --report # → data/benchmarks/REPORT.md
Golden set:diverse prompts (reasoning, creative, code, domain). Metrics: latency, tokens, repetition_ratio (+ per-task pass), cost, throughput. Leaderboard: data/benchmarks/leaderboard.jsonl · Report: data/benchmarks/REPORT.md.
Honesty contract
The connector silently falls back to a canned offline persona stub when a provider is unreachable. The benchmark therefore pins the requested model onto the provider config, streams with fallback=False, and rejects any run whose metadata reports source: offline or whose output is empty — recording it as an error, never as provider data. (The CLI model used to be ignored; it is now the model actually queried, so a leaderboard row always belongs to the model it names.)
Validation contract (#730)
| Mode | Latency | Repetition (target / floor) | Success rate |
|---|---|---|---|
| FAST | ≤s (error) | 0.85 / 0.80 | ≥ 0.90 (error) |
| DEEP | 70-85 s band (warn\*) | 0.80 / 0.75 | ≥ 0.90 (error) |
Repetition is WARN below target but ERROR only below the floor — a real ✅✅✅ token-loop scores ~0.1-0.3, far under the floor, while honest short replies hover near the target. A per-task check also fails if any single golden-set task collapses below repetition 0.5. \* The DEEP latency band only binds the native Σ₀ runtime; it is informational for cached providers.
Daily automation: .github/workflows/serving-benchmark.yml runs the logic tests, benchmarks every cloud provider whose API key is a repo secret, regenerates the report, commits the leaderboard, then runs --validate as a gate. Providers without a key are skipped — never fabricated. Over time the leaderboard becomes a measurable performance history; each run appends one row.
Migration Path
Phase 1: ✅ COMPLETE (2026-06-18)
- Add anti-repetition decode params to all providers
- Implement FAST/DEEP mode system
- Create standing benchmark
- Default to FAST (product-ready)
Phase 2: Validation (#730)
- Harden benchmark for honest metrics (model pinning, offline-stub rejection)
- Add
--validateFAST/DEEP threshold gate (two-tier repetition) - Daily benchmark automation (CI workflow, validates as a gate)
- Reasoning/coding regression check (per-task repetition floor)
- Document FAST/DEEP expectations + per-provider decode params (PROVIDERS.md)
- Accrue ≥7 daily runs (automated; accrues once provider secrets are set)
Phase 3: Optimization (Weeks of 2026-07-02)
- Tune decode params per provider
- Explore lighter KV cache configs for DEEP mode
- Cache DEEP mode results for common research questions
- Consider hybrid mode: FAST with DEEP fallback for hard problems
Phase 4: Research (Ongoing)
- Compare DEEP mode to other high-reasoning approaches (Claude Opus, etc.)
- Measure Σ₀ Q-exit effectiveness (grounding quality vs. latency)
- Publish results as case study
Code Integration
Updated Files
src/unified_agent_connector.py
- Integrated serving modes
- All provider streamers now respect mode-appropriate decode params
src/serving_modes.py (NEW)
- Mode definitions (FAST, DEEP)
get_serving_mode(),get_decode_params()
src/serving_benchmark.py (NEW)
- Golden set runner
- Leaderboard tracking
- CLI:
--run provider:model,--summarize
apps/lantern-garage/lib/unified-agent.js (TODO in Phase 2)
- Add
OURO_NATIVEdetection for Node.js side - Route requests to correct inference path
Success Criteria
✅ FAST mode (product default):
- Latency < 2s for 90% of dream chat requests
- No token loops (
✅✅✅degeneration) - Repetition ratio > 0.85 (unique words)
- Cost stable or decreasing
✅ DEEP mode (research opt-in):
- Grounded reasoning (Σ₀ loop validates claims)
- Latency 70–85s acceptable for complex decisions
- Repetition ratio > 0.80 (grounding may repeat concepts)
- Used successfully for ≥3 real architecture decisions
✅ Benchmark sustainability:
- Daily runs on CI (appendable leaderboard)
- Golden set results tracked for all changes
- Alert on regression (latency +20%, repetition -0.05)
Backward Compatibility
- Existing code paths unchanged (no breaking changes)
- Default behavior is now FAST; opt-in to DEEP
unified-agentPython interface: drop-in compatible- Node.js callers: no changes needed (will use FAST by default)
References
- Σ₀ Framework:
docs/CONVERGANCE-SIGMA0-BRIEFING.md - Model Architecture:
docs/research-canon.md - Benchmark:
src/serving_benchmark.py - Serving Modes Config:
src/serving_modes.py