docs/SERVING-ARCHITECTURE-2026.md

Lantern OS Serving Architecture 2026: Fast Default + Deep Research Mode

Decision Date: 2026-06-18 Status: Implementation Complete (Phase 1) Owner: Lantern Core Team

Problem Statement

Previous serving approach produced degraded replies: 70–85 second latencies that degenerated into token loops (✅✅✅), blocking a sustainable product.

Two blockers existed:

  1. No-cache speed: Every reply required full inference without KV caching
  2. Decode degeneration: Missing anti-repetition parameters allowed token loops

THE DECISION: Make fast cached inference the product default. Keep native Σ₀ Q-exit loop as an opt-in research mode.


Architecture

Default: FAST MODE (Fast Cached Inference)

When: All requests unless OURO_NATIVE=1 is set.

What:

  • Uses cached KV inference (Ollama/Ouro UniversalTransformerCache)
  • Anti-repetition decode parameters enabled by default
  • Target latency: <2 seconds for dream chat
  • Suitable for interactive UX, real-time feedback, production use

Decode Parameters (FAST mode):


{

    "temperature": 0.7,

    "top_p": 0.95,

    "frequency_penalty": 0.5,      # OpenAI/Deepseek/Groq

    "repetition_penalty": 1.1,     # Ollama

    "repeat_last_n": 64,           # Ollama context window

}

Rationale:

  • Fast KV cache prevents the "decode degeneration" problem
  • Aggressive repetition penalties (top_p=0.95, freq_penalty=0.5) kill the ✅✅✅ loop
  • 64-token context for repetition detection balances freshness vs. tone consistency

Opt-In: DEEP MODE (Native Σ₀ Q-exit Loop)

When: OURO_NATIVE=1 environment variable is set.

What:

  • Adaptive depth via native Σ₀ Q-exit loop (grounded reasoning)
  • No KV cache — full adaptive inference per query
  • Higher latency acceptable (70–85 seconds for complex reasoning)
  • Suitable for: architecture decisions, research, grant writing, core system design

Decode Parameters (DEEP mode):


{

    "temperature": 0.7,

    "top_p": 0.98,                 # Slightly less aggressive

    "frequency_penalty": 0.2,      # Allow more repetition for grounding

    "repetition_penalty": 1.05,    # Ollama: softer penalty

    "repeat_last_n": 128,          # Ollama: wider context

}

Rationale:

  • Adaptive loop may reference prior states — softer antirepetition allows this
  • Higher top_p (0.98 vs 0.95) gives reasoning more token diversity
  • 128-token context captures cross-turn grounding patterns

Configuration

Environment Variables


# Product default: fast cached inference

(unset OURO_NATIVE) → FAST mode



# Opt-in to deep research mode

OURO_NATIVE=1 → DEEP mode

Detection (Code)


from serving_modes import get_serving_mode, get_decode_params



mode = get_serving_mode()  # Returns FAST_MODE or DEEP_MODE

decode_params = get_decode_params(mode)


Performance Baseline (measured, not estimated)

The baseline accrues from real benchmark runs in data/benchmarks/leaderboard.jsonl — appended by the daily workflow and by python src/serving_benchmark.py --providers <provider:model> --mode fast.

⚠️ A previous version of this section hard-coded estimated numbers (e.g. "450ms / 0.92") that were never measured. They were removed per the External Reality Rule: the leaderboard and data/benchmarks/REPORT.md are the single source of truth, and the benchmark records only real provider responses (an unreachable provider is logged as an error, never as fabricated data).


Measurement & Iteration

Standing Benchmark


# Run one provider:model pair (FAST mode, the product default)

python src/serving_benchmark.py --run anthropic:claude-haiku-4-5-20251001 --mode fast



# Run several at once; DEEP mode sets OURO_NATIVE for the run

python src/serving_benchmark.py --providers "anthropic:claude-haiku-4-5-20251001,openai:gpt-4.1-mini" --mode fast



# Validate the latest run per config against the FAST/DEEP contract (exit 1 on regression)

python src/serving_benchmark.py --validate



# Refresh the Markdown monitoring report

python src/serving_benchmark.py --report   # → data/benchmarks/REPORT.md

Golden set:diverse prompts (reasoning, creative, code, domain). Metrics: latency, tokens, repetition_ratio (+ per-task pass), cost, throughput. Leaderboard: data/benchmarks/leaderboard.jsonl · Report: data/benchmarks/REPORT.md.

Honesty contract

The connector silently falls back to a canned offline persona stub when a provider is unreachable. The benchmark therefore pins the requested model onto the provider config, streams with fallback=False, and rejects any run whose metadata reports source: offline or whose output is empty — recording it as an error, never as provider data. (The CLI model used to be ignored; it is now the model actually queried, so a leaderboard row always belongs to the model it names.)

Validation contract (#730)

Mode Latency Repetition (target / floor) Success rate
FAST ≤s (error) 0.85 / 0.80 ≥ 0.90 (error)
DEEP 70-85 s band (warn\*) 0.80 / 0.75 ≥ 0.90 (error)

Repetition is WARN below target but ERROR only below the floor — a real ✅✅✅ token-loop scores ~0.1-0.3, far under the floor, while honest short replies hover near the target. A per-task check also fails if any single golden-set task collapses below repetition 0.5. \* The DEEP latency band only binds the native Σ₀ runtime; it is informational for cached providers.

Daily automation: .github/workflows/serving-benchmark.yml runs the logic tests, benchmarks every cloud provider whose API key is a repo secret, regenerates the report, commits the leaderboard, then runs --validate as a gate. Providers without a key are skipped — never fabricated. Over time the leaderboard becomes a measurable performance history; each run appends one row.


Migration Path

Phase 1: ✅ COMPLETE (2026-06-18)

  • Add anti-repetition decode params to all providers
  • Implement FAST/DEEP mode system
  • Create standing benchmark
  • Default to FAST (product-ready)

Phase 2: Validation (#730)

  • Harden benchmark for honest metrics (model pinning, offline-stub rejection)
  • Add --validate FAST/DEEP threshold gate (two-tier repetition)
  • Daily benchmark automation (CI workflow, validates as a gate)
  • Reasoning/coding regression check (per-task repetition floor)
  • Document FAST/DEEP expectations + per-provider decode params (PROVIDERS.md)
  • Accrue ≥7 daily runs (automated; accrues once provider secrets are set)

Phase 3: Optimization (Weeks of 2026-07-02)

  • Tune decode params per provider
  • Explore lighter KV cache configs for DEEP mode
  • Cache DEEP mode results for common research questions
  • Consider hybrid mode: FAST with DEEP fallback for hard problems

Phase 4: Research (Ongoing)

  • Compare DEEP mode to other high-reasoning approaches (Claude Opus, etc.)
  • Measure Σ₀ Q-exit effectiveness (grounding quality vs. latency)
  • Publish results as case study

Code Integration

Updated Files

src/unified_agent_connector.py

  • Integrated serving modes
  • All provider streamers now respect mode-appropriate decode params

src/serving_modes.py (NEW)

  • Mode definitions (FAST, DEEP)
  • get_serving_mode(), get_decode_params()

src/serving_benchmark.py (NEW)

  • Golden set runner
  • Leaderboard tracking
  • CLI: --run provider:model, --summarize

apps/lantern-garage/lib/unified-agent.js (TODO in Phase 2)

  • Add OURO_NATIVE detection for Node.js side
  • Route requests to correct inference path

Success Criteria

FAST mode (product default):

  • Latency < 2s for 90% of dream chat requests
  • No token loops (✅✅✅ degeneration)
  • Repetition ratio > 0.85 (unique words)
  • Cost stable or decreasing

DEEP mode (research opt-in):

  • Grounded reasoning (Σ₀ loop validates claims)
  • Latency 70–85s acceptable for complex decisions
  • Repetition ratio > 0.80 (grounding may repeat concepts)
  • Used successfully for ≥3 real architecture decisions

Benchmark sustainability:

  • Daily runs on CI (appendable leaderboard)
  • Golden set results tracked for all changes
  • Alert on regression (latency +20%, repetition -0.05)

Backward Compatibility

  • Existing code paths unchanged (no breaking changes)
  • Default behavior is now FAST; opt-in to DEEP
  • unified-agent Python interface: drop-in compatible
  • Node.js callers: no changes needed (will use FAST by default)

References

  • Σ₀ Framework: docs/CONVERGANCE-SIGMA0-BRIEFING.md
  • Model Architecture: docs/research-canon.md
  • Benchmark: src/serving_benchmark.py
  • Serving Modes Config: src/serving_modes.py