ADR-0003: One canonical CSF module
Status
Proposed — awaiting approval from Alex Place.
Context
CSF (Convergence-Fitted Searchable Format) is the project's lossless binary archive — the "one CSF archive" half of the persistence model (the other half is append-only JSONL; see ADR-0004). Historically CSF accreted multiple writers: a segmented v1 CsfArchive with its own csf_compress/decompress/merge/search CLIs, a v0.3 csf_file writer, and lossy v0.7 symbolic text compressors. Multiple formats meant callers picked the wrong one, archives were written in incompatible layouts, and "which CSF do I use?" was a recurring tax.
The v2 consolidation (2026-06, repo v1.5.0) resolved this.
Decision
We will keep exactly one canonical CSF module with a single public API at the package root, src/csf/__init__.py: pack/unpack/read_file for file/blob archives and compress/decompress for byte strings, backed by the engine csf_pack.py.
- The duplicate/legacy writers are deleted so they cannot be called by mistake.
- Existing on-disk archives remain openable read-only via
- The v07 lattice primitives (Tesseract "storage face") and the Status-Cube container
(status_cube.py) are retained as kept components.
- No second CSF format may be re-introduced. New needs extend the one module.
Options Considered
Option A: One canonical module, legacy read-only (chosen)
Pros: one obvious API; impossible to write the wrong format; old data still readable. Cons: one-time migration; lost the niche strengths of deleted writers (e.g. symbolic ratios — which were lossy fiction anyway).
Option B: Keep multiple writers behind a façade (rejected)
Cons: the façade hides, but does not remove, the footgun; writers drift; the "which format" tax persists.
Trade-off Analysis
Deleting working code feels lossy, but a writer that can be called by mistake is a liability, not an asset. Read-only legacy support preserves the only thing that mattered (existing data) while removing the ways to create new divergence.
Consequences
- Positive:
import csfis the one true entry; new code can't fork the format; archives
stay consistent.
- Negative / trade-offs: docstrings/spec must not over-promise the codec — see the known
divergence below.
- Follow-ups: ARCHITECTURE.md §9.3 — the
public API advertises zstd-19+LDM (__init__.py:12) but active paths have bottlenecked on zlib / low-level zstd; verify the real codec before quoting ratios. This is a perf-debt follow-up, not a reason to add a second format.
Alternatives considered
See Options. "Do nothing" (keep all writers) was the status quo that caused the problem.
Evidence
| Claim | Evidence (file:line / commit / PR) | Confidence | Source |
|---|---|---|---|
| One public API at package root | src/csf/__init__.py:12-18 |
High | code |
| Legacy writers deleted, read-only via legacy.py | __init__.py:27, legacy.py |
High | code |
| v07 lattice + Status-Cube kept | status_cube.py; CLAUDE.md CSF section |
High | code + doc |
| Codec claim vs reality is open debt | __init__.py:12 vs ARCHITECTURE.md §9 |
Medium | doc + prior measurement |