CSF Format Specification (canonical, consolidated)

CSF = Convergence-Fitted Searchable Format — Keystone OS's binary container family for memory, symbolic data, and (as of v0.8) arbitrary files.

This is the single canonical spec. It consolidates the previously-scattered CSF documentation (whitepaper, backend notes, CADD, code docstrings).

v2 consolidation (2026-06). CSF is now one lossless, zstd-backed format. The duplicate/legacy writers were deleted so they can't be called by mistake: the segmented CsfArchive v1 (+ header/dictionary/sparse/search.py and the csf_compress/decompress/merge/search CLIs), the root csf_file.py v0.3 symbolic writer, and the lossy v0.7 symbolic text compressors (csf_symbolic_compressor, the ClassicalCompressor class). The public API is the package root csf/__init__.py over the engine csf/csf_pack.py. Existing on-disk archives still open read-only via csf/legacy.py. The 3¹² lattice primitives (csf/v07/quantum_dust.py, qutrit_delta.py) and the Status-Cube binary container (csf/v07/csf_file.py) are kept — CSF stores a point on that lattice; it is not the lattice (see §6).

Lattice view (singularity). The symbolic v0.7 engine (src/csf/v07/) is the storage face of the 3**12 balanced-ternary Convergence Lattice — the same object the Converged Tesseract moves across. See TESSERACT-CSF-SINGULARITY.md for the consolidation; §6 below is the short bridge.

1. Version lineage

Version	Magic	Status	Reference code
v0.8 — CSF-Pack (canonical)	`CSF\0`	Active. The one format. Lossless arbitrary-file/blob container; zstd-19+LDM default	`csf/csf_pack.py` · API `csf/__init__.py`
v0.3 symbolic	`CSF\0`	Removed (writer) — read-only via `csf.legacy`	retired
v1 segmented	`CSFv1\0\0\0`	Removed (writer + CLIs) — read-only via `csf.legacy`	retired
v0.7 symbolic text compressors	—	Removed (lossy, non-invertible)	retired
v0.7 lattice primitives	—	Kept — Tesseract storage face (§6)	`csf/v07/`

Use the canonical core for everything. import csf; csf.pack(...) / csf.compress(...). The removed symbolic "compressors" were lossy and had no decoder — never a real format. Legacy on-disk archives open read-only through csf.legacy.

2. CSF-Pack (v0.8) — arbitrary-file container

2.1 Binary layout


[Magic        4 bytes : b"CSF\x00"]

[Version      2 bytes : major, minor = 0, 8]

[Flags        2 bytes : bit0 = blobs zlib-compressed]

[ManifestLen  4 bytes : uint32 BE]

[Manifest     N bytes : UTF-8 JSON]

[Blob region  M bytes : concatenated (optionally compressed) file bytes]

[Footer      40 bytes : sha256(all preceding bytes) (32) + total size uint64 BE (8)]

2.2 Manifest JSON


{

  "format": "csf-pack", "version": "0.8", "created_at": 1750000000.0,

  "compressed": true, "file_count": 3,

  "files": [

    {"path": "src/a.txt", "size": 1050, "csize": 60,

     "sha256": "…", "offset": 0, "compressed": true,

     "description": "one-line gloss of what this file is",

     "metadata": {"loop_stage": "Remember", "verdict": "grounded"}}

  ]

}

path — POSIX-relative arc path (directory structure preserved on unpack).
size/csize — original / stored byte length; offset is relative to the blob region.
sha256 — digest of the original bytes; verified on unpack.
description (optional) — human-readable summary of the member (Σ₀ gloss).
metadata (optional) — JSON-serialisable dict of grounding (purpose, loop

stage, verdict, confidence, source). Both are omitted when absent, so annotating is fully backward compatible — un-annotated archives are byte-identical to before the fields existed, and older readers ignore the extra keys.

Attach them at pack time with annotations={arc_path: {"description": ..., "metadata": ...}} (a bare string value is shorthand for description-only), and read them back with csf.list_archive, csf.file_annotation(archive, path), or csf.annotations(archive) (the searchable grounding index of every annotated member). Existing archives can be regenerated with descriptions via scripts/annotate_csf_archive.py.

2.3 Integrity & safety

Footer digest (sha256 of everything before the footer) is verified before

the manifest is parsed — any tampering fails with a clean integrity error.

Per-file sha256 is verified on extraction.
Path traversal (.., absolute paths) is rejected on unpack (_safe_join).

2.4 API


import csf



# archive (files or in-memory blobs); per-file SHA-256 + footer integrity

csf.pack(["mydir", "file.bin"], "out.csf")          # default codec = zstd-19+LDM

csf.pack(["mydir"], "out.csf", codec="zstd", use_dict=True)  # shared dict, keeps random access

csf.list_archive("out.csf")                          # manifest (no extract)

csf.unpack("out.csf", "dest_dir")                    # -> [written paths]

data = csf.read_file("out.csf", "file.bin")          # verified single member



# per-file grounding (Σ₀): description + metadata, retrievable without extract

csf.pack(["mydir"], "out.csf",

         annotations={"mydir/a.py": {"description": "…", "metadata": {"loop_stage": "Act"}}})

csf.file_annotation("out.csf", "mydir/a.py")         # {"description": …, "metadata": …}

csf.annotations("out.csf")                            # {arc_path: {description, metadata}} index



# lightweight single-blob stream (1-byte codec header, no manifest)

blob = csf.compress(b"...")                           # -> bytes

raw  = csf.decompress(blob)

Codec is per-file and self-describing (zstd | zlib | store | omni); a missing codec reads as zlib, so pre-codec archives still extract byte-for-byte. The opt-in omni codec is the best-fit / max-ratio tier (see §2.7.1).

2.5 CLI


python -m csf.csf_pack pack <paths...> -o out.csf [--no-compress]

python -m csf.csf_pack unpack out.csf -d <dest_dir>

python -m csf.csf_pack list out.csf

Tests: tests/test_csf_pack.py (round-trip ×2, list, tamper-detection, traversal).

2.6 App routes

Wired into the server alongside the legacy tesseract pack (routes/csf.js):


POST /api/csf/pack    { paths: ["docs","README.md"], out: "data/exports/bundle.csf", compress?: true }

POST /api/csf/unpack  { archive: "data/exports/bundle.csf", dest: "data/exports/out" }

Both constrain paths to within repoRoot (no traversal) on top of the module's own guards.

2.7 Benchmark — does it work better?

python scripts/csf_pack_benchmark.py onreal repo files (2.8 MB):

Format	Size	Ratio	Integrity / safety
CSF-Pack v0.8	1.0 MB	2.73×	SHA-256 per file + whole-archive footer digest + path-traversal guard
zip (DEFLATE-9)	1.0 MB	2.76×	CRC-32 only; no crypto hash; no path guard
tar.gz	888 KB	3.22×	no per-file checksum
legacy symbolic CSF	—	—	cannot store arbitrary files (255 B/record payload cap)

Verdict (honest): CSF-Pack is size-competitive with zip (within ~1%) and strictly safer — cryptographic per-file + whole-archive integrity and path-traversal protection that zip/tar don't provide by default. tar.gz compresses better (solid stream) but offers no per-file integrity. Against the legacy symbolic CSF it's a categorical upgrade — that format can't hold arbitrary file bytes at all. Use CSF-Pack when integrity + safety matter; it's the format for general bundling.

2.7.1 CSF-Omni — the opt-in best-fit codec

The table above is the archive-vs-zip picture (text+code, ~30% spread). On the append-only memory log the codec choice dominates. CSF already ships zstd-19 + LDM by default, which compresses aMB JSONL memory log 362× (vs zlib's 14× before #835). CSF-Omni (src/csf/omni.py) is a new opt-in codec (codec="omni") that goes one better: it runs the whole panel per blob (store · zlib · bz2 · lzma · zstd · brotli + a byte transform), round-trip-verifies each, and keeps the smallest behind a 7-byte self-describing, CRC-checked header — deterministically (same input → same bytes).

Measured against the shipped zstd-19, every codec round-trip-verified lossless (experiments/csf_compression_benchmark.py; full write-up: CSF Compression Benchmark — Review v3 (PDF)):

Corpus (raw)	zstd-19 (ships)	brotli-11	CSF-Omni	Omni vs zstd
text + code (3.07 MB)	4.11×	4.23×	4.23×	+2.8%
JSONL memory log (4.0 MB)	362.6×	422.1×	421.8×	+16.3%
cube delta stream (25 KB)	16.0×	17.15×	17.07×	+6.5%

CSF-Omni beats every other codec (zlib/zstd/lzma/bz2) on every single stream and is the only configuration that is best-or-tied everywhere. On the multi-file archive the picture is narrower: codec="omni" reaches 3.06× on the 340-file corpus, but master's existing zstd + shared dictionary (use_dict=True) already reaches 3.00× — so omni's archive edge is only +2 %, at ~7× the encode cost (omni ~31 s vs zstd+dict ~4 s vs plain zstd ~1 s). The dictionary recovers the cross-file redundancy that per-file omni cannot, so omni's durable win is on single streams, not archives.

Honest framing (Σ₀). CSF-Omni is the upper envelope, not a new entropy coder: on these corpora brotli is the frontier, so Omni matches it (payload-identical, +7-byte header) — it does not beat brotli's raw bytes, and no library available here (PPMd, paq) does. Its value is guaranteed best-in-field selection on any input plus integrity (the CRC catches corruption that zstd's default frame returns silently) and a portable stdlib-only mode. Trade-off: the panel sweep makes the omni archiver ~31 s forfiles, so zstd stays the default (hot paths) and omni is the opt-in max-ratio tier for cold/archival single-stream blobs; decode is fast.

Adversarially verified. A six-agent fleet stress-tested it (fuzz round-trip — 3,817 checks · envelope · backward-compat · code review · decode-safety · honesty audit). Two defects were found and fixed: a corrupt brotli payload could decode to silently-wrong bytes (→ CRC-32 + a ValueError decode contract), and a docstring over-claim (→ reworded to the envelope framing above). CSF tests pass.

2.7.2 Beyond the envelope — beating zstd-19 (theorized, tracked)

Omni is the upper envelope of off-the-shelf byte codecs; to go past it you must model structure or statistics LZ cannot express. Four grounded techniques are theorized in research/2026-06-29-csf-beating-zstd.md:

#	Technique	Lossless	Regime	Beats zstd-19?	Issue
1	CSF-Col — known-schema row→column transpose + typed coding → zstd backend	yes	hot	yes, 1.5–2.5× predicted on memory logs	#1593
2	RKD — retrieval-keyed lossless delta vs nearest prior record	yes	batch	yes on similar records	#1594
3	GRC — grounded resident-model residual coding, Σ₀-gated adaptive depth (corrected E1)	yes	cold	only if grounded (ungrounded raises entropy — proven)	#1595
4	Hybrid — GRC over a CSF-Col residual	yes	cold	highest ceiling	#1596

The Σ₀ collapse certificate (§ external) is load-bearing for #3: "deeper recurrence → fewer bits" holds only inside the grounded, non-collapsed regime; the NIS/anisotropy canary supplies the measured depth-exit. CSF-Col (#1593) is the recommended first build — see § 2.7.3 for the shipped result.

2.7.3 CSF-Col — shipped (transform id 2)

src/csf/col_transform.py is a lossless invertible byte→byte transform (registered as Omni transform id 2) that transposes flat-ish JSONL records from row-major to column-major before the entropy backend, so like-typed fields (timestamps, confidence, the near-constant reasoner/verified) form long compressible runs. Values are captured as raw source substrings (no JSON re-serialization → byte-exact), and forward() self-checks its own round-trip and raises NotApplicable otherwise — so Omni (which re-verifies and keeps the strict min) auto-selects it only on JSONL where it actually wins, and fast-skips everything else.

Measured, all round-trip-verified lossless (col+brotli selected by Omni vs the prior best baseline):

Corpus (raw)	best before	Omni + CSF-Col	gain
`csf_memory/deltas.jsonl` (21 KB)	19.9× (brotli)	24.4×	+23%
`csf_memory/raw.jsonl` (320 KB)	15.5× (omni)	16.4×	+6%
`convergence/records.jsonl` (353 KB)	8.5×	8.7×	+2%
small / text-dominated (e.g. 1.9 KB `agi-benchmark`)	—	falls back to brotli	no regression (not selected)

Honest framing. The win is real but modest on these corpora because they are dominated by large free-text fields (hypothesis/result) that don't columnarize — the gain comes from the small structured fields, and it grows with schema homogeneity (largest on the append-only deltas stream). It does not reach the 2–3× that schema-rich (mostly-typed-field) NDJSON sees in the literature. Because Omni selects per input, CSF-Col never regresses: on text-dominated or tiny blobs the framing overhead loses and it simply isn't picked. Tests: tests/test_csf_col.py (13, incl. 1.5k-case fuzz).

2.8 Per-user profile pack (one file per user, KB-grounded)

src/csf/profile_pack.py compacts all of one user's CSF data into a single file — data/profiles/<user>.csf (CSF-Pack v0.8) — and grounds it in the base Knowledge Center.

Archive contents:

user/… — the user's cube (data/cubes/<user>.private), deltas, indexes,

dreamer notebooks, csf_memory, profile json.

knowledge/index.jsonl — the embedded base KB grounding index (so the file

is self-contained and grounded), plus its .meta.json.

_profile.json — sources, user file count, and the grounding reference (KB sha256 + section count).


python -m csf.profile_pack pack <user>            # -> data/profiles/<user>.csf

python -m csf.profile_pack info  <archive>        # embedded _profile.json

python -m csf.profile_pack unpack <archive> -d <dest>

Routes: POST /api/csf/profile/pack {user} · GET /api/csf/profile/info?user=<id>. User profile archives are gitignored (user data — privacy). Tests: tests/test_csf_profile.py.

2.9 Base Knowledge Center grounding + cheaper routing

KB index — scripts/build_knowledge_index.py turns the Knowledge Center

source docs into data/knowledge/index.jsonl (one record per doc section with heading path + snippet). This is the base grounding corpus for "better LLM grounding."

Cheaper deterministic / near routing —

lib/knowledge-router.js answers from the KB index before paying for an LLM:

deterministic — exact heading match → that section verbatim ($0)
near — TF-IDF nearest section above threshold → grounded answer ($0)
miss — caller falls through to the provider chain

Route: GET|POST /api/knowledge/query { q } → { tier, hit, source, text, score }.

Rebuild the KB index after editing core docs: python scripts/build_knowledge_index.py.

3. Retired formats & the read-only bridge

The v2 consolidation removed every legacy writer. Nothing in the codebase produces these formats anymore; existing archives open read-only through csf.legacy.

Retired format	Was	Why removed	Reading it now
v0.3 symbolic (`csf_file.py`)	`CSF\0` baseline+dict+delta writer	superseded by the lossless core	`csf.legacy` (no on-disk files existed)
v1 segmented (`header.py` + `dictionary/sparse/search.py` +CLIs)	`CSFv1\0\0\0` segment container	duplicate archive format	`csf.legacy` (no on-disk files existed)
v0.7 symbolic text (`csf_symbolic_compressor`, `ClassicalCompressor`)	lossy "ratio" projection	lossy + no decoder — never reversible	n/a (was never a real archive)
raw DEFLATE blobs (dream-journal previews, `archive-commons/*.csf`)	bare zlib streams	— (kept)	`csf.legacy.decode_bytes` (inflate)

csf.legacy.open(path) sniffs and dispatches: modern CSF\0 → core; zlib blob → inflate; anything else → CsfLegacyError.

3.1 Still-live v0.7 lattice (`v07/`) — kept

quantum_dust, qutrit_delta, convergence_engine, plus the binary container csf_file.py (used by the Status-Cube store) and the lossless primitives in classical_compressor.py (SymbolicDictionary, sparse CSR). These are the storage face of the 3¹² lattice (§6), not a compression format.

4. Code map

Path	Role
`src/csf/__init__.py`	Canonical public API (facade over the engine)
`src/csf/csf_pack.py`	The format engine — pack/unpack/read_file + codec layer
`src/csf/legacy.py`	Read-only decoders for retired/legacy on-disk archives
`src/csf/profile_pack.py`	per-user profile archive (over the core)
`src/csf/v07/`	3¹² lattice primitives (Tesseract storage face) + Status-Cube container
`src/csf/status_cube.py`	StatusCube (player ImagniVerse)
`src/csf/memory_engine.py`	memory archive over CSF
`caad/README.md`	CADD (Context Archive for Dream Data) — built on CSF

5. Consolidation pointers (previously scattered)

docs/CSF-Whitepaper-v0.3.pdf — original whitepaper
docs/PHASE-1-CSF-BACKEND.md — backend phase notes
caad/README.md, caad/dollhouse-csf-upgrade.md — CADD layer
CSF-IMAGE-TRAINING.md — image-LoRA training over CSF
csf/ingest/ — CSF ingest docs are the memory/task queue, not format specs

This spec is the authoritative format reference; the above remain for history.

6. The 3¹² lattice (storage face of the singularity)

The v0.7 symbolic engine is not a standalone compressor — it is the storage face of a single 3**12 = 531,441-cell balanced-ternary lattice that the project also reasons over geometrically (the "Tesseract"). The two are one object; the full argument and grounding live in TESSERACT-CSF-SINGULARITY.md. Bridge facts:

Spec concept	Lattice role	Code
`NUM_DIMENSIONS = 12`, `TOTAL_POSITIONS =** 12`	12 ternary axes (one per Convergence-12 component)	`qutrit_delta.py`
`QutritState` (amp 0-7, phase 0-7) + `QutritDelta` (2 B)	a lattice cell + its signed change	`qutrit_delta.py`
`QuantumDustField` baseline + active deltas + dust	a stored point; most cells implicit ("dust")	`quantum_dust.py`
observer-collapsed wavefront	the motion face (Tesseract) reads the same field	`converged_tesseract.py`

Why base-3, not base-2: ternary is the most economical integer radix (optimum is e, nearest integer 3), and balanced ternary {-1,0,+1} gives symmetric arithmetic — the same substrate as BitNet b1.58's ternary weights (arXiv:2402.17764). The "no change is free" dust optimisation is the storage-side twin of BitNet's ~66 % zero-weight sparsity. Citations and falsifiable experiments: see the singularity doc §5–6.