A deterministic context compression layer that runs as a single Rust binary with no ML dependencies at query time. Exposes 11 MCP tools including compress_prompt for structured input, compress_history for conversation memory, compress_output with 21 domain-aware filters for shell commands, and read_file_delta/write_file_delta for incremental edits. Built to stack in front of provider caching and memory systems like mem0, targeting the dynamic conversation blocks those layers don't touch. The real-data benchmarks show 84.7% token savings across a weighted blend of history, shell output, file reads, and search operations measured on the repo's own development cycle. Reach for this when you're burning tokens on repeated file reads, verbose tool output, or growing conversation windows in long-running agent sessions.
Deterministic context compression for MCP agents. One Rust binary. Zero ML at query time. Reproducible benchmarks, real-data measurements.
Sophon is a deterministic context layer for agents speaking the Model Context Protocol. It compresses prompts, conversation memory, code digests, file deltas, and shell output — without an embedding model at query time, without a GPU, and without API keys.
Single 5.2 MB Rust binary. MCP-native. cl100k_base-accurate. Default build pulls no Python, no ML weights, no network.
| Tool | What it solves |
|---|---|
compress_prompt | Long structured prompt → keep only sections relevant to the query |
compress_history | Growing conversation → summary + facts + recent window + optional retrieval |
compress_output | Shell stdout/stderr → 21 domain-aware filters (git, cargo, docker, kubectl, JSON, …) |
read_file_delta / write_file_delta | Re-reads + edits → diffs only, never the whole file |
encode_fragments | Repeated boilerplate → single token reference |
update_memory | Append turn → JSONL persist + incremental rolling summary |
navigate_codebase | Repo digest with tree-sitter / regex + PageRank, ranked by query |
11 MCP tools total (full table below).
We built four independent benches that each capture a different chunk of an agent's tool traffic. All four run against this repo's actual git history + working tree on the operator's machine. Reproducible byte-for-byte by anyone with cargo build --release.
| Dimension | What it measures | Saved | Bench |
|---|---|---|---|
| history | compress_history over real commits | 94.6 % | real_session_capture.py |
| shell | compress_output on real git/cargo/gh/ls stdout | 84.4 % | real_session_shell.py |
| filereads | compress_prompt on real Rust / Python / Markdown / TOML files | 71.7 % | real_session_filereads.py |
| search | compress_output on real grep/find patterns | 79.5 % | real_session_search.py |
| 🎯 Weighted blend (35/30/20/15) | typical agent session estimate | 84.7 % | real_session_holistic.py |
real_session_holistic.py runs all four sub-benches with --json, parses them, and produces the weighted blend. Default weights reflect this repo's observed shape; pass --weights "history=0.4,..." to model your own workload.
| Saved per session | |
|---|---|
| Naive input pricing ($15/MT) | $2.03 |
| With prompt caching (25-turn reads at $1.50/MT) | $3.24 |
Pass
--model sonnetor--model haikutoreal_session_deep_dive.pyif you're re-pricing for a cheaper tier.
git captures (commits + diffs) — typically ~5-10 % of a real session's tool traffic. The 94.6 % is the upper bound, not the typical case.git diff 95 %) with commands that don't (gh repo view --json adds tokens, −9 %). 84.4 % is a real-world average, not a curated highlight.compress_prompt on raw source files compresses by budget cap, not by query routing — same file with 3 different queries → identical output. Section detection only fires on structured input (Markdown headers, XML tags). Documented inline in the bench.grep TODO.The blended 84.7 % is napkin-math from a linear weighted average across four real measurements. Not a cherry-picked synthetic. Run the benches yourself to verify.
| Test | Result | Bench |
|---|---|---|
compress_output across 18 command families | 90.1 % weighted aggregate | compress_output_per_command.py |
| 25-turn synthetic Claude Code session | 68.1 % session tokens saved | session_token_economics.py |
compress_prompt across 22 prompt shapes | 70.2 % mean, 36 ms mean latency | prompt_compression_extended.py |
| Code retrieval on "where is X?" questions | recall@3 = 70 % (vs grep 10 %, FULL 20 %) | repo_qa.py |
| vs LLMLingua-2 on structured prompts | +8.9 pt accuracy at 35× lower latency | llmlingua_compare.py |
| Sophon + Anthropic prompt caching | +24 % tokens / +49 % $ on top of caching | sophon_plus_prompt_caching.py |
| Sophon + mem0 | Additional savings on retrieved memories | sophon_plus_mem0.py |
Sophon is not a memory platform, a recall system, an OCR stack, or a replacement for provider-side caching. It's a deterministic compressor that slots in front of whatever memory / cache / code-nav layer you already use, and attacks the tokens those layers can't.
Provider caching handles the static half of a request — system prompt, tool definitions, reused documents. It doesn't touch the dynamic half (growing conversation history, tool outputs). Sophon compresses exactly that half. The two stack cleanly.
+24 % tokens / +49 % $ saved on top of prompt caching on a 25-turn Claude session — because the uncached dynamic block is billed at 10× the cached rate. See
sophon_plus_prompt_caching.py.
Memory systems retrieve the right memories. Sophon shrinks what gets sent to the LLM after retrieval. If mem0 returns 2 kB of raw memories, compress_prompt keeps only the sections the query actually references.
Honest caveat: on very short retrieved blocks (< ~200 tokens) Sophon's wrapper adds overhead and you should pass through. The bench reports this directly.
Primary use case. Every repeat file read becomes a read_file_delta; every shell command output goes through compress_output; every repeated boilerplate block gets a fragment_cache token. Install transparently with sophon hook install --agent claude --global.
navigate_codebase produces a PageRanked repo digest that a RAG retriever would otherwise spend expensive embedding calls to build. Tree-sitter / regex symbol extraction over 11 languages, sub-second.
npm install -g mcp-sophon
sophon doctor # verify install + show config
The postinstall script downloads the right prebuilt binary for your platform from the GitHub Releases page. Supported: macOS arm64/x64, Linux arm64/x64, Windows x64.
git clone https://github.com/lacausecrypto/mcp-sophon
cd mcp-sophon/sophon
cargo build --release -p mcp-integration # ~5.2 MB binary
Optional features:
# 11-language tree-sitter AST extraction (~25 MB):
cargo build --release -p mcp-integration --features codebase-navigator/tree-sitter
# BGE-small semantic embedder (~34 MB), activate with SOPHON_EMBEDDER=bge:
cargo build --release -p mcp-integration --features bge
# All features (~42 MB):
cargo build --release -p mcp-integration --features "codebase-navigator/tree-sitter,bge"
Requires Rust 1.75+.
Most clients accept this snippet (Claude Desktop, Claude Code, Cursor, Cline, Continue):
{
"mcpServers": {
"sophon": {
"command": "sophon",
"args": ["serve"]
}
}
}
Run sophon doctor to print the right config path for your client.
# Persistent memory + on-disk retriever store + BM25+Hash hybrid
export SOPHON_MEMORY_PATH=~/.sophon/memory.jsonl
export SOPHON_RETRIEVER_PATH=~/.sophon/retriever
export SOPHON_HYBRID=1
sophon serve
sophon exec -- cargo test # run + compress combined output
sophon compress-prompt --prompt ./system.txt --query "rust errors" --max-tokens 500
sophon hook install --agent claude --global # transparent Claude Code integration
sophon stats --period session # token savings rollup
echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"compress_prompt","arguments":{"prompt":"<rust>?: operator</rust><web>fetch()</web>","query":"rust errors","max_tokens":500}}}' \
| sophon serve
11 MCP tools, all stdio:
| Tool | What it does |
|---|---|
compress_prompt | Keep query-relevant sections of a long prompt |
compress_history | Summary + facts + recent + optional retrieval over the conversation |
compress_output | Strip noise from command stdout/stderr (21 domain filters + JsonStructural) |
navigate_codebase | tree-sitter / regex digest of a repo, PageRanked by query |
update_memory | Append messages, JSONL persist, optional rolling summary |
read_file_delta | Version/hash-aware file read, unchanged → minimal payload |
write_file_delta | Send edits as diffs, not full files |
encode_fragments / decode_fragments | Detect repeated boilerplate, swap with tokens |
count_tokens | cl100k_base-accurate token count |
get_token_stats | Session-level savings rollup |
Binary sizes by feature set:
| Build | Size |
|---|---|
| Default (regex extractors, HashEmbedder) | 5.2 MB |
| + tree-sitter (11 languages) | ~25 MB |
| + BGE semantic embedder | ~34 MB |
| All features | ~42 MB |
MCP protocol: 2025-06-18. notifications/cancelled actually drops the response (since v0.5.4). Structured JSON-RPC error codes (-32000..-32099 reserved for Sophon). Infallible dispatcher — a malformed request can't kill the stdio loop.
Run sophon doctor to see every SOPHON_* env var currently set with validation warnings. Full catalogue (24 flags) lives in runtime_flags.rs. The flags worth knowing:
| Flag | Effect | Cost |
|---|---|---|
SOPHON_RETRIEVER_PATH=/dir | Activate the semantic retriever (chunk store on disk) | ~0 |
SOPHON_MEMORY_PATH=/file.jsonl | Persistent conversation memory across sophon serve runs | ~0 |
SOPHON_HYBRID=1 | BM25 sparse-lexical + HashEmbedder fused via RRF | ~1 ms |
SOPHON_ROLLING_SUMMARY=1 | Build rolling summary at update_memory time, not at query time | LLM call moved to ingest |
SOPHON_CHUNK_TARGET=500 | Bigger chunks preserve cross-sentence context | ~0 |
SOPHON_EMBEDDER=bge | Swap HashEmbedder for BGE-small (needs --features bge) | model load at startup |
SOPHON_LLM_CMD="claude -p --model haiku" | LLM shell-out command (used by summarizer when configured) | per-call subprocess |
Deprecated v0.4.0 recall-chasing flags — SOPHON_HYDE, SOPHON_FACT_CARDS, SOPHON_ENTITY_GRAPH, SOPHON_ADAPTIVE, SOPHON_LLM_RERANK, SOPHON_TAIL_SUMMARY, SOPHON_REACT, SOPHON_GRAPH_MEMORY, SOPHON_MULTIHOP_LLM — chase LOCOMO recall, an axis we no longer optimise. Still functional but sophon doctor flags them. Removed in a future major.
The full list lives in BENCHMARK.md § 8. Headlines:
SOPHON_EMBEDDER=bge) for semantic recall — costs +25 MB binary + model load.SOPHON_LLM_CMD set.gh repo view --json adds tokens, git log --oneline saves 0.4 %. Sophon's job isn't to compress already-compact output — it's to compress redundant verbose output. The benches name the gaps explicitly..
├── README.md ← you are here
├── BENCHMARK.md ← full per-section benchmark detail
├── CHANGELOG.md ← version history + deprecated numbers
├── benchmarks/ ← reproducible scripts for every number above
├── npm/ ← npm wrapper package
└── sophon/crates/ ← 11-crate Rust workspace
├── prompt-compressor/ compress_prompt
├── memory-manager/ compress_history, update_memory, rolling summary
├── delta-streamer/ read/write_file_delta
├── fragment-cache/ encode/decode_fragments
├── semantic-retriever/ chunker + HashEmbedder + BM25 + entity graph
├── output-compressor/ 21 command-aware filters + JsonStructural
├── codebase-navigator/ tree-sitter / regex + PageRank
├── cli-hooks/ transparent agent installer
└── mcp-integration/ stdio server, async dispatch, cancellation
PRs welcome. Run the test suite:
cd sophon && cargo test --workspace --lib --tests --exclude prompt-compressor # 405 tests
cd sophon && cargo test --features codebase-navigator/tree-sitter # +AST tests
cd sophon-py && .venv/bin/pytest tests/ # 4 Python tests
Every benchmark claim is reproducible — pointers to the scripts live in BENCHMARK.md. If a number doesn't reproduce on your machine, open an issue.
Particularly welcome:
sophon-py/)gh family filter (gh run list, gh pr list, gh repo view --json) — the bench shows this is currently a gapSOPHON_EMBEDDER_CMD shell-out plugin pattern (mirror of SOPHON_LLM_CMD) for Voyage / OpenAI / Coherereal_session_holistic.py runs against popular open-source reposMIT. See LICENSE.
taida957789/ida-mcp-server-plugin
zboralski/ida-headless-mcp
svnscha/mcp-windbg
fosdickio/binary_ninja_mcp
wasdubya/x64dbgmcp