If you're burning thousands of tokens on tool outputs, logs, or RAG results before they even reach the LLM, this MCP server gives you compress and retrieve operations that claim 60 to 95 percent savings while preserving answer quality. It routes JSON through SmartCrusher, code through an AST compressor, and prose through a Kompress model, then caches originals locally so the LLM can pull them back with headroom_retrieve if needed. The same library also ships as a proxy and agent wrapper for Claude, Cursor, and Aider. Benchmarks on GSM8K and TruthfulQA show no accuracy drop. You'd reach for this when context windows fill up faster than your budget allows or when you want the same agent behavior at a fraction of the API cost.
██╗ ██╗███████╗ █████╗ ██████╗ ██████╗ ██████╗ ██████╗ ███╗ ███╗
██║ ██║██╔════╝██╔══██╗██╔══██╗██╔══██╗██╔═══██╗██╔═══██╗████╗ ████║
███████║█████╗ ███████║██║ ██║██████╔╝██║ ██║██║ ██║██╔████╔██║
██╔══██║██╔══╝ ██╔══██║██║ ██║██╔══██╗██║ ██║██║ ██║██║╚██╔╝██║
██║ ██║███████╗██║ ██║██████╔╝██║ ██║╚██████╔╝╚██████╔╝██║ ╚═╝ ██║
╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═════╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═╝
The context compression layer for AI agents
60–95% fewer tokens · library · proxy · MCP · 6 algorithms · local-first · reversible
Docs · Install · Proof · Agents · Discord · llms.txt · Enterprise
AI agents / LLMs: read /llms.txt here, or fetch the live index / full docs blob.
Headroom compresses everything your AI agent reads — tool outputs, logs, RAG chunks, files, and conversation history — before it reaches the LLM. Same answers, fraction of the tokens.
Live: 10,144 → 1,260 tokens — same FATAL found.
compress(messages) in Python or TypeScript, inline in any appheadroom proxy --port 8787, zero code changes, any languageheadroom wrap claude|codex|cursor|aider|copilot|opencode in one commandheadroom_compress, headroom_retrieve, headroom_stats for any MCP clientheadroom learn — mines failed sessions, writes corrections to CLAUDE.md / AGENTS.md Your agent / app
(Claude Code, Cursor, Codex, LangChain, Agno, Strands, your own code…)
│ prompts · tool outputs · logs · RAG results · files
▼
┌────────────────────────────────────────────────────┐
│ Headroom (runs locally — your data stays here) │
│ ──────────────────────────────────────────────── │
│ CacheAligner → ContentRouter → CCR │
│ ├─ SmartCrusher (JSON) │
│ ├─ CodeCompressor (AST) │
│ └─ Kompress-base (text, HF) │
│ │
│ Cross-agent memory · headroom learn · MCP │
└────────────────────────────────────────────────────┘
│ compressed prompt + retrieval tool
▼
LLM provider (Anthropic · OpenAI · Bedrock · …)
headroom_retrieve if it needs them→ Architecture · CCR reversible compression · Kompress-v2-base model card
# 1 — Install
pip install "headroom-ai[all]" # Python
npm install headroom-ai # Node / TypeScript
# 2 — Pick your mode
headroom wrap claude # wrap a coding agent
headroom proxy --port 8787 # drop-in proxy, zero code changes
# or: from headroom import compress # inline library
# 3 — See the savings
headroom perf
headroom dashboard # live savings dashboard (proxy must be running)
Granular extras: [proxy], [mcp], [ml], [code], [memory], [relevance], [image], [agno], [langchain], [evals], [pytorch-mps] (Apple-GPU memory-embedder offload — set HEADROOM_EMBEDDER_RUNTIME=pytorch_mps). Requires Python 3.10+.
Savings on real agent workloads:
| Workload | Before | After | Savings |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| GitHub issue triage | 54,174 | 14,761 | 73% |
| Codebase exploration | 78,502 | 41,254 | 47% |
Accuracy preserved on standard benchmarks:
| Benchmark | Category | N | Baseline | Headroom | Delta |
|---|---|---|---|---|---|
| GSM8K | Math | 100 | 0.870 | 0.870 | ±0.000 |
| TruthfulQA | Factual | 100 | 0.530 | 0.560 | +0.030 |
| SQuAD v2 | QA | 100 | — | 97% | 19% compression |
| BFCL | Tools | 100 | — | 97% | 32% compression |
Reproduce: python -m headroom.evals suite --tier 1 · Full benchmarks & methodology
Everything above shrinks the prompt you send. But you also pay for every token the model writes back — and on Opus-class models output costs 5× input. A lot of that output is waste: "Great, let me…" preambles, re-printing code you just showed it, and deep "thinking" on routine steps like reading a file.
Headroom can trim that too, from the proxy, without you changing any code:
Turn it on:
export HEADROOM_OUTPUT_SHAPER=1 # off by default
headroom proxy --port 8787
Already running a proxy? These switches are read live on every request, so a proxy that
headroom wrapreused (rather than started) would not see a value you export afterwards — its environment was snapshotted at launch.headroom wrapnow hot-syncs your current settings to the running proxy via a loopbackPOST /admin/runtime-env, so they take effect immediately with no restart (no cold start, no dropped requests, no lost caches). Set them before youwrap. On a shared proxy these overrides are global — the last explicit setting wins.
Learn the right terseness for you. People don't say how terse they want
answers — they show it (they interrupt long replies, or move on before they
could have read them). headroom learn --verbosity reads your past sessions and
picks the level automatically:
headroom learn --verbosity # preview what it found (dry run)
headroom learn --verbosity --apply # save it; the proxy uses it from now on
See how many output tokens you saved. Output savings are counterfactual — we never see what the model would have written — so Headroom reports an honest estimate with a confidence range, never a made-up number:
headroom output-savings
# Reduction: 31.7% (95% CI 27.7% … 35.7%) [estimated]
Want a measured number instead of an estimate? Leave 10% of conversations
unshaped as a control group: export HEADROOM_OUTPUT_HOLDOUT=0.1. The dashboard
shows an Output Tokens Saved card next to input compression, labelled
measured or estimated with the confidence band.
→ Full write-up incl. the measurement methodology: docs/proposals/output-token-reduction.md
| Agent | headroom wrap | Notes |
|---|---|---|
| Claude Code | ✅ | --memory · --code-graph |
| Codex | ✅ | shares memory with Claude |
| Cursor | ✅ | prints config — paste once |
| Aider | ✅ | starts proxy + launches |
| Copilot CLI | ✅ | starts proxy + launches |
| OpenClaw | ✅ | installs as ContextEngine plugin |
| OpenCode | ✅ | injects config · starts proxy + launches |
| Cortex Code | ✅ | 60–65% savings · library mode |
Any OpenAI-compatible client works via headroom proxy. MCP-native: headroom mcp install.
Headroom can route GitHub Copilot CLI subscription traffic through the local proxy:
headroom copilot-auth login
headroom wrap copilot --subscription -- --model gpt-4o
This lets Headroom intercept OpenAI-compatible Copilot CLI requests and apply the same proxy compression pipeline before forwarding to GitHub Copilot's hosted API. The wrapper exchanges Headroom's reusable GitHub OAuth token for Copilot's short-lived API token and prints the upstream endpoint as COPILOT_PROVIDER_API_URL=... during launch.
headroom copilot-auth login stores a Headroom-specific Copilot OAuth token.
This avoids relying on generic GitHub or Copilot CLI tokens that can read
Copilot account metadata but may still be rejected by Copilot's token-exchange
endpoint.
For GitHub Enterprise Server or custom-domain Copilot deployments, set the deployment domain before launching:
export GITHUB_COPILOT_ENTERPRISE_DOMAIN=ghe.example.com
For GitHub.com Enterprise Cloud URLs such as
github.com/enterprises/your-enterprise, do not set an enterprise-domain
override. Headroom uses GitHub's normal token-exchange endpoint and the Copilot
API endpoint advertised for the signed-in account.
Platform support note: macOS auth reuse via Copilot CLI Keychain storage has been smoke-tested. Windows Credential Manager, Linux Secret Service / secret-tool, and Docker/CI token-injection paths are implemented or planned as auth-discovery paths, but still need real OS validation before they should be considered fully vetted. For Docker and CI, prefer passing an explicit GITHUB_COPILOT_TOKEN or GITHUB_COPILOT_GITHUB_TOKEN rather than relying on host keychain access.
Great fit if you…
Skip it if you…
| Your setup | Hook in with |
|---|---|
| Any Python app | compress(messages, model=…) |
| Any TypeScript app | await compress(messages, { model }) |
| Anthropic / OpenAI SDK | withHeadroom(new Anthropic()) · withHeadroom(new OpenAI()) |
| Vercel AI SDK | wrapLanguageModel({ model, middleware: headroomMiddleware() }) |
| LiteLLM | litellm.callbacks = [HeadroomCallback()] |
| LangChain | HeadroomChatModel(your_llm) |
| Agno | HeadroomAgnoModel(your_model) |
| Strands | Strands guide |
| ASGI apps | app.add_middleware(CompressionMiddleware) |
| Multi-agent | SharedContext().put / .get |
| MCP clients | headroom mcp install |
headroom learn — plugin-based failure mining for Claude, Codex, Gemini.Headroom exposes one stable request lifecycle across compress(), the SDK, and the proxy:
Setup → Pre-Start → Post-Start → Input Received → Input Cached → Input Routed → Input Compressed → Input Remembered → Pre-Send → Post-Send → Response Received
on_pipeline_event(...).Provider and tool-specific behavior lives under headroom/providers/ so core orchestration stays focused on lifecycle, sequencing, and policy.
headroom/providers/claude, copilot, codex, openclawheadroom/providers/claude, gemini, plus shared backend/runtime dispatch in headroom/providers/registry.pywrap.py, client.py, cli/proxy.py, and proxy/server.py delegate provider-specific env shaping, API target normalization, backend selection, and transport dispatch.pip install "headroom-ai[all]" # Python, everything
npm install headroom-ai # TypeScript / Node
docker pull ghcr.io/chopratejas/headroom:latest
Granular extras: [proxy], [mcp], [ml] (Kompress-base), [code], [memory], [relevance], [image], [agno], [langchain], [evals], [pytorch-mps] (Apple-GPU memory-embedder offload — set HEADROOM_EMBEDDER_RUNTIME=pytorch_mps). Requires Python 3.10+.
Using pipx? Choose a supported interpreter explicitly:
pipx install --python python3.13 "headroom-ai[all]"
→ Installation guide — Docker tags, persistent service, PowerShell, devcontainers.
headroom update # detects pip / pipx / uv tool and upgrades in place
headroom update --check # report the latest release without upgrading
headroom update --pre # include pre-releases
headroom update figures out how Headroom was installed (pip/venv, pip --user,
pipx, uv tool) and runs the matching upgrade across macOS, Linux, and Windows.
For git checkouts, editable installs, Docker images, and externally-managed
system Pythons (PEP 668) it prints the correct manual step instead of guessing.
The proxy also shows a one-line "update available" notice on startup. It checks
PyPI at most once a day, in the background, and never blocks. Opt out with
HEADROOM_UPDATE_CHECK=off (also skipped in --stateless mode and CI).
If pip install "headroom-ai[all]" fails with CERTIFICATE_VERIFY_FAILED
(unable to get local issuer certificate), your network uses SSL inspection — a MITM
proxy presenting a company-issued CA. The build backend (maturin) downloads rustup over a
connection your TLS stack doesn't trust. Install Rust first so the build doesn't fetch it:
# macOS / Linux
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh && rustup default stable
# Windows
winget install Rustlang.Rustup && rustup default stable
Restart your shell, then pip install "headroom-ai[all]". A prebuilt wheel avoids the Rust
build entirely where available: pip install --only-binary headroom-ai headroom-ai.
Two runtime assets are fetched over TLS; if they are blocked, trust your corporate CA via
REQUESTS_CA_BUNDLE / SSL_CERT_FILE / CURL_CA_BUNDLE:
cdn.pyke.io — the ONNX Runtime for the Rust core. Alternatively pre-provide it with
ORT_STRATEGY=system and ORT_LIB_LOCATION=/path/to/onnxruntime.huggingface.co — the kompress-base compression model. Pre-download it and run with
HF_HUB_OFFLINE=1, or set HF_ENDPOINT to a trusted mirror.Running with compression disabled (pure gateway) requires neither asset.
headroom learn — mines failed sessions, writes corrections to CLAUDE.md / AGENTS.md / GEMINI.md.
| Start here | Go deeper |
|---|---|
| Quickstart | Architecture |
| Proxy | How compression works |
| MCP tools | CCR — reversible compression |
| Memory | Cache optimization |
| Failure learning | Benchmarks |
| Configuration | Limitations |
Headroom runs locally, covers every content type, works with every major framework, and is reversible.
| Scope | Deploy | Local | Reversible | |
|---|---|---|---|---|
| Headroom | All context — tools, RAG, logs, files, history | Proxy · library · middleware · MCP | Yes | Yes |
| RTK | CLI command outputs | CLI wrapper | Yes | No |
| lean-ctx | CLI commands, MCP tools, editor rules | CLI wrapper · MCP | Yes | No |
| Compresr, Token Co. | Text sent to their API | Hosted API call | No | No |
| OpenAI Compaction | Conversation history | Provider-native | No | No |
Attribution. Headroom ships with the excellent RTK binary for shell-output rewriting —
git show --short, scopedls, summarized installers. Huge thanks to the RTK team; their tool is a first-class part of our stack, and Headroom compresses everything downstream of it. Headroom can also use lean-ctx as the selected CLI context tool; setHEADROOM_CONTEXT_TOOL=lean-ctxbefore runningheadroom wrap ....
git clone https://github.com/chopratejas/headroom.git && cd headroom
uv sync --extra dev && uv run pytest
Devcontainers in .devcontainer/ (default + memory-stack with Qdrant & Neo4j). See CONTRIBUTING.md.
Apache 2.0 — see LICENSE.
io.github.ericm1018/skillfm-llm-cost-optimizer-openai-anthropic-usage
io.github.mikerawsonnz/llm-orchestration-agent
io.github.mikerawsonnz/authenticated-llm-agent
labforgedev/copilot-memory-mcp
csoai-org/agent-prompt-injection-firewall-mcp
io.github.mikerawsonnz/authenticated-multi-llm-agent