A proxy layer that wraps existing MCP servers and replaces full tool schemas with compressed stubs (tool name plus one-line description), then lazy-loads complete schemas only when tools are actually invoked. Verified benchmarks show 6-7x token reduction across 10+ servers with 100+ tools, dropping overhead from 34k to 5k tokens. Includes disk caching with 24-hour TTL, response compression for large JSON payloads, and writes auditable proof logs to local JSONL files you can review with the built-in report command. Reach for this when you're running multiple MCP servers and context window overhead is eating your budget before the LLM even starts reasoning.
Reduce MCP tool schema token overhead by 6-7x — via lazy-loading and schema caching.
Verified, not claimed. Every session writes a proof log to
~/.mcp-proxy-metrics.jsonl. Runmcp-lazy-proxy --reportto see your actual savings, not marketing estimates.
⚠️ Security notice: The only official package is
mcp-lazy-proxybykiraautonomaon npm. Third-party forks or repackaging under other scopes are not endorsed and may contain malicious code. MCP servers have broad system access — always install from the canonical source.
If you use multiple MCP servers, your tool definitions consume thousands of tokens of context window on every API call — before you've even asked a question.
With 10 servers × 10 tools × ~344 tokens/schema = 34,000 tokens overhead per call. At $3/MTok (Claude Sonnet): $0.10 wasted per call, or $261/month at 100 calls/day.
This proxy sits between your MCP client and upstream MCP servers. Instead of sending full tool schemas upfront, it:
| Servers | Tools | Eager Tokens | Lazy Tokens | Reduction | Monthly Savings* |
|---|---|---|---|---|---|
| 1 | 10 | 3,555 | 550 | 6.5x | $27 |
| 3 | 30 | 11,140 | 1,620 | 6.9x | $86 |
| 5 | 60 | 20,607 | 3,224 | 6.4x | $156 |
| 10 | 100 | 34,360 | 5,350 | 6.4x | $261 |
| 10 | 200 | 71,583 | 10,790 | 6.6x | $547 |
| 15 | 225 | 81,460 | 12,115 | 6.7x | $624 |
| 20 | 200 | 71,997 | 10,760 | 6.7x | $551 |
*At $3/MTok input pricing, 100 API calls/day
npm install -g mcp-lazy-proxy
mcp-lazy-proxy --server "fs:stdio:npx:-y:@modelcontextprotocol/server-filesystem:/home"
{
"servers": [
{
"id": "filesystem",
"name": "Filesystem MCP",
"transport": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home"]
},
{
"id": "github",
"name": "GitHub MCP",
"transport": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"]
}
],
"mode": "lazy"
}
mcp-lazy-proxy --config proxy.json
{
"mcpServers": {
"proxy": {
"command": "mcp-lazy-proxy",
"args": ["--config", "/path/to/proxy.json"]
}
}
}
| Mode | Description | Token Savings |
|---|---|---|
lazy | Load schemas on first tool use (default) | ~85% |
stub-only | Never send full schemas (maximum savings) | ~85% |
eager | Load all schemas upfront (no savings, debug only) | 0% |
Tested against the official @modelcontextprotocol/server-filesystem (14 tools):
✅ Initialize response: mcp-context-proxy
✅ Got 14 tools — 14/14 have lazy-load stubs
✅ Tool call (read_file) succeeded — file content correct
✅ Tool call (list_directory) succeeded
Token comparison: ~2800 eager vs ~832 lazy stubs (3.4x on this small server)
With 10+ servers the ratio increases to 6-7x as schema complexity grows.
import { MCPContextProxy } from 'mcp-lazy-proxy';
const proxy = new MCPContextProxy({
servers: [
{ id: 'fs', name: 'Filesystem', transport: 'stdio',
command: 'npx', args: ['-y', '@modelcontextprotocol/server-filesystem', '/tmp'] }
],
mode: 'lazy'
});
await proxy.start();
Unlike other MCP optimizers that only show estimates, mcp-lazy-proxy logs every interaction:
# See your actual savings (not estimates)
mcp-lazy-proxy --report
Raw proof is in ~/.mcp-proxy-metrics.jsonl — one JSON line per tool call, fully auditable.
| Feature | mcp-lazy-proxy | Atlassian mcp-compressor |
|---|---|---|
| Language | Node.js/npm | Python/pip |
| Mechanism | Lazy-load on call | Description compression |
| Schema caching | ✅ Disk (24h TTL) | ❌ |
| Proof logging | ✅ Auditable JSONL | ❌ |
| Response compression | ✅ JSON summary + text truncation | ❌ |
| Hosted option | 🔜 Planned | ❌ |
Large tool call responses are automatically compressed before reaching the LLM:
[truncated, X chars total] noteresponseCompression: false in config to disable, or fine-tune thresholds{
"servers": [...],
"mode": "lazy",
"responseCompression": {
"enabled": true,
"maxTextLength": 10000,
"minCompressLength": 1000,
"maxArrayItems": 3
}
}
--report CLI for auditing savingsMIT — built by Kira, an autonomous AI agent.