This is a local memory layer for multi-agent setups that catches drift before agents act. It embeds raw prompts with BGE-m3 instead of extracting entities through an LLM, which keeps everything on your machine and lets it score the next action against past failure patterns. You get seven MCP tools including add_memory, recall, and add_worker for registering deterministic worker specs. The trade is real: it scores 56.6% on LongMemEval compared to white-box leaders in the 80s, but runs 14x cheaper and works across any LLM without sending data out. Built for Claude Desktop, Cline, Cursor, and Continue. Reach for it when you need cross-session coordination without an orchestrator or when agents keep forgetting rules you already set.
Reliability layer for multi-agent setups · keep multiple agents — or your own long-running sessions — coordinating reliably without an orchestrator. Cross-dialog contracts + drift detection + a 4-tier memory lifecycle schema (activation in progress). Plugin for Claude Code/Desktop · Cline · Cursor · Continue.dev · Zed.
When an agent drifts from a rule you set, takes a shortcut you flagged, or claims a prior agreement that never happened — compass catches it before the agent acts.
Why it holds up technically: the memory underneath is black-box — raw text embedded locally with BGE-m3, no LLM extraction step, no graph, no data leaving your machine (~14× cheaper to reproduce than white-box stacks like Mem0 / Letta / Cognee / Zep / MemOS). That same raw-prompt index is exactly what lets compass score the next action against your past mistakes — drift detection that white-box entity-graph memory structurally can't do. Full argument: paper/BLACKBOX_VS_WHITEBOX.md.
Built by Nautilus Platform · open agent ecosystem · join as agent →
🇬🇧 English (this file) · 🇨🇳 中文
compass's #1 job is multi-agent reliability without an orchestrator. The reason it can do that — and not be just another memory store — is its black-box memory core:
White-box memory layers (Mem0, Letta, Cognee, Zep, MemOS, smrti):
"I call an LLM to extract facts from your conversation,
then store them in a graph. Pay extraction tokens. Send
data to the provider."
Black-box memory (compass · this project):
"I embed raw text locally with BGE-m3. No extraction LLM.
No graph. No data leaving your machine. And because raw
prompts are still in the index, I can score the next
prompt against your past mistakes before the agent acts."
The trade is real: −30 points on LongMemEval-S vs white-box leaders that build entity graphs, in exchange for 14× cheaper reproduction, full local-deployment, cross-LLM portability, and drift detection that white-box systems can't offer. Full argument: paper/BLACKBOX_VS_WHITEBOX.md.
In one line: when the AI is about to forget a rule you set, take a shortcut you flagged, or fabricate a prior agreement, it gets stopped by its own history of failure patterns.
v2.1.0 unifies two development lines (daemon/reliability + lifecycle/PoI) onto a
single main and hardens the drift loop.
The old OR-vote firing (neg_cos ≥ 0.538) fired on 64.5% of events in 11.5k
records of real traffic — benign prompts with high anti-anchor cosine overlapped
genuine drift, so agents tuned out (act-on rate 9.87%). v2.1.0 makes firing
high-signal:
should_alert = rule_hit (danger-command regex) OR drift_score < −0.07
Production-measured fire rate 0.5% · danger commands (rm -rf / force push /
DROP / hardcoded key) always caught · the multi-signal drift/firing.py vote is
retained behind an env flag for A/B.
inbound_/outbound_ handoff filesActivation status (honest): the L3 lifecycle machinery — tier promotion,
forget_atarchival, the promotion driver — is shipped and unit-tested, but the production recall path does not yet promote tiers or applyforget_atat query time (query ranking currently uses file-agearchived_atdecay + an importance gate). PoI emission requires cross-agent outcome events, which depend on the L4 data pipeline now being wired. Treat the lifecycle below as a schema + tested functions, with production activation + validation in progress.
bounded handler pool · in-flight semaphore (CLOSE_WAIT cure) · server-side recall cache · pkl warmup (cold-start CPU cure) · BM25 + vector RRF fusion (opt-in) · inotify cache invalidation.
v2.0.0 ships a deterministic lifecycle layer on top of the black-box memory base — paradigm fuse of llm-wiki2 (Karpathy v2), agentmemory (LongMemEval-S 95.2% R@5), and GBrain (Garry Tan · MIT).
The bet: every other memory project (Mem0, Letta, Cognee, Zep, MemOS, llm-wiki2, agentmemory) calls an LLM at some lifecycle decision — ingest, promotion, consolidation, or forgetting. compass v2.0.0 makes them all schema-declared.
tier: working | episodic | semantic | procedural # 4 tiers verbatim from llm-wiki2
decay_rate: 0.5 # Ebbinghaus exponential decay
forget_at: 2026-06-01T00:00:00Z # null = never · soft-archive when reached
promote_after: "7d" | "5_access" # duration or access count
reinforce_count: 0 # access event counter
reinforce_count >= promote_after → tier++reinforce_count++forget_at reached → soft-archive flagprocedural (top tier) does not promoteFull design rationale in paper/LLM_WIKI2_FUSE_DESIGN.md;
implementation at recall.py:708+.
The promotion rule above is implemented as
promote_lifecycle_tier()and covered bytests/test_lifecycle_fuse.py, but is not yet invoked on the production recall path — see the activation-status note under L3 tier promotion above.
stop_hook.py for
Claude Code: SessionStart, UserPromptSubmit, PreToolUse, PostToolUse,
PostToolUseFailure, PreCompact, SubagentStart/Stop, SessionEndadd_worker(spec) MCP tool: super-agents register deterministic
worker specs (cron / pubsub / queue / http / custom) to .cache/workers.jsonlrecall.py: combine BM25 + vector + KG ranked
lists with session-diversified output (max 3 per session · agentmemory verbatim)npx nautilus-compass init: one-command workspace setup creating
.compass/.env, sample anchors, and Claude Code hook templatesFrame borrowed from GBrain ("Garry's Opinionated OpenClaw/Hermes Agent Brain"). compass v2.0.0 takes a stance on what not to include:
reinforce_count + promote_after)forget_at + counter only)paper/RESULTS_v0.8.md)Active development on the v3-full-fusion
branch · not in any release. Plan: ~2 work weeks · 8 Sprints · each Sprint
has a prove-or-kill gate (statistical · SQL/eval · not agent self-assessment).
Default-off byte-equal promise: with no opt-in env set, v3.0 / v3.5
behavior is byte-equal to v2.0.1. Verified by
tests/test_llm_opt_in.py ·
the test_default_off_invariant_* family gates every PR into main.
MEMORY_REPORT.md auto-gen (Sprint 2 · session-end hook · 4-tier
distribution + cumulative_impact + drift summary)implementation_notes frontmatter (Sprint 2 · rationale + rejected: [{alt, why}])| env var | tier | feature (Sprint) |
|---|---|---|
COMPASS_USE_LLM_RESOLVE | 1 (session-end) | LLM contradiction resolution (Sprint 3) |
COMPASS_USE_LLM_VERIFY | 4 (runtime) | anti-confabulation cite-or-refuse (Sprint 4) |
COMPASS_USE_LLM_DRIFT_PAY | 4 (runtime) | drift × outcome anchor feedback (Sprint 5) |
COMPASS_USE_LLM_REFLECT | 3 (periodic) | self-reflection semantic emit (Sprint 6) |
COMPASS_USE_LLM_ECON | 4 (runtime) | memory-as-economy NAU budget (Sprint 7) |
Pattern mirrors the existing COMPASS_USE_GEMINI_FLASH opt-in
(judges/gemini_flash.py) — env truthy
(1/true/yes/on) activates · anything else disables. Registry: llm_opt_in.py.
Per-Sprint gates are pre-registered. If a Sprint's gate metric does not
pass (e.g. Sprint 1: multi-hop +3pp on LongMemEval-S multi-session subset,
n=133), that Sprint stops · no further Sprints attempted · the
corresponding paper3 v2 novelty claim is removed. This protects against
post-hoc rationalization of negative results.
Across 28 hours on 2026-05-30 / 31, four Claude Code dialogs (compass / Soul / V5 / nautilus-core) ran concurrently on shared filesystem-mediated protocols. The recorded run includes:
act_on_rate measured at 9.87% / 7d · 40.79% / 24hcnt_compass_soul_sub_a1 closing in
17.92h (vs 6d 21h budget · 5.8d slack)scripts/__init__.py
was added (commit pushed in the same change as the case study)The full field log including 7 generalizable patterns for OSS multi-agent
reliability is at docs/case_study_4dialog_compass.md.
You told Claude at session start: "never claim deployment success without verification." Fifty prompts later Claude says "deployed successfully ✅" — without verifying. The memory rule was there; the AI forgot it under context pressure.
Persona Vectors (Anthropic, 2025) proved that LLM activations contain directions for sycophancy and hallucination. But that requires model weights — closed APIs (Claude, GPT-4) don't expose them. There has been no production black-box equivalent that runs in a Claude Code hook.
Mem0, Letta, claude-mem, Zep all compete on "recall the most relevant past memory." But memory recalled doesn't stop the AI from breaking the rule this time — that other half has been unsolved.
User prompt: "Fix bug X for me"
│
▼
┌─────────────────────────────────────┐
│ UserPromptSubmit Hook (this plugin)│
└─────────────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌──────────┐
│ recall │ │ drift │ │ profile │
│ memory │ │ check │ │ aggregate│
└────────┘ └─────────┘ └──────────┘
│
▼
Hooks inject results into Claude's system prompt:
- Time-bucketed past memory (BGE-m3 semantic recall)
- Drift score + nearest negative anchor (if score < threshold)
- Profile facts ("you have 3 unfinished tasks in this repo")
│
▼
Claude answers — with full context loaded
The drift detector compares each prompt against an anchor set (25 positive + 35 negative behavioral patterns drawn from real failure transcripts) using BGE-m3 cosine similarity. AUC 0.83 on held-out, 50ms p95 hook latency.
Drift detection without ack instrumentation is an open loop · the detector fires alerts but nothing measures whether the agent (or user) actually acted on them. v3 closes this loop with a single rate metric.
The signal: every fired drift alert gets a stable alert_id and lands
in .cache/drift_mitigation_log.jsonl. When the user acknowledges the alert
via the feedback CLI
python ~/.claude/plugins/nautilus-compass/feedback.py log <alert_id> fp|tp
(fp = false positive · tp = true positive · either way the alert was
seen and judged), a matching kind: "ack" record is appended to the same
sidecar.
The metric: act_on_rate(window_hours) groups records by alert_id
within the window and reports the fraction of fired alerts that received at
least one ack. The legacy KPI script prints it alongside everything else:
python ~/.claude/plugins/nautilus-compass/audit_kpi.py
=== act-on rate (drift alert closure · target ≥0.70) ===
· 24h: fires=81 acked=1 rate=0.012
· 7d: fires=294 acked=1 rate=0.003
Target: ≥0.70 over rolling 7d. Below 0.30 indicates the agent is tuning
out alerts (cry-wolf · cf. the open-loop write-up)
· raise the firing threshold (drift/firing.py:should_fire_drift) or
recalibrate negative anchors via feedback retrain. Programmatic API for
CI / cron monitors:
from audit_kpi import act_on_rate
m = act_on_rate(window_hours=168)
assert m["rate"] >= 0.70, f"drift loop open · rate={m['rate']:.3f} fires={m['fires']}"
| Benchmark | Score | Honest compare |
|---|---|---|
| LongMemEval-S (n=500) | 56.6% (locked at v0.8) | open-source 50–60% band · white-box leaders (OMEGA, Mem0g, ByteRover) report 90+% — that gap is an architectural ceiling for black-box, not a tuning gap. See BLACKBOX_VS_WHITEBOX. |
| EverMemBench-Dynamic (n=500) | 44.4% (Run 1) / 47.3% (Run 2) | tops the four published Table 4 baselines (Mem0 37.09, Zep 39.97, MemOS 42.55, MemoBase 34.27). Not "industry SOTA" — OMEGA / Mem0g haven't reported on EverMemBench publicly. |
| Drift detector AUC | 0.83 held-out / 0.92 in-set | only public memory layer that does drift detection at all — white-box systems abstract prompts into facts before drift becomes checkable |
| Reproduction cost | ~$3.50 for 500 LongMemEval questions | ~14× cheaper than GPT-4o-judged stacks ($50+) |
| p95 hook latency | <50 ms | safe for every-prompt invocation |
We deliberately report Run 1 (44.4%) as the abstract headline for
EverMemBench to avoid cherry-picking; the cross-run mean (45.84%) clears
MemOS by +3.3 pts. See paper/sections/paper2_06_5_evermembench.tex
for honest dual-run + Gemini cross-judge sensitivity analysis.
Try it without installing: live drift-detection + Merkle-integrity demo at huggingface.co/spaces/chunxiaox/nautilus-compass (CPU only · metadata-mode jaccard fallback · no signup needed).
Reproduce the numbers: evaluation dataset (behavioral anchors + labeled session traces for drift ROC + LongMemEval-S / EverMemBench scoring) is live on the Hugging Face Hub: huggingface.co/datasets/chunxiaox/nautilus-compass-test-data
from datasets import load_dataset
ds = load_dataset("chunxiaox/nautilus-compass-test-data")
git clone https://github.com/chunxiaoxx/nautilus-compass ~/.claude/plugins/nautilus-compass
bash ~/.claude/plugins/nautilus-compass/install.sh
# Start the BGE-m3 daemon (one-time per boot)
bash ~/.claude/plugins/nautilus-compass/daemon_start.sh
The installer wires three hooks into ~/.claude/settings.json:
UserPromptSubmit → injects time-bucketed memory recall + driftPostToolUse → mid-session writerStop → end-of-session summary writerFive user-facing slash commands appear in Claude Code:
/compass-verify · /compass-drift · /compass-recall ·
/compass-search · /compass-status.
python ~/.claude/plugins/nautilus-compass/scripts/install_to_agent.py
Auto-detects Claude Desktop, Cursor, Cline, Continue.dev, Zed Editor and
patches their MCP config. See docs/AGENT_ONBOARDING.md
for per-agent copy-paste configs and docs/mcp-usage.md
for the raw protocol specification.
curl https://compass.nautilus.social/.well-known/agent.json
Returns the standard A2A discovery descriptor. Sign up at
compass.nautilus.social/signup for a hosted gateway with multi-user
sync, audit log, and managed BGE-m3 deployment.
| Tool | Purpose | Latency |
|---|---|---|
ingest_obs(name, body, agent_id?) | Write observation with auto-anchor + drift signal | ~150 ms |
recall(query, project?, top_k?) | BGE-m3 semantic + keyword search | ~200 ms |
session_search(query, since?) | Time-bucketed session-log search | ~80 ms |
profile(user_id?) | Work-profile aggregate (topics, agents, drift trend) | ~100 ms |
drift_check(prompt, project?) | Black-box drift score against anchors | <50 ms |
drift_history(since?, agent_id?) | Drift score timeline for trend audit | ~30 ms |
feedback_log(direction, reason) | Log positive/negative anchor signal | <20 ms |
The MCP server speaks JSON-RPC 2.0 over stdio / TCP / TLS / mTLS.
Per-token RBAC, per-token rate limiting, notifications/{progress, cancelled, message}, logging/setLevel, and resources/* for session-log
streaming are all spec-complete.
| Capability | this | mem0 | Letta | Zep | claude-mem | MemOS | Smriti |
|---|---|---|---|---|---|---|---|
| Cross-agent memory | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | archive-only |
| MCP A2A protocol native | ✅ TLS+mTLS+RBAC | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Drift detection | ✅ AUC 0.83 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Merkle integrity audit log | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| LongMemEval-S verified | ✅ 56.6% (locked) | n/r | n/r | n/r | ❌ | n/r | ❌ |
| EverMemBench verified | ✅ 44.4-47.3% | 37.09 | n/r | 39.97 | n/r | 42.55 | ❌ |
| Self-host + hosted both | ✅ | ☁ only | ✅ | ☁ only | ✅ | OSS only | OSS only |
| License | MIT | Apache | Apache | proprietary | MIT | Apache | MIT |
n/r = not reported in their published evaluations. Smriti is a team
conversation archive with git-based sharing — different scope from a
runtime memory layer, so most rows are intentionally out-of-scope rather
than missing features.
If you run the OSS plugin alongside a Nautilus-style task platform (or your own multi-agent backend), two MCP tools open a bidirectional channel without any new HTTP server:
| Tool | Direction | Purpose |
|---|---|---|
submit_platform_task(name, channels, payload, anchor_pack_hint, priority) | compass dialog → platform | Push a task into the platform's queue. File-based by default (~/.claude/projects/_platform_queue/<id>.json); auto-promotes to HTTP POST when COMPASS_PLATFORM_QUEUE_URL is set. |
ingest_platform_task_result(task_id, result_summary, channels_published, drift, agent_id) | platform → compass | Platform agent reports completion. Writes a JSON archive AND a session_*.md so the result becomes searchable cross-session via recall / session_search. |
End-to-end round-trip — no platform deployment needed for the OSS half:
python examples/platform_flywheel_demo.py
# [1] compass dialog → submit_platform_task (queues to file)
# [2] platform V5 cycle ← poll _platform_queue/ (claims by status flip)
# [3] platform agent → executes channels (simulated)
# [4] platform agent → ingest_platform_task_result
# [5] compass dialog → session_search (HIT · result is searchable)
# OK · BP1 + BP3 round-trip verified
The full wire spec, breakpoint analysis, and SaaS-side TODO list live in
docs/PLATFORM_HANDSHAKE.md §7.
For deployments running multiple specialised executors (V5, V6, Kairos, …), three additional MCP tools provide a thin governance layer that decomposes multi-channel work, audits cross-agent state, and locks the L0 immutable core. V7 sits above the executors — it routes and audits, it does not execute or chat with an LLM itself.
| Tool | Purpose |
|---|---|
governance_dispatch(name, channels, payload, anchor_pack_hint, priority) | Decompose 1 complex task → N routed sub-tasks (heuristic table picks executor per channel) |
governance_audit(days, project) | Scan recent session logs for fake-closure / red drift / empty platform results |
governance_lock_check(bootstrap) | SHA256 lock on recall.py, merkle_chain.py, anchors.json, selftest.py |
python examples/v7_governance_demo.py
# [1] V7 governance_lock_check · bootstrap + verify
# [2] V7 governance_dispatch · 4 channels → routed to v5/v5/v6/kairos
# [3] V7 governance_audit · 7-day scan
# OK · V7 v0.1 governance round-trip verified
Contract details + platform-side TODOs (cron, governance fee, CI gate, telegram
/dispatch) in docs/PLATFORM_HANDSHAKE.md §8.
docs/AGENT_ONBOARDING.md — per-agent install configs (6 platforms + 3 frameworks)docs/mcp-usage.md — raw MCP protocol guide, TLS setup, RBACdocs/PLATFORM_HANDSHAKE.md — OSS↔SaaS coordination contractpaper/ — two papers (drift detection + memory pipeline) and supporting eval scriptsCHANGELOG.md — versioned release notesCONTRIBUTING.md — adding new domain anchors / running benchmarksIf you use this work, please cite:
Paper 1 · drift detection:
@misc{nautiluscompass-drift-2026,
title = {Nautilus Compass: Black-box Persona Drift Detection
for Production LLM Agents},
author = {Chunxiao Wang},
year = {2026},
note = {Yiluo Technology Co., Ltd.},
howpublished = {\url{https://github.com/chunxiaoxx/nautilus-compass}}
}
Paper 2 · memory pipeline + EverMemBench cross-bench:
@misc{nautiluscompass-memrecall-2026,
title = {Closing the Memory Recall Gap with Chinese LLMs:
A Multi-Stage Retrieval Pipeline Achieving Zep-SOTA Performance
on LongMemEval-S at 1/15 Cost},
author = {Chunxiao Wang},
year = {2026},
note = {Yiluo Technology Co., Ltd.},
howpublished = {\url{https://github.com/chunxiaoxx/nautilus-compass}}
}
The howpublished field will be updated to the arXiv identifier once
the preprints are live.
We also build on prior work — please cite as appropriate:
LICENSE)anchors*.json) — CC0 1.0 Universal (see LICENSE-ANCHORS)You may use this in any project, commercial or otherwise, with attribution.
PRs welcome — see CONTRIBUTING.md.
chunxiaoxx@gmail.comio.github.ericm1018/skillfm-llm-cost-optimizer-openai-anthropic-usage
io.github.mikerawsonnz/llm-orchestration-agent
io.github.mikerawsonnz/authenticated-llm-agent
labforgedev/copilot-memory-mcp
csoai-org/agent-prompt-injection-firewall-mcp
io.github.mikerawsonnz/authenticated-multi-llm-agent