This is the research and memory engine behind NodeBench's entity intelligence product, packaged as 260 tools across 49 domains. It exposes search pipelines, entity extraction, claim verification, and workspace memory operations through three distribution profiles: base, power, and admin. The hosted MCP endpoint runs anonymous, metered public research without signup, returning dossiers with sources, freshness scores, and attribution headers. You'd reach for this when you need structured company research, event intelligence, or progressive fact-checking that compounds into reusable artifacts instead of disappearing after one chat turn. Tools cover web scraping, quality gates, and the AI flywheel patterns that turn answers into tracked entities and nudges.
Entity intelligence for any company, market, or question.
Live: nodebenchai.com
npm: npx nodebench-mcp / npx nodebench-mcp-power / npx nodebench-mcp-admin
GitHub: HomenShum/nodebench-ai
NodeBench is a research and reporting product built around five user-facing surfaces:
Home = start quicklyReports = reusable memoryChat = do the workInbox = captures, nudges, alerts, automations, and unassigned reviewMe = operator context and controlDeep research opens in the separate Workspace surface at nodebench.workspace;
it is not a sixth tab in the operating app.
The core idea is simple:
users do not just need a chatbot that answers once.
They need a system that can:
Home, Reports, Chat, Inbox, and Menodebench.workspacehttps://nodebench-mcp-unified.onrender.com?profile=public-researchReports with code-gen, design-gen, research,
composed runs, schedules, streaming previews, eval scorecard, and MCP HTTP
bridgenodebench-mcp, nodebench-mcp-power, and nodebench-mcp-admin
distribution lanesNodeBench can be used as a public research memory and tool server from any agent or app without forcing signup before the first useful result.
Use the hosted MCP endpoint:
https://nodebench-mcp-unified.onrender.com?profile=public-research
For Gmail/job-match style integrations, use the smaller profile:
https://nodebench-mcp-unified.onrender.com?profile=gmail-research
Public profiles are anonymous by default, but still metered. Responses include:
x-nodebench-request-id
x-nodebench-profile
x-nodebench-auth-mode
x-nodebench-account-key
Apps should send stable, non-sensitive client headers:
x-nodebench-client: your-app-name
x-nodebench-client-version: 1.0.0
x-nodebench-client-id: stable-install-or-workspace-id
x-nodebench-client-id lets NodeBench attribute anonymous usage and estimated
costs without requiring a NodeBench token. Do not put private email text,
resume text, API keys, or user secrets in this header.
The intended user flow is:
First public dossier works without signup
-> show sources, freshness, and confidence
-> offer "Link NodeBench" after value is visible
-> linked users get stable history, higher budgets, team usage, webhooks,
token management, billing controls, and reusable private workspace context
Do not block public-source research behind login. Promote sign-in when the user wants persistence, shared team memory, budget controls, private workspace linking, or API/MCP tokens.
See MCP_TOOL_PROFILES.md for the full profile list, tool catalog, account attribution, and cost tracking contract.
USER SURFACES
-------------
Home -> start quickly
Reports -> reusable report memory
Chat -> answer, sources, trace, follow-ups
Inbox -> captures, nudges, automations, alerts, unassigned items
Me -> operator context, permissions, controls
BACKEND
-------
Convex tables and product state for sessions, reports, entities, nudges,
files, shared context, and evaluation artifacts
RUNTIME
-------
search pipeline
-> answer packet
-> saved report
-> tracked entity / tracked theme / follow-up task
-> nudge or prep brief
-> resumed chat or reopened report
COMPOUNDING LOOP
----------------
question
-> answer
-> saved report
-> watch item
-> useful nudge
-> better next run
DISTRIBUTION
------------
nodebenchai.com
nodebench.workspace
nodebench-mcp
nodebench-mcp-power
nodebench-mcp-admin
Event serving extends the same budgeted search route used by the main app and MCP runtime. NodeBench treats search as a memory-building operation with a budget. For events, NodeBench checks the event corpus and workspace memory before live search, then persists useful results as entities, claims, sources, and workspace context.
ScratchNode is the lightweight live-event sidecar for this model: a disposable
room that turns public chat and sourced /ask answers into a public wiki while
keeping attendee notes private. It complements Luma, Slack, Eventbrite, and
other event surfaces instead of replacing them; the explicit handoff to
NodeBench opens https://nodebenchai.com/events/:eventSlug/private with
private-note continuation context and no URL-borne ownerKey.
The event flow is:
Before event
-> build event corpus
During event
-> capture messy notes instantly
After event
-> turn captures into report, cards, follow-ups, and reusable memory
The product model is:
ScratchNode sidecar room + public event corpus + private NodeBench continuation
Event corpus and capture data stay separated:
Shared event corpus = public event info, speakers, sponsors, company pages,
sessions, and public source cache.Private captures = what a user personally heard, wrote, recorded, or
photographed; ScratchNode private notes never enter the public feed, public
wiki, or public /ask cache.Team/org memory = shared only inside the fund, company, or workspace.Event aggregate insights = opt-in or anonymized only.During the event, most captures should hit the event corpus first and avoid paid search:
voice memo / text / screenshot
-> captureRouter
-> active event corpus
-> entity and claim extraction
-> active event session attachment
-> budget policy
-> ack + next action
Example mobile ack:
Saved to Ship Demo Day session
Detected 1 person | 1 company | 2 claims | 1 follow-up
Using event corpus | 0 paid calls
After the event, the report opens in nodebench.workspace:
Brief -> post-event memo
Cards -> people, companies, products, themes
Notebook -> raw notes, transcripts, screenshot OCR, cleaned notes
Sources -> field notes, public evidence, verification status
Chat -> follow-up questions and deeper refreshes
Map -> graph view later
The canonical spec lives in EVENT_INTELLIGENCE_SERVING_MODEL.md. The ScratchNode/NodeBench privacy boundary lives in SCRATCHNODE_NODEBENCH_BOUNDARY.md.
NodeBench is designed around a few product realities:
That drives the current design:
opusplan split:
stronger planning lane, cheaper execution laneHarness v2 work focused on specification, operator context, and
compounding behaviorPlain English:
NodeBench should not spend the most expensive reasoning path on every request.
It should move fast by default, then go deeper when the task, evidence, or user
request justifies it.
The detailed implementation, verification, and evaluation plan for this mode lives in:
NodeBench should not feel like five separate destinations.
The intended product behavior is:
Home
-> start quickly
Reports
-> turn that artifact into reusable memory
Chat
-> do the work
-> create the first useful artifact
Inbox
-> triage captures, nudges, automations, alerts, and unassigned items
Me
-> improve how the next run is handled
Workspace
-> open deep Brief / Cards / Notebook / Sources / Chat / Map work
-> lives at nodebench.workspace, not in the operating tab bar
Next Home or Chat run
-> starts with more context than before
The shortest version of the compounding loop is:
question
-> answer
-> saved report
-> watch item
-> useful nudge
-> better next run
Plain-English artifact flow:
input
-> answer packet
-> saved report
-> tracked entity / tracked theme / follow-up task
-> nudge or prep brief
-> resumed report or resumed chat
-> user correction or confirmation
-> updated operator context
-> better next run
What each page contributes:
Home starts the run with the least friction possibleReports turns those into a durable report the user can reopen, refresh, and
reuseChat creates the answer, sources, trace, entities, and next actionsInbox collects nudges, captures, automations, alerts, and unassigned itemsMe stores the operator context that improves the next answerWorkspace owns recursive cards, notebook editing, source verification, and
long-lived intelligence memoryNodeBench is not starting from zero. The repo already contains a substantial legacy stack that works today.
Current legacy foundation:
What that means:
The near-term goal is:
keep the working legacy foundation
remove accidental complexity
add specification-aware operator context
ship one clear compounding workflow
Main tasks still to finish:
Home -> Reports -> Chat -> Inbox -> Me behave like one continuous
workflow instead of five adjacent surfacesLayer 0 operator context so the system can learn useful workflow
patterns without forcing a heavy onboarding flownodebench-mcpNudges as an Inbox section with at least one working daily triggerMe clearly improve future runs by exposing what context is being
used and whynodebench-mcp v3 cut-and-split plan so default runtime,
power runtime, and admin runtime are clearly separatedOpen nodebenchai.com and start in Home.
# Claude Code
claude mcp add nodebench -- npx -y nodebench-mcp
# Claude Code power lane
claude mcp add nodebench-power -- npx -y nodebench-mcp-power
# Claude Code admin lane
claude mcp add nodebench-admin -- npx -y nodebench-mcp-admin
# Cursor
npx nodebench-mcp --preset cursor
# Generic MCP client
npx nodebench-mcp
git clone https://github.com/HomenShum/nodebench-ai.git
cd nodebench-ai
npm install
cp .env.example .env.local
# Frontend + Convex + voice server
npm run dev
# Production build
npm run build
nodebenchai.com (React + Vite + Tailwind)
|
Convex Cloud (sessions, reports, entities, nudges, files, product state)
|
server runtime + search pipeline + SSE
|
answer packet
|
saved report
|
tracked entities / watch conditions / nudges
|
future runs with better operator context
The notebook and diligence stack in this repo are a good example of a common product engineering tradeoff:
For NodeBench, that means:
founder is a trait and diligence block, not a permanent sixth tab*Identify.ts
featuresscratchpad-first -> structuring pass -> deterministic mergeWhy the notebook does not use one giant live editor model yet:
The practical rule in this repo is:
UX should feel monolithic.
Runtime should stay layered.
Typing should be local-first.
Agent output should be overlay-first.
Accepted output should become owned prose.
Current notebook refactor lessons:
scratchpad checkpoint -> JSON -> validation/repair -> deterministic fallback -> projection rowFor students reading the code, the most relevant docs are:
The live notebook refactor is deliberately incremental:
Set these in .env.local for local work or in Convex / Vercel for deployed
environments.
| Key | Required | Purpose |
|---|---|---|
GEMINI_API_KEY | Yes | classification, extraction, synthesis |
LINKUP_API_KEY | Recommended | web search and sourced answers |
VITE_CONVEX_URL | Yes | Convex deployment URL |
Top-3 levels, annotated. See ARCHITECTURE.md for the
pipeline diagram and docs/architecture/README.md
for the 13 canonical architecture docs.
nodebench-ai/
├── README.md ← you are here
├── ARCHITECTURE.md ← top-level pipeline diagram
├── CONTRIBUTING.md ← contribution bar
├── CLAUDE.md ← Claude Code conventions for this repo
├── AGENTS.md ← agent methodology + eval bench
├── LICENSE ← MIT
│
├── src/ ← React frontend (Vite)
│ ├── features/ ← feature-first, 30 folders (Home · Reports · Chat · Inbox · Me · Workspace · entities · agents · …)
│ │ └── <feature>/ ← views · components · hooks · lib · __tests__ (colocated)
│ ├── shared/ ← shared UI primitives, hooks, utils
│ ├── lib/ ← registry, analytics, error reporting
│ └── layouts/ ← shell + cockpit + public
│
├── server/ ← Node runtime (Express + MCP gateway)
│ ├── pipeline/ ← agent harness runtime + diligence blocks
│ ├── routes/ ← HTTP routes (search, harness, founder episodes)
│ ├── mcpGateway.ts ← WebSocket MCP gateway
│ └── services/ ← shared services
│
├── convex/ ← Convex backend
│ ├── domains/ ← 19 domain folders (agents · product · research · founder · search · …)
│ ├── schema.ts ← database schema (includes agentScratchpads)
│ └── crons.ts ← scheduled jobs
│
├── packages/
│ ├── mcp-local/ ← the published nodebench-mcp npm package (MIT)
│ ├── mcp-client/ ← typed client SDK
│ └── convex-mcp-nodebench/ ← Convex-side MCP auditor
│
├── .claude/
│ ├── README.md ← map of the .claude/ layout
│ ├── rules/ ← 31 modular rules with related_ cross-refs
│ ├── skills/ ← reusable how-to procedures
│ ├── agents/ ← subagent configs
│ └── commands/ ← custom slash commands
│
├── docs/
│ ├── README.md ← docs tree map
│ ├── ONBOARDING.md ← 30-minute new-contributor path
│ ├── architecture/ ← 13 canonical specs + plans/ + README index
│ ├── agents/ ← agent docs + bootstrap configs
│ ├── guides/ ← how-to for builders
│ ├── decisions/ ← ADRs
│ ├── changelog/ ← release notes
│ ├── product/ ← product decisions
│ ├── qa/ ← QA protocols
│ └── archive/ ← superseded content, provenance-only
│
├── tests/
│ ├── e2e/ ← Playwright end-to-end
│ └── fixtures/ ← shared fixtures
│
├── scripts/ ← dogfood, eval harness, one-offs
├── public/ ← static assets served by Vite + Vercel
└── vendor/ ← third-party references
Start here: docs/ONBOARDING.md · ARCHITECTURE.md · docs/architecture/README.md
The 13 canonical architecture docs are organized in 4 tiers. See docs/architecture/README.md for the indexed map:
AGENT_PIPELINE · DILIGENCE_BLOCKS · USER_FEEDBACK_SECURITYSCRATCHPAD_PATTERN · PROSEMIRROR_DECORATIONS · AGENT_OBSERVABILITY · SESSION_ARTIFACTSFOUNDER_FEATURE · REPORTS_AND_ENTITIES · AUTH_AND_SHARINGMCP_INTEGRATION · EVAL_AND_FLYWHEEL · DESIGN_SYSTEMActive architecture addenda:
GRAPH_SEARCH_AGENT_CONTEXT
captures the graph/search/agent-context strategy, exact product questions,
scale projection, node attention model, and human-vs-agent retrieval split.
Historical specs are preserved in docs/archive/2026-q1/.
NodeBench ships with a comprehensive evaluation harness that proves correctness across 32+ scenarios, 9 user personas, and 9 feature categories. This is not hand-wavy "it works" — it is measured, versioned, and reproducible.
Pi-AI pipeline cascade: merged to main on 2026-04-30 at
2a541037874c0f8c675ab393d5c08f50123cf6d2.
| Lane | Result |
|---|---|
| PR chain | #211 -> #212 -> #213 -> #214 -> #215 -> #216 all merged |
| Production surface | https://www.nodebenchai.com/?surface=packets |
| MCP bridge | https://agile-caribou-964.convex.site/mcp/pipeline/* behind MCP_SECRET |
| Code-gen run | pipeline_mokobe4y_6n23be succeeded, verified, 6 files, 32.9s, about $0.001 |
| Research streaming | pipeline_mokpvi1b_yoj8ot completed with 4,317 streamed characters |
| Linkup research | pipeline_mol2wj2j_2lgx2u succeeded with 18 snippets across 5 sub-questions |
| Composed pipeline | research_then_code completed stage 1 research and stage 2 code-gen |
| Schedule workflow | once schedule swept by cron/manual sweep and auto-disabled after run |
| Design output | design-gen produced a PNG stored in Convex storage |
| UI launcher | DOM-submitted composed run updated the reactive run list |
| Pipeline scorecard | 41.7% verified, Brier 0.135 across 12 runs |
The implementation is mounted on the Reports surface:
PipelineLauncherPipelineSchedulesPanelPipelineEvalScorecardPipelineRunsPanelEntityFindingsPanelThe detailed handoff is in
docs/handoff/PI_AI_PIPELINES_HANDOFF.md.
Workflow-loop eval bank: added on 2026-04-30 to test the full product loop, not just answer text.
query / capture
-> memory search
-> entity resolution
-> report update
-> notebook update
-> graph edges
-> sources / claims
-> follow-up / export
| Eval bank | Result |
|---|---|
| Total workflow cases | 124 |
| Minimum P0 suite | 30 cases |
| Coverage categories | 11 |
| Score dimensions | 12 |
| Validator | src/features/evaluation/data/nodebenchWorkflowEvalBank.test.ts |
| Latest local check | npx vitest run src/features/evaluation/data/nodebenchWorkflowEvalBank.test.ts -> 4/4 passed |
The eval bank lives in
src/features/evaluation/data/nodebenchWorkflowEvalBank.ts.
Every production run is evaluated by two independent systems:
Layer 1: Deterministic Boolean Gates (server/pipeline/diligenceJudge.ts)
verified | provisionally_verified | needs_review | failedLayer 2: LLM Semantic Scoring (server/pipeline/diligenceLlmJudge.ts)
llmjudge-v1) for cohort separationThis dual-layer approach means hallucinations and quality regressions are caught by two independent systems before they reach users.
Latest Full-Stack Eval: 2026-04-23T06:46:53Z
Overall Pass Rate: 100% ✅
LLM Judge Average: 9.6/10 (target: ≥7 for production)
Dogfood Score: 100/100 (0 real issues)
Entity Resolution: 100% ✅
Factual Accuracy: 90.6% ✅
No Hallucinations: 90.6% ✅
Actionable Output: 100% ✅
Answer Control: 100% ✅ (all 8 dimensions)
Feature Breadth: 100% ✅ (31 scenarios)
Retention/Continuity: 4/4 passed ✅
All production gates passing:
Note: The only outstanding item is p95 latency optimization (174s vs 90s target) — a performance enhancement, not a correctness blocker. The system is production-ready for all quality scenarios.
Capability Eval — 32 Persona Scenarios
| Persona | Example Query | Status |
|---|---|---|
| JPM Startup Banker | "DISCO — worth reaching out? Fastest debrief" | ✅ 100% |
| Early Stage VC | "OpenAutoGLM — what's the wedge?" | ✅ 100% |
| CTO Tech Lead | "QuickJS — do I have exposure?" | ✅ 100% |
| Enterprise Exec | "Gemini 3 — procurement next step?" | ✅ 100% |
| Ecosystem Partner | "SoundCloud VPN — who benefits?" | ✅ 100% |
| Founder Strategy | "Salesforce Agentforce — counter-positioning?" | ✅ 100% |
| Academic R&D | "RyR2/Alzheimer's — literature anchor?" | ✅ 100% |
| Quant Analyst | "DISCO — extract funding signal" | ✅ 100% |
| Product Designer | "DISCO — schema-dense UI card JSON" | ✅ 100% |
| Sales Engineer | "DISCO — share-ready outbound summary" | ✅ 100% |
Expanded Feature Breadth — 31 Scenarios
| Category | Count | Pass Rate |
|---|---|---|
| Calendar | 3 | 100% ✅ |
| Disclosure | 4 | 100% ✅ |
| Document | 3 | 100% ✅ |
| Hybrid | 4 | 100% ✅ |
| Media | 3 | 100% ✅ |
| Skills | 4 | 100% ✅ |
| Spreadsheet | 3 | 100% ✅ |
| Tools | 4 | 100% ✅ |
| Web | 3 | 100% ✅ |
Answer Control — 8 Dimensions
Run the full production evaluation suite:
# Full 8-phase evaluation (typecheck → build → capability → expanded →
# answer-control → dogfood → notebook → history)
npm run eval
# Quick verification (3 scenarios)
npm run eval:quick-slice
# Individual lanes
npm run eval:capability # 32 persona scenarios
npm run eval:feature-breadth # 31 feature scenarios
npm run eval:retention # Wiki continuity suite
All artifacts are versioned in docs/architecture/benchmarks/:
full-stack-eval-latest.md — aggregate summarycomprehensive-eval-*.md — capability resultsexpanded-eval-*.md — feature breadth resultsproduct-answer-control-eval-*.md — answer control resultsThe system meets all of these. The only remaining work is latency optimization — making fast answers even faster, not making broken answers work.
moonshotai/kimi-k2.6 (OpenRouter) — 100% capability passgpt-5.4 — automatic retry on empty/missing debriefkimi-k2.6 — 9.6/10 average across all scenariosKimi is the primary lane. GPT-5.4 remains the safety fallback until Kimi's first-attempt stability improves, but both paths are production-tested.
NodeBench AI = flagship user surface
nodebench-mcp = workflow lane
Attrition.sh = measured replay + optimization lane
Attrition is not a third flagship. It is the measurable optimization lane for the same NodeBench workflow.
MIT
com.mcparmory/google-search
io.github.pipeworx-io/brave-search
marcopesani/mcp-server-serper
brave/brave-search-mcp-server
com.mcparmory/google-search-console
acamolese/google-search-console-mcp