CAT
/MCP
SkillsMCPMarketplacesDigestToolsAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Sales & MarketingWeb & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web Crawling
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Cross AI Tools

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Tools
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic

Mcp Eval Runner

dbsectrainer/mcp-eval-runner
authSTDIOregistry active
Summary

A testing harness that lets you write eval fixtures as YAML files and run them directly from your MCP client. Each fixture defines steps with tool calls, inputs, and assertions like output_contains, schema_match, or latency_under. It has two modes: live mode spawns a real MCP server via stdio and tests against actual tool responses, while simulation mode runs assertions against static expected_output strings. You get tools like run_suite for executing all tests, regression_report to compare runs, and create_test_case for scaffolding new fixtures. Step outputs can pipe into downstream inputs using template syntax. Useful when you're building MCP servers and need regression tests in version control without leaving your editor.

CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →

MCP Eval Runner

npm mcp-eval-runner package

A standardized testing harness for MCP servers and agent workflows. Define test cases as YAML fixtures (steps → expected tool calls → expected outputs), run regression suites directly from your MCP client, and get pass/fail results with diffs — without leaving Claude Code or Cursor.

Tool reference | Configuration | Fixture format | Contributing | Troubleshooting | Design principles

Key features

  • YAML fixtures: Test cases are plain files in version control — diffable, reviewable, and shareable.
  • Two execution modes: Live mode spawns a real MCP server and calls tools via stdio; simulation mode runs assertions against expected_output without a server.
  • Composable assertions: Combine output_contains, output_not_contains, output_equals, output_matches, schema_match, tool_called, and latency_under per step.
  • Step output piping: Reference a previous step's output in downstream inputs via {{steps.<step_id>.output}}.
  • Regression reports: Compare the current run to any past run and surface what changed.
  • Watch mode: Automatically reruns the affected fixture when files change.
  • CI-ready: Includes a GitHub Action for running evals on every config change.

Requirements

  • Node.js v22.5.0 or newer.
  • npm.

Getting started

Add the following config to your MCP client:

{
  "mcpServers": {
    "eval-runner": {
      "command": "npx",
      "args": ["-y", "mcp-eval-runner@latest"]
    }
  }
}

By default, eval fixtures are loaded from ./evals/ in the current working directory. To use a different path:

{
  "mcpServers": {
    "eval-runner": {
      "command": "npx",
      "args": ["-y", "mcp-eval-runner@latest", "--fixtures=~/my-project/evals"]
    }
  }
}

MCP Client configuration

Amp · Claude Code · Cline · Cursor · VS Code · Windsurf · Zed

Your first prompt

Create a file at evals/smoke.yaml. Use live mode (recommended) by including a server block:

name: smoke
description: "Verify eval runner itself is working"
server:
  command: node
  args: ["dist/index.js"]
steps:
  - id: list_check
    description: "List available test cases"
    tool: list_cases
    input: {}
    expect:
      output_contains: "smoke"

Then enter the following in your MCP client:

Run the eval suite.

Your client should return a pass/fail result for the smoke test.

Fixture format

Fixtures are YAML (or JSON) files placed in the fixtures directory. Each file defines one test case.

Top-level fields

FieldRequiredDescription
nameYesUnique name for the test case
descriptionNoHuman-readable description
serverNoServer config — if present, runs in live mode; if absent, runs in simulation mode
stepsYesArray of steps to execute

server block (live mode)

server:
  command: node # executable to spawn
  args: ["dist/index.js"] # arguments
  env: # optional environment variables
    MY_VAR: "value"

When server is present the eval runner spawns the server as a child process, connects via MCP stdio transport, and calls each step's tool against the live server.

steps array

Each step has the following fields:

FieldRequiredDescription
idYesUnique identifier within the fixture (used for output piping)
toolYesMCP tool name to call
descriptionNoHuman-readable step description
inputNoKey-value map of arguments passed to the tool (default: {})
expected_outputNoLiteral string used as output in simulation mode
expectNoAssertions evaluated against the step output

Execution modes

Live mode — fixture has a server block:

  • The server is spawned and each step calls the named tool via MCP stdio.
  • Assertions run against the real tool response.
  • Errors from the server cause the step (and by default the case) to fail immediately.

Simulation mode — no server block:

  • No server is started.
  • Each step's output is taken from expected_output (or empty string if absent).
  • Assertions run against that static output.
  • Useful for authoring and CI dry-runs, but output_contains assertions will always fail if expected_output is not set.

Assertion types

All assertions go inside a step's expect block:

expect:
  output_contains: "substring" # output includes this text
  output_not_contains: "error" # output must NOT include this text
  output_equals: "exact string" # output exactly matches
  output_matches: "regex pattern" # output matches a regular expression
  tool_called: "tool_name" # verifies which tool was called
  latency_under: 500 # latency in ms must be below this threshold
  schema_match: # output (parsed as JSON) matches JSON Schema
    type: object
    required: [id]
    properties:
      id:
        type: number

Multiple assertions in one expect block are all evaluated; the step fails if any assertion fails.

Step output piping

Reference the output of a previous step in a downstream step's input using {{steps.<step_id>.output}}:

steps:
  - id: search_step
    tool: search
    input:
      query: "mcp eval runner"
    expected_output: "result: mcp-eval-runner v1.0"
    expect:
      output_contains: "mcp-eval-runner"

  - id: summarize_step
    tool: summarize
    input:
      text: "{{steps.search_step.output}}"
    expected_output: "Summary: mcp-eval-runner v1.0"
    expect:
      output_contains: "Summary"

Piping works in both live mode and simulation mode.

Note on create_test_case

Fixtures created with the create_test_case tool do not include a server block. They always run in simulation mode. To use live mode, add a server block manually to the generated YAML file.

Tools

Running

  • run_suite — execute all fixtures in the fixtures directory; returns a pass/fail summary
  • run_case — run a single named fixture by name
  • list_cases — enumerate available fixtures with step counts and descriptions

Authoring

  • create_test_case — create a new YAML fixture file (simulation mode; no server block)
  • scaffold_fixture — generate a boilerplate fixture with placeholder steps and pre-filled assertion comments

Reporting

  • regression_report — compare the current fixture state to the last run; surfaces regressions and fixes
  • compare_results — diff two specific runs by run ID
  • generate_html_report — generate a single-file HTML report for a completed run

Operations

  • evaluate_deployment_gate — CI gate; fails if recent pass rate drops below a configurable threshold
  • discover_fixtures — discover fixture files across one or more directories (respects FIXTURE_LIBRARY_DIRS)

Configuration

--fixtures / --fixtures-dir

Directory to load YAML/JSON eval fixture files from.

Type: string Default: ./evals

--db / --db-path

Path to the SQLite database file used to store run history.

Type: string Default: ~/.mcp/evals.db

--timeout

Maximum time in milliseconds to wait for a single step before marking it as failed.

Type: number Default: 30000

--watch

Watch the fixtures directory and rerun the affected fixture automatically when files change.

Type: boolean Default: false

--format

Output format for eval results.

Type: string Choices: console, json, html Default: console

--concurrency

Number of test cases to run in parallel.

Type: number Default: 1

--http-port

Start an HTTP server on this port instead of stdio transport.

Type: number Default: disabled (uses stdio)

Pass flags via the args property in your JSON config:

{
  "mcpServers": {
    "eval-runner": {
      "command": "npx",
      "args": ["-y", "mcp-eval-runner@latest", "--watch", "--timeout=60000"]
    }
  }
}

Design principles

  • No mocking: Live mode evals run against real servers. Correctness is non-negotiable.
  • Fixtures are text: YAML/JSON in version control; no proprietary formats or databases.
  • Dogfood-first: The eval runner's own smoke fixture tests the eval runner itself.

Verification

Before publishing a new version, verify the server with MCP Inspector to confirm all tools are exposed correctly and the protocol handshake succeeds.

Interactive UI (opens browser):

npm run build && npm run inspect

CLI mode (scripted / CI-friendly):

# List all tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list

# List resources and prompts
npx @modelcontextprotocol/inspector --cli node dist/index.js --method resources/list
npx @modelcontextprotocol/inspector --cli node dist/index.js --method prompts/list

# Call a tool (example — replace with a relevant read-only tool for this plugin)
npx @modelcontextprotocol/inspector --cli node dist/index.js \
  --method tools/call --tool-name list_cases

# Call a tool with arguments
npx @modelcontextprotocol/inspector --cli node dist/index.js \
  --method tools/call --tool-name run_case --tool-arg name=smoke

Run before publishing to catch regressions in tool registration and runtime startup.

Contributing

New assertion types go in src/assertions.ts — implement the Assertion interface and add a test. Integration tests live under tests/ as unit tests and under evals/ as eval fixtures.

npm install && npm test

MCP Registry & Marketplace

This plugin is available on:

  • MCP Registry
  • MCP Market

Search for mcp-eval-runner.

Featured
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →

Configuration

YOUR_API_KEY*secret

Your API key for the service

Categories
AI & LLM Tools
Registryactive
Packagemcp-eval-runner
TransportSTDIO
AuthRequired
UpdatedMar 23, 2026
View on GitHub

Related AI & LLM Tools MCP Servers

View all →
SkillFM LLM Cost Optimizer

io.github.ericm1018/skillfm-llm-cost-optimizer-openai-anthropic-usage

LLM cost optimizer for OpenAI, Anthropic, token usage, BYOK, and SkillFM Beacon audits.
Llm Orchestration Agent

io.github.mikerawsonnz/llm-orchestration-agent

Run a prompt through a LangChain (system + human) chain over Gemini on Vertex AI; optional LangSmith
Authenticated Llm Agent

io.github.mikerawsonnz/authenticated-llm-agent

JWT-gated LLM gateway: authenticate (bcrypt/JWT), then run a LangChain-on-Vertex Gemini completion.
Copilot Memory MCP

labforgedev/copilot-memory-mcp

Persistent semantic memory for AI agents using local ChromaDB vector search. No cloud required.
1
Agent Prompt Injection Firewall Mcp

csoai-org/agent-prompt-injection-firewall-mcp

The WAF for agents. Pattern-based + heuristic firewall scans prompts, RAG documents, tool argume...
Authenticated Multi Llm Agent

io.github.mikerawsonnz/authenticated-multi-llm-agent

Google-OAuth-gated LLM gateway: verify a Google ID token, then run a Gemini (Vertex AI) completion f