Adds consensus measurement tools to Claude using Fleiss' kappa and bootstrap confidence intervals to check if AI models agree with themselves or each other. The eight MCP tools let you run multi-model evaluations across Bedrock, OpenAI, and Gemini, generate statistical reports, compare runs over time, and estimate costs before executing. The self-consistency mode is handy because it uses MCP Sampling to test the host model without external API keys. You'd reach for this when you need statistically rigorous validation that an AI is giving consistent answers, especially for high-stakes applications where agreement matters more than speed. Includes schema validation and AI-powered schema suggestion from your data.
One command. Find out if your AI agrees with itself.
ConKurrence is a statistically validated consensus measurement toolkit for AI evaluation pipelines. It uses multiple AI models as independent raters, measures inter-rater reliability with Fleiss' kappa and bootstrap confidence intervals, and routes contested items to human experts.
npm install -g conkurrence
Use ConKurrence as an MCP server in Claude Desktop or any MCP-compatible client:
npx conkurrence mcp
Add to your claude_desktop_config.json:
{
"mcpServers": {
"conkurrence": {
"command": "npx",
"args": ["-y", "conkurrence", "mcp"]
}
}
}
/plugin marketplace add AlligatorC0der/conkurrence
| Tool | Description |
|---|---|
conkurrence_run | Execute an evaluation across multiple AI raters |
conkurrence_report | Generate a detailed markdown report |
conkurrence_compare | Side-by-side comparison of two runs |
conkurrence_trend | Track agreement over multiple runs |
conkurrence_suggest | AI-powered schema suggestion from your data |
conkurrence_validate_schema | Validate a schema before running |
conkurrence_estimate | Estimate cost and token usage |
BUSL-1.1 — Business Source License 1.1