CAT
/MCP
SkillsMCPMarketplacesDigestToolsAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Sales & MarketingWeb & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web Crawling
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Cross AI Tools

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Tools
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic

RAGScore

hzyai/ragscore
32authSTDIOregistry active
Summary

This server brings RAGScore's evaluation toolkit into Claude, letting you generate synthetic QA datasets from documents and benchmark RAG systems without switching contexts. You can create tailored question-answer pairs targeting specific audiences (developers, customers, auditors), run multi-metric evaluations across correctness, completeness, and faithfulness, and diagnose failure modes (retriever miss vs generator hallucination). It works with any LLM provider, including local Ollama models for fully private workflows. The detailed evaluation mode gives you five diagnostic dimensions per answer in a single call, making it practical for iterating on retrieval strategies or prompt engineering. Use it when you're building or debugging a RAG pipeline and need systematic test coverage rather than manual spot checking.

CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
RAGScore Logo

PyPI version PyPI Downloads Python 3.9+ License Ollama Open In Colab MCP

Generate QA datasets & evaluate RAG systems in 2 commands

🔒 Privacy-First • ⚡ Lightning Fast • 🤖 Any LLM • 🏠 Local or Cloud • 🌍 Multilingual

English | 中文 | 日本語 | Deutsch


⚡ 2-Line RAG Evaluation

# Step 1: Generate QA pairs from your docs
ragscore generate docs/

# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query

That's it. Get accuracy scores and incorrect QA pairs instantly.

============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================

❌ 15 Incorrect Pairs:

  1. Q: "What is RAG?"
     Score: 2/5 - Factually incorrect

  2. Q: "How does retrieval work?"
     Score: 3/5 - Incomplete answer

🚀 Quick Start

Install

pip install ragscore              # Core (works with Ollama)
pip install "ragscore[openai]"    # + OpenAI support
pip install "ragscore[notebook]"  # + Jupyter/Colab support
pip install "ragscore[all]"       # + All providers

Already installed? Keep up to date — new versions add features like failure diagnosis and retrieved context capture:

pip install --upgrade ragscore

Option 1: Python API (Notebook-Friendly)

Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.

from ragscore import quick_test

# 1. Audit your RAG in one line
result = quick_test(
    endpoint="http://localhost:8000/query",  # Your RAG API
    docs="docs/",                            # Your documents
    n=10,                                    # Number of test questions
)

# 1b. Tailored QA — target specific audiences
result = quick_test(
    endpoint="http://localhost:8000/query",
    docs="docs/",
    audience="developers",                   # Who asks the questions?
    purpose="api-integration",               # What's the document for?
)

# 2. See the report
result.plot()

# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])

Rich Object API:

  • result.accuracy - Accuracy score
  • result.df - Pandas DataFrame of all results
  • result.plot() - 3-panel visualization (4-panel with detailed=True)
  • result.corrections - List of items to fix

Option 2: CLI (Production)

Generate QA Pairs

# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."

# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10

# Tailored QA generation — target specific audiences
ragscore generate docs/ --audience developers --purpose faq
ragscore generate docs/ --audience customers --purpose "pre-sales"
ragscore generate docs/ --audience "compliance auditors" --purpose "security audit"

Evaluate Your RAG

# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query

# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json

🔬 Detailed Multi-Metric Evaluation

Go beyond a single score. Add detailed=True to get 5 diagnostic dimensions per answer — in the same single LLM call.

result = quick_test(
    endpoint=my_rag,
    docs="docs/",
    n=10,
    detailed=True,  # ⭐ Enable multi-metric evaluation
)

# Inspect per-question metrics
display(result.df[[
    "question", "score", "correctness", "completeness",
    "relevance", "conciseness", "faithfulness"
]])

# Radar chart + 4-panel visualization
result.plot()
==================================================
✅ PASSED: 9/10 correct (90%)
Average Score: 4.3/5.0
Threshold: 70%
──────────────────────────────────────────────────
  Correctness: 4.5/5.0
  Completeness: 4.2/5.0
  Relevance: 4.8/5.0
  Conciseness: 4.1/5.0
  Faithfulness: 4.6/5.0
==================================================
MetricWhat it measuresScale
CorrectnessSemantic match to golden answer5 = fully correct
CompletenessCovers all key points5 = fully covered
RelevanceAddresses the question asked5 = perfectly on-topic
ConcisenessFocused, no filler5 = concise and precise
FaithfulnessNo fabricated claims5 = fully faithful

CLI:

ragscore evaluate http://localhost:8000/query --detailed

🔍 Failure Diagnosis (--diagnose)

When answers fail, --diagnose tells you why — retriever miss, generator hallucination, incomplete answer, or wrong interpretation:

ragscore evaluate http://localhost:8000/query --diagnose
🔍 Failure Diagnosis:
  Retriever Miss: 3 (42.9%)
  Generator Hallucination: 2 (28.6%)
  Incomplete Answer: 1 (14.3%)
  Wrong Interpretation: 1 (14.3%)

Uses the support_span already generated with each QA pair to give the judge grounding context. Combine with --detailed for full diagnostics:

ragscore evaluate http://localhost:8000/query --diagnose --detailed -o results.json
CategoryMeaning
Retriever MissRAG didn't retrieve the chunk containing the evidence
Generator HallucinationRetrieved correctly but fabricated information
Incomplete AnswerRetrieved correctly but answer is partial
Wrong InterpretationRetrieved correctly but misunderstood the content

📓 Full demo notebook — build a mini RAG and test it with detailed metrics.

🎯 Audience & Purpose demo — generate tailored QA for developers, customers, auditors, and more.

🏠 Ollama local demo — 100% private RAG evaluation with no API keys.


🏠 100% Private with Local LLMs

# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query

Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬

Ollama Model Recommendations

RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.

ModelSizeMin RAMQA QualityRecommended
llama3.1:70b40GB48GB VRAMExcellentGPU server (A100, L40)
qwen2.5:32b18GB24GB VRAMExcellentGPU server (A10, L20)
llama3.1:8b4.7GB8GB VRAMGoodBest local choice
qwen2.5:7b4.4GB8GB VRAMGoodGood local alternative
mistral:7b4.1GB8GB VRAMGoodGood local alternative
llama3.2:3b2.0GB4GB RAMFairCPU-only / testing
qwen2.5:1.5b1.0GB2GB RAMPoorNot recommended

Minimum recommended: 8B+ models. Smaller models (1.5B–3B) produce lower quality support spans and may timeout on longer chunks.

Ollama Performance Guide

# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b

# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5

Expected performance (28 chunks, 5 QA pairs per chunk):

HardwareModelTimeConcurrency
MacBook (CPU)llama3.2:3b~45 min2
MacBook (CPU)llama3.1:8b~25 min2
A10 (24GB)llama3.1:8b~3–5 min5
L20/L40 (48GB)qwen2.5:32b~3–5 min5
OpenAI APIgpt-4o-mini~2 min10

RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.


🔌 Supported LLMs

ProviderSetupNotes
Ollamaollama serveLocal, free, private
OpenAIexport OPENAI_API_KEY="sk-..."Best quality
Anthropicexport ANTHROPIC_API_KEY="..."Long context
DashScopeexport DASHSCOPE_API_KEY="..."Qwen models
vLLMexport LLM_BASE_URL="..."Production-grade
Any OpenAI-compatibleexport LLM_BASE_URL="..."Groq, Together, etc.

📊 Output Formats

Generated QA Pairs (output/generated_qas.jsonl)

{
  "id": "abc123",
  "question": "What is RAG?",
  "answer": "RAG (Retrieval-Augmented Generation) combines...",
  "rationale": "This is explicitly stated in the introduction...",
  "support_span": "RAG systems retrieve relevant documents...",
  "difficulty": "medium",
  "source_path": "docs/rag_intro.pdf"
}

Evaluation Results (--output results.json)

{
  "summary": {
    "total": 100,
    "correct": 85,
    "incorrect": 15,
    "accuracy": 0.85,
    "avg_score": 4.2
  },
  "incorrect_pairs": [
    {
      "question": "What is RAG?",
      "golden_answer": "RAG combines retrieval with generation...",
      "rag_answer": "RAG is a database system.",
      "score": 2,
      "reason": "Factually incorrect - RAG is not a database"
    }
  ]
}

🧪 Python API

from ragscore import run_pipeline, run_evaluation

# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)

# Generate tailored QA pairs for specific audiences
run_pipeline(
    paths=["docs/"],
    audience="support engineers",
    purpose="fine-tuning a support chatbot",
)

# Evaluate RAG
results = run_evaluation(
    endpoint="http://localhost:8000/query",
    model="gpt-4o",  # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")

🤖 AI Agent Integration

RAGScore is designed for AI agents and automation:

# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json

# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing

CLI Reference:

CommandDescription
ragscore generate <paths>Generate QA pairs from documents
ragscore generate <paths> --audience <who>Tailored QA for specific audience
ragscore generate <paths> --purpose <why>Focus QA on document purpose
ragscore evaluate <endpoint>Evaluate RAG against golden QAs
ragscore evaluate <endpoint> --detailedMulti-metric evaluation
ragscore evaluate <endpoint> --diagnoseFailure root-cause classification
ragscore --helpShow all commands and options
ragscore generate --helpShow generate options
ragscore evaluate --helpShow evaluate options

⚙️ Configuration

Zero config required. Optional environment variables:

export RAGSCORE_CHUNK_SIZE=512          # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5   # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir   # Working directory

🔐 Privacy & Security

DataCloud LLMLocal LLM
Documents✅ Local✅ Local
Text chunks⚠️ Sent to LLM✅ Local
Generated QAs✅ Local✅ Local
Evaluation results✅ Local✅ Local

Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅


🧪 Development

git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest

📡 Telemetry

RAGScore collects telemetry only in MCP server mode (ragscore serve). Standard CLI and Python API usage do not send telemetry.

We collect limited anonymous operational metrics to understand feature usage and improve reliability. No document content, prompts, QA text, model outputs, API keys, endpoint URLs, or file paths are collected.

Collected in MCP mode:

  • MCP tool invoked
  • LLM provider and model name
  • ragscore version, Python version, OS type
  • Success/failure status
  • Random anonymous installation ID

Opt out:

export RAGSCORE_NO_TELEMETRY=1

�� Links

  • GitHub • PyPI • Issues • Discussions

⭐ Star us on GitHub if RAGScore helps you!
Made with ❤️ for the RAG community

Featured
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →

Configuration

OPENAI_API_KEYsecret

OpenAI API key (if using OpenAI provider)

ANTHROPIC_API_KEYsecret

Anthropic API key (if using Anthropic provider)

Categories
AI & LLM Tools
Registryactive
Packageragscore
TransportSTDIO
AuthRequired
UpdatedMay 19, 2026
View on GitHub

Related AI & LLM Tools MCP Servers

View all →
SkillFM LLM Cost Optimizer

io.github.ericm1018/skillfm-llm-cost-optimizer-openai-anthropic-usage

LLM cost optimizer for OpenAI, Anthropic, token usage, BYOK, and SkillFM Beacon audits.
Llm Orchestration Agent

io.github.mikerawsonnz/llm-orchestration-agent

Run a prompt through a LangChain (system + human) chain over Gemini on Vertex AI; optional LangSmith
Authenticated Llm Agent

io.github.mikerawsonnz/authenticated-llm-agent

JWT-gated LLM gateway: authenticate (bcrypt/JWT), then run a LangChain-on-Vertex Gemini completion.
Copilot Memory MCP

labforgedev/copilot-memory-mcp

Persistent semantic memory for AI agents using local ChromaDB vector search. No cloud required.
1
Agent Prompt Injection Firewall Mcp

csoai-org/agent-prompt-injection-firewall-mcp

The WAF for agents. Pattern-based + heuristic firewall scans prompts, RAG documents, tool argume...
Authenticated Multi Llm Agent

io.github.mikerawsonnz/authenticated-multi-llm-agent

Google-OAuth-gated LLM gateway: verify a Google ID token, then run a Gemini (Vertex AI) completion f