CAT
/Skills
SkillsMCPMarketplacesDigestToolsAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Sales & MarketingWeb & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web Crawling
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Cross AI Tools

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Tools
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic

Promptfoo Evaluation

daymade/claude-code-skills
547 installs1.1k stars
Summary

Sets up and runs LLM evaluations with Promptfoo, an open-source CLI for testing prompts across different models. You'll reach for this when you need to compare Claude and GPT outputs side by side, write custom Python assertions for specific quality checks, or use LLM-as-judge scoring with rubrics. The skill covers the whole workflow: creating promptfooconfig.yaml, managing test cases with variable injection, implementing few-shot examples in chat format, and handling the gotchas like maxConcurrency placement and file path resolution. One thing to watch: if you're running through a relay API, every llm-rubric assertion needs its own provider config with apiBaseUrl or you'll hit 401 errors.

Install to Claude Code

npx -y skills add daymade/claude-code-skills --skill promptfoo-evaluation --agent claude-code

Installs into .claude/skills of the current project.

CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
Files
SKILL.md

Promptfoo Evaluation

Overview

This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.

Quick Start

# Initialize a new evaluation project
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval

# View results in browser
npx promptfoo@latest view

Configuration Structure

A typical Promptfoo project structure:

project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions

Core Configuration (promptfooconfig.yaml)

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"

# Prompts to test
prompts:
  - file://prompts/system.md
  - file://prompts/chat.json

# Models to compare
providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
  - id: openai:gpt-4.1
    label: GPT-4.1

# Test cases
tests: file://tests/cases.yaml

# Concurrency control (MUST be under commandLineOptions, NOT top-level)
commandLineOptions:
  maxConcurrency: 2

# Default assertions for all tests
defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:custom_assert
    - type: llm-rubric
      value: |
        Evaluate the response quality on a 0-1 scale.
      threshold: 0.7

# Output path
outputPath: results/eval-results.json

Prompt Formats

Text Prompt (system.md)

You are a helpful assistant.

Task: {{task}}
Context: {{context}}

Chat Format (chat.json)

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]

Few-Shot Pattern

Embed examples directly in prompt or use chat format with assistant messages:

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]

Test Cases (tests/cases.yaml)

- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8

Python Custom Assertions

Create a Python file for custom assertions (e.g., scripts/metrics.py):

def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }

Key points:

  • Default function name is get_assert
  • Specify function with file://path.py:function_name
  • Return bool, float (score), or dict with pass/score/reason
  • Access variables via context['vars']

LLM-as-Judge (llm-rubric)

assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model

When using a relay/proxy API, each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:

assert:
  - type: llm-rubric
    value: |
      Evaluate quality on a 0-1 scale.
    threshold: 0.7
    provider:
      id: anthropic:messages:claude-sonnet-4-6
      config:
        apiBaseUrl: https://your-relay.example.com/api

Best practices:

  • Provide clear scoring criteria
  • Use threshold to set minimum passing score
  • Default grader uses available API keys (OpenAI → Anthropic → Google)
  • When using relay/proxy: every llm-rubric must have its own provider with apiBaseUrl — the main provider's apiBaseUrl is NOT inherited

Common Assertion Types

TypeUsageExample
containsCheck substringvalue: "hello"
icontainsCase-insensitivevalue: "HELLO"
equalsExact matchvalue: "42"
regexPattern matchvalue: "\\d{4}"
pythonCustom logicvalue: file://script.py
llm-rubricLLM gradingvalue: "Is professional"
latencyResponse timethreshold: 1000

File References

All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.

# Load file content as variable
vars:
  content: file://data/input.txt

# Load prompt from file
prompts:
  - file://prompts/main.md

# Load test cases from file
tests: file://tests/cases.yaml

# Load Python assertion
assert:
  - type: python
    value: file://scripts/check.py:validate

Running Evaluations

# Basic run
npx promptfoo@latest eval

# With specific config
npx promptfoo@latest eval --config path/to/config.yaml

# Output to file
npx promptfoo@latest eval --output results.json

# Filter tests
npx promptfoo@latest eval --filter-metadata category=math

# View results
npx promptfoo@latest view

Relay / Proxy API Configuration

When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
    config:
      max_tokens: 4096
      apiBaseUrl: https://your-relay.example.com/api  # Promptfoo appends /v1/messages

# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
commandLineOptions:
  maxConcurrency: 1  # Respect relay rate limits

Key rules:

  • apiBaseUrl goes in providers[].config — Promptfoo appends /v1/messages automatically
  • maxConcurrency must be under commandLineOptions: — placing it at top level is silently ignored
  • When using relay with LLM-as-judge, set maxConcurrency: 1 to avoid concurrent request limits (generation + grading share the same pool)
  • Pass relay token as ANTHROPIC_API_KEY env var

Troubleshooting

Python not found:

export PROMPTFOO_PYTHON=python3

Large outputs truncated: Outputs over 30000 characters are truncated. Use head_limit in assertions.

File not found errors: All file:// paths resolve relative to promptfooconfig.yaml location.

maxConcurrency ignored (shows "up to N at a time"): maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.

LLM-as-judge returns 401 with relay API: Each llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.

HTML tags in model output inflating metrics: Models may output <br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:

import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)

Echo Provider (Preview Mode)

Use the echo provider to preview rendered prompts without making API calls:

# promptfooconfig-preview.yaml
providers:
  - echo  # Returns prompt as output, no API calls

tests:
  - vars:
      input: "test content"

Use cases:

  • Preview prompt rendering before expensive API calls
  • Verify Few-shot examples are loaded correctly
  • Debug variable substitution issues
  • Validate prompt structure
# Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml

Cost: Free - no API tokens consumed.

Advanced Few-Shot Implementation

Multi-turn Conversation Pattern

For complex few-shot learning with full examples:

[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]

Test case configuration:

tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt

Best practices:

  • Use 1-3 few-shot examples (more may dilute effectiveness)
  • Ensure examples match the task format exactly
  • Load examples from files for better maintainability
  • Use echo provider first to verify structure

Long Text Handling

For Chinese/long-form content evaluations (10k+ characters):

Configuration:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length

Python assertion for text metrics:

import re

def strip_tags(text: str) -> str:
    """Remove HTML tags for pure text."""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """Check output length constraints."""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }

Real-World Example

Project: Chinese short-video content curation from long transcripts

Structure:

tiaogaoren/
├── promptfooconfig.yaml          # Production config
├── promptfooconfig-preview.yaml  # Preview config (echo provider)
├── prompts/
│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
│   └── v4/system-v4.md          # System prompt
├── tests/cases.yaml              # 3 test samples
├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
├── data/                         # 5 samples (2 few-shot, 3 eval)
└── results/

See: ./tiaogaoren/ (example project root) for full implementation.

Resources

For detailed API reference and advanced patterns, see references/promptfoo_api.md.

Featured
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
First SeenJun 3, 2026
View on GitHub

Recommended

caveman

juliusbrussee/caveman

Ultra-compressed communication mode cutting token usage ~75% while preserving technical accuracy.
203.4k
67.8k
grill-me

mattpocock/skills

Relentless interviewing skill that stress-tests plans and designs through systematic questioning.
250.9k
114.5k
improve

shadcn/improve

Survey any codebase as a senior advisor and produce prioritized, self-contained implementation plans for other models/agents to execute.
10
205
systematic-debugging

obra/superpowers

Structured debugging methodology that mandates root cause investigation before attempting any fixes.
124.6k
215.9k
karpathy-guidelines

forrestchang/andrej-karpathy-skills

Behavioral guidelines to reduce common LLM coding mistakes through explicit assumptions, simplicity, and verifiable success criteria.
13.9k
165.4k
find-skills

vercel-labs/skills

Discover and install specialized agent skills from the open ecosystem when users need extended capabilities.
1.8M
21.1k