Sets up and runs LLM evaluations with Promptfoo, an open-source CLI for testing prompts across different models. You'll reach for this when you need to compare Claude and GPT outputs side by side, write custom Python assertions for specific quality checks, or use LLM-as-judge scoring with rubrics. The skill covers the whole workflow: creating promptfooconfig.yaml, managing test cases with variable injection, implementing few-shot examples in chat format, and handling the gotchas like maxConcurrency placement and file path resolution. One thing to watch: if you're running through a relay API, every llm-rubric assertion needs its own provider config with apiBaseUrl or you'll hit 401 errors.
npx -y skills add daymade/claude-code-skills --skill promptfoo-evaluation --agent claude-codeInstalls into .claude/skills of the current project.
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
# Initialize a new evaluation project
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
A typical Promptfoo project structure:
project/
├── promptfooconfig.yaml # Main configuration
├── prompts/
│ ├── system.md # System prompt
│ └── chat.json # Chat format prompt
├── tests/
│ └── cases.yaml # Test cases
└── scripts/
└── metrics.py # Custom Python assertions
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"
# Prompts to test
prompts:
- file://prompts/system.md
- file://prompts/chat.json
# Models to compare
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
- id: openai:gpt-4.1
label: GPT-4.1
# Test cases
tests: file://tests/cases.yaml
# Concurrency control (MUST be under commandLineOptions, NOT top-level)
commandLineOptions:
maxConcurrency: 2
# Default assertions for all tests
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
Evaluate the response quality on a 0-1 scale.
threshold: 0.7
# Output path
outputPath: results/eval-results.json
You are a helpful assistant.
Task: {{task}}
Context: {{context}}
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]
Embed examples directly in prompt or use chat format with assistant messages:
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]
- description: "Test case 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# Load content from files
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8
Create a Python file for custom assertions (e.g., scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict:
"""Default assertion function."""
vars_dict = context.get('vars', {})
# Access test variables
expected = vars_dict.get('expected', '')
# Return result
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""Custom named assertion."""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}
Key points:
get_assertfile://path.py:function_namebool, float (score), or dict with pass/score/reasoncontext['vars']assert:
- type: llm-rubric
value: |
Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness
Score 0.0-1.0 where 0.7+ is passing.
threshold: 0.7
provider: openai:gpt-4.1 # Optional: override grader model
When using a relay/proxy API, each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
assert:
- type: llm-rubric
value: |
Evaluate quality on a 0-1 scale.
threshold: 0.7
provider:
id: anthropic:messages:claude-sonnet-4-6
config:
apiBaseUrl: https://your-relay.example.com/api
Best practices:
threshold to set minimum passing scorellm-rubric must have its own provider with apiBaseUrl — the main provider's apiBaseUrl is NOT inherited| Type | Usage | Example |
|---|---|---|
contains | Check substring | value: "hello" |
icontains | Case-insensitive | value: "HELLO" |
equals | Exact match | value: "42" |
regex | Pattern match | value: "\\d{4}" |
python | Custom logic | value: file://script.py |
llm-rubric | LLM grading | value: "Is professional" |
latency | Response time | threshold: 1000 |
All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.
# Load file content as variable
vars:
content: file://data/input.txt
# Load prompt from file
prompts:
- file://prompts/main.md
# Load test cases from file
tests: file://tests/cases.yaml
# Load Python assertion
assert:
- type: python
value: file://scripts/check.py:validate
# Basic run
npx promptfoo@latest eval
# With specific config
npx promptfoo@latest eval --config path/to/config.yaml
# Output to file
npx promptfoo@latest eval --output results.json
# Filter tests
npx promptfoo@latest eval --filter-metadata category=math
# View results
npx promptfoo@latest view
When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
config:
max_tokens: 4096
apiBaseUrl: https://your-relay.example.com/api # Promptfoo appends /v1/messages
# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
commandLineOptions:
maxConcurrency: 1 # Respect relay rate limits
Key rules:
apiBaseUrl goes in providers[].config — Promptfoo appends /v1/messages automaticallymaxConcurrency must be under commandLineOptions: — placing it at top level is silently ignoredmaxConcurrency: 1 to avoid concurrent request limits (generation + grading share the same pool)ANTHROPIC_API_KEY env varPython not found:
export PROMPTFOO_PYTHON=python3
Large outputs truncated:
Outputs over 30000 characters are truncated. Use head_limit in assertions.
File not found errors:
All file:// paths resolve relative to promptfooconfig.yaml location.
maxConcurrency ignored (shows "up to N at a time"):
maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.
LLM-as-judge returns 401 with relay API:
Each llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.
HTML tags in model output inflating metrics:
Models may output <br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:
import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)
Use the echo provider to preview rendered prompts without making API calls:
# promptfooconfig-preview.yaml
providers:
- echo # Returns prompt as output, no API calls
tests:
- vars:
input: "test content"
Use cases:
# Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
Cost: Free - no API tokens consumed.
For complex few-shot learning with full examples:
[
{"role": "system", "content": "{{system_prompt}}"},
// Few-shot Example 1
{"role": "user", "content": "Task: {{example_input_1}}"},
{"role": "assistant", "content": "{{example_output_1}}"},
// Few-shot Example 2 (optional)
{"role": "user", "content": "Task: {{example_input_2}}"},
{"role": "assistant", "content": "{{example_output_2}}"},
// Actual test
{"role": "user", "content": "Task: {{actual_input}}"}
]
Test case configuration:
tests:
- vars:
system_prompt: file://prompts/system.md
# Few-shot examples
example_input_1: file://data/examples/input1.txt
example_output_1: file://data/examples/output1.txt
example_input_2: file://data/examples/input2.txt
example_output_2: file://data/examples/output2.txt
# Actual test
actual_input: file://data/test1.txt
Best practices:
For Chinese/long-form content evaluations (10k+ characters):
Configuration:
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
max_tokens: 8192 # Increase for long outputs
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:check_length
Python assertion for text metrics:
import re
def strip_tags(text: str) -> str:
"""Remove HTML tags for pure text."""
return re.sub(r'<[^>]+>', '', text)
def check_length(output: str, context: dict) -> dict:
"""Check output length constraints."""
raw_input = context['vars'].get('raw_input', '')
input_len = len(strip_tags(raw_input))
output_len = len(strip_tags(output))
reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
return {
"pass": 0.7 <= reduction_ratio <= 0.9,
"score": reduction_ratio,
"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
"named_scores": {
"input_length": input_len,
"output_length": output_len,
"reduction_ratio": reduction_ratio
}
}
Project: Chinese short-video content curation from long transcripts
Structure:
tiaogaoren/
├── promptfooconfig.yaml # Production config
├── promptfooconfig-preview.yaml # Preview config (echo provider)
├── prompts/
│ ├── tiaogaoren-prompt.json # Chat format with few-shot
│ └── v4/system-v4.md # System prompt
├── tests/cases.yaml # 3 test samples
├── scripts/metrics.py # Custom metrics (reduction ratio, etc.)
├── data/ # 5 samples (2 few-shot, 3 eval)
└── results/
See: ./tiaogaoren/ (example project root) for full implementation.
For detailed API reference and advanced patterns, see references/promptfoo_api.md.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills