This walks you through building LLM-as-judge evaluators for things code can't check: tone, faithfulness, relevance. The core insight is binary pass/fail only, one failure mode per judge, with detailed critique before verdict. You need 20+ labeled examples per outcome, and the guide is firm about exhausting regex and keyword checks first before reaching for semantic evaluation. The structured approach (task definition, pass/fail criteria, few-shot examples, forced JSON output) is practical, and the anti-patterns section saves time by calling out common mistakes like using Likert scales or skipping validation. It assumes you've already done error analysis and have labeled data ready.
npx -y skills add hamelsmu/evals-skills --skill write-judge-prompt --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot