This walks you through calibrating an LLM judge against human labels using proper train/dev/test splits and TPR/TNR metrics. You'd use it after writing a judge prompt when you need to verify it actually agrees with human judgment before trusting it in production. The workflow is methodical: split your labeled data, iterate on the dev set until you hit 90% TPR and TNR, then measure once on the held-out test set. It includes the Rogan-Gladen bias correction formula for estimating true success rates from biased judge scores, plus bootstrap confidence intervals. The anti-pattern section is worth reading since most people skip validation entirely and just assume judges work.
npx -y skills add hamelsmu/evals-skills --skill validate-evaluator --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills