This walks you through the unglamorous work of figuring out how your LLM system actually fails. You review about 100 traces, note what went wrong in each one, then group similar failures into 5-10 categories you can measure and fix. The process is deliberately manual at first because pre-defined failure lists cause confirmation bias. It pushes you to distinguish root causes (missing a filter in the SQL) from symptoms (wrong results downstream) and only build evaluators for failures that warrant the effort. Use it when starting evals, after big pipeline changes, or when production metrics tank and you need to know why.
npx -y skills add hamelsmu/evals-skills --skill error-analysis --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
JamieMason/syncpack
awslabs/agent-plugins
github/awesome-copilot
addyosmani/agent-skills