If you're building eval pipelines for LLM outputs, this is worth your time. It treats LLM-as-a-Judge as a family of techniques rather than one approach, which is the right mental model. You get patterns for picking the right evaluation method, mitigating judge biases, and correlating automated scores with human judgment. The skill synthesizes academic research with industry practice, so it's not just theory. Most useful when you're comparing model responses, debugging inconsistent evals, or setting up A/B tests for prompt changes. It's already seen 163 installs and passed security audits from three providers, which suggests people are actually using it in production.
npx -y skills add flora131/atomic --skill advanced-evaluation --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills