This helps you design and implement systematic evaluations for AI products, whether you're measuring LLM output quality, building test cases, or creating scoring rubrics. It pushes you through the practitioner workflow: understanding failure modes through manual review, open coding what's broken, clustering patterns, then writing specific binary criteria. The approach is grounded in insights from folks like Hamel Husain and Shreya Shankar who argue evals are becoming a core product skill, not just an ML engineering concern. Useful when you need to move beyond "does this feel good" to actually measuring whether your AI feature works, and it'll flag common mistakes like skipping manual trace analysis or using fuzzy Likert scales instead of clear pass/fail criteria.
npx -y skills add refoundai/lenny-skills --skill ai-evals --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot