This is about building evaluation frameworks for agent systems where the normal rules don't apply. The core insight here is that agents are non-deterministic and can take completely different valid paths to the same goal, so you need outcome-focused rubrics instead of checking specific steps. There's a notable finding from BrowseComp research: token usage explains 80% of performance variance, which means your evaluation needs realistic token budgets, not unlimited resources. The framework covers LLM-as-judge for scale, human evaluation for edge cases, and multi-dimensional scoring across accuracy, completeness, and tool efficiency. Use this when you need systematic testing before shipping changes or want to catch regressions in production agent systems.
npx -y skills add sickn33/antigravity-awesome-skills --skill evaluation --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills