This automates the full cycle of running model comparison benchmarks following the Benchmark Suite V3 reference implementation. It executes tasks across different Claude models (Opus vs Sonnet), spins up reviewer agents to score code quality, tracks metrics like duration and tool calls, then generates a comprehensive report as a GitHub issue with archived artifacts. The mandatory cleanup phase closes all test PRs and issues, which is honestly the kind of housekeeping that's easy to forget when you're running benchmarks manually. Best for systematic model evaluations where you need reproducible results and proper documentation, not one-off performance checks.
npx -y skills add rysweet/amplihack --skill model-evaluation-benchmark --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills