This taps into the BigCode Evaluation Harness to benchmark code generation models across 15+ standardized tests including HumanEval, MBPP, and MultiPL-E in 18 languages. You'd use this when you need objective metrics on how well a model generates code, whether you're comparing different models, tracking improvements over time, or validating a fine-tuned version. It's a fork from an AI research collection, so it's built for people who want real numbers rather than vibes. Fair warning that the security audits show mixed results, with a fail from Gen Agent Trust Hub, so review what you're installing before running it in production environments.
npx -y skills add davila7/claude-code-templates --skill evaluating-code-models --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills