NVIDIA's enterprise benchmarking platform that runs your LLMs through 100+ evaluation tasks from 18+ harnesses including MMLU, HumanEval, and GSM8K. Works with any OpenAI-compatible endpoint and handles execution across local Docker, Slurm HPC clusters, or cloud platforms. The containerized approach means reproducible results, and you get built-in exports to MLflow and Weights & Biases. If you're running evals on a single machine with simpler needs, lm-evaluation-harness is lighter weight. But if you're benchmarking at scale across infrastructure or need that full harness coverage in one tool, this delivers the industrial-grade setup.
npx -y skills add orchestra-research/ai-research-skills --skill nemo-evaluator-sdk --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot