This is a solid reference for measuring LLM output quality across automated metrics (BLEU, ROUGE, BERTScore), human evaluation rubrics, and LLM-as-judge patterns. You get working Python examples for scoring translations and summaries, comparing model outputs pairwise, and building custom metrics like groundedness checks. The automated metrics are fast but often miss nuance, so the guide walks through when to layer in human ratings or use a stronger model as a judge. Most useful when you're trying to catch regressions before shipping prompt changes or need to justify which of two models actually performs better on your specific use case.
npx -y skills add sickn33/antigravity-awesome-skills --skill llm-evaluation --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot