This is a comprehensive workflow tool for running LLM-as-judge evaluations through Arize's platform. It handles the full lifecycle: creating evaluators with custom judge prompts and classification choices, mapping template variables to your actual span or experiment data, running evals at span/trace/session granularity, and setting up continuous monitoring for production traffic. The skill knows how to troubleshoot common issues like missing credentials or failed API calls, and it includes a strict rule against fabricating evaluation results if something goes wrong. Most useful when you need to systematically score things like hallucination, faithfulness, or relevance across LLM outputs at scale. The documentation is quite detailed on column mapping and the differences between evaluating individual spans versus entire conversation sessions.
npx -y skills add github/awesome-copilot --skill arize-evaluator --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
cursor/plugins
github/awesome-copilot
alirezarezvani/claude-skills
microsoft/win-dev-skills