If you're building LLM applications, you need evals, and Phoenix gives you a practical framework for both. The approach is sensible: start with error analysis to understand what's actually failing, build code-based evaluators for deterministic checks first, then layer in LLM judges for nuanced cases. The skill covers pre-built evaluators for common patterns like RAG, but the real value is in helping you build custom ones from your actual failures. One thing I appreciate: they're explicit about validating your evaluators against human labels (aiming for 80%+ accuracy) and prefer binary pass/fail over fuzzy scoring scales. Works in both Python and TypeScript, requires a Phoenix server running.
npx -y skills add arize-ai/phoenix --skill phoenix-evals --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills