This is how you actually test whether your agent workflow works in practice. It walks you through five evaluation dimensions (task completion, output quality, error behavior, user experience, consistency) and forces you to run concrete scenarios including edge cases and adversarial inputs. You get a structured table to document what happened versus what should have happened, then a graded report with specific improvement actions. The real value is that it won't let you skip the uncomfortable tests. Most people only check the happy path, but this pushes you to test malformed input, tool failures, and tricky cases. If you're shipping an agent workflow to users, run this first and prepare to feel slightly embarrassed by what breaks.
npx -y skills add sharpdeveye/maestro --skill evaluate --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills