When your HyperPod training run is slower on some nodes than others or checkpoint saves are eating your step time, this skill pulls host-side telemetry without changing anything. It runs a read-only snapshot script via SSM that grabs nvidia-smi state, EFA port health, FSx CloudWatch metrics, dmesg Xids, and NVLink topology, then tells you whether you're looking at a cross-AZ placement issue, a straggler node with thermal problems, or saturated filesystem throughput. It doesn't fix things itself. Instead it routes you to the right sibling skill for remediation: hyperpod-node-debugger for GPU hardware faults, hyperpod-nccl for AllReduce hangs, or hyperpod-version-checker if driver versions have drifted. Useful for diagnosing why identical jobs have different performance profiles across your cluster.
npx -y skills add awslabs/agent-plugins --skill hyperpod-performance-debugger --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
cursor/plugins
github/awesome-copilot
alirezarezvani/claude-skills
microsoft/win-dev-skills