This handles Datadog APM setup and trace analysis, with a strong focus on getting Single Step Instrumentation working correctly in Kubernetes. The skill is opinionated about common failures (like using the wrong Helm chart or accidentally having ddtrace dependencies that silently break SSI) and walks you through operator install, DatadogAgent CR configuration, and pod restarts. It also covers Linux SSI and includes pup commands for searching traces, analyzing service dependencies, and debugging performance issues. The routing logic is detailed, maybe overly so, but if you've ever fought with Kubernetes auto-instrumentation that looks like it worked but didn't, you'll appreciate the paranoia.
npx -y skills add datadog-labs/agent-skills --skill dd-apm --agent claude-codeInstalls into .claude/skills of the current project.
Distributed tracing, service maps, and performance analysis.
Match the user's request to one of the entries below. Each entry has the same shape: triggers → which sub-skill to load → the anti-pattern to avoid. If a request seems to fit more than one entry, see "Overlap disambiguation". If nothing matches, see "None of the above" at the end.
Kubernetes APM install / instrument / onboard — trigger when the user mentions Kubernetes, K8s, EKS, GKE, AKS, kind, minikube, K3s, helm, DatadogAgent CR, kubectl, SSI on a cluster, pod injection, or init containers.
Immediately read .claude/skills/dd-apm/k8s-ssi/agent-install/SKILL.md now, then .claude/skills/dd-apm/k8s-ssi/enable-ssi/SKILL.md, then .claude/skills/dd-apm/k8s-ssi/verify-ssi/SKILL.md — do not proceed from memory.
Common wrong approaches that LOOK like they work but silently fail:
helm install datadog datadog/datadog— the standard chart does NOT support SSI via DatadogAgent CR.- Adding
ddtraceimports orddtrace-runto the app — SSI auto-instruments WITHOUT any code changes.admission.datadoghq.com/enabledannotations — that's admission controller config injection, not SSI init container injection.
Linux APM install / instrument / onboard — trigger when the user mentions a single host, VM, EC2 instance, bare-metal, RHEL/Ubuntu/Debian, systemd, or no orchestrator.
Immediately read .claude/skills/dd-apm/linux-ssi/agent-install/SKILL.md now, then .claude/skills/dd-apm/linux-ssi/enable-ssi/SKILL.md, then .claude/skills/dd-apm/linux-ssi/verify-ssi/SKILL.md — do not proceed from memory.
Do NOT install the agent via plain
apt-get install datadog-agent(or yum equivalent) and assume SSI follows — host auto-instrumentation requires the install script with the SSI flags, which the sub-skill walks through.
Service rename / service remapping — trigger when the user mentions renaming a service, collapsing multiple service names, stripping suffixes/prefixes, or cleaning up inferred services.
Immediately read .claude/skills/dd-apm/service-remapping/SKILL.md now — do not proceed from memory.
Do NOT change
tags.datadoghq.com/servicelabels orDD_SERVICEenv vars to rename a service in Datadog. That requires a rollout and only affects new data. Use a service remapping rule — it rewrites the name at ingestion time with no deployment change.
When a request could plausibly fit more than one entry above, use these tiebreakers:
| Hint | Route to |
|---|---|
| Cluster orchestrator mentioned (EKS/GKE/AKS/kind/K3s/minikube) — even if "just one node" | k8s-ssi |
| Single host, VM, or EC2 with no orchestrator | linux-ssi |
| "Several services that should be one" | service-remapping — the sub-skill picks the rule type based on whether the duplicates are real instrumented services or inferred entities (DBs, queues, external APIs) |
| "My service shows under the wrong name" | First check DD_SERVICE on the deploy. If correct and the name is still wrong → service-remapping. |
| "Reduce APM volume / cost / noise" | No sub-skill yet. Ask whether the user means sampling (fewer ingested traces) or retention filters (less indexed data) before suggesting commands. |
If the request doesn't match any entry above, continue reading the trace-search, service analysis, and metrics content below. If even that doesn't fit, ask the user to clarify — do not invent a workflow.
Datadog Labs Pup should be installed. See Setup Pup if not.
For scoped commands, use this order:
pup auth login
# Confirm env tag with the user first (do not assume production/prod/prd).
pup apm services list --env <env> --from 1h --to now
pup traces search --query "service:api-gateway" --from 1h
pup apm services list --env <env> --from 1h --to now
pup apm services stats --env <env> --from 1h --to now
pup apm services stats --env <env> --from 1h --to now
# View dependencies
pup apm flow-map --query "service:api-gateway&from=$(($(date +%s)-3600))000&to=$(date +%s)000" --env <env> --limit 10
# By service
pup traces search --query "service:api-gateway" --from 1h
# Errors only
pup traces search --query "service:api-gateway status:error" --from 1h
# Slow traces (>1s)
pup traces search --query "service:api-gateway @duration:>1000ms" --from 1h
# With specific tag
pup traces search --query "service:api-gateway @http.url:/api/users" --from 1h
# No direct get command for a single trace ID.
# Use traces search with a narrow query and time window.
pup traces search --query "trace_id:<trace_id>" --from 1h
| Metric | What It Measures |
|---|---|
trace.http.request.hits | Request count |
trace.http.request.duration | Latency |
trace.http.request.errors | Error count |
trace.http.request.apdex | User satisfaction |
Link APM to SLOs:
pup slos create --file slo.json
| Goal | Query |
|---|---|
| Slowest endpoints | avg:trace.http.request.duration{*} by {resource_name} |
| Error rate | sum:trace.http.request.errors{*} / sum:trace.http.request.hits{*} |
| Throughput | sum:trace.http.request.hits{*}.as_rate() |
| Problem | Fix |
|---|---|
| No traces | Check ddtrace installed, DD_TRACE_ENABLED=true |
| Missing service | Verify DD_SERVICE env var |
| Traces not linked | Check trace headers propagated |
| High cardinality | Don't tag with user_id/request_id |
sickn33/antigravity-awesome-skills
kubesphere/kubesphere
supercent-io/skills-template