This handles Datadog monitor management through the command line, letting you list, search, create from files, and manage alerting downtimes. What makes it worth looking at is the opinionated best practices baked in: it pushes you toward stable alert windows, proper recovery thresholds to prevent flapping, and a safe deletion workflow that marks monitors instead of nuking them. The guidance on avoiding alert fatigue is solid, like using 5 minute windows instead of 1 minute, scoping alerts to what actually matters, and including runbooks in messages. Requires pup in your path to work with the Datadog API.
npx -y skills add datadog-labs/agent-skills --skill dd-monitors --agent claude-codeInstalls into .claude/skills of the current project.
Create, manage, and maintain monitors for alerting.
This requires pup in your path. See Setup Pup.
For scoped commands, use this order:
pup auth login
pup monitors list
pup monitors list --tags "team:platform"
pup monitors get <id>
pup monitors create --file monitor.json
# No pup monitors mute/unmute commands.
# Use downtime payloads to silence monitor notifications.
pup downtime create --file downtime.json
pup downtime cancel <downtime_id>
| Rule | Why |
|---|---|
| No flapping alerts | Use last_Xm not last_1m |
| Meaningful thresholds | Based on SLOs, not guesses |
| Actionable alerts | If no action needed, don't alert |
| Include runbook | @runbook-url in message |
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive
# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window
# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope
# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
monitor = {
"query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70, # ✅ Prevents flapping
"warning": 60,
"warning_recovery": 50
}
}
}
message = """
## High CPU Alert
Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}
### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed
@slack-ops @pagerduty-oncall
"""
Use safe deletion workflow (same as dashboards):
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
"""Mark monitor instead of deleting."""
monitor = client.get_monitor(monitor_id)
name = monitor.get("name", "")
if "[MARKED FOR DELETION]" in name:
print(f"Already marked: {name}")
return False
new_name = f"[MARKED FOR DELETION] {name}"
client.update_monitor(monitor_id, {"name": new_name})
print(f"✓ Marked: {new_name}")
return True
| Type | Use Case |
|---|---|
metric alert | CPU, memory, custom metrics |
query alert | Complex metric queries |
service check | Agent check status |
event alert | Event stream patterns |
log alert | Log pattern matching |
composite | Combine multiple monitors |
apm | APM metrics |
# Find monitors without owners
pup monitors list | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
# Find noisy monitors (high alert count)
pup monitors list | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
| Use | When |
|---|---|
| Downtime | Any planned silence window |
| Monitor edit | Query/threshold behavior changes |
# Downtime (preferred)
pup downtime create --file downtime.json
| Problem | Fix |
|---|---|
| Alert not firing | Check query returns data, thresholds |
| Too many alerts | Increase window, add recovery threshold |
| No data alerts | Check agent connectivity, metric exists |
| Auth error | pup auth refresh |
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot