AgentWatch

STDIOregistry active

Summary

Gives you post-mortem tools for multi-agent systems. Agents call `report()` for heartbeats and `trace()` for actions, linking them by trace ID and parent event. When something fails, `correlate()` walks backward through the chain to find the root cause. Exposes 13 MCP tools including `agentwatch_cascade`, `agentwatch_replay`, and `agentwatch_dashboard` so Claude can diagnose why your swarm fell over. Stores everything in local SQLite, works with any agent framework. The CLI lets you run `npx @nicofains1/agentwatch demo` to see a 5-agent cascade failure traced soup to nuts. Useful if you're running CrewAI, AutoGen, or custom agents and need forensics when one failure triggers three others.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

agentwatch

Your agent swarm crashed at 2am. You have logs from 10 agents and no idea which one started the cascade. AgentWatch tells you.

It tracks heartbeats, links actions across agents, walks backward from any failure to the root cause, and replays the full sequence. Works with any agent framework (CrewAI, AutoGen, LangGraph, PocketFlow, custom). Stores everything in a local SQLite file.

Early stage. Issues and feedback welcome: https://github.com/nicofains1/agentwatch/issues

See it in action

No install needed:

npx @nicofains1/agentwatch demo

This seeds a 5-agent fleet, triggers a cascade failure, and shows you the full trace:

AgentWatch Fleet Dashboard
============================================================
Agents: 5 total | 3 healthy | 1 degraded | 1 error | 0 offline

Cascade Failure (4 steps, root cause: scheduler/dispatch-batch)
============================================================
[ROOT] scheduler/dispatch-batch [ok] 15ms
       {"assigned_to": "fetcher"}
       |
[  1 ] fetcher/call-api [error] 30000ms
       TIMEOUT after 30000ms
       |
[  2 ] processor/transform [error] 120ms
       Error: input is null - expected array from fetcher
       |
[FAIL] notifier/send-alert [error] 8ms
       Error: no processed data to report

Install

npm install @nicofains1/agentwatch

Requires Node 18+. Uses better-sqlite3 (native bindings, no external database needed).

Quick start

import { AgentWatch } from '@nicofains1/agentwatch';

const aw = new AgentWatch(); // creates agentwatch.db in the current directory

// Report heartbeats from your agents
aw.report('agent-a', 'healthy');
aw.report('agent-b', 'healthy');

// Trace an action in agent-a
const traceId = aw.createTraceId();
const e1 = aw.trace(traceId, 'agent-a', 'fetch-data',
  'url=https://api.example.com', 'rows=150');

// Trace a dependent action in agent-b that fails
const e2 = aw.trace(traceId, 'agent-b', 'process',
  JSON.stringify({ rows: 150 }), 'Error: out of memory', {
    parentEventId: e1.id,
    status: 'error',
    durationMs: 4200,
  });

// Walk back to the root cause
const chain = aw.correlate(e2.id);
console.log(chain?.root_cause);
// -> { agent: 'agent-a', action: 'fetch-data', ... }

// Print fleet status
console.log(aw.dashboardText());

What it does

Heartbeats - Each agent calls aw.report(name, status) on a schedule. AgentWatch tracks health over time and marks agents as stale or offline based on configurable thresholds.

Cross-agent tracing - Actions are linked by trace ID and optional parent event ID. When agent-c fails because agent-b sent bad data that came from agent-a, the full chain is queryable.

Cascade detection - correlate(failureEventId) walks backward from any failure to the root cause, returning the full chain with timing and output at each step.

Alert de-duplication - The same alert type from the same agent within a time window collapses into one entry with an incrementing count. Severity auto-escalates: info (1x) -> warning (3x) -> critical (10x).

Forensic replay - replay(traceId) returns all cascade chains within a trace. Useful for post-mortem analysis when a single trace touched multiple agents.

OpenTelemetry export - Export traces as OTEL spans (GenAI semantic conventions). Works with Jaeger, Grafana, or any OTEL-compatible backend. Requires optional peer deps.

CLI

npx @nicofains1/agentwatch demo                   # run the demo
npx @nicofains1/agentwatch dashboard              # fleet health overview
npx @nicofains1/agentwatch cascade <event-id>     # trace cascade from a failure
npx @nicofains1/agentwatch failures [agent]       # list recent failures
npx @nicofains1/agentwatch alerts [agent]         # list active alerts
npx @nicofains1/agentwatch replay <trace-id>      # replay all cascades in a trace
npx @nicofains1/agentwatch mcp                    # start MCP server (stdio)

Set AGENTWATCH_DB to point to your database file. Default: agentwatch.db in the current directory.

MCP server

AgentWatch runs as an MCP server. Add it to your Claude Code or Cursor config:

Claude Code (~/.claude/claude_desktop_config.json or .claude/settings.json):

{
  "mcpServers": {
    "agentwatch": {
      "command": "npx",
      "args": ["@nicofains1/agentwatch", "mcp"],
      "env": {
        "AGENTWATCH_DB": "/absolute/path/to/agentwatch.db"
      }
    }
  }
}

Cursor (.cursor/mcp.json):

{
  "mcpServers": {
    "agentwatch": {
      "command": "npx",
      "args": ["@nicofains1/agentwatch", "mcp"],
      "env": {
        "AGENTWATCH_DB": "/absolute/path/to/agentwatch.db"
      }
    }
  }
}

This exposes 13 tools: agentwatch_dashboard, agentwatch_report_heartbeat, agentwatch_trace, agentwatch_cascade, agentwatch_replay, agentwatch_get_alerts, agentwatch_get_failures, agentwatch_get_trace, agentwatch_fleet_health, agentwatch_create_trace_id, agentwatch_alert, agentwatch_resolve_alert, agentwatch_dashboard_text.

API reference

Constructor

const aw = new AgentWatch({
  db_path: 'agentwatch.db',        // SQLite file path
  alert_window_minutes: 30,         // de-dup window for alerts
  heartbeat_stale_minutes: 30,      // when to mark agents as offline
});

Heartbeats

aw.report(agent, status, context?)     // status: 'healthy' | 'degraded' | 'error' | 'offline'
aw.getLatestHeartbeat(agent)           // -> Heartbeat | undefined
aw.getFleetHealth()                    // -> AgentHealth[]

Tracing

aw.createTraceId()                                // -> string (UUID)
aw.trace(traceId, agent, action, input, output, {
  parentEventId?: number,
  status?: 'ok' | 'error',                        // default: 'ok'
  durationMs?: number,
})                                                // -> TraceEvent
aw.getTraceEvents(traceId)                        // -> TraceEvent[]
aw.getRecentFailures(agent?, limit?)              // -> TraceEvent[]

Cascade detection

aw.correlate(failureEventId)    // -> CascadeChain | null
aw.replay(traceId)              // -> CascadeChain[]

Alerts

aw.alert(agent, alertType, message)
aw.resolveAlert(alertId)
aw.activeAlerts(agent?)         // -> Alert[]

Dashboard

aw.dashboard()      // -> DashboardOutput (structured)
aw.dashboardText()  // -> string (formatted for terminal)

OpenTelemetry export

Requires optional peer deps @opentelemetry/api and @opentelemetry/sdk-trace-base.

await aw.exportTraceToOtel(traceId, { serviceName: 'my-agents' });
await aw.exportRecentToOtel(1); // last 1 hour

Storage

SQLite via better-sqlite3. The database file is created automatically on first use. WAL mode is on for concurrent reads.

Tables: heartbeats, trace_events, alerts.

License

MIT

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Configuration

AGENTWATCH_DB

Path to SQLite database file

agentwatch

Your agent swarm crashed at 2am. You have logs from 10 agents and no idea which one started the cascade. AgentWatch tells you.

Early stage. Issues and feedback welcome: https://github.com/nicofains1/agentwatch/issues

See it in action

No install needed:

npx @nicofains1/agentwatch demo

This seeds a 5-agent fleet, triggers a cascade failure, and shows you the full trace:

AgentWatch Fleet Dashboard
============================================================
Agents: 5 total | 3 healthy | 1 degraded | 1 error | 0 offline

Cascade Failure (4 steps, root cause: scheduler/dispatch-batch)
============================================================
[ROOT] scheduler/dispatch-batch [ok] 15ms
       {"assigned_to": "fetcher"}
       |
[  1 ] fetcher/call-api [error] 30000ms
       TIMEOUT after 30000ms
       |
[  2 ] processor/transform [error] 120ms
       Error: input is null - expected array from fetcher
       |
[FAIL] notifier/send-alert [error] 8ms
       Error: no processed data to report

Install

npm install @nicofains1/agentwatch

Requires Node 18+. Uses better-sqlite3 (native bindings, no external database needed).

Quick start

import { AgentWatch } from '@nicofains1/agentwatch';

const aw = new AgentWatch(); // creates agentwatch.db in the current directory

// Report heartbeats from your agents
aw.report('agent-a', 'healthy');
aw.report('agent-b', 'healthy');

// Trace an action in agent-a
const traceId = aw.createTraceId();
const e1 = aw.trace(traceId, 'agent-a', 'fetch-data',
  'url=https://api.example.com', 'rows=150');

// Trace a dependent action in agent-b that fails
const e2 = aw.trace(traceId, 'agent-b', 'process',
  JSON.stringify({ rows: 150 }), 'Error: out of memory', {
    parentEventId: e1.id,
    status: 'error',
    durationMs: 4200,
  });

// Walk back to the root cause
const chain = aw.correlate(e2.id);
console.log(chain?.root_cause);
// -> { agent: 'agent-a', action: 'fetch-data', ... }

// Print fleet status
console.log(aw.dashboardText());

What it does

Heartbeats - Each agent calls aw.report(name, status) on a schedule. AgentWatch tracks health over time and marks agents as stale or offline based on configurable thresholds.

Cross-agent tracing - Actions are linked by trace ID and optional parent event ID. When agent-c fails because agent-b sent bad data that came from agent-a, the full chain is queryable.

Cascade detection - correlate(failureEventId) walks backward from any failure to the root cause, returning the full chain with timing and output at each step.

Forensic replay - replay(traceId) returns all cascade chains within a trace. Useful for post-mortem analysis when a single trace touched multiple agents.

OpenTelemetry export - Export traces as OTEL spans (GenAI semantic conventions). Works with Jaeger, Grafana, or any OTEL-compatible backend. Requires optional peer deps.

CLI

npx @nicofains1/agentwatch demo                   # run the demo
npx @nicofains1/agentwatch dashboard              # fleet health overview
npx @nicofains1/agentwatch cascade <event-id>     # trace cascade from a failure
npx @nicofains1/agentwatch failures [agent]       # list recent failures
npx @nicofains1/agentwatch alerts [agent]         # list active alerts
npx @nicofains1/agentwatch replay <trace-id>      # replay all cascades in a trace
npx @nicofains1/agentwatch mcp                    # start MCP server (stdio)

Set AGENTWATCH_DB to point to your database file. Default: agentwatch.db in the current directory.

MCP server

AgentWatch runs as an MCP server. Add it to your Claude Code or Cursor config:

Claude Code (~/.claude/claude_desktop_config.json or .claude/settings.json):

{
  "mcpServers": {
    "agentwatch": {
      "command": "npx",
      "args": ["@nicofains1/agentwatch", "mcp"],
      "env": {
        "AGENTWATCH_DB": "/absolute/path/to/agentwatch.db"
      }
    }
  }
}

Cursor (.cursor/mcp.json):

{
  "mcpServers": {
    "agentwatch": {
      "command": "npx",
      "args": ["@nicofains1/agentwatch", "mcp"],
      "env": {
        "AGENTWATCH_DB": "/absolute/path/to/agentwatch.db"
      }
    }
  }
}

API reference

Constructor

const aw = new AgentWatch({
  db_path: 'agentwatch.db',        // SQLite file path
  alert_window_minutes: 30,         // de-dup window for alerts
  heartbeat_stale_minutes: 30,      // when to mark agents as offline
});

Heartbeats

aw.report(agent, status, context?)     // status: 'healthy' | 'degraded' | 'error' | 'offline'
aw.getLatestHeartbeat(agent)           // -> Heartbeat | undefined
aw.getFleetHealth()                    // -> AgentHealth[]

Tracing

aw.createTraceId()                                // -> string (UUID)
aw.trace(traceId, agent, action, input, output, {
  parentEventId?: number,
  status?: 'ok' | 'error',                        // default: 'ok'
  durationMs?: number,
})                                                // -> TraceEvent
aw.getTraceEvents(traceId)                        // -> TraceEvent[]
aw.getRecentFailures(agent?, limit?)              // -> TraceEvent[]

Cascade detection

aw.correlate(failureEventId)    // -> CascadeChain | null
aw.replay(traceId)              // -> CascadeChain[]

Alerts

aw.alert(agent, alertType, message)
aw.resolveAlert(alertId)
aw.activeAlerts(agent?)         // -> Alert[]

Dashboard

aw.dashboard()      // -> DashboardOutput (structured)
aw.dashboardText()  // -> string (formatted for terminal)

OpenTelemetry export

Requires optional peer deps @opentelemetry/api and @opentelemetry/sdk-trace-base.

await aw.exportTraceToOtel(traceId, { serviceName: 'my-agents' });
await aw.exportRecentToOtel(1); // last 1 hour

Storage

SQLite via better-sqlite3. The database file is created automatically on first use. WAL mode is on for concurrent reads.

Tables: heartbeats, trace_events, alerts.

License

MIT

AgentWatch

agentwatch

See it in action

Install

Quick start

What it does

CLI

MCP server

API reference

Constructor

Heartbeats

Tracing

Cascade detection

Alerts

Dashboard

OpenTelemetry export

Storage

License

Configuration

AgentWatch

agentwatch

See it in action

Install

Quick start

What it does

CLI

MCP server

API reference

Constructor

Heartbeats

Tracing

Cascade detection

Alerts

Dashboard

OpenTelemetry export

Storage

License

Configuration

Related AI & LLM Tools MCP Servers

Related AI & LLM Tools MCP Servers