Agent Health Monitoring & Alerting
Monitor AI agent health, detect anomalies, set up alerting, and maintain observability dashboards for production multi-agent systems. Covers liveness checks, performance metrics, drift detection, and incident response.
Agent Health Monitoring & Alerting#
Overview#
Production multi-agent systems fail silently. An agent that stops responding, returns empty results, or enters an infinite loop can degrade an entire workflow without triggering traditional infrastructure alerts. This skill covers how to build comprehensive health monitoring, metrics collection, and alerting for AI agent fleets.
Core Concepts#
Agent Vital Signs#
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Response Rate | % of agent invocations that return a result | Dropping rate indicates crashes or context overflows |
| Latency (P50/P95/P99) | Time from invocation to response | Spikes indicate context bloat or degraded model performance |
| Error Rate | % of invocations with errors/tool failures | Rising rate indicates systemic issues |
| Step Count | Number of reasoning steps per task | Unbounded growth indicates looping behavior |
| Tool Call Success Rate | % of tool calls that succeed | Drop indicates broken integrations or rate limiting |
| Token Consumption | Tokens used per agent run | Budget anomalies indicate runaway agents |
| Context Utilization | % of context window used | High utilization risks truncation and quality loss |
| Hallucination Score | Confidence calibration or factuality checks | Degrading accuracy undermines trust |
Alert Severity Levels#
| Level | Color | Response Time | Examples |
|---|---|---|---|
| P0 (Critical) | 🔴 Red | < 5 min | Agent completely down, data loss, security breach |
| P1 (High) | 🟠 Orange | < 15 min | Error rate > 20%, latency 5x baseline |
| P2 (Medium) | 🟡 Yellow | < 1 hour | Error rate > 5%, slow degradation |
| P3 (Low) | 🔵 Blue | < 24 hours | Single agent underperforming, minor drift |
Step-by-Step Implementation#
Step 1: Instrument Every Agent#
Wrap every agent invocation with telemetry:
class MonitoredAgent:
"""Agent wrapper that collects metrics on every invocation."""
def __init__(self, agent, agent_name: str, metrics_client):
self.agent = agent
self.agent_name = agent_name
self.metrics = metrics_client
async def run(self, task: str) -> str:
start_time = time.time()
step_count = 0
token_usage = 0
try:
result = await self.agent.run(task)
# Collect metrics
duration = time.time() - start_time
self.metrics.timing(f"agent.{self.agent_name}.latency", duration)
self.metrics.increment(f"agent.{self.agent_name}.invocations")
self.metrics.increment(f"agent.{self.agent_name}.success")
self.metrics.gauge(f"agent.{self.agent_name}.steps", step_count)
return result
except Exception as e:
duration = time.time() - start_time
self.metrics.increment(f"agent.{self.agent_name}.errors")
self.metrics.timing(f"agent.{self.agent_name}.error_latency", duration)
raiseStep 2: Implement Liveness & Readiness Probes#
class AgentHealthProbe:
"""Kubernetes-style health probes for AI agents."""
async def liveness_check(self, agent) -> bool:
"""Is the agent process alive and responding?"""
try:
result = await asyncio.wait_for(
agent.run("Respond with: OK"),
timeout=5.0
)
return "OK" in result
except (asyncio.TimeoutError, Exception):
return False
async def readiness_check(self, agent) -> dict:
"""Is the agent ready to accept tasks?"""
checks = {
"model_available": await self._check_model(agent),
"tools_available": await self._check_tools(agent),
"memory_available": await self._check_memory(agent),
"context_capacity": await self._check_context(agent),
}
return {
"ready": all(checks.values()),
"checks": checks
}
async def deep_check(self, agent) -> dict:
"""Full diagnostic: run a test task and validate output."""
test_task = agent.config.test_prompt
result = await agent.run(test_task)
return {
"passed": self._validate_output(result),
"output_preview": result[:200],
"latency_ms": self._last_latency
}Step 3: Set Up Anomaly Detection#
class AnomalyDetector:
"""Detect unusual agent behavior using statistical methods."""
def __init__(self, window_size: int = 100):
self.window_size = window_size
self.metrics_history = defaultdict(list)
def record(self, agent_name: str, metric: str, value: float):
self.metrics_history[f"{agent_name}:{metric}"].append(value)
# Keep rolling window
history = self.metrics_history[f"{agent_name}:{metric}"]
if len(history) > self.window_size:
history.pop(0)
def is_anomalous(self, agent_name: str, metric: str, value: float,
z_threshold: float = 3.0) -> tuple[bool, float]:
"""Check if a value is anomalous using z-score."""
history = self.metrics_history.get(f"{agent_name}:{metric}", [])
if len(history) < 10:
return False, 0.0 # Not enough data
mean = statistics.mean(history)
stdev = statistics.stdev(history)
if stdev == 0:
return False, 0.0
z_score = (value - mean) / stdev
return abs(z_score) > z_threshold, z_scoreStep 4: Build the Alerting Pipeline#
class AlertManager:
"""Route alerts to the right channels based on severity."""
def __init__(self):
self.channels = {
"p0": ["pagerduty", "slack-critical", "phone"],
"p1": ["slack-critical", "email"],
"p2": ["slack-warn", "email"],
"p3": ["dashboard", "weekly-report"],
}
async def alert(self, severity: str, title: str, message: str,
context: dict = None):
"""Send an alert through the appropriate channels."""
channels = self.channels.get(severity, self.channels["p3"])
for channel in channels:
await self._send(channel, {
"severity": severity,
"title": title,
"message": message,
"context": context,
"timestamp": datetime.now().isoformat()
})Step 5: Define Alert Rules#
# alert-rules.yaml
rules:
- name: agent_down
condition: liveness_check == false
for: 30s
severity: P0
message: "Agent {name} is unresponsive"
- name: high_error_rate
condition: error_rate > 0.20
for: 5m
severity: P1
message: "Agent {name} error rate is {error_rate:.0%}"
- name: latency_spike
condition: p99_latency > 30s
for: 3m
severity: P1
message: "Agent {name} p99 latency is {latency:.1f}s"
- name: looping_detected
condition: step_count > max_steps * 0.8
for: 1m
severity: P2
message: "Agent {name} approaching step limit on {task_count} tasks"
- name: budget_anomaly
condition: token_usage > daily_budget * 0.5
for: 1h
severity: P2
message: "Agent {name} used {usage} tokens in last hour (50% of daily budget)"Step 6: Build the Dashboard#
Essential dashboard panels for a multi-agent system:
| Panel | Metric | Display |
|---|---|---|
| Agent Grid | Liveness per agent | Green/Red status cards |
| Latency Heatmap | P50/P95/P99 per agent | Color-coded time series |
| Error Waterfall | Error rate by agent + error type | Stacked area chart |
| Token Burn Rate | Tokens/min per agent | Line chart with budget line |
| Active Tasks | Tasks in-flight per agent | Gauge per agent |
| Top Errors | Most frequent error messages | Ranked list with count |
| Context Pressure | % context window used | Per-agent gauge cluster |
| Alert Timeline | Alerts over past 24h | Event timeline |
Trigger Phrases#
| Phrase | Action |
|---|---|
| "Check agent health" | Run liveness probes on all agents |
| "Show me the dashboard" | Generate or link to monitoring dashboard |
| "Why is agent X slow?" | Show latency breakdown for specific agent |
| "Any anomalies?" | Run anomaly detection on recent metrics |
| "Set up alert for..." | Create a new alert rule |
| "Agent X is down" | Trigger incident response workflow |
| "Run a health check" | Execute full liveness + readiness + deep check |
Production Runbook#
Incident: Agent Unresponsive#
- Check liveness probe — is the process running?
- Check model endpoint — is the LLM provider healthy?
- Check context window — has the agent exceeded its limit?
- Restart agent with fresh context
- If recurring, set up circuit breaker
Incident: Error Rate Spike#
- Identify error type — tool failure, model error, or parsing issue?
- Check recent deploys — did a prompt or tool change?
- Rollback if a recent change correlates
- Check rate limits — are external APIs throttling?
- Scale out if traffic increased
Incident: Token Budget Spike#
- Identify which agent(s) are consuming
- Check for looping — excessive step counts
- Review recent tasks — unusually long inputs?
- Implement budget caps per task
- Alert the team if pattern persists
Anti-Patterns#
| Anti-Pattern | Why It Fails | Fix |
|---|---|---|
| Monitoring only liveness | Agent can be "alive" but useless | Add readiness + deep checks |
| Same threshold for all agents | Different agents have different baselines | Per-agent dynamic thresholds |
| No alert deduplication | Alert fatigue leads to ignored alerts | Group by fingerprint, rate-limit |
| Fixing symptoms, not causes | Band-aid solutions mask root issues | Always capture root cause in alerts |
| No dashboard | No shared visibility | Build and maintain a live dashboard |
More in AI / ML
View all →Agent Audit Log Reporting
Implement comprehensive audit logging and reporting for multi-agent systems. Covers event capture, structured logging, traceability, compliance reporting, forensic analysis, and real-time monitoring dashboards for agent actions and decisions.
Agent-to-Agent Handoff Protocols
Design and implement agent-to-agent handoff protocols for multi-agent systems. Covers context passing, escalation patterns, handshake mechanisms, conversation continuity, and routing between specialized agents in production workflows.
Agent Task Delegation & Load Balancing
Design and operate task delegation systems for multi-agent fleets. Covers workload distribution, load balancing, queue management, priority scheduling, and dynamic agent scaling for production agent systems.