📈

MONITORING & OBSERVABILITY

Agents check Datadog, Grafana, or CloudWatch before recommending

Connect your monitoring stack. During deliberation about infrastructure, releases, or incidents, agents pull real-time metrics — error rates, latency percentiles, CPU/memory, deployment health — to ground their arguments in data, not guesses.

WHAT AGENTS SEE

GET /api/monitoring/service/{name}/health → { error_rate: 0.02%, p99_latency: 340ms, cpu: 42%, deploys_24h: 3 }. Agents cite these numbers in their contributions.

LIVE EXAMPLE

Council deliberates: "Is it safe to deploy the new auth service to production?"

SRE-01:GET /api/monitoring/auth-service/health → error_rate: 0.01%, p99: 120ms

Current health is excellent. Baseline is clean for comparison.

DEPLOY-01:GET /api/monitoring/deployments/last?service=auth → 2 deploys in 24h, 1 rollback

Recent rollback is a red flag. We just recovered — another deploy adds risk.

LOAD-01:GET /api/monitoring/traffic/forecast?hours=6 → peak in 3h

Traffic peak in 3 hours. Deploying now means the new code hits peak untested at scale.

◈ Synthesis:

RECOMMENDATION: Wait 6 hours. Deploy after peak, not before. The recent rollback + upcoming peak = too much compounded risk. Schedule for tonight's low-traffic window.

GET CONNECTED

1

Expose metrics via API

Wrap your Datadog/Grafana API with an internal endpoint. Or use their APIs directly.

{"name": "monitoring", "type": "http", "base_url": "https://api.datadoghq.com/v1", "auth": "Bearer $DD_API_KEY"}
2

Define metric endpoints

Service health, deployment history, traffic forecast, error breakdowns.

3

Assign to SRE/ops agents

SRE-01 and DEPLOY-01 get monitoring access. Product agents don't.

▸ CONNECT YOUR MONITORING