Monitoring & Alerting¶
Verity exposes Prometheus metrics from all services and ships alerting rules via the Helm chart. This guide covers the metrics, alerts, and recommended Grafana dashboards.
Metrics Endpoint¶
All Verity services expose Prometheus metrics at:
Prometheus scrape configuration (typically handled by the Prometheus Operator):
Prometheus Metrics¶
Ingestion¶
| Metric | Type | Labels | Description |
|---|---|---|---|
verity_events_ingested_total |
Counter | platform, connector |
Total events ingested from source platforms |
verity_events_lag_seconds |
Gauge | platform, connector |
Lag between event occurrence and ingestion time |
Key queries:
# Ingestion rate per connector (events/sec)
rate(verity_events_ingested_total[5m])
# Current ingestion lag
verity_events_lag_seconds
Decay Scoring¶
| Metric | Type | Labels | Description |
|---|---|---|---|
verity_scores_computed_total |
Counter | trigger |
Total access-decay scores computed |
verity_score_computation_duration_seconds |
Histogram | trigger |
Duration of score computation |
Key queries:
# Score computation rate
rate(verity_scores_computed_total[5m])
# p99 score computation latency
histogram_quantile(0.99, rate(verity_score_computation_duration_seconds_bucket[5m]))
# Average score computation time
rate(verity_score_computation_duration_seconds_sum[5m])
/ rate(verity_score_computation_duration_seconds_count[5m])
Reviews¶
| Metric | Type | Labels | Description |
|---|---|---|---|
verity_reviews_open_total |
Gauge | risk_level |
Currently open review packets |
verity_reviews_sla_breached_total |
Counter | risk_level |
Total reviews that breached SLA |
Key queries:
# Open reviews by risk level
verity_reviews_open_total
# SLA breach rate (per hour)
increase(verity_reviews_sla_breached_total[1h])
Remediation¶
| Metric | Type | Labels | Description |
|---|---|---|---|
verity_remediations_executed_total |
Counter | platform, status |
Total remediation actions executed |
verity_remediations_failed_total |
Counter | platform |
Total failed remediations |
Key queries:
# Remediation success rate
1 - (
rate(verity_remediations_failed_total[5m])
/ rate(verity_remediations_executed_total[5m])
)
# Failed remediations in last hour
increase(verity_remediations_failed_total[1h])
API Gateway¶
| Metric | Type | Labels | Description |
|---|---|---|---|
verity_api_request_duration_seconds |
Histogram | method, endpoint, status |
API request duration |
Key queries:
# Request rate
rate(verity_api_request_duration_seconds_count[5m])
# p95 API latency
histogram_quantile(0.95, rate(verity_api_request_duration_seconds_bucket[5m]))
# Error rate (5xx)
sum(rate(verity_api_request_duration_seconds_count{status=~"5.."}[5m]))
/ sum(rate(verity_api_request_duration_seconds_count[5m]))
Audit¶
| Metric | Type | Labels | Description |
|---|---|---|---|
verity_audit_write_lag_seconds |
Gauge | — | Lag between event creation and ClickHouse persistence |
Alerting Rules¶
The following alerts are defined in infra/helm/verity/templates/prometheus-rules.yaml and deployed automatically with the Helm chart:
Critical Alerts¶
| Alert | Expression | For | Description |
|---|---|---|---|
| AuditWriteLagHigh | verity_audit_write_lag_seconds > 30 |
2m | Audit log write lag exceeds 30 seconds. May indicate ClickHouse performance issues or Kafka consumer lag. |
| RemediationFailed | increase(verity_remediations_failed_total[5m]) > 0 |
0m | One or more remediation actions failed. Requires immediate investigation — access may not have been revoked. |
| ReviewSLABreach | CRITICAL review open > 4 hours | 0m | A CRITICAL-risk review packet has exceeded its SLA. Escalation required. |
| ReviewSLABreachCount | increase(verity_reviews_sla_breached_total{risk_level="CRITICAL"}[1h]) > 0 |
0m | CRITICAL review SLA breached in the last hour. |
| APIHighErrorRate | 5xx rate > 5% for 5 minutes | 5m | API is returning excessive server errors. |
Warning Alerts¶
| Alert | Expression | For | Description |
|---|---|---|---|
| ConnectorStopped | increase(verity_events_ingested_total[10m]) == 0 |
10m | No events ingested from a connector for 10 minutes. May indicate connector crash or source API issues. |
| ScoreComputationSlow | p99 score computation > 5s | 5m | Score computation is degraded. May indicate database performance issues. |
Alert Routing¶
Each alert includes:
severitylabel:criticalorwarningteamlabel:verityrunbook_urlannotation: Link to the relevant runbook
Configure AlertManager to route team: verity alerts to the appropriate on-call channel (Slack, PagerDuty, etc.).
Grafana Dashboards¶
Recommended Dashboards¶
1. Service Health Overview¶
Panels:
- Pod status by service (Running / Pending / Failed)
- CPU and memory utilisation per service
- Restart count per pod
- Container readiness
2. Ingestion Pipeline¶
Panels:
- Events ingested per second (by platform/connector)
- Ingestion lag (by connector)
- Kafka consumer lag (by consumer group)
- Event enrichment throughput
3. Decay Score Analytics¶
Panels:
- Scores computed per second (by trigger type)
- Score computation latency (p50, p95, p99)
- Score distribution histogram
- High-risk grants over time
4. Review SLA Compliance¶
Panels:
- Open reviews by risk level (stacked gauge)
- SLA breach count (by risk level, over time)
- Review decision rate
- Average time to decision
5. API Performance¶
Panels:
- Request rate (by endpoint)
- Latency percentiles (p50, p95, p99)
- Error rate (4xx vs 5xx)
- Top slowest endpoints
6. Remediation Status¶
Panels:
- Remediations executed (by platform, success/failure)
- Failure rate trend
- Time from decision to remediation
ClickHouse Query Performance¶
Monitor ClickHouse query performance for audit and compliance reports:
-- Active queries
SELECT query_id, elapsed, read_rows, memory_usage
FROM system.processes
ORDER BY elapsed DESC;
-- Slow query log (last 24h)
SELECT
query,
query_duration_ms,
read_rows,
read_bytes,
memory_usage
FROM system.query_log
WHERE event_date = today()
AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC
LIMIT 20;
-- Table sizes
SELECT
database,
table,
formatReadableSize(sum(bytes_on_disk)) AS size,
sum(rows) AS total_rows
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC;
Infrastructure Monitoring¶
In addition to application metrics, monitor the underlying infrastructure:
| Component | Key Metrics | Source |
|---|---|---|
| PostgreSQL | Connections, query latency, replication lag, disk usage | Azure Monitor / pg_stat |
| ClickHouse | Query rate, merge operations, disk usage, memory | ClickHouse system tables |
| Kafka | Consumer lag, partition count, broker health | Kafka JMX / Azure Event Hubs metrics |
| Redis | Memory usage, hit rate, connected clients | Redis INFO / Azure Monitor |
| Temporal | Workflow execution latency, task queue depth, worker health | Temporal metrics |
| Kubernetes | Node CPU/memory, pod restarts, OOMKills | kube-state-metrics |