Monitoring & Alerting¶

Verity exposes Prometheus metrics from all services and ships alerting rules via the Helm chart. This guide covers the metrics, alerts, and recommended Grafana dashboards.

Metrics Endpoint¶

All Verity services expose Prometheus metrics at:

GET /v1/metrics

Prometheus scrape configuration (typically handled by the Prometheus Operator):

serviceMonitor:
  enabled: true
  endpoints:
    - port: http
      path: /v1/metrics
      interval: 15s

Prometheus Metrics¶

Ingestion¶

Metric	Type	Labels	Description
`verity_events_ingested_total`	Counter	`platform`, `connector`	Total events ingested from source platforms
`verity_events_lag_seconds`	Gauge	`platform`, `connector`	Lag between event occurrence and ingestion time

Key queries:

# Ingestion rate per connector (events/sec)
rate(verity_events_ingested_total[5m])

# Current ingestion lag
verity_events_lag_seconds

Decay Scoring¶

Metric	Type	Labels	Description
`verity_scores_computed_total`	Counter	`trigger`	Total access-decay scores computed
`verity_score_computation_duration_seconds`	Histogram	`trigger`	Duration of score computation

Key queries:

# Score computation rate
rate(verity_scores_computed_total[5m])

# p99 score computation latency
histogram_quantile(0.99, rate(verity_score_computation_duration_seconds_bucket[5m]))

# Average score computation time
rate(verity_score_computation_duration_seconds_sum[5m])
/ rate(verity_score_computation_duration_seconds_count[5m])

Reviews¶

Metric	Type	Labels	Description
`verity_reviews_open_total`	Gauge	`risk_level`	Currently open review packets
`verity_reviews_sla_breached_total`	Counter	`risk_level`	Total reviews that breached SLA

Key queries:

# Open reviews by risk level
verity_reviews_open_total

# SLA breach rate (per hour)
increase(verity_reviews_sla_breached_total[1h])

Remediation¶

Metric	Type	Labels	Description
`verity_remediations_executed_total`	Counter	`platform`, `status`	Total remediation actions executed
`verity_remediations_failed_total`	Counter	`platform`	Total failed remediations

Key queries:

# Remediation success rate
1 - (
  rate(verity_remediations_failed_total[5m])
  / rate(verity_remediations_executed_total[5m])
)

# Failed remediations in last hour
increase(verity_remediations_failed_total[1h])

API Gateway¶

Metric	Type	Labels	Description
`verity_api_request_duration_seconds`	Histogram	`method`, `endpoint`, `status`	API request duration

Key queries:

# Request rate
rate(verity_api_request_duration_seconds_count[5m])

# p95 API latency
histogram_quantile(0.95, rate(verity_api_request_duration_seconds_bucket[5m]))

# Error rate (5xx)
sum(rate(verity_api_request_duration_seconds_count{status=~"5.."}[5m]))
/ sum(rate(verity_api_request_duration_seconds_count[5m]))

Audit¶

Metric	Type	Labels	Description
`verity_audit_write_lag_seconds`	Gauge	—	Lag between event creation and ClickHouse persistence

Alerting Rules¶

The following alerts are defined in infra/helm/verity/templates/prometheus-rules.yaml and deployed automatically with the Helm chart:

Critical Alerts¶

Alert	Expression	For	Description
AuditWriteLagHigh	`verity_audit_write_lag_seconds > 30`	2m	Audit log write lag exceeds 30 seconds. May indicate ClickHouse performance issues or Kafka consumer lag.
RemediationFailed	`increase(verity_remediations_failed_total[5m]) > 0`	0m	One or more remediation actions failed. Requires immediate investigation — access may not have been revoked.
ReviewSLABreach	CRITICAL review open > 4 hours	0m	A CRITICAL-risk review packet has exceeded its SLA. Escalation required.
ReviewSLABreachCount	`increase(verity_reviews_sla_breached_total{risk_level="CRITICAL"}[1h]) > 0`	0m	CRITICAL review SLA breached in the last hour.
APIHighErrorRate	5xx rate > 5% for 5 minutes	5m	API is returning excessive server errors.

Warning Alerts¶

Alert	Expression	For	Description
ConnectorStopped	`increase(verity_events_ingested_total[10m]) == 0`	10m	No events ingested from a connector for 10 minutes. May indicate connector crash or source API issues.
ScoreComputationSlow	p99 score computation > 5s	5m	Score computation is degraded. May indicate database performance issues.

Alert Routing¶

Each alert includes:

severity label: critical or warning
team label: verity
runbook_url annotation: Link to the relevant runbook

Configure AlertManager to route team: verity alerts to the appropriate on-call channel (Slack, PagerDuty, etc.).

Grafana Dashboards¶

Recommended Dashboards¶

1. Service Health Overview¶

Panels:

Pod status by service (Running / Pending / Failed)
CPU and memory utilisation per service
Restart count per pod
Container readiness

2. Ingestion Pipeline¶

Panels:

Events ingested per second (by platform/connector)
Ingestion lag (by connector)
Kafka consumer lag (by consumer group)
Event enrichment throughput

# Example: Ingestion throughput panel
rate(verity_events_ingested_total[5m])

3. Decay Score Analytics¶

Panels:

Scores computed per second (by trigger type)
Score computation latency (p50, p95, p99)
Score distribution histogram
High-risk grants over time

4. Review SLA Compliance¶

Panels:

Open reviews by risk level (stacked gauge)
SLA breach count (by risk level, over time)
Review decision rate
Average time to decision

5. API Performance¶

Panels:

Request rate (by endpoint)
Latency percentiles (p50, p95, p99)
Error rate (4xx vs 5xx)
Top slowest endpoints

6. Remediation Status¶

Panels:

Remediations executed (by platform, success/failure)
Failure rate trend
Time from decision to remediation

ClickHouse Query Performance¶

Monitor ClickHouse query performance for audit and compliance reports:

-- Active queries
SELECT query_id, elapsed, read_rows, memory_usage
FROM system.processes
ORDER BY elapsed DESC;

-- Slow query log (last 24h)
SELECT
    query,
    query_duration_ms,
    read_rows,
    read_bytes,
    memory_usage
FROM system.query_log
WHERE event_date = today()
  AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC
LIMIT 20;

-- Table sizes
SELECT
    database,
    table,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    sum(rows) AS total_rows
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC;

Infrastructure Monitoring¶

In addition to application metrics, monitor the underlying infrastructure:

Component	Key Metrics	Source
PostgreSQL	Connections, query latency, replication lag, disk usage	Azure Monitor / pg_stat
ClickHouse	Query rate, merge operations, disk usage, memory	ClickHouse system tables
Kafka	Consumer lag, partition count, broker health	Kafka JMX / Azure Event Hubs metrics
Redis	Memory usage, hit rate, connected clients	Redis INFO / Azure Monitor
Temporal	Workflow execution latency, task queue depth, worker health	Temporal metrics
Kubernetes	Node CPU/memory, pod restarts, OOMKills	kube-state-metrics