Skip to content

Monitoring & Alerting

Verity exposes Prometheus metrics from all services and ships alerting rules via the Helm chart. This guide covers the metrics, alerts, and recommended Grafana dashboards.

Metrics Endpoint

All Verity services expose Prometheus metrics at:

GET /v1/metrics

Prometheus scrape configuration (typically handled by the Prometheus Operator):

serviceMonitor:
  enabled: true
  endpoints:
    - port: http
      path: /v1/metrics
      interval: 15s

Prometheus Metrics

Ingestion

Metric Type Labels Description
verity_events_ingested_total Counter platform, connector Total events ingested from source platforms
verity_events_lag_seconds Gauge platform, connector Lag between event occurrence and ingestion time

Key queries:

# Ingestion rate per connector (events/sec)
rate(verity_events_ingested_total[5m])

# Current ingestion lag
verity_events_lag_seconds

Decay Scoring

Metric Type Labels Description
verity_scores_computed_total Counter trigger Total access-decay scores computed
verity_score_computation_duration_seconds Histogram trigger Duration of score computation

Key queries:

# Score computation rate
rate(verity_scores_computed_total[5m])

# p99 score computation latency
histogram_quantile(0.99, rate(verity_score_computation_duration_seconds_bucket[5m]))

# Average score computation time
rate(verity_score_computation_duration_seconds_sum[5m])
/ rate(verity_score_computation_duration_seconds_count[5m])

Reviews

Metric Type Labels Description
verity_reviews_open_total Gauge risk_level Currently open review packets
verity_reviews_sla_breached_total Counter risk_level Total reviews that breached SLA

Key queries:

# Open reviews by risk level
verity_reviews_open_total

# SLA breach rate (per hour)
increase(verity_reviews_sla_breached_total[1h])

Remediation

Metric Type Labels Description
verity_remediations_executed_total Counter platform, status Total remediation actions executed
verity_remediations_failed_total Counter platform Total failed remediations

Key queries:

# Remediation success rate
1 - (
  rate(verity_remediations_failed_total[5m])
  / rate(verity_remediations_executed_total[5m])
)

# Failed remediations in last hour
increase(verity_remediations_failed_total[1h])

API Gateway

Metric Type Labels Description
verity_api_request_duration_seconds Histogram method, endpoint, status API request duration

Key queries:

# Request rate
rate(verity_api_request_duration_seconds_count[5m])

# p95 API latency
histogram_quantile(0.95, rate(verity_api_request_duration_seconds_bucket[5m]))

# Error rate (5xx)
sum(rate(verity_api_request_duration_seconds_count{status=~"5.."}[5m]))
/ sum(rate(verity_api_request_duration_seconds_count[5m]))

Audit

Metric Type Labels Description
verity_audit_write_lag_seconds Gauge Lag between event creation and ClickHouse persistence

Alerting Rules

The following alerts are defined in infra/helm/verity/templates/prometheus-rules.yaml and deployed automatically with the Helm chart:

Critical Alerts

Alert Expression For Description
AuditWriteLagHigh verity_audit_write_lag_seconds > 30 2m Audit log write lag exceeds 30 seconds. May indicate ClickHouse performance issues or Kafka consumer lag.
RemediationFailed increase(verity_remediations_failed_total[5m]) > 0 0m One or more remediation actions failed. Requires immediate investigation — access may not have been revoked.
ReviewSLABreach CRITICAL review open > 4 hours 0m A CRITICAL-risk review packet has exceeded its SLA. Escalation required.
ReviewSLABreachCount increase(verity_reviews_sla_breached_total{risk_level="CRITICAL"}[1h]) > 0 0m CRITICAL review SLA breached in the last hour.
APIHighErrorRate 5xx rate > 5% for 5 minutes 5m API is returning excessive server errors.

Warning Alerts

Alert Expression For Description
ConnectorStopped increase(verity_events_ingested_total[10m]) == 0 10m No events ingested from a connector for 10 minutes. May indicate connector crash or source API issues.
ScoreComputationSlow p99 score computation > 5s 5m Score computation is degraded. May indicate database performance issues.

Alert Routing

Each alert includes:

  • severity label: critical or warning
  • team label: verity
  • runbook_url annotation: Link to the relevant runbook

Configure AlertManager to route team: verity alerts to the appropriate on-call channel (Slack, PagerDuty, etc.).

Grafana Dashboards

1. Service Health Overview

Panels:

  • Pod status by service (Running / Pending / Failed)
  • CPU and memory utilisation per service
  • Restart count per pod
  • Container readiness

2. Ingestion Pipeline

Panels:

  • Events ingested per second (by platform/connector)
  • Ingestion lag (by connector)
  • Kafka consumer lag (by consumer group)
  • Event enrichment throughput
# Example: Ingestion throughput panel
rate(verity_events_ingested_total[5m])

3. Decay Score Analytics

Panels:

  • Scores computed per second (by trigger type)
  • Score computation latency (p50, p95, p99)
  • Score distribution histogram
  • High-risk grants over time

4. Review SLA Compliance

Panels:

  • Open reviews by risk level (stacked gauge)
  • SLA breach count (by risk level, over time)
  • Review decision rate
  • Average time to decision

5. API Performance

Panels:

  • Request rate (by endpoint)
  • Latency percentiles (p50, p95, p99)
  • Error rate (4xx vs 5xx)
  • Top slowest endpoints

6. Remediation Status

Panels:

  • Remediations executed (by platform, success/failure)
  • Failure rate trend
  • Time from decision to remediation

ClickHouse Query Performance

Monitor ClickHouse query performance for audit and compliance reports:

-- Active queries
SELECT query_id, elapsed, read_rows, memory_usage
FROM system.processes
ORDER BY elapsed DESC;

-- Slow query log (last 24h)
SELECT
    query,
    query_duration_ms,
    read_rows,
    read_bytes,
    memory_usage
FROM system.query_log
WHERE event_date = today()
  AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC
LIMIT 20;

-- Table sizes
SELECT
    database,
    table,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    sum(rows) AS total_rows
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC;

Infrastructure Monitoring

In addition to application metrics, monitor the underlying infrastructure:

Component Key Metrics Source
PostgreSQL Connections, query latency, replication lag, disk usage Azure Monitor / pg_stat
ClickHouse Query rate, merge operations, disk usage, memory ClickHouse system tables
Kafka Consumer lag, partition count, broker health Kafka JMX / Azure Event Hubs metrics
Redis Memory usage, hit rate, connected clients Redis INFO / Azure Monitor
Temporal Workflow execution latency, task queue depth, worker health Temporal metrics
Kubernetes Node CPU/memory, pod restarts, OOMKills kube-state-metrics