Skip to content

Production Checklist

Complete this checklist before deploying Verity to production. Each item should be verified and signed off by the responsible team.

Identity & Authentication

  • Azure AD / Entra ID app registration configured with appropriate API permissions
  • Service principal or managed identity provisioned for each connector
  • JWT signing key generated and stored in Azure Key Vault
  • CORS policy restricted to production domain(s) only
  • API rate limiting configured (recommended: 100 req/s per client)
  • OAuth 2.0 / OIDC token validation enabled on API Gateway

Secrets Management

  • All database credentials stored in Kubernetes Secrets (via Azure Key Vault CSI driver)
  • Azure Key Vault SecretProviderClass configured and syncing:
    • db-passwordverity-db-credentials
    • clickhouse-passwordverity-clickhouse-credentials
    • kafka-connection-stringverity-kafka-credentials
    • redis-passwordverity-redis-credentials
    • azure-ad-client-secretverity-azure-ad
    • jwt-signing-keyverity-jwt
  • Secrets rotation policy established (recommended: 90-day rotation)
  • No secrets in environment variables, ConfigMaps, or source code

TLS & Networking

  • TLS certificates provisioned via cert-manager with letsencrypt-prod cluster issuer
  • Ingress configured with ssl-redirect: "true"
  • NetworkPolicies enabled (networkPolicy.enabled: true in Helm values)
  • Default deny-all network policy active
  • All 21 service-specific network policies verified
  • Internal service communication over private network (no public endpoints for databases)

Compute & Scaling

  • Resource requests and limits set for all pods (per values-prod.yaml)
  • Horizontal Pod Autoscaler configured for API Gateway and ingestion services
  • Node pools sized appropriately:
    • System pool: 3 nodes minimum
    • Workload pool: Auto-scaling enabled (min 3, max 20)
  • Pod Disruption Budgets set for critical services (≥1 pod always available)
  • Temporal namespace created (verity) with appropriate worker task queue scaling

Databases

  • PostgreSQL (Azure Database for PostgreSQL Flexible Server):
    • High Availability enabled (zone-redundant)
    • Automated backups configured (retention: 30 days minimum)
    • Point-in-time restore tested
    • TimescaleDB extension enabled
    • Connection pooling via PgBouncer enabled
    • max_connections tuned for expected load
  • ClickHouse:
    • Replication configured (≥2 replicas)
    • TTL policies set for audit data retention
    • Backup strategy documented and tested
    • max_memory_usage configured per query
  • Redis (Azure Cache for Redis):
    • Premium tier with TLS enabled (rediss://)
    • Persistence enabled (AOF or RDB)
    • Maxmemory policy set to allkeys-lru

Kafka / Event Hubs

  • Azure Event Hubs namespace provisioned with Kafka protocol support
  • Topic replication factor ≥ 3 (or Event Hubs Standard/Premium tier)
  • All required topics created:
    • verity.events.raw.{platform} (per connector)
    • verity.events.normalised
    • verity.scores.updated
    • verity.reviews.created
    • verity.reviews.decided
    • verity.remediation.completed
    • verity.audit.trail
  • Consumer group IDs registered for each service
  • SASL_SSL authentication configured (securityProtocol: SASL_SSL)
  • Message retention set (recommended: 7 days for event topics, 30 days for audit)

Temporal

  • Temporal namespace verity created
  • Temporal worker scaling configured (min 2 workers per task queue)
  • Workflow execution timeout set appropriately
  • Retention period for closed workflows configured (recommended: 30 days)
  • Temporal server health monitored

Monitoring & Observability

  • Prometheus scraping enabled for all services at /v1/metrics
  • Prometheus alerting rules deployed (from prometheus-rules.yaml):
    • AuditWriteLagHigh — audit write lag > 30s
    • RemediationFailed — any remediation failure
    • ConnectorStopped — no events for 10 minutes
    • ReviewSLABreach — CRITICAL review open > 4 hours
    • ScoreComputationSlow — p99 score latency > 5s
    • APIHighErrorRate — 5xx rate > 5%
  • Grafana dashboards provisioned for:
    • Service health overview
    • Ingestion pipeline throughput
    • Decay score distribution
    • Review SLA compliance
    • API latency and error rates
  • Log aggregation configured (Azure Monitor / ELK / Loki)
  • Structured JSON logging enabled (LOG_FORMAT: json)
  • OpenTelemetry collector configured for distributed tracing

Security

  • Container image scanning enabled in CI (Trivy)
  • Bandit SAST running on every PR
  • Semgrep SAST running on every PR
  • Weekly scheduled security scans configured
  • Container images pinned to specific SHA tags (not latest) in production
  • Pod security standards enforced (restricted profile)
  • Azure Workload Identity configured (no stored service principal credentials)
  • Audit trail immutability verified (ClickHouse append-only)

Disaster Recovery

  • PostgreSQL backup strategy documented:
    • Automated daily backups
    • Cross-region geo-redundant backup storage
    • Recovery time objective (RTO) defined
    • Recovery point objective (RPO) defined
  • ClickHouse backup strategy documented
  • Kafka/Event Hubs data retention configured
  • Runbook for full cluster recovery documented
  • Disaster recovery drill scheduled (quarterly recommended)

Compliance

  • Data retention policies configured per regulatory requirements
  • Compliance reports validated (SOC 2, ISO 27001, GDPR as applicable)
  • Audit trail completeness verified
  • Access review SLA thresholds configured:
    • CRITICAL: 4 hours
    • HIGH: 24 hours
    • MEDIUM: 7 days
    • LOW: 30 days

Final Verification

  • End-to-end smoke test passed:
    1. Connector ingests events → Kafka
    2. Events enriched and normalised
    3. Decay scores computed
    4. Review packets generated for high-risk access
    5. Review decision workflow completes
    6. Remediation executes on source platform
    7. Audit trail written to ClickHouse
    8. API returns all data correctly
    9. Dashboard displays real-time metrics
  • Load test completed at 2× expected peak traffic
  • Rollback procedure tested with Helm
  • On-call rotation established
  • Incident response playbook documented