Production Checklist¶
Complete this checklist before deploying Verity to production. Each item should be verified and signed off by the responsible team.
Identity & Authentication¶
- Azure AD / Entra ID app registration configured with appropriate API permissions
- Service principal or managed identity provisioned for each connector
- JWT signing key generated and stored in Azure Key Vault
- CORS policy restricted to production domain(s) only
- API rate limiting configured (recommended: 100 req/s per client)
- OAuth 2.0 / OIDC token validation enabled on API Gateway
Secrets Management¶
- All database credentials stored in Kubernetes Secrets (via Azure Key Vault CSI driver)
- Azure Key Vault
SecretProviderClassconfigured and syncing:db-password→verity-db-credentialsclickhouse-password→verity-clickhouse-credentialskafka-connection-string→verity-kafka-credentialsredis-password→verity-redis-credentialsazure-ad-client-secret→verity-azure-adjwt-signing-key→verity-jwt
- Secrets rotation policy established (recommended: 90-day rotation)
- No secrets in environment variables, ConfigMaps, or source code
TLS & Networking¶
- TLS certificates provisioned via cert-manager with
letsencrypt-prodcluster issuer - Ingress configured with
ssl-redirect: "true" - NetworkPolicies enabled (
networkPolicy.enabled: truein Helm values) - Default deny-all network policy active
- All 21 service-specific network policies verified
- Internal service communication over private network (no public endpoints for databases)
Compute & Scaling¶
- Resource requests and limits set for all pods (per
values-prod.yaml) - Horizontal Pod Autoscaler configured for API Gateway and ingestion services
- Node pools sized appropriately:
- System pool: 3 nodes minimum
- Workload pool: Auto-scaling enabled (min 3, max 20)
- Pod Disruption Budgets set for critical services (≥1 pod always available)
- Temporal namespace created (
verity) with appropriate worker task queue scaling
Databases¶
- PostgreSQL (Azure Database for PostgreSQL Flexible Server):
- High Availability enabled (zone-redundant)
- Automated backups configured (retention: 30 days minimum)
- Point-in-time restore tested
- TimescaleDB extension enabled
- Connection pooling via PgBouncer enabled
max_connectionstuned for expected load
- ClickHouse:
- Replication configured (≥2 replicas)
- TTL policies set for audit data retention
- Backup strategy documented and tested
max_memory_usageconfigured per query
- Redis (Azure Cache for Redis):
- Premium tier with TLS enabled (
rediss://) - Persistence enabled (AOF or RDB)
- Maxmemory policy set to
allkeys-lru
- Premium tier with TLS enabled (
Kafka / Event Hubs¶
- Azure Event Hubs namespace provisioned with Kafka protocol support
- Topic replication factor ≥ 3 (or Event Hubs Standard/Premium tier)
- All required topics created:
verity.events.raw.{platform}(per connector)verity.events.normalisedverity.scores.updatedverity.reviews.createdverity.reviews.decidedverity.remediation.completedverity.audit.trail
- Consumer group IDs registered for each service
- SASL_SSL authentication configured (
securityProtocol: SASL_SSL) - Message retention set (recommended: 7 days for event topics, 30 days for audit)
Temporal¶
- Temporal namespace
veritycreated - Temporal worker scaling configured (min 2 workers per task queue)
- Workflow execution timeout set appropriately
- Retention period for closed workflows configured (recommended: 30 days)
- Temporal server health monitored
Monitoring & Observability¶
- Prometheus scraping enabled for all services at
/v1/metrics - Prometheus alerting rules deployed (from
prometheus-rules.yaml):AuditWriteLagHigh— audit write lag > 30sRemediationFailed— any remediation failureConnectorStopped— no events for 10 minutesReviewSLABreach— CRITICAL review open > 4 hoursScoreComputationSlow— p99 score latency > 5sAPIHighErrorRate— 5xx rate > 5%
- Grafana dashboards provisioned for:
- Service health overview
- Ingestion pipeline throughput
- Decay score distribution
- Review SLA compliance
- API latency and error rates
- Log aggregation configured (Azure Monitor / ELK / Loki)
- Structured JSON logging enabled (
LOG_FORMAT: json) - OpenTelemetry collector configured for distributed tracing
Security¶
- Container image scanning enabled in CI (Trivy)
- Bandit SAST running on every PR
- Semgrep SAST running on every PR
- Weekly scheduled security scans configured
- Container images pinned to specific SHA tags (not
latest) in production - Pod security standards enforced (restricted profile)
- Azure Workload Identity configured (no stored service principal credentials)
- Audit trail immutability verified (ClickHouse append-only)
Disaster Recovery¶
- PostgreSQL backup strategy documented:
- Automated daily backups
- Cross-region geo-redundant backup storage
- Recovery time objective (RTO) defined
- Recovery point objective (RPO) defined
- ClickHouse backup strategy documented
- Kafka/Event Hubs data retention configured
- Runbook for full cluster recovery documented
- Disaster recovery drill scheduled (quarterly recommended)
Compliance¶
- Data retention policies configured per regulatory requirements
- Compliance reports validated (SOC 2, ISO 27001, GDPR as applicable)
- Audit trail completeness verified
- Access review SLA thresholds configured:
- CRITICAL: 4 hours
- HIGH: 24 hours
- MEDIUM: 7 days
- LOW: 30 days
Final Verification¶
- End-to-end smoke test passed:
- Connector ingests events → Kafka
- Events enriched and normalised
- Decay scores computed
- Review packets generated for high-risk access
- Review decision workflow completes
- Remediation executes on source platform
- Audit trail written to ClickHouse
- API returns all data correctly
- Dashboard displays real-time metrics
- Load test completed at 2× expected peak traffic
- Rollback procedure tested with Helm
- On-call rotation established
- Incident response playbook documented