Operational Runbooks¶
Step-by-step procedures for common operational tasks and incident response.
Service Restart Procedures¶
Restart a Single Service¶
# Rolling restart (zero-downtime)
kubectl rollout restart deployment/verity-api-gateway -n verity
# Watch rollout progress
kubectl rollout status deployment/verity-api-gateway -n verity
# Verify pods are healthy
kubectl get pods -n verity -l app.kubernetes.io/component=api-gateway
Restart All Services¶
# Rolling restart of all deployments
kubectl rollout restart deployment -n verity
# Monitor
kubectl get pods -n verity -w
Force Restart a Stuck Pod¶
# Identify the stuck pod
kubectl get pods -n verity | grep -v Running
# Delete the pod (replacement will be created automatically)
kubectl delete pod <pod-name> -n verity
# If pod is stuck in Terminating
kubectl delete pod <pod-name> -n verity --force --grace-period=0
Database Maintenance¶
PostgreSQL / TimescaleDB¶
Chunk Management¶
TimescaleDB automatically creates chunks for hypertables. Monitor and manage them:
-- View chunk information
SELECT hypertable_name, chunk_name, range_start, range_end,
pg_size_pretty(total_bytes) AS size
FROM timescaledb_information.chunks
ORDER BY range_start DESC
LIMIT 20;
-- Drop old chunks (e.g., older than 90 days)
SELECT drop_chunks('access_events', older_than => INTERVAL '90 days');
SELECT drop_chunks('access_scores', older_than => INTERVAL '365 days');
-- Compress old chunks (if compression policy is set)
SELECT compress_chunk(c.chunk_name)
FROM timescaledb_information.chunks c
WHERE c.hypertable_name = 'access_events'
AND c.range_end < now() - INTERVAL '7 days'
AND NOT c.is_compressed;
Continuous Aggregate Refresh¶
-- Check continuous aggregate status
SELECT view_name, materialization_hypertable_name
FROM timescaledb_information.continuous_aggregates;
-- Manually refresh a continuous aggregate
CALL refresh_continuous_aggregate('daily_access_summary',
now() - INTERVAL '7 days', now());
Vacuum and Analyse¶
-- Run VACUUM ANALYZE on key tables
VACUUM ANALYZE principals;
VACUUM ANALYZE assets;
VACUUM ANALYZE access_grants;
VACUUM ANALYZE review_packets;
VACUUM ANALYZE review_decisions;
Connection Pool Check¶
-- Check active connections
SELECT count(*), state, usename, application_name
FROM pg_stat_activity
WHERE datname = 'verity'
GROUP BY state, usename, application_name
ORDER BY count DESC;
-- Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'verity'
AND state = 'idle'
AND state_change < now() - INTERVAL '10 minutes';
Kafka Consumer Lag Investigation¶
Check Consumer Group Lag¶
# List all consumer groups
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--list
# Describe a specific consumer group
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--group decay-engine \
--describe
Expected output columns: TOPIC, PARTITION, CURRENT-OFFSET, LOG-END-OFFSET, LAG
Diagnose High Lag¶
-
Check consumer health:
-
Check for consumer rebalancing:
-
Scale consumers if processing is too slow:
-
Check if Kafka is healthy:
Reset Consumer Offset (Emergency)¶
Data Loss Risk
Resetting offsets will skip unprocessed messages. Only use in emergencies.
# Reset to latest (skip all pending messages)
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--group decay-engine \
--topic verity.events.normalised \
--reset-offsets \
--to-latest \
--execute
# Reset to a specific timestamp
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--group decay-engine \
--topic verity.events.normalised \
--reset-offsets \
--to-datetime "2024-06-15T00:00:00.000" \
--execute
Temporal Workflow Management¶
List Running Workflows¶
# Port-forward Temporal UI
kubectl port-forward -n verity svc/verity-temporal 8233:8080
# Or use the CLI
kubectl exec -n verity -it deploy/verity-workflow-engine -- \
temporal workflow list \
--namespace verity \
--query "ExecutionStatus='Running'"
Investigate a Stuck Workflow¶
# Get workflow details
temporal workflow describe \
--namespace verity \
--workflow-id <workflow-id>
# View workflow history
temporal workflow show \
--namespace verity \
--workflow-id <workflow-id>
Common causes of stuck workflows:
- Activity timeout — The activity is waiting for an external response (e.g., platform API)
- Worker crash — The worker pod restarted mid-execution
- Dependency failure — PostgreSQL or Kafka is unavailable
Retry a Failed Workflow¶
# Retry from the last failed activity
temporal workflow retry \
--namespace verity \
--workflow-id <workflow-id>
Terminate a Workflow¶
# Terminate (immediate stop, no cleanup)
temporal workflow terminate \
--namespace verity \
--workflow-id <workflow-id> \
--reason "Manual termination: <reason>"
# Cancel (graceful stop, allows cleanup)
temporal workflow cancel \
--namespace verity \
--workflow-id <workflow-id>
Scale Temporal Workers¶
# Check current worker count
kubectl get pods -n verity -l app.kubernetes.io/component=workflow-engine
# Scale up
kubectl scale -n verity deployment/verity-workflow-engine --replicas=4
ClickHouse Cleanup and TTL Management¶
Check Table Sizes¶
SELECT
database,
table,
formatReadableSize(sum(bytes_on_disk)) AS disk_size,
sum(rows) AS total_rows,
max(modification_time) AS last_modified
FROM system.parts
WHERE active AND database = 'verity_audit'
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC;
Configure TTL¶
-- Set TTL to automatically delete audit events older than 2 years
ALTER TABLE verity_audit.audit_events
MODIFY TTL occurred_at + INTERVAL 2 YEAR;
-- Verify TTL settings
SELECT name, engine, partition_key, sorting_key
FROM system.tables
WHERE database = 'verity_audit';
Manual Cleanup¶
-- Delete old data manually
ALTER TABLE verity_audit.audit_events
DELETE WHERE occurred_at < now() - INTERVAL 2 YEAR;
-- Optimize table after deletion (merge parts)
OPTIMIZE TABLE verity_audit.audit_events FINAL;
Monitor Merge Operations¶
-- Check active merges
SELECT database, table, elapsed, progress,
formatReadableSize(total_size_bytes_compressed) AS size
FROM system.merges;
-- Check merge performance
SELECT event_date, count() AS merges,
sum(duration_ms)/1000 AS total_duration_s
FROM system.part_log
WHERE event_type = 'MergeParts'
GROUP BY event_date
ORDER BY event_date DESC
LIMIT 7;
Emergency: Disabling a Connector¶
If a connector is causing issues (e.g., overwhelming the pipeline, API errors, credential compromise):
Immediate Stop¶
# Scale the connector to 0 replicas
kubectl scale -n verity deployment/verity-connector-aad --replicas=0
# Verify it's stopped
kubectl get pods -n verity -l app.kubernetes.io/component=connectors
Verify Pipeline Drains¶
# Check remaining messages in the connector's raw event topic
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--group event-enricher \
--describe | grep "verity.events.raw"
Investigate and Restore¶
-
Check connector logs for errors:
-
Check the source platform's API status
-
Verify credentials are still valid
-
Re-enable when ready:
Emergency: Manually Revoking Access¶
If the automated remediation pipeline is unavailable and access must be revoked immediately:
1. Identify the Access Grant¶
-- Find the active grant
SELECT ag.id, p.display_name, p.email, a.fqn, a.platform,
ag.privilege, ag.grant_mechanism, ag.granted_via
FROM access_grants ag
JOIN principals p ON p.id = ag.principal_id
JOIN assets a ON a.id = ag.asset_id
WHERE ag.id = '<grant-uuid>'
AND ag.is_active = true;
2. Revoke on the Source Platform¶
Perform the revocation directly on the source platform (Azure AD, Fabric, etc.) using the platform's admin portal or CLI.
3. Update Verity Database¶
-- Mark the grant as revoked
UPDATE access_grants
SET is_active = false,
revoked_at = now(),
revoked_by_id = '<admin-principal-uuid>'
WHERE id = '<grant-uuid>';
4. Update the Review Packet¶
-- Mark the review as decided
UPDATE review_packets
SET status = 'DECIDED'
WHERE id = '<packet-uuid>';
-- Record the decision
INSERT INTO review_decisions (id, packet_id, decision, decided_by_id, decided_at, justification)
VALUES (gen_random_uuid(), '<packet-uuid>', 'REVOKE', '<admin-principal-uuid>', now(),
'Emergency manual revocation: <reason>');
5. Log the Emergency Action¶
-- Write to audit trail
INSERT INTO audit_events (event_id, event_type, occurred_at, action, detail_json, risk_level)
VALUES (
gen_random_uuid()::text,
'EMERGENCY_REVOCATION',
now(),
'MANUAL_REVOKE',
'{"grant_id": "<grant-uuid>", "reason": "<reason>", "operator": "<admin-email>"}'::jsonb,
'CRITICAL'
);
6. Publish Audit Event to Kafka¶
If Kafka is available, also publish to the audit trail topic for ClickHouse persistence: