Operational Runbooks¶

Step-by-step procedures for common operational tasks and incident response.

Service Restart Procedures¶

Restart a Single Service¶

# Rolling restart (zero-downtime)
kubectl rollout restart deployment/verity-api-gateway -n verity

# Watch rollout progress
kubectl rollout status deployment/verity-api-gateway -n verity

# Verify pods are healthy
kubectl get pods -n verity -l app.kubernetes.io/component=api-gateway

Restart All Services¶

# Rolling restart of all deployments
kubectl rollout restart deployment -n verity

# Monitor
kubectl get pods -n verity -w

Force Restart a Stuck Pod¶

# Identify the stuck pod
kubectl get pods -n verity | grep -v Running

# Delete the pod (replacement will be created automatically)
kubectl delete pod <pod-name> -n verity

# If pod is stuck in Terminating
kubectl delete pod <pod-name> -n verity --force --grace-period=0

Database Maintenance¶

PostgreSQL / TimescaleDB¶

Chunk Management¶

TimescaleDB automatically creates chunks for hypertables. Monitor and manage them:

-- View chunk information
SELECT hypertable_name, chunk_name, range_start, range_end, 
       pg_size_pretty(total_bytes) AS size
FROM timescaledb_information.chunks
ORDER BY range_start DESC
LIMIT 20;

-- Drop old chunks (e.g., older than 90 days)
SELECT drop_chunks('access_events', older_than => INTERVAL '90 days');
SELECT drop_chunks('access_scores', older_than => INTERVAL '365 days');

-- Compress old chunks (if compression policy is set)
SELECT compress_chunk(c.chunk_name)
FROM timescaledb_information.chunks c
WHERE c.hypertable_name = 'access_events'
  AND c.range_end < now() - INTERVAL '7 days'
  AND NOT c.is_compressed;

Continuous Aggregate Refresh¶

-- Check continuous aggregate status
SELECT view_name, materialization_hypertable_name
FROM timescaledb_information.continuous_aggregates;

-- Manually refresh a continuous aggregate
CALL refresh_continuous_aggregate('daily_access_summary', 
    now() - INTERVAL '7 days', now());

Vacuum and Analyse¶

-- Run VACUUM ANALYZE on key tables
VACUUM ANALYZE principals;
VACUUM ANALYZE assets;
VACUUM ANALYZE access_grants;
VACUUM ANALYZE review_packets;
VACUUM ANALYZE review_decisions;

Connection Pool Check¶

-- Check active connections
SELECT count(*), state, usename, application_name
FROM pg_stat_activity
WHERE datname = 'verity'
GROUP BY state, usename, application_name
ORDER BY count DESC;

-- Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'verity'
  AND state = 'idle'
  AND state_change < now() - INTERVAL '10 minutes';

Kafka Consumer Lag Investigation¶

Check Consumer Group Lag¶

# List all consumer groups
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --list

# Describe a specific consumer group
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --group decay-engine \
  --describe

Expected output columns: TOPIC, PARTITION, CURRENT-OFFSET, LOG-END-OFFSET, LAG

Diagnose High Lag¶

Check consumer health:

kubectl logs -n verity -l app.kubernetes.io/component=decay-engine --tail=50

Check for consumer rebalancing:

kubectl logs -n verity -l app.kubernetes.io/component=decay-engine | grep -i rebalance

Scale consumers if processing is too slow:

kubectl scale -n verity deployment/verity-decay-engine --replicas=4

Check if Kafka is healthy:

# Check broker status
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-broker-api-versions.sh --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS

Reset Consumer Offset (Emergency)¶

Data Loss Risk

Resetting offsets will skip unprocessed messages. Only use in emergencies.

# Reset to latest (skip all pending messages)
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --group decay-engine \
  --topic verity.events.normalised \
  --reset-offsets \
  --to-latest \
  --execute

# Reset to a specific timestamp
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --group decay-engine \
  --topic verity.events.normalised \
  --reset-offsets \
  --to-datetime "2024-06-15T00:00:00.000" \
  --execute

Temporal Workflow Management¶

List Running Workflows¶

# Port-forward Temporal UI
kubectl port-forward -n verity svc/verity-temporal 8233:8080

# Or use the CLI
kubectl exec -n verity -it deploy/verity-workflow-engine -- \
  temporal workflow list \
  --namespace verity \
  --query "ExecutionStatus='Running'"

Investigate a Stuck Workflow¶

# Get workflow details
temporal workflow describe \
  --namespace verity \
  --workflow-id <workflow-id>

# View workflow history
temporal workflow show \
  --namespace verity \
  --workflow-id <workflow-id>

Common causes of stuck workflows:

Activity timeout — The activity is waiting for an external response (e.g., platform API)
Worker crash — The worker pod restarted mid-execution
Dependency failure — PostgreSQL or Kafka is unavailable

Retry a Failed Workflow¶

# Retry from the last failed activity
temporal workflow retry \
  --namespace verity \
  --workflow-id <workflow-id>

Terminate a Workflow¶

# Terminate (immediate stop, no cleanup)
temporal workflow terminate \
  --namespace verity \
  --workflow-id <workflow-id> \
  --reason "Manual termination: <reason>"

# Cancel (graceful stop, allows cleanup)
temporal workflow cancel \
  --namespace verity \
  --workflow-id <workflow-id>

Scale Temporal Workers¶

# Check current worker count
kubectl get pods -n verity -l app.kubernetes.io/component=workflow-engine

# Scale up
kubectl scale -n verity deployment/verity-workflow-engine --replicas=4

ClickHouse Cleanup and TTL Management¶

Check Table Sizes¶

SELECT
    database,
    table,
    formatReadableSize(sum(bytes_on_disk)) AS disk_size,
    sum(rows) AS total_rows,
    max(modification_time) AS last_modified
FROM system.parts
WHERE active AND database = 'verity_audit'
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC;

Configure TTL¶

-- Set TTL to automatically delete audit events older than 2 years
ALTER TABLE verity_audit.audit_events
MODIFY TTL occurred_at + INTERVAL 2 YEAR;

-- Verify TTL settings
SELECT name, engine, partition_key, sorting_key
FROM system.tables
WHERE database = 'verity_audit';

Manual Cleanup¶

-- Delete old data manually
ALTER TABLE verity_audit.audit_events
DELETE WHERE occurred_at < now() - INTERVAL 2 YEAR;

-- Optimize table after deletion (merge parts)
OPTIMIZE TABLE verity_audit.audit_events FINAL;

Monitor Merge Operations¶

-- Check active merges
SELECT database, table, elapsed, progress, 
       formatReadableSize(total_size_bytes_compressed) AS size
FROM system.merges;

-- Check merge performance
SELECT event_date, count() AS merges, 
       sum(duration_ms)/1000 AS total_duration_s
FROM system.part_log
WHERE event_type = 'MergeParts'
GROUP BY event_date
ORDER BY event_date DESC
LIMIT 7;

Emergency: Disabling a Connector¶

If a connector is causing issues (e.g., overwhelming the pipeline, API errors, credential compromise):

Immediate Stop¶

# Scale the connector to 0 replicas
kubectl scale -n verity deployment/verity-connector-aad --replicas=0

# Verify it's stopped
kubectl get pods -n verity -l app.kubernetes.io/component=connectors

Verify Pipeline Drains¶

# Check remaining messages in the connector's raw event topic
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --group event-enricher \
  --describe | grep "verity.events.raw"

Investigate and Restore¶

Check connector logs for errors:

kubectl logs -n verity -l app.kubernetes.io/name=connector-aad --tail=500

Check the source platform's API status
Verify credentials are still valid

Re-enable when ready:

kubectl scale -n verity deployment/verity-connector-aad --replicas=2

Emergency: Manually Revoking Access¶

If the automated remediation pipeline is unavailable and access must be revoked immediately:

1. Identify the Access Grant¶

-- Find the active grant
SELECT ag.id, p.display_name, p.email, a.fqn, a.platform,
       ag.privilege, ag.grant_mechanism, ag.granted_via
FROM access_grants ag
JOIN principals p ON p.id = ag.principal_id
JOIN assets a ON a.id = ag.asset_id
WHERE ag.id = '<grant-uuid>'
  AND ag.is_active = true;

2. Revoke on the Source Platform¶

Perform the revocation directly on the source platform (Azure AD, Fabric, etc.) using the platform's admin portal or CLI.

3. Update Verity Database¶

-- Mark the grant as revoked
UPDATE access_grants
SET is_active = false,
    revoked_at = now(),
    revoked_by_id = '<admin-principal-uuid>'
WHERE id = '<grant-uuid>';

4. Update the Review Packet¶

-- Mark the review as decided
UPDATE review_packets
SET status = 'DECIDED'
WHERE id = '<packet-uuid>';

-- Record the decision
INSERT INTO review_decisions (id, packet_id, decision, decided_by_id, decided_at, justification)
VALUES (gen_random_uuid(), '<packet-uuid>', 'REVOKE', '<admin-principal-uuid>', now(), 
        'Emergency manual revocation: <reason>');

5. Log the Emergency Action¶

-- Write to audit trail
INSERT INTO audit_events (event_id, event_type, occurred_at, action, detail_json, risk_level)
VALUES (
    gen_random_uuid()::text,
    'EMERGENCY_REVOCATION',
    now(),
    'MANUAL_REVOKE',
    '{"grant_id": "<grant-uuid>", "reason": "<reason>", "operator": "<admin-email>"}'::jsonb,
    'CRITICAL'
);

6. Publish Audit Event to Kafka¶

If Kafka is available, also publish to the audit trail topic for ClickHouse persistence:

# From a pod with Kafka access
echo '{"event_type":"EMERGENCY_REVOCATION","action":"MANUAL_REVOKE","risk_level":"CRITICAL"}' | \
  kafka-console-producer.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --topic verity.audit.trail