Troubleshooting Guide¶
Common errors, their root causes, and solutions for the Verity platform.
Common Errors¶
PostgreSQL INET Returns IPv4Address¶
Symptom: Pydantic validation error when loading AccessEvent from the database:
ValidationError: 1 validation error for AccessEvent
source_ip
Input should be a valid string [type=string_type]
Cause: PostgreSQL's INET column type is returned by asyncpg as a Python IPv4Address object, not a string. Pydantic's strict validation rejects it.
Solution: The AccessEvent model includes a field_validator that coerces IPv4Address to string:
@field_validator("source_ip", mode="before")
@classmethod
def _coerce_source_ip(cls, v: object) -> Optional[str]:
if v is None:
return None
return str(v)
If you encounter this with a new model, add the same validator pattern.
SQLAlchemy metadata Reserved Name¶
Symptom: SQLAlchemy error or missing data when the Pydantic model has a metadata field:
Cause: SQLAlchemy's declarative base uses metadata as a reserved class attribute for table metadata. The ORM column must be named metadata_ to avoid the conflict.
Solution: The domain models use AliasChoices to accept both names:
metadata: dict = Field(
default_factory=dict,
validation_alias=AliasChoices("metadata_", "metadata")
)
With from_attributes=True and populate_by_name=True on VerityModel, Pydantic reads metadata_ from the ORM object and exposes it as metadata in the API.
ClickHouse Nullable(LowCardinality()) Ordering¶
Symptom: ClickHouse error when ordering or grouping by a Nullable(LowCardinality(String)) column:
Cause: ClickHouse does not support direct comparison/ordering on Nullable(LowCardinality(...)) columns in some contexts.
Solution: Use assumeNotNull() or coalesce() in your queries:
-- Option 1: assumeNotNull
SELECT * FROM audit_events
ORDER BY assumeNotNull(risk_level);
-- Option 2: coalesce with a default
SELECT * FROM audit_events
ORDER BY coalesce(risk_level, 'UNKNOWN');
When creating new ClickHouse tables, prefer LowCardinality(String) without Nullable where possible, using an empty string as the default.
Temporal DB=postgres12 (Not postgresql)¶
Symptom: Temporal fails to start with database connection errors:
Cause: The Temporal auto-setup image requires DB=postgres12 as the database type environment variable, not DB=postgresql or DB=postgres.
Solution: Ensure the docker-compose or Helm configuration uses:
environment:
DB: postgres12 # NOT "postgresql" or "postgres"
DB_PORT: "5432"
POSTGRES_USER: verity
POSTGRES_PWD: verity_dev
POSTGRES_SEEDS: postgres
Python Dashed Module Names in Containers¶
Symptom: ModuleNotFoundError when importing a service in a Docker container:
Cause: Python does not allow hyphens in module/package names. Directory names like api-gateway/ cannot be imported directly.
Solution: Verity services use underscored package names internally (e.g., api_gateway/) while the directory and Docker image names use hyphens. Ensure your Dockerfile WORKDIR and CMD reference the correct Python module path:
structlog add_logger_name Crash with PrintLoggerFactory¶
Symptom: Application crashes on startup with:
Cause: The structlog.stdlib.add_logger_name processor expects a stdlib Logger object with a name attribute. When using structlog.PrintLoggerFactory (which creates PrintLogger instances), this processor fails because PrintLogger has no name attribute.
Solution: Verity's logging configuration intentionally omits add_logger_name from the processor chain and uses PrintLoggerFactory(sys.stdout) for direct JSON output:
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.stdlib.add_log_level, # ← This is fine
# structlog.stdlib.add_logger_name, # ← DO NOT use with PrintLoggerFactory
structlog.processors.TimeStamper(fmt="iso"),
...
structlog.processors.JSONRenderer(),
],
logger_factory=structlog.PrintLoggerFactory(sys.stdout),
)
If you need the logger name, switch to structlog.stdlib.LoggerFactory() and use the stdlib logging integration.
Health Check Endpoints¶
All Verity services expose the following endpoints:
| Endpoint | Method | Purpose | Expected Response |
|---|---|---|---|
/health |
GET | Liveness probe | 200 OK with {"status": "ok"} |
/health/ready |
GET | Readiness probe | 200 OK when all dependencies are healthy |
/v1/metrics |
GET | Prometheus metrics | Prometheus text format |
Check Health from Inside the Cluster¶
# API Gateway health
kubectl exec -n verity -it deploy/verity-api-gateway -- \
curl -s http://localhost:8000/health | python -m json.tool
# Check readiness
kubectl exec -n verity -it deploy/verity-api-gateway -- \
curl -s http://localhost:8000/health/ready | python -m json.tool
Check via Port Forward¶
Log Analysis Commands¶
View Structured Logs¶
# Recent logs from a service (JSON format)
kubectl logs -n verity -l app.kubernetes.io/component=decay-engine --tail=50
# Parse JSON logs with jq
kubectl logs -n verity -l app.kubernetes.io/component=decay-engine --tail=100 | \
jq -r '. | "\(.timestamp) [\(.level)] \(.event)"'
# Filter for errors
kubectl logs -n verity -l app.kubernetes.io/component=api-gateway --tail=500 | \
jq 'select(.level == "error")'
# Search for a specific trace ID
kubectl logs -n verity --all-containers --tail=1000 | \
jq 'select(.trace_id == "abc-123-trace")'
Follow Logs in Real-Time¶
# Follow a single service
kubectl logs -n verity -l app.kubernetes.io/component=ingestion -f
# Follow with jq formatting
kubectl logs -n verity -l app.kubernetes.io/component=ingestion -f | \
jq -r '. | "\(.timestamp) [\(.level)] \(.service): \(.event)"'
Aggregate Logs Across Services¶
# All errors in the last 5 minutes
kubectl logs -n verity --all-containers --since=5m | \
jq 'select(.level == "error")' | \
jq -r '. | "\(.service): \(.event)"'
Database Connection Issues¶
PostgreSQL Connection Refused¶
Symptom: Service logs show:
Diagnosis:
# Check PostgreSQL pod
kubectl get pods -n verity -l app.kubernetes.io/component=postgresql
# Check PostgreSQL logs
kubectl logs -n verity -l app.kubernetes.io/component=postgresql --tail=50
# Test connectivity from a service pod
kubectl exec -n verity -it deploy/verity-api-gateway -- \
python -c "import asyncio, asyncpg; asyncio.run(asyncpg.connect('postgresql://verity:verity@verity-postgres:5432/verity'))"
Common causes:
- PostgreSQL pod not running or restarting
- Network policy blocking the connection
- Incorrect
DB_HOSTin ConfigMap - Connection pool exhausted (
max_connectionsreached)
ClickHouse Connection Timeout¶
# Test ClickHouse connectivity
kubectl exec -n verity -it deploy/verity-audit-writer -- \
curl -s "http://verity-clickhouse:8123/?query=SELECT%201"
Kafka Consumer Group Debugging¶
Consumer Not Joining Group¶
Symptom: Service logs show:
Diagnosis:
# Describe the consumer group
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--group decay-engine \
--describe
# Check for partition assignments
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--group decay-engine \
--members --verbose
Common causes:
- Too many consumers for the number of partitions
- Consumer session timeout too short
- Consumer processing too slow (exceeds
max.poll.interval.ms)
Messages Not Being Consumed¶
# Check topic exists and has messages
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-topics.sh \
--bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
--describe \
--topic verity.events.normalised
# Check end offsets
kubectl exec -n verity -it deploy/verity-ingestion -- \
kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list $KAFKA_BOOTSTRAP_SERVERS \
--topic verity.events.normalised \
--time -1
Quick Diagnostic Checklist¶
When investigating an issue, work through this checklist:
- Check pod status:
kubectl get pods -n verity - Check recent events:
kubectl get events -n verity --sort-by=.metadata.creationTimestamp | tail -20 - Check service logs:
kubectl logs -n verity deploy/verity-<service> --tail=100 - Check health endpoints:
curl http://localhost:<port>/health - Check Prometheus alerts: Port-forward Prometheus and check
/alerts - Check database connectivity: Test from within a pod
- Check Kafka consumer lag: Use
kafka-consumer-groups.sh --describe - Check resource utilisation:
kubectl top pods -n verity - Check recent deployments:
helm history verity -n verity - Check network policies:
kubectl get networkpolicies -n verity