Skip to content

Troubleshooting Guide

Common errors, their root causes, and solutions for the Verity platform.


Common Errors

PostgreSQL INET Returns IPv4Address

Symptom: Pydantic validation error when loading AccessEvent from the database:

ValidationError: 1 validation error for AccessEvent
source_ip
  Input should be a valid string [type=string_type]

Cause: PostgreSQL's INET column type is returned by asyncpg as a Python IPv4Address object, not a string. Pydantic's strict validation rejects it.

Solution: The AccessEvent model includes a field_validator that coerces IPv4Address to string:

@field_validator("source_ip", mode="before")
@classmethod
def _coerce_source_ip(cls, v: object) -> Optional[str]:
    if v is None:
        return None
    return str(v)

If you encounter this with a new model, add the same validator pattern.


SQLAlchemy metadata Reserved Name

Symptom: SQLAlchemy error or missing data when the Pydantic model has a metadata field:

AttributeError: 'Principal' object has no attribute 'metadata'

Cause: SQLAlchemy's declarative base uses metadata as a reserved class attribute for table metadata. The ORM column must be named metadata_ to avoid the conflict.

Solution: The domain models use AliasChoices to accept both names:

metadata: dict = Field(
    default_factory=dict,
    validation_alias=AliasChoices("metadata_", "metadata")
)

With from_attributes=True and populate_by_name=True on VerityModel, Pydantic reads metadata_ from the ORM object and exposes it as metadata in the API.


ClickHouse Nullable(LowCardinality()) Ordering

Symptom: ClickHouse error when ordering or grouping by a Nullable(LowCardinality(String)) column:

DB::Exception: Argument at index 1 for function less must not be Nullable

Cause: ClickHouse does not support direct comparison/ordering on Nullable(LowCardinality(...)) columns in some contexts.

Solution: Use assumeNotNull() or coalesce() in your queries:

-- Option 1: assumeNotNull
SELECT * FROM audit_events
ORDER BY assumeNotNull(risk_level);

-- Option 2: coalesce with a default
SELECT * FROM audit_events
ORDER BY coalesce(risk_level, 'UNKNOWN');

When creating new ClickHouse tables, prefer LowCardinality(String) without Nullable where possible, using an empty string as the default.


Temporal DB=postgres12 (Not postgresql)

Symptom: Temporal fails to start with database connection errors:

unable to connect to database: invalid dsn

Cause: The Temporal auto-setup image requires DB=postgres12 as the database type environment variable, not DB=postgresql or DB=postgres.

Solution: Ensure the docker-compose or Helm configuration uses:

environment:
  DB: postgres12          # NOT "postgresql" or "postgres"
  DB_PORT: "5432"
  POSTGRES_USER: verity
  POSTGRES_PWD: verity_dev
  POSTGRES_SEEDS: postgres

Python Dashed Module Names in Containers

Symptom: ModuleNotFoundError when importing a service in a Docker container:

ModuleNotFoundError: No module named 'api-gateway'

Cause: Python does not allow hyphens in module/package names. Directory names like api-gateway/ cannot be imported directly.

Solution: Verity services use underscored package names internally (e.g., api_gateway/) while the directory and Docker image names use hyphens. Ensure your Dockerfile WORKDIR and CMD reference the correct Python module path:

# Correct
CMD ["python", "-m", "api_gateway.main"]

# Wrong
CMD ["python", "-m", "api-gateway.main"]

structlog add_logger_name Crash with PrintLoggerFactory

Symptom: Application crashes on startup with:

AttributeError: 'PrintLogger' object has no attribute 'name'

Cause: The structlog.stdlib.add_logger_name processor expects a stdlib Logger object with a name attribute. When using structlog.PrintLoggerFactory (which creates PrintLogger instances), this processor fails because PrintLogger has no name attribute.

Solution: Verity's logging configuration intentionally omits add_logger_name from the processor chain and uses PrintLoggerFactory(sys.stdout) for direct JSON output:

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.add_log_level,        # ← This is fine
        # structlog.stdlib.add_logger_name,    # ← DO NOT use with PrintLoggerFactory
        structlog.processors.TimeStamper(fmt="iso"),
        ...
        structlog.processors.JSONRenderer(),
    ],
    logger_factory=structlog.PrintLoggerFactory(sys.stdout),
)

If you need the logger name, switch to structlog.stdlib.LoggerFactory() and use the stdlib logging integration.


Health Check Endpoints

All Verity services expose the following endpoints:

Endpoint Method Purpose Expected Response
/health GET Liveness probe 200 OK with {"status": "ok"}
/health/ready GET Readiness probe 200 OK when all dependencies are healthy
/v1/metrics GET Prometheus metrics Prometheus text format

Check Health from Inside the Cluster

# API Gateway health
kubectl exec -n verity -it deploy/verity-api-gateway -- \
  curl -s http://localhost:8000/health | python -m json.tool

# Check readiness
kubectl exec -n verity -it deploy/verity-api-gateway -- \
  curl -s http://localhost:8000/health/ready | python -m json.tool

Check via Port Forward

kubectl port-forward -n verity svc/verity-api-gateway 8000:8000
curl http://localhost:8000/health

Log Analysis Commands

View Structured Logs

# Recent logs from a service (JSON format)
kubectl logs -n verity -l app.kubernetes.io/component=decay-engine --tail=50

# Parse JSON logs with jq
kubectl logs -n verity -l app.kubernetes.io/component=decay-engine --tail=100 | \
  jq -r '. | "\(.timestamp) [\(.level)] \(.event)"'

# Filter for errors
kubectl logs -n verity -l app.kubernetes.io/component=api-gateway --tail=500 | \
  jq 'select(.level == "error")'

# Search for a specific trace ID
kubectl logs -n verity --all-containers --tail=1000 | \
  jq 'select(.trace_id == "abc-123-trace")'

Follow Logs in Real-Time

# Follow a single service
kubectl logs -n verity -l app.kubernetes.io/component=ingestion -f

# Follow with jq formatting
kubectl logs -n verity -l app.kubernetes.io/component=ingestion -f | \
  jq -r '. | "\(.timestamp) [\(.level)] \(.service): \(.event)"'

Aggregate Logs Across Services

# All errors in the last 5 minutes
kubectl logs -n verity --all-containers --since=5m | \
  jq 'select(.level == "error")' | \
  jq -r '. | "\(.service): \(.event)"'

Database Connection Issues

PostgreSQL Connection Refused

Symptom: Service logs show:

ConnectionRefusedError: [Errno 111] Connection refused

Diagnosis:

# Check PostgreSQL pod
kubectl get pods -n verity -l app.kubernetes.io/component=postgresql

# Check PostgreSQL logs
kubectl logs -n verity -l app.kubernetes.io/component=postgresql --tail=50

# Test connectivity from a service pod
kubectl exec -n verity -it deploy/verity-api-gateway -- \
  python -c "import asyncio, asyncpg; asyncio.run(asyncpg.connect('postgresql://verity:verity@verity-postgres:5432/verity'))"

Common causes:

  1. PostgreSQL pod not running or restarting
  2. Network policy blocking the connection
  3. Incorrect DB_HOST in ConfigMap
  4. Connection pool exhausted (max_connections reached)

ClickHouse Connection Timeout

# Test ClickHouse connectivity
kubectl exec -n verity -it deploy/verity-audit-writer -- \
  curl -s "http://verity-clickhouse:8123/?query=SELECT%201"

Kafka Consumer Group Debugging

Consumer Not Joining Group

Symptom: Service logs show:

WARNING: Consumer group decay-engine is rebalancing...

Diagnosis:

# Describe the consumer group
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --group decay-engine \
  --describe

# Check for partition assignments
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --group decay-engine \
  --members --verbose

Common causes:

  1. Too many consumers for the number of partitions
  2. Consumer session timeout too short
  3. Consumer processing too slow (exceeds max.poll.interval.ms)

Messages Not Being Consumed

# Check topic exists and has messages
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-topics.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP_SERVERS \
  --describe \
  --topic verity.events.normalised

# Check end offsets
kubectl exec -n verity -it deploy/verity-ingestion -- \
  kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list $KAFKA_BOOTSTRAP_SERVERS \
  --topic verity.events.normalised \
  --time -1

Quick Diagnostic Checklist

When investigating an issue, work through this checklist:

  1. Check pod status: kubectl get pods -n verity
  2. Check recent events: kubectl get events -n verity --sort-by=.metadata.creationTimestamp | tail -20
  3. Check service logs: kubectl logs -n verity deploy/verity-<service> --tail=100
  4. Check health endpoints: curl http://localhost:<port>/health
  5. Check Prometheus alerts: Port-forward Prometheus and check /alerts
  6. Check database connectivity: Test from within a pod
  7. Check Kafka consumer lag: Use kafka-consumer-groups.sh --describe
  8. Check resource utilisation: kubectl top pods -n verity
  9. Check recent deployments: helm history verity -n verity
  10. Check network policies: kubectl get networkpolicies -n verity