Troubleshooting and Error Conditions(3.0)

This page describes a number of the most common error conditions that can occur in the system and some information to help resolve them.

We separate error conditions in 6 different layers as depicted below.

Infrastructure Layer

The infrastructure layer includes all AWS resources used by the application. These resources are typically controlled Infrastructure as Code as described in Assets and Services(3.0).

See https://docs.aws.amazon.com/awssupport/latest/user/troubleshooting.html for instructions on how to troubleshoot AWS services.

For the EKS cluster specifically, see https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html.

Orchestration Layer

General Kubernetes Troubleshooting

There are plenty of good guides on the internet to help you troubleshoot problems related to Kubernetes resources. Instead of describing such procedures here you are recommended to refer such guides. Some good examples are:

Official Kubernetes documentation on Troubleshooting Applications: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/
Official Kubernetes documentation on Troubleshooting Cluster https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/
Troubleshooting guide from LearnK8s: https://learnk8s.io/troubleshooting-deployments

Kubernetes Resource Health Monitoring

All Kubernetes pods deployed by define Liveness and Readiness probes according to https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

The state of these probes can be checked using the 'kubectl describe' command. The output of this command applied on a pod will display the status like in this example:

$ kube describe pod platform-0
...
    Liveness:   http-get https://:http/health/live delay=300s timeout=10s period=15s #success=1 #failure=3
    Readiness:  http-get https://:http/health/ready delay=10s timeout=10s period=5s #success=1 #failure=120
...

and also all events related to these probes, like:

...
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  8m46s (x880 over 20d)   kubelet  Liveness probe failed: Get "https://192.168.83.227:9000/health/live": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  8m31s (x582 over 20d)   kubelet  Liveness probe failed: Get "https://192.168.83.227:9000/health/live": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    8m31s (x496 over 20d)   kubelet  Container platform failed liveness probe, will be restarted
  Warning  Unhealthy  8m22s (x2098 over 20d)  kubelet  Readiness probe failed: Get "https://192.168.83.227:9000/health/ready": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  8m2s (x1332 over 20d)   kubelet  Readiness probe failed: Get "https://192.168.83.227:9000/health/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  8m1s                    kubelet  Readiness probe failed: Get "https://192.168.83.227:9000/health/ready": read tcp 192.168.69.131:46652->192.168.83.227:9000: read: connection reset by peer

These events often contain useful information for troubleshooting.

Kubernetes Logs

All processes in a system produce log data that is collected by Kubernetes log function and can easily be transferred to a common log target. Also the centralized System Log (System Log) can be forwarded to the same log target, The stack Fluentd, ElasticSearch and Kibana to collect, store and visualize log data. See Configure Log Collection, Target, and Visualization - AWS for a description on how to set this up.

If centralized logging is not used, all process logs can instead be accessed using the 'kubectl logs' command. For example:

System Layer

Event Notifications

To monitor error conditions in the system layer, has a very flexible Event Notifier feature with targets like AWS SNS Topic and several others. See Event Notifications(3.0) for info on how to configure this.

System and Process Logs

logs events from the entire system in the central System Log. See System Log(3.0)

Metrics

All metrics in are exposed on a REST interface in a format that can be scraped by Prometheus. This means if Prometheus is installed according to Setting up Prometheus(3.0) it will automatically start scraping metrics from all system resources.

Alerts

Since Prometheus is the central integration point for metrics it is straight forward to also implement alert functionality here using the Prometheus Alert Manager. See https://prometheus.io/docs/alerting/latest/alertmanager/ for details.

Verifying System Database Connectivity

If the system is not starting up and the platform process log indicates problems with attaching to the system database it's possible to check this connection by accessing Postgres command line tool through kubectl command.

# Fetch username
export USER=$(kubectl get secret env-secrets -o jsonpath={.data.jdbcUser} | base64 -d)
# Fetch password and store in variable known by psql
export PGPASSWORD=$(kubectl get secret env-secrets -o jsonpath={.data.jdbcPassword} | base64 -d)
# Connect from platform pod using above credentials
kubectl exec platform -- psql -U $USER -d mz -h <db-hostname>

Configuration Layer

Configuration Generation Errors

If for some a discrepancy in the dependencies between resources generated from workflows, Ultra definitions and other resources happen, the result will be error messages like "Corrupt type information" or similar can occur when trying to save or import resources. It might then be necessary to regenerate all resources to fix the inconsistencies. This is done my the regenerateconfigs command described in regenerateconfigs(3.0).

Execution Layer

Workflow Monitor

For workflow troubleshooting, the Workflow Monitor can be used to view detail debug information, see Workflow Monitor on the Web Interface(3.0).

Execution Layer Metrics

has extensive monitoring capabilities for troubleshooting workflow scheduling and execution. See Metrics(3.0) for more information on available metrics. See Reading JMX Metrics, MIMs and Prometheus Agent Metrics from Execution Context Endpoint(3.0) for information on how to expose execution layer the metrics in Prometheus and/or Grafana.

Troubleshooting EC Deployments

ECDeployments or ECDs are instances of a Kubernetes Custom Resource. Just like other Kubernetes resource an ECD has a status section that contain informative text describing the state of the resource. This information can be displayed using the kubectl command:

kubectl describe ecd <ecdname>

Data Layer

When unexpected things happen when processing payload data, like for instance decoding errors or unexpected type codes, provides a powerful subsystem called Data Veracity to help resolving the error condition. See Data Veracity(3.0) for information on how to configure this.