# Runbook: Debugging a KUDO Kafka cluster

This runbook explains how to debug a KUDO Kafka cluster.

# Pre-conditions

  • Kubernetes cluster with KUDO version >= 0.10.1 installed
  • Have a KUDO Kafka cluster version 1.2.0 up and running in the namespace kudo-kafka
  • Have binary of jq installed in the $PATH

# Steps

# Verifying if KUDO Kafka plans are COMPLETE

# Get the KUDO Kafka Instance object name

Verify the KUDO Kafka instance object is present in the expected namespace

kubectl get instances

expected output is the KUDO Instance objects present in the namespace default:

NAME                 AGE
kafka-instance       82m
zookeeper-instance   82m

# Verify the KUDO Kafka plans

kubectl kudo plan status --instance=kafka-instance

expected output is the current status of the KUDO Kafka instance plans:

Plan(s) for "kafka-instance" in namespace "default":
.
└── kafka-instance (Operator-Version: "kafka-1.3.1" Active-Plan: "deploy")
    ├── Plan cruise-control (serial strategy) [NOT ACTIVE]
    │   └── Phase cruise-addon (serial strategy) [NOT ACTIVE]
    │       └── Step deploy-cruise-control [NOT ACTIVE]
    ├── Plan deploy (serial strategy) [COMPLETE], last updated 2020-04-21 11:31:33
    │   ├── Phase deploy-kafka (serial strategy) [COMPLETE]
    │   │   ├── Step generate-tls-certificates [COMPLETE]
    │   │   ├── Step configuration [COMPLETE]
    │   │   ├── Step service [COMPLETE]
    │   │   └── Step app [COMPLETE]
    │   └── Phase addons (parallel strategy) [COMPLETE]
    │       ├── Step monitoring [COMPLETE]
    │       ├── Step access [COMPLETE]
    │       ├── Step mirror [COMPLETE]
    │       └── Step load [COMPLETE]
    ├── Plan external-access (serial strategy) [NOT ACTIVE]
    │   └── Phase resources (serial strategy) [NOT ACTIVE]
    │       └── Step deploy [NOT ACTIVE]
    ├── Plan kafka-connect (serial strategy) [NOT ACTIVE]
    │   └── Phase deploy-kafka-connect (serial strategy) [NOT ACTIVE]
    │       ├── Step deploy [NOT ACTIVE]
    │       └── Step setup [NOT ACTIVE]
    ├── Plan mirrormaker (serial strategy) [NOT ACTIVE]
    │   └── Phase app (serial strategy) [NOT ACTIVE]
    │       └── Step deploy [NOT ACTIVE]
    ├── Plan not-allowed (serial strategy) [NOT ACTIVE]
    │   └── Phase not-allowed (serial strategy) [NOT ACTIVE]
    │       └── Step not-allowed [NOT ACTIVE]
    ├── Plan service-monitor (serial strategy) [NOT ACTIVE]
    │   └── Phase enable-service-monitor (serial strategy) [NOT ACTIVE]
    │       └── Step deploy [NOT ACTIVE]
    ├── Plan update-instance (serial strategy) [NOT ACTIVE]
    │   └── Phase app (serial strategy) [NOT ACTIVE]
    │       ├── Step conf [NOT ACTIVE]
    │       ├── Step svc [NOT ACTIVE]
    │       └── Step sts [NOT ACTIVE]
    └── Plan user-workload (serial strategy) [NOT ACTIVE]
        └── Phase workload (serial strategy) [NOT ACTIVE]
            └── Step toggle-workload [NOT ACTIVE]

# Get all KUDO Kafka instance pods

We can use the KUDO Kafka instance name to retrieve all pods for KUDO Kafka cluster.

kubectl get pods -l kudo.dev/instance=kafka-instance

expected output is the pods list that belong to the current KUDO Kafka instance:

NAME                     READY   STATUS    RESTARTS   AGE
kafka-instance-kafka-0   2/2     Running   1          124m
kafka-instance-kafka-1   2/2     Running   0          124m
kafka-instance-kafka-2   2/2     Running   0          123m

# Debugging the pods logs

# Get logs from KUDO Kafka combined pods

Sometimes we need the combined logs output of all the Kafka pods: kubectl logs -l kudo.dev/instance=kafka-instance -c k8skafka -f

expected output is the current logs from all the Kafka pods, we can identify to which broker each line belong by the brokerId present in log lines:

[2020-02-03 12:13:22,030] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:23:22,030] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:33:22,030] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:02:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:12:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:22:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:32:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:42:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:52:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:02:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:12:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:22:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:32:36,834] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:22:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:32:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:42:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 11:52:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:02:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:12:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:22:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:32:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)

# Get logs from KUDO Kafka individual pod

kubectl logs kafka-instance-kafka-0 -c k8skafka

[ ... lines removed for clarity ...]
[2020-02-03 11:52:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:02:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:12:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:22:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-02-03 12:32:09,613] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[ ... lines removed for clarity ...]

# Get logs from KUDO Kafka node exporter container

kubectl logs kafka-instance-kafka-0 -c kafka-node-exporter

expected output is the logs from the node exporter container running as a sidecar with the broker 0 of Kafka:

time="2020-02-03T10:31:44Z" level=info msg="Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e)" source="node_exporter.go:156"
time="2020-02-03T10:31:44Z" level=info msg="Build context (go=go1.12.5, user=root@b50852a1acba, date=20190604-16:41:18)" source="node_exporter.go:157"

# Debugging service issues

# Verify the service endpoints

kubectl get endpoints kafka-instance-svc -o json | jq -r '.subsets[].addresses[].hostname'

expected output is the pods name for the brokers:

kafka-instance-kafka-2
kafka-instance-kafka-0
kafka-instance-kafka-1

# Verify the service selector labels are matching the ones in pods

Get the labels presents in the Kafka pods

kubectl get pods -l kudo.dev/instance=kafka-instance -o json | jq -r '.items[].metadata.labels'

expected output is the list of the labels used in the pods of kafka-instance cluster:

{
  "app": "kafka",
  "controller-revision-hash": "kafka-instance-kafka-76b8b8559b",
  "kafka": "kafka",
  "kudo.dev/instance": "kafka-instance",
  "statefulset.kubernetes.io/pod-name": "kafka-instance-kafka-0"
}
{
  "app": "kafka",
  "controller-revision-hash": "kafka-instance-kafka-76b8b8559b",
  "kafka": "kafka",
  "kudo.dev/instance": "kafka-instance",
  "statefulset.kubernetes.io/pod-name": "kafka-instance-kafka-1"
}
{
  "app": "kafka",
  "controller-revision-hash": "kafka-instance-kafka-76b8b8559b",
  "kafka": "kafka",
  "kudo.dev/instance": "kafka-instance",
  "statefulset.kubernetes.io/pod-name": "kafka-instance-kafka-2"
}

Now we need to verify that the service selector is using a subset of these labels

kubectl get svc kafka-instance-svc -o json | jq -r '.spec.selector'

expected output are the two labels the service use to find the Kafka pods:

{
  "app": "kafka",
  "kudo.dev/instance": "kafka-instance"
}

# Debugging health of all objects

# Get list of all objects created by KUDO Kafka Instance

kubectl api-resources --verbs=get --namespaced -o name \ | xargs -n 1 kubectl get --show-kind --ignore-not-found -l kudo.dev/instance=kafka-instance

expected output are all resources which are created by the KUDO Kafka Instance kafka-instance. The list can be different based on different features enabled in KUDO Kafka.

NAME                                           DATA   AGE
configmap/kafka-instance-bootstrap             1      5h26m
configmap/kafka-instance-enable-tls            1      5h26m
configmap/kafka-instance-health-check-script   1      5h26m
configmap/kafka-instance-jaas-config           1      5h26m
configmap/kafka-instance-krb5-config           1      5h26m
configmap/kafka-instance-metrics-config        1      5h26m
configmap/kafka-instance-serverproperties      1      5h26m
NAME                           ENDPOINTS                                                                AGE
endpoints/kafka-instance-svc   192.168.183.18:9096,192.168.51.87:9096,192.168.65.150:9096 + 9 more...   5h26m
NAME                         READY   STATUS    RESTARTS   AGE
pod/kafka-instance-kafka-0   2/2     Running   1          5h26m
pod/kafka-instance-kafka-1   2/2     Running   0          5h25m
pod/kafka-instance-kafka-2   2/2     Running   0          5h25m
NAME                            SECRETS   AGE
serviceaccount/kafka-instance   1         5h26m
NAME                         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                               AGE
service/kafka-instance-svc   ClusterIP   None         <none>        9093/TCP,9092/TCP,9094/TCP,9096/TCP   5h26m
NAME                                                      CONTROLLER                              REVISION   AGE
controllerrevision.apps/kafka-instance-kafka-76b8b8559b   statefulset.apps/kafka-instance-kafka   1          5h26m
NAME                                    READY   AGE
statefulset.apps/kafka-instance-kafka   3/3     5h26m
NAME                                               AGE
podmetrics.metrics.k8s.io/kafka-instance-kafka-1   1s
podmetrics.metrics.k8s.io/kafka-instance-kafka-2   1s
podmetrics.metrics.k8s.io/kafka-instance-kafka-0   1s
NAME                                            MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
poddisruptionbudget.policy/kafka-instance-pdb   N/A             1                 1                     5h29m
NAME                                                           AGE
rolebinding.rbac.authorization.k8s.io/kafka-instance-binding   5h29m
NAME                                                 AGE
role.rbac.authorization.k8s.io/kafka-instance-role   5h29m

# Debugging deployment issues

# Pod stuck with Status ContainerCreating

After deploying KUDO Kafka if the pods are stuck in ContainerCreating status. It can be caused by several issues caused by resource starvation to storage issues.

To debug the root cause of the ContainerCreating issue for KUDO Kafka. For example for the next case:

kubectl get pods 
NAME                             READY   STATUS              RESTARTS   AGE
kafka-instance-kafka-0           0/2     ContainerCreating   0          4m17s

Run the pod describe command and look for the events related to the pod

kubectl describe pod kafka-instance-kafka-0

expected output should reveal some reasons that are stopping the scheduling of the pod

[ ... lines removed for clarity ...]
Events:
  Type     Reason              Age                  From                                                Message
  ----     ------              ----                 ----                                                -------
  Normal   Scheduled           <unknown>            default-scheduler                                   Successfully assigned default/kafka-instance-kafka-0 to ip-10-0-128-61.us-west-2.compute.internal
  Warning  FailedMount         36s                  kubelet, ip-10-0-128-61.us-west-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[kafka-instance-datadir], unattached volumes=[config health-check-script metrics kafka-instance-datadir kafka-instance-token-m2d2h bootstrap]: timed out waiting for the condition
[ ... lines removed for clarity ...]

Here we can see that the issue was caused by the FailedMount event.

To get more details on what is happening we can fetch the events

kubectl get events --sort-by='.metadata.creationTimestamp'

[ ... lines removed for clarity ...]
default       54s         Warning   FailedAttachVolume       pod/kafka-instance-kafka-0                                                        AttachVolume.Attach failed for volume "pvc-e949f30a-79d6-46ba-9ec1-5658c8e66c17" : PersistentVolume "pvc-e949f30a-79d6-46ba-9ec1-5658c8e66c17" is marked for deletion
[ ... lines removed for clarity ...]

we can see that the root cause of the container being stuck is an issue with the PersistentVolume.

# Pod stuck with Status Pending

Same debugging can be applied for the pods stuck in other states like Pending.

kubectl get pods

expected output is a list of pods with one Pending pod:

NAME                             READY   STATUS    RESTARTS   AGE
kafka-instance-kafka-0           2/2     Running   0          35m
kafka-instance-kafka-1           2/2     Running   0          35m
kafka-instance-kafka-2           0/2     Pending   0          17s

Run the pod describe command and look for the events related to the pod

kubectl describe pod kafka-instance-kafka-2

expected output should reveal some reasons that are stopping the scheduling of the pod

[ ... lines removed for clarity ...]
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/7 nodes are available: 7 Insufficient memory.
[ ... lines removed for clarity ...]

We can see that pod is in Pending state because of insufficient memory.

Same can be also be retrieved by checking the events kubectl get events --sort-by='.metadata.creationTimestamp'

expected output should reveal that pod kafka-instance-kafka-2 is stuck due to resource starvation

[ ... lines removed for clarity ...]

default     <unknown>   Warning   FailedScheduling         pod/kafka-instance-kafka-2         0/7 nodes are available: 7 Insufficient memory.
[ ... lines removed for clarity ...]