5. Troubleshooting

For more details on Troubleshooting, have a look into Cilium’s Troubleshooting Documentation .

Component & Cluster Health

An initial overview of Cilium can be retrieved by listing all pods to verify whether all pods have the status Running:

kubectl -n kube-system get pods -l k8s-app=cilium

In our single node cluster there is only one cilium pod running:

NAME           READY     STATUS    RESTARTS   AGE
cilium-ksr7h   1/1       Running   0          12m16

If Cilium encounters a problem that it cannot recover from, it will automatically report the failure state via cilium status which is regularly queried by the Kubernetes liveness probe to automatically restart Cilium pods. If a Cilium Pod is in state CrashLoopBackoff then this indicates a permanent failure scenario.

If a particular Cilium Pod is not in a running state, the status and health of the agent on that node can be retrieved by running cilium status in the context of that pod:

kubectl -n kube-system exec ds/cilium -- cilium status

The output looks similar to this:

Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init)
KVStore:                 Ok   Disabled
Kubernetes:              Ok   1.24 (v1.24.3) [linux/amd64]
Kubernetes APIs:         ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:    Disabled   
Host firewall:           Disabled
CNI Chaining:            none
Cilium:                  Ok   1.12.5 (v1.12.5-701acde)
NodeMonitor:             Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon:    Ok   
IPAM:                    IPv4: 10/254 allocated from 10.1.0.0/24, 
ClusterMesh:             0/0 clusters ready, 0 global-services
BandwidthManager:        Disabled
Host Routing:            Legacy
Masquerading:            IPTables [IPv4: Enabled, IPv6: Disabled]
Controller Status:       50/50 healthy
Proxy Status:            OK, ip 10.1.0.182, 0 redirects active on ports 10000-20000
Global Identity Range:   min 256, max 65535
Hubble:                  Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 8.71   Metrics: Ok
Encryption:              Disabled
Cluster health:          1/1 reachable   (2023-01-26T08:23:50Z)

More detailed information about the status of Cilium can be inspected with:

kubectl -n kube-system exec ds/cilium -- cilium status --verbose

Verbose output includes detailed IPAM state (allocated addresses), Cilium controller status, and details of the Proxy status.

Logs

To retrieve log files of a cilium pod, run:

kubectl -n kube-system logs --timestamps <pod-name>

The <pod-name> can be determined with the following command and by selecting the name of one of the pods:

kubectl -n kube-system get pods -l k8s-app=cilium

If the Cilium Pod was already restarted due to the liveness problem after encountering an issue, it can be useful to retrieve the logs of the Pod previous to the last restart:

kubectl -n kube-system logs --timestamps -p <pod-name>

Policy Troubleshooting - Ensure Pod is managed by Cilium

A potential cause for policy enforcement not functioning as expected is that the networking of the Pod selected by the policy is not being managed by Cilium. The following situations result in unmanaged pods:

  • The Pod is running in host networking and will use the host’s IP address directly. Such pods have full network connectivity but Cilium will not provide security policy enforcement for such pods.
  • The Pod was started before Cilium was deployed. Cilium only manages pods that have been deployed after Cilium itself was started. Cilium will not provide security policy enforcement for such pods.

If Pod networking is not managed by Cilium, ingress and egress policy rules selecting the respective pods will not be applied. See the section Network Policy for more details.

For a quick assessment of whether any pods are not managed by Cilium, the Cilium CLI will print the number of managed pods. If this prints that all of the pods are managed by Cilium, then there is no problem:

cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         OK
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         OK
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Deployment        hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
Deployment        hubble-ui          Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet         cilium             Desired: 2, Ready: 2/2, Available: 2/2
Containers:       cilium-operator    Running: 2
                  hubble-relay       Running: 1
                  hubble-ui          Running: 1
                  cilium             Running: 2
Cluster Pods:     5/5 managed by Cilium

You can run the following script to list the pods which are not managed by Cilium:

curl -sLO https://raw.githubusercontent.com/cilium/cilium/master/contrib/k8s/k8s-unmanaged.sh
chmod +x k8s-unmanaged.sh
./k8s-unmanaged.sh

Reporting a problem - Automatic log & state collection

Before you report a problem, make sure to retrieve the necessary information from your cluster before the failure state is lost.

Execute the cilium sysdump command to collect troubleshooting information from your Kubernetes cluster:

cilium sysdump

Note that by default cilium sysdump will attempt to collect as many logs as possible for all the nodes in the cluster. If your cluster size is above 20 nodes, consider setting the following options to limit the size of the sysdump. This is not required, but is useful for those who have a constraint on bandwidth or upload size.

  • set the --node-list option to pick only a few nodes in case the cluster has many of them.
  • set the --logs-since-time option to go back in time to when the issues started.
  • set the --logs-limit-bytes option to limit the size of the log files (note: passed onto kubectl logs; does not apply to entire collection archive). Ideally, a sysdump that has a full history of select nodes, rather than a brief history of all the nodes, would be preferred (by using --node-list). The second recommended way would be to use --logs-since-time if you are able to narrow down when the issues started. Lastly, if the Cilium agent and Operator logs are too large, consider --logs-limit-bytes.

Use --help to see more options:

cilium sysdump --help