Kubernetes Troubleshooting: Why Did My Pod Die?
Learn how to diagnose and fix Kubernetes pod failures. A practical guide to solving CrashLoopBackOff, OOMKilled, and Pending states in production.
Pods die because of scheduling failures, startup crashes or runtime terminations. When a Kubernetes pod fails, it rarely tells you exactly why in a single line. Instead, it gives you a status like CrashLoopBackOff or Pending, which are symptoms rather than root causes. To fix a pod, you must distinguish between these three failure modes. This guide provides a decision tree for diagnosing pod deaths, moving from high-level status checks to deep-dive container inspection.
For a comprehensive look at the most common restart cycles, check out the guide on how to fix Kubernetes CrashLoopBackOff in production.
Understanding the failure states
A pod “death” is a transition in the Pod Lifecycle. When you see CrashLoopBackOff, the container started, crashed, and Kubernetes is now waiting an exponentially increasing amount of time before trying to start it again to avoid hammering the node.
OOMKilled means the Linux kernel terminated the process because it exceeded its memory limit. I have seen this happen frequently in Java applications where the JVM heap is set higher than the Kubernetes memory limit. ImagePullBackOff means the kubelet cannot retrieve the container image from the registry. Each of these states points to a different layer of the stack: the infrastructure, the container runtime or the application code itself. Refer to the official Kubernetes Pod Lifecycle documentation for the full state machine.
Common root causes
Pod failures generally fall into three categories.
Startup and Configuration Failures
- Missing Environment Variables: The application panics immediately because a required database URL or API key is missing.
- Invalid Image Tags: A typo in the image version or a deleted tag in the registry leads to
ErrImagePull. - Registry Authentication: The
imagePullSecretsare missing or the service account lacks permission to pull from a private repository.
Resource and Infrastructure Constraints
- Memory Limits (OOMKilled): The container attempted to allocate more memory than defined in its
limitssection. - CPU Throttling: Extreme throttling can trigger Liveness probe timeouts, causing Kubernetes to kill and restart the pod.
- Scheduling Constraints: Pods stuck in
Pendingusually suffer from “Insufficient cpu” or “Insufficient memory” on all available nodes, or they havenodeSelectorconstraints that no node meets.
Health Check Failures
- Liveness Probe Misconfiguration: The probe checks a
/healthendpoint that takes 10 seconds to respond, but thetimeoutSecondsis set to 1. Kubernetes assumes the app is dead and kills it. - Slow Startup: Heavy frameworks like Spring Boot may take 60 seconds to start. If the Liveness probe starts checking after 10 seconds, the pod is killed before it ever becomes ready.
Step-by-step recovery process
Follow this decision tree to isolate the root cause.
Step 1: Identify the Symptom
Start with the high-level status.
kubectl get pods
Observation:
- If status is
Pending→ Go to Step 2. - If status is
CrashLoopBackOfforError→ Go to Step 3. - If status is
Runningbut the pod keeps restarting → Go to Step 4.
Step 2: Debugging Pending Pods
If the pod isn’t even starting, check the events.
kubectl describe pod <pod-name>
Look at the Events section at the bottom. If you see FailedScheduling, check for taints or resource pressure. If you see FailedMount, your PVC is likely stuck in another zone or not bound.
Step 3: Debugging CrashLoops and Errors
If the pod starts and then dies, check the application logs.
kubectl logs <pod-name> --previous
The --previous flag is critical. It allows you to see the logs from the container that just crashed, rather than the logs of the new container currently starting.
If logs are empty, check the exit code using kubectl describe pod.
- Exit Code 137: This is almost always
OOMKilled. You must increase the memory limits in your manifest. - Exit Code 1: Application crash (NullPointerException, missing config, etc.).
Step 4: Debugging Pods that Restart
If the pod is Running but the restart count is climbing, the Liveness probe is likely killing it.
kubectl describe pod <pod-name> | grep -i "Liveness probe failed"
Implement a startupProbe. This tells Kubernetes to ignore Liveness and Readiness probes until the container has finished its initial boot sequence. In clusters with slow-starting legacy apps, this reduces unnecessary restart cycles by nearly 100%.
# Example Startup Probe for slow apps
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
Step 5: Advanced Inspection
If you cannot get logs and the pod dies too fast to exec into, use an ephemeral debug container.
kubectl debug -it <pod-name> --image=busybox --target=<container-name>
This attaches a shell to the process namespace of the failing pod without restarting it, which allows you to inspect the filesystem and network state in real time.
How to prevent future failures
Preventing pod failure requires a production-ready manifest checklist. Never deploy a pod without these four elements:
- Explicit Resource Requests and Limits: Set
requeststo what the app needs to run andlimitsto a reasonable ceiling. This prevents a single pod from consuming all node memory and triggering a node-wide Out-of-Memory event. - Proper Probe Hierarchy: Use
startupProbefor initial boot,livenessProbefor deadlock detection andreadinessProbeto control traffic flow. - Non-Root Users: Use
securityContextto ensure the pod does not crash due to permission errors when writing to mounted volumes. - Graceful Shutdown: Handle
SIGTERMin your application code. This allows Kubernetes to drain connections before the 30-secondterminationGracePeriodSecondsexpires, avoiding 502 errors during deployments.
Stay up to date
Get DevOps tips, tutorials, and guides delivered to your inbox.