Kubernetes Troubleshooting: Why Did My Pod Die?

Pods die because of scheduling failures, startup crashes or runtime terminations. When a Kubernetes pod fails, it rarely tells you exactly why in a single line. Instead, it gives you a status like CrashLoopBackOff or Pending, which are symptoms rather than root causes. To fix a pod, you must distinguish between these three failure modes. This guide provides a decision tree for diagnosing pod deaths, moving from high-level status checks to deep-dive container inspection.

For a comprehensive look at the most common restart cycles, check out the guide on how to fix Kubernetes CrashLoopBackOff in production.

Understanding the failure states

A pod “death” is a transition in the Pod Lifecycle. When you see CrashLoopBackOff, the container started, crashed, and Kubernetes is now waiting an exponentially increasing amount of time before trying to start it again to avoid hammering the node.

OOMKilled means the Linux kernel terminated the process because it exceeded its memory limit. I have seen this happen frequently in Java applications where the JVM heap is set higher than the Kubernetes memory limit. ImagePullBackOff means the kubelet cannot retrieve the container image from the registry. Each of these states points to a different layer of the stack: the infrastructure, the container runtime or the application code itself. Refer to the official Kubernetes Pod Lifecycle documentation for the full state machine.

Common root causes

Pod failures generally fall into three categories.

Startup and Configuration Failures

Missing Environment Variables: The application panics immediately because a required database URL or API key is missing.
Invalid Image Tags: A typo in the image version or a deleted tag in the registry leads to ErrImagePull.
Registry Authentication: The imagePullSecrets are missing or the service account lacks permission to pull from a private repository.

Resource and Infrastructure Constraints

Memory Limits (OOMKilled): The container attempted to allocate more memory than defined in its limits section.
CPU Throttling: Extreme throttling can trigger Liveness probe timeouts, causing Kubernetes to kill and restart the pod.
Scheduling Constraints: Pods stuck in Pending usually suffer from “Insufficient cpu” or “Insufficient memory” on all available nodes, or they have nodeSelector constraints that no node meets.

Health Check Failures

Liveness Probe Misconfiguration: The probe checks a /health endpoint that takes 10 seconds to respond, but the timeoutSeconds is set to 1. Kubernetes assumes the app is dead and kills it.
Slow Startup: Heavy frameworks like Spring Boot may take 60 seconds to start. If the Liveness probe starts checking after 10 seconds, the pod is killed before it ever becomes ready.

Step-by-step recovery process

Follow this decision tree to isolate the root cause.

Step 1: Identify the Symptom

Start with the high-level status.

kubectl get pods

Observation:

If status is Pending → Go to Step 2.
If status is CrashLoopBackOff or Error → Go to Step 3.
If status is Running but the pod keeps restarting → Go to Step 4.

Step 2: Debugging Pending Pods

If the pod isn’t even starting, check the events.

kubectl describe pod <pod-name>

Look at the Events section at the bottom. If you see FailedScheduling, check for taints or resource pressure. If you see FailedMount, your PVC is likely stuck in another zone or not bound.

Step 3: Debugging CrashLoops and Errors

If the pod starts and then dies, check the application logs.

kubectl logs <pod-name> --previous

The --previous flag is critical. It allows you to see the logs from the container that just crashed, rather than the logs of the new container currently starting.

If logs are empty, check the exit code using kubectl describe pod.

Exit Code 137: This is almost always OOMKilled. You must increase the memory limits in your manifest.
Exit Code 1: Application crash (NullPointerException, missing config, etc.).

Step 4: Debugging Pods that Restart

If the pod is Running but the restart count is climbing, the Liveness probe is likely killing it.

kubectl describe pod <pod-name> | grep -i "Liveness probe failed"

Implement a startupProbe. This tells Kubernetes to ignore Liveness and Readiness probes until the container has finished its initial boot sequence. In clusters with slow-starting legacy apps, this reduces unnecessary restart cycles by nearly 100%.

# Example Startup Probe for slow apps
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Step 5: Advanced Inspection

If you cannot get logs and the pod dies too fast to exec into, use an ephemeral debug container.

kubectl debug -it <pod-name> --image=busybox --target=<container-name>

This attaches a shell to the process namespace of the failing pod without restarting it, which allows you to inspect the filesystem and network state in real time.

How to prevent future failures

Preventing pod failure requires a production-ready manifest checklist. Never deploy a pod without these four elements:

Explicit Resource Requests and Limits: Set requests to what the app needs to run and limits to a reasonable ceiling. This prevents a single pod from consuming all node memory and triggering a node-wide Out-of-Memory event.
Proper Probe Hierarchy: Use startupProbe for initial boot, livenessProbe for deadlock detection and readinessProbe to control traffic flow.
Non-Root Users: Use securityContext to ensure the pod does not crash due to permission errors when writing to mounted volumes.
Graceful Shutdown: Handle SIGTERM in your application code. This allows Kubernetes to drain connections before the 30-second terminationGracePeriodSeconds expires, avoiding 502 errors during deployments.