How to Fix Kubernetes CrashLoopBackOff in Production
Stop Kubernetes CrashLoopBackOff crashes with this production triage guide. Learn to decode exit codes, analyze logs, and prevent OOMKills effectively.
Problem: What CrashLoopBackOff actually means
When you see CrashLoopBackOff in your kubectl get pods output, you aren’t looking at a specific error, but a state. It is a symptom. It tells you that the kubelet tried to start your container, the container crashed, and Kubernetes is now waiting before trying again.
To prevent the API server and the node from being hammered by a process that crashes instantly, Kubernetes implements an exponential backoff delay. The first restart happens quickly, but subsequent failures increase the wait time (10s, 20s, 40s, up to a maximum of 5 minutes). If you don’t intervene, your pod will spend more time waiting than running, making it nearly impossible to catch logs in real-time.
You can find more details on the pod lifecycle in the official Kubernetes Documentation.
Root Causes: The “Why” behind the crash
Diagnosing a crash loop requires moving from the symptom to the cause. I’ve seen this fail most often in clusters with >50 nodes where configuration drift becomes common. Root causes generally fall into three severity tiers.
Low Severity: Configuration and Environment
These are “fail-fast” errors. The application starts, realizes it is missing a critical piece of information, and exits. Common culprits include:
- Missing ConfigMaps or Secrets: The pod is configured to expect a volume or environment variable that doesn’t exist.
- Incorrect Environment Variables: A typo in a database URL or a missing API key.
- Wrong Command/Args: An incorrect entrypoint in the Dockerfile or a typo in the
argssection of the YAML.
Medium Severity: Resource and Infrastructure
These crashes are often intermittent or happen shortly after the app begins processing traffic:
- OOMKilled (Exit Code 137): The container exceeded its memory limit. This is the most common production crash, often reducing availability by 100% for that specific replica.
- Liveness Probe Death Spiral: The application takes 30 seconds to start, but the liveness probe kills it after 10 seconds. The pod is healthy, but Kubernetes thinks it’s dead.
- Storage Permissions: The container user doesn’t have write access to a mounted PersistentVolume.
High Severity: External Dependencies
The application is healthy, but its environment is hostile:
- Database Connection Timeouts: The app crashes because it cannot reach the DB due to a firewall rule or incorrect credentials.
- DNS Resolution Failures: CoreDNS issues prevent the app from finding other services within the cluster.
- Dependency API Outages: A hard dependency (like an external Auth provider) is down, and the app isn’t designed to handle the failure gracefully.
Solution: Step-by-Step Production Triage
When a production service goes into CrashLoopBackOff, follow this severity-based logic tree. Do not guess; use the data provided by the cluster.
Step 1: The Rapid Triage
Start by checking the pod status and events. This tells you if the crash is happening because of the image, the scheduler, or the application itself.
kubectl describe pod <pod-name>
Look at the Containers section for the Last State. You will see an Exit Code. This is your most important clue.
Expected Output:
You should see a section like this:
Last State: Terminated
Reason: Error
Exit Code: 137
Step 2: Decoding the Exit Code
Map the exit code from describe to your action plan:
- Exit Code 0: The app finished its task and exited. If this is a Deployment, it shouldn’t happen. You likely need a Job instead.
- Exit Code 1: General application crash. Move to Step 3 to check logs.
- Exit Code 137: OOMKilled. Increase your memory limits in the deployment YAML.
- Exit Code 139: Segmentation fault. This is usually a binary incompatibility or a memory corruption issue in the code.
- Exit Code 143: SIGTERM. The pod was told to stop but didn’t do it gracefully.
Step 3: Retrieving the “Hidden” Logs
If a pod is crashing, kubectl logs <pod-name> often returns nothing because the current container has just started and hasn’t logged anything yet. You must check the logs of the previous failed instance.
kubectl logs <pod-name> --previous
If the logs show a connection timeout to a database, check your network policies or secret values. If you need a faster way to handle urgent fixes, you can use /tips/rapid-rollback-kubectl-set-image-for-urgent-fixes to revert to a known working image version.
Step 4: Debugging “Silent” Crashes with Ephemeral Containers
Sometimes logs are empty and the exit code is vague. In Kubernetes v1.23+, you can use kubectl debug to spin up a sidecar container with debugging tools (like curl, dig, or vim) that shares the same process namespace as the crashing pod.
kubectl debug -it <pod-name> --image=busybox --target=<container-name>
Once inside, you can inspect the /tmp directory, check network connectivity, or look at the filesystem to see if a config file was mounted incorrectly. For a wider array of helpful commands, refer to the /tips/kubectl-essential-commands guide.
Prevention: Stopping the loop before it starts
Prevention is about shifting from “fixing” to “hardening”. Implement these four strategies to eliminate CrashLoopBackOff in production:
-
Right-Size Resources: Use a Vertical Pod Autoscaler (VPA) in
Recommendermode to find the actual memory usage of your app. Set yourlimitsroughly 20% higher than yourrequeststo handle spikes without triggering an OOMKill. -
Graceful Probes: Never set a
livenessProbewith the same timing as yourreadinessProbe. Give your application ainitialDelaySecondsthat covers its worst-case startup time. If your app takes 20 seconds to boot, set the delay to 30 seconds. -
Robust Signal Handling: Ensure your application handles
SIGTERM(Exit Code 143). If your app ignores this signal, Kubernetes will eventually force-kill it withSIGKILL(Exit Code 137) after theterminationGracePeriodSecondsexpires. -
Better Observability: Integrate deep monitoring. If you are running AI workloads, implementing /tutorials/llm-observability-on-kubernetes-a-practical-guide will help you identify if crashes are caused by GPU memory exhaustion or model loading timeouts.
FAQ
Why does my pod stay in CrashLoopBackOff even after I fix the ConfigMap?
Kubernetes uses an exponential backoff. If your pod has crashed multiple times, it might wait up to 5 minutes before the next restart attempt. You can force an immediate restart by deleting the pod: kubectl delete pod <pod-name>.
Is an Exit Code 137 always an OOMKill?
Almost always, but not exclusively. It means the process received a SIGKILL (Signal 9). While the kubelet usually sends this when a container exceeds its memory limit, it can also happen if an external process or the node’s OOM killer terminates the process.
How can I prevent a crashing pod from affecting other pods on the same node?
Define strict resources.limits. Without memory limits, a single leaking container can consume all node memory, triggering the Node-level OOM killer and causing unrelated pods to be evicted.
Conclusion and Next Steps
CrashLoopBackOff is a safety mechanism, not a bug. By isolating the exit code and inspecting previous logs, you can quickly determine if you are facing a configuration error, a resource constraint, or a dependency failure.
To further harden your production environment, take these next steps:
- Audit your
livenessProbesto ensure they aren’t too aggressive. - Deploy a Vertical Pod Autoscaler (VPA) to get data-driven memory limits.
- Implement a structured logging pipeline (EFK/ELK) so you don’t have to rely on
--previouslogs during an incident.
Stay up to date
Get DevOps tips, tutorials, and guides delivered to your inbox.