troubleshooting warning kubernetes ·

How to Fix Kubernetes CrashLoopBackOff in Production

Stop Kubernetes CrashLoopBackOff crashes with this production triage guide. Learn to decode exit codes, analyze logs, and prevent OOMKills effectively.

How to Fix Kubernetes CrashLoopBackOff in Production
Advertisement

Problem: What CrashLoopBackOff actually means

When you see CrashLoopBackOff in your kubectl get pods output, you aren’t looking at a specific error, but a state. It is a symptom. It tells you that the kubelet tried to start your container, the container crashed, and Kubernetes is now waiting before trying again.

To prevent the API server and the node from being hammered by a process that crashes instantly, Kubernetes implements an exponential backoff delay. The first restart happens quickly, but subsequent failures increase the wait time (10s, 20s, 40s, up to a maximum of 5 minutes). If you don’t intervene, your pod will spend more time waiting than running, making it nearly impossible to catch logs in real-time.

You can find more details on the pod lifecycle in the official Kubernetes Documentation.

Root Causes: The “Why” behind the crash

Diagnosing a crash loop requires moving from the symptom to the cause. I’ve seen this fail most often in clusters with >50 nodes where configuration drift becomes common. Root causes generally fall into three severity tiers.

Low Severity: Configuration and Environment

These are “fail-fast” errors. The application starts, realizes it is missing a critical piece of information, and exits. Common culprits include:

  • Missing ConfigMaps or Secrets: The pod is configured to expect a volume or environment variable that doesn’t exist.
  • Incorrect Environment Variables: A typo in a database URL or a missing API key.
  • Wrong Command/Args: An incorrect entrypoint in the Dockerfile or a typo in the args section of the YAML.

Medium Severity: Resource and Infrastructure

These crashes are often intermittent or happen shortly after the app begins processing traffic:

  • OOMKilled (Exit Code 137): The container exceeded its memory limit. This is the most common production crash, often reducing availability by 100% for that specific replica.
  • Liveness Probe Death Spiral: The application takes 30 seconds to start, but the liveness probe kills it after 10 seconds. The pod is healthy, but Kubernetes thinks it’s dead.
  • Storage Permissions: The container user doesn’t have write access to a mounted PersistentVolume.

High Severity: External Dependencies

The application is healthy, but its environment is hostile:

  • Database Connection Timeouts: The app crashes because it cannot reach the DB due to a firewall rule or incorrect credentials.
  • DNS Resolution Failures: CoreDNS issues prevent the app from finding other services within the cluster.
  • Dependency API Outages: A hard dependency (like an external Auth provider) is down, and the app isn’t designed to handle the failure gracefully.

Solution: Step-by-Step Production Triage

When a production service goes into CrashLoopBackOff, follow this severity-based logic tree. Do not guess; use the data provided by the cluster.

Step 1: The Rapid Triage

Start by checking the pod status and events. This tells you if the crash is happening because of the image, the scheduler, or the application itself.

kubectl describe pod <pod-name>

Look at the Containers section for the Last State. You will see an Exit Code. This is your most important clue.

Expected Output: You should see a section like this: Last State: Terminated Reason: Error Exit Code: 137

Step 2: Decoding the Exit Code

Map the exit code from describe to your action plan:

  • Exit Code 0: The app finished its task and exited. If this is a Deployment, it shouldn’t happen. You likely need a Job instead.
  • Exit Code 1: General application crash. Move to Step 3 to check logs.
  • Exit Code 137: OOMKilled. Increase your memory limits in the deployment YAML.
  • Exit Code 139: Segmentation fault. This is usually a binary incompatibility or a memory corruption issue in the code.
  • Exit Code 143: SIGTERM. The pod was told to stop but didn’t do it gracefully.

Step 3: Retrieving the “Hidden” Logs

If a pod is crashing, kubectl logs <pod-name> often returns nothing because the current container has just started and hasn’t logged anything yet. You must check the logs of the previous failed instance.

kubectl logs <pod-name> --previous

If the logs show a connection timeout to a database, check your network policies or secret values. If you need a faster way to handle urgent fixes, you can use /tips/rapid-rollback-kubectl-set-image-for-urgent-fixes to revert to a known working image version.

Step 4: Debugging “Silent” Crashes with Ephemeral Containers

Sometimes logs are empty and the exit code is vague. In Kubernetes v1.23+, you can use kubectl debug to spin up a sidecar container with debugging tools (like curl, dig, or vim) that shares the same process namespace as the crashing pod.

kubectl debug -it <pod-name> --image=busybox --target=<container-name>

Once inside, you can inspect the /tmp directory, check network connectivity, or look at the filesystem to see if a config file was mounted incorrectly. For a wider array of helpful commands, refer to the /tips/kubectl-essential-commands guide.

Prevention: Stopping the loop before it starts

Prevention is about shifting from “fixing” to “hardening”. Implement these four strategies to eliminate CrashLoopBackOff in production:

  1. Right-Size Resources: Use a Vertical Pod Autoscaler (VPA) in Recommender mode to find the actual memory usage of your app. Set your limits roughly 20% higher than your requests to handle spikes without triggering an OOMKill.

  2. Graceful Probes: Never set a livenessProbe with the same timing as your readinessProbe. Give your application a initialDelaySeconds that covers its worst-case startup time. If your app takes 20 seconds to boot, set the delay to 30 seconds.

  3. Robust Signal Handling: Ensure your application handles SIGTERM (Exit Code 143). If your app ignores this signal, Kubernetes will eventually force-kill it with SIGKILL (Exit Code 137) after the terminationGracePeriodSeconds expires.

  4. Better Observability: Integrate deep monitoring. If you are running AI workloads, implementing /tutorials/llm-observability-on-kubernetes-a-practical-guide will help you identify if crashes are caused by GPU memory exhaustion or model loading timeouts.

FAQ

Why does my pod stay in CrashLoopBackOff even after I fix the ConfigMap? Kubernetes uses an exponential backoff. If your pod has crashed multiple times, it might wait up to 5 minutes before the next restart attempt. You can force an immediate restart by deleting the pod: kubectl delete pod <pod-name>.

Is an Exit Code 137 always an OOMKill? Almost always, but not exclusively. It means the process received a SIGKILL (Signal 9). While the kubelet usually sends this when a container exceeds its memory limit, it can also happen if an external process or the node’s OOM killer terminates the process.

How can I prevent a crashing pod from affecting other pods on the same node? Define strict resources.limits. Without memory limits, a single leaking container can consume all node memory, triggering the Node-level OOM killer and causing unrelated pods to be evicted.

Conclusion and Next Steps

CrashLoopBackOff is a safety mechanism, not a bug. By isolating the exit code and inspecting previous logs, you can quickly determine if you are facing a configuration error, a resource constraint, or a dependency failure.

To further harden your production environment, take these next steps:

  • Audit your livenessProbes to ensure they aren’t too aggressive.
  • Deploy a Vertical Pod Autoscaler (VPA) to get data-driven memory limits.
  • Implement a structured logging pipeline (EFK/ELK) so you don’t have to rely on --previous logs during an incident.
Advertisement

Stay up to date

Get DevOps tips, tutorials, and guides delivered to your inbox.