How to Debug Kubernetes CrashLoopBackOff

How do I debug Kubernetes CrashLoopBackOff?

TL;DR

Constraints

Quick Reference

# Cause Likelihood Key Signal Fix
1 Application error / crash ~30% Exit code 1; error in logs Fix application code [src1, src3]
2 Missing/wrong env vars or config ~20% Exit 1; "env var not set" Fix ConfigMap/Secret/env [src3, src4]
3 OOMKilled (exit 137) ~15% OOMKilled: true Increase resources.limits.memory [src3, src5]
4 Liveness probe failure ~10% "Liveness probe failed" in events Use startupProbe [src2, src3]
5 Command not found (exit 127) ~5% "exec format error" Fix CMD/entrypoint [src4, src6]
6 Missing ConfigMap or Secret ~5% "CreateContainerConfigError" Create missing resource [src3, src5]
7 Init container failure ~4% Init container crashing kubectl logs -c init-name [src7]
8 Volume mount failure ~4% "MountVolume.SetUp failed" Fix PVC/storage class [src3, src5]
9 Permission denied (exit 126) ~3% "Permission denied" in logs Fix securityContext [src4]
10 Image pull issues then crash ~2% "ImagePullBackOff" Fix image/registry [src1, src3]
11 Container exits successfully (exit 0) ~2% Repeated exit 0 Use Job; or keep process running [src5, src6]

Decision Tree

START — Pod shows CrashLoopBackOff
├── kubectl describe pod → Check "Last State" Exit Code
│   ├── Exit 0 → App exits but shouldn't → needs long-running process or Job [src6]
│   ├── Exit 1 → App error → check logs: kubectl logs --previous [src1, src3]
│   ├── Exit 126 → Permission denied → fix securityContext [src4]
│   ├── Exit 127 → Command not found → fix image/CMD; check arch [src4, src6]
│   ├── Exit 137 → OOMKilled? → increase memory limits [src3, src5]
│   └── Exit 143 → SIGTERM → check probe timing or preStop hook [src2]
├── Check Events section
│   ├── "Liveness probe failed" → fix probe config; use startupProbe [src2, src3]
│   ├── "MountVolume.SetUp failed" → fix PVC [src3]
│   └── "configmap/secret not found" → create resource [src3, src5]
├── Check Init Containers
│   └── Failing? → kubectl logs -c <init-name> [src7]
├── Check Sidecar Containers (K8s 1.29+)
│   └── Sidecar healthy but main crashing? → debug main container separately [src7]
└── kubectl logs --previous → find the error message [src1]

Step-by-Step Guide

1. Get the pod status and restart count

Identify which pods are in CrashLoopBackOff. [src1, src3]

kubectl get pods -A | grep CrashLoopBackOff
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].restartCount}'

2. Describe the pod for events and state

The single most important debugging command. [src1, src3, src5]

kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

3. Read the previous container's logs

The *previous* container has the crash output. [src1, src3, src5]

kubectl logs <pod> --previous
kubectl logs <pod> -c <container> --previous
kubectl logs <pod> -c <init-container-name>
kubectl logs <pod> -c <sidecar-name>  # K8s 1.29+

4. Fix OOMKilled (exit code 137)

Container exceeded its memory limit. [src3, src5]

resources:
  requests:
    memory: 256Mi
  limits:
    memory: 512Mi
kubectl top pod <pod>

5. Fix liveness probe failures

Use startupProbe for slow starters. [src2, src3]

startupProbe:
  httpGet:
    path: /health
    port: 8080
  periodSeconds: 10
  failureThreshold: 30    # 5 min startup allowance
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  periodSeconds: 15
  failureThreshold: 3

6. Fix missing ConfigMap / Secret / env vars

Missing configuration causes immediate crashes. [src3, src4, src5]

kubectl get configmap -n <ns>
kubectl get secret -n <ns>
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].env}'

7. Debug with ephemeral containers

When logs don't show enough. K8s 1.25+ GA. [src1, src4]

kubectl debug -it <pod> --image=busybox --target=<container>
kubectl exec -it <pod> -- /bin/sh
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20

Code Examples

Comprehensive CrashLoopBackOff diagnostic script

#!/bin/bash
POD="$1"; NS="${2:-default}"
if [ -z "$POD" ]; then
    echo "Usage: $0 <pod> [namespace]"
    kubectl get pods -A | grep CrashLoopBackOff; exit 1
fi

echo "=== CrashLoopBackOff Diagnostic: $POD (ns: $NS) ==="
RESTARTS=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].restartCount}')
EXIT=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}')
REASON=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}')
echo "Restarts: $RESTARTS | Exit: $EXIT | Reason: $REASON"

case "$EXIT" in
    0)   echo "Exit 0 — completed (shouldn't for services)" ;;
    1)   echo "Exit 1 — Application error" ;;
    126) echo "Exit 126 — Permission denied" ;;
    127) echo "Exit 127 — Command not found" ;;
    137) echo "Exit 137 — $( [ "$REASON" = "OOMKilled" ] && echo "OOMKilled" || echo "SIGKILL")" ;;
    143) echo "Exit 143 — SIGTERM" ;;
esac

echo "=== Resources ==="
kubectl get pod "$POD" -n "$NS" -o jsonpath='{range .spec.containers[*]}{.name}: req={.resources.requests.memory} lim={.resources.limits.memory}{"\n"}{end}'
kubectl top pod "$POD" -n "$NS" 2>/dev/null || echo "(metrics unavailable)"

echo "=== Events ==="
kubectl get events -n "$NS" --field-selector "involvedObject.name=$POD" --sort-by='.lastTimestamp' | tail -10

echo "=== Previous Logs (30 lines) ==="
kubectl logs "$POD" -n "$NS" --previous --tail=30 2>/dev/null || echo "(none)"

Production-ready pod with proper probes and resources

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      initContainers:
        - name: wait-for-db
          image: busybox:1.36
          command: ['sh', '-c', 'until nc -zv db-service 5432; do sleep 2; done']
          resources:
            limits: { cpu: 50m, memory: 32Mi }
        - name: log-shipper      # Sidecar (K8s 1.29+)
          image: fluentbit:2.2
          restartPolicy: Always   # Survives main container restarts
          resources:
            requests: { cpu: 50m, memory: 64Mi }
            limits: { cpu: 100m, memory: 128Mi }
      containers:
        - name: api
          image: myapp:1.2.3
          ports: [{ containerPort: 8080 }]
          envFrom:
            - configMapRef: { name: api-config }
            - secretRef: { name: api-secrets }
          resources:
            requests: { cpu: 100m, memory: 256Mi }
            limits: { cpu: 500m, memory: 512Mi }
          startupProbe:
            httpGet: { path: /health, port: 8080 }
            periodSeconds: 10
            failureThreshold: 30
          livenessProbe:
            httpGet: { path: /health, port: 8080 }
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet: { path: /ready, port: 8080 }
            periodSeconds: 10
      terminationGracePeriodSeconds: 30

Automated CrashLoopBackOff watcher

#!/usr/bin/env python3
import subprocess, json, time
from datetime import datetime

def get_crash_pods():
    result = subprocess.run(
        "kubectl get pods -A -o json".split(), capture_output=True, text=True)
    if result.returncode != 0: return []
    pods = json.loads(result.stdout)
    crash = []
    for pod in pods.get("items", []):
        for cs in pod.get("status", {}).get("containerStatuses", []):
            w = cs.get("state", {}).get("waiting", {})
            if w.get("reason") == "CrashLoopBackOff":
                last = cs.get("lastState", {}).get("terminated", {})
                crash.append({
                    "name": pod["metadata"]["name"],
                    "ns": pod["metadata"]["namespace"],
                    "restarts": cs.get("restartCount", 0),
                    "exit": last.get("exitCode"),
                    "reason": last.get("reason", "Unknown"),
                })
    return crash

def monitor(interval=30):
    seen = set()
    while True:
        for p in get_crash_pods():
            key = f"{p['ns']}/{p['name']}"
            if key not in seen:
                seen.add(key)
                print(f"[{datetime.now():%H:%M:%S}] ALERT {key} "
                      f"exit={p['exit']} ({p['reason']}) restarts={p['restarts']}")
        time.sleep(interval)

if __name__ == "__main__":
    monitor()

Anti-Patterns

Wrong: No resource limits

# BAD — unlimited resources; OOM kills other pods [src3, src5]
containers:
  - name: app
    image: myapp:latest

Correct: Always set resource requests and limits

# GOOD — predictable; scheduler places properly [src3, src5]
containers:
  - name: app
    image: myapp:1.2.3
    resources:
      requests: { cpu: 100m, memory: 256Mi }
      limits: { cpu: 500m, memory: 512Mi }

Wrong: Aggressive liveness probe on slow-starting app

# BAD — kills Spring Boot during startup [src2, src3]
livenessProbe:
  httpGet: { path: /health, port: 8080 }
  initialDelaySeconds: 5
  periodSeconds: 3
  failureThreshold: 3

Correct: Use startupProbe for slow starters

# GOOD — startupProbe allows up to 5 min init [src2]
startupProbe:
  httpGet: { path: /health, port: 8080 }
  periodSeconds: 10
  failureThreshold: 30
livenessProbe:
  httpGet: { path: /health, port: 8080 }
  periodSeconds: 15

Wrong: Using :latest tag

# BAD — unpredictable; hard to rollback [src3, src6]
containers:
  - name: app
    image: myapp:latest
    imagePullPolicy: Always

Correct: Pin specific image version

# GOOD — reproducible; easy rollback [src3, src6]
containers:
  - name: app
    image: myapp:1.2.3
    imagePullPolicy: IfNotPresent

Wrong: Sidecar as regular container (K8s 1.29+)

# BAD — sidecar dies with main container [src7]
containers:
  - name: app
    image: myapp:1.2.3
  - name: log-shipper
    image: fluentbit:2.2

Correct: Sidecar as init container with restartPolicy: Always

# GOOD — sidecar survives main container restarts [src7, src8]
initContainers:
  - name: log-shipper
    image: fluentbit:2.2
    restartPolicy: Always
containers:
  - name: app
    image: myapp:1.2.3

Decision Logic

If kubectl describe pod shows Reason: OOMKilled (exit 137)

The container hit its resources.limits.memory ceiling. Profile with kubectl top pod first, then raise the memory limit (typically 1.5–2× observed working set). Do NOT just remove the limit — unbounded pods evict their neighbors. [src3, src5, src9]

If exit code is 1 AND kubectl logs --previous shows a stack trace

This is an application-level bug, not an orchestration problem. Fix the code path, not the pod spec. Common: unhandled exception during init, missing config value, schema migration not run. [src1, src3, src11]

If exit code is 0 AND restartPolicy: Always

The container completed and the controller keeps restarting it. Either (a) the entrypoint is a one-shot task — convert the Deployment to a Job/CronJob; or (b) the long-running process exited cleanly — keep it in the foreground (no daemonizing, no &). [src5, src6, src9]

If exit code is 137 AND Reason is NOT OOMKilled

The kubelet killed it (usually a failed liveness probe with SIGKILL after grace period). Check kubectl describe pod Events for “Liveness probe failed”. Fix: add startupProbe for slow starters, raise failureThreshold, or relax periodSeconds. [src2, src3]

If exit code is 127 (“command not found”) OR “exec format error”

Either the CMD/ENTRYPOINT references a binary that isn’t in the image, OR there’s an image architecture mismatch (amd64 image scheduled on arm64 node). Run kubectl describe pod to confirm the node arch, then either rebuild as a multi-arch image or add nodeSelector: kubernetes.io/arch: amd64. [src4, src6]

If you have a sidecar container AND main container keeps crashing

K8s 1.29+ (GA in 1.33): sidecar containers (init containers with restartPolicy: Always) survive main container restarts and have separate lifecycle. The pod can appear partially ready while the main container is in CrashLoopBackOff. Debug each container individually with kubectl logs -c <name>; don’t assume pod-level signals reflect the main app. [src1, src7, src10]

If you’re on K8s 1.33+ AND the 5-minute backoff is slowing down recovery

Enable the ReduceDefaultCrashLoopBackOffDecay alpha feature gate on the kubelet (--feature-gates=ReduceDefaultCrashLoopBackOffDecay=true). This caps backoff at 60s instead of 300s, with a 1s initial delay. Per-node override is available via KubeletCrashLoopBackOffMax (K8s 1.32+ alpha). Both are alpha — do not enable in production without testing. [src8, src10]

Common Pitfalls

Diagnostic Commands

# === Find CrashLoopBackOff pods ===
kubectl get pods -A | grep CrashLoopBackOff

# === Pod details ===
kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# === Logs ===
kubectl logs <pod> --previous
kubectl logs <pod> -c <container> --previous
kubectl logs <pod> -c <init-container>
kubectl logs <pod> -c <sidecar>

# === Events ===
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20

# === Resources ===
kubectl top pod <pod>
kubectl top nodes

# === Debugging ===
kubectl debug -it <pod> --image=busybox --target=<container>
kubectl exec -it <pod> -- /bin/sh

# === Config ===
kubectl get configmap -n <ns>
kubectl get secret -n <ns>

# === Node health ===
kubectl describe node <node> | grep -A5 "Conditions"

Version History & Compatibility

Version Behavior Key Changes
K8s 1.34 (Aug 2025) Current ContainerRestartRules alpha (KEP-5307) — per-container restartPolicyRules matching exit codes; restart in place even when pod-level restartPolicy: Never. Targets AI/ML batch jobs on GPUs [src10]
K8s 1.33 (Apr 2025) Stable Sidecar containers GA; ReduceDefaultCrashLoopBackOffDecay alpha — initial 1s, max 60s backoff (down from 10s/300s) [src10, src11]
K8s 1.32 (Dec 2024) Stable KubeletCrashLoopBackOffMax alpha — configurable max backoff delay (1s-300s per node) [src8]
K8s 1.29 (Dec 2023) Stable Sidecar containers beta; better init container lifecycle [src1, src7]
K8s 1.28 Stable Sidecar alpha; improved probe logging [src1]
K8s 1.25 Ephemeral GA Ephemeral debug containers GA [src1]
K8s 1.23 Debug beta kubectl debug beta [src1]
K8s 1.20 startupProbe GA startupProbe graduated to stable [src2]
K8s 1.18 startupProbe beta startupProbe for slow-starting containers [src2]
K8s 1.16 Stable probes Liveness/readiness stable in all workloads [src2]

When to Use / When Not to Use

Use When Don't Use When Use Instead
Pod shows CrashLoopBackOff Pod stuck in Pending Debug scheduling: kubectl describe pod
Container keeps restarting Pod in ImagePullBackOff Fix image name/registry
Exit code is non-zero Pod running but not ready Debug readiness probe
Events show probe failures Container is Evicted Check node pressure

Important Caveats