How to Debug Kubernetes CrashLoopBackOff
How do I debug Kubernetes CrashLoopBackOff?
TL;DR
- Bottom line: CrashLoopBackOff means a container in a pod keeps crashing and Kubernetes keeps restarting it with increasing delays (10s, 20s, 40s, ... up to 5 min max). It's not an error itself — it's Kubernetes telling you the container can't stay running. The root cause is always in the container.
- Key tool/command:
kubectl describe pod <pod>shows events, exit codes, and probe config.kubectl logs <pod> --previousshows the crashed container's last output. These two solve 90% of cases. - Watch out for: Liveness probes with too-short
initialDelaySeconds— Kubernetes kills the container before it starts. UsestartupProbefor slow starters. - Works with: All Kubernetes versions (1.20+). Same concepts apply to OpenShift, EKS,
GKE, AKS, k3s, minikube. K8s 1.32+ adds configurable max backoff via
KubeletCrashLoopBackOffMax. K8s 1.33+ (Apr 2025) reduces default backoff to 1s initial / 60s max under theReduceDefaultCrashLoopBackOffDecayalpha feature gate, and graduates sidecar containers to GA.
Constraints
- Capture before deleting: Never delete a CrashLoopBackOff pod without first running
kubectl logs --previousandkubectl describe pod— deletion permanently destroys crash logs and events. [src1] - Logs are ephemeral:
kubectl logs --previousonly retrieves the last crash. Earlier crash logs require centralized logging (ELK, Loki, Datadog). [src1, src3] - startupProbe blocks other probes: When a
startupProbeis configured, it disables liveness and readiness probes until it succeeds. If the startup probe never passes, the pod enters CrashLoopBackOff with no liveness/readiness evaluation. [src2] - Memory limits are hard cgroups enforcement: Setting
resources.limits.memorybelow an app's baseline working set guarantees OOMKill (exit 137). Always profile memory usage withkubectl top podbefore setting limits. [src3, src5] - Init containers restart as a group: If any init container fails, ALL init containers restart from the beginning — not just the failed one. [src7]
- Sidecar containers have separate lifecycle: Sidecar containers (K8s 1.29+ GA, defined
as init containers with
restartPolicy: Always) survive main container restarts. Debug them separately withkubectl logs -c <sidecar-name>. [src1, src7]
Quick Reference
| # | Cause | Likelihood | Key Signal | Fix |
|---|---|---|---|---|
| 1 | Application error / crash | ~30% | Exit code 1; error in logs | Fix application code [src1, src3] |
| 2 | Missing/wrong env vars or config | ~20% | Exit 1; "env var not set" | Fix ConfigMap/Secret/env [src3, src4] |
| 3 | OOMKilled (exit 137) | ~15% | OOMKilled: true |
Increase resources.limits.memory [src3,
src5]
|
| 4 | Liveness probe failure | ~10% | "Liveness probe failed" in events | Use startupProbe [src2,
src3]
|
| 5 | Command not found (exit 127) | ~5% | "exec format error" | Fix CMD/entrypoint [src4, src6] |
| 6 | Missing ConfigMap or Secret | ~5% | "CreateContainerConfigError" | Create missing resource [src3, src5] |
| 7 | Init container failure | ~4% | Init container crashing | kubectl logs -c init-name [src7]
|
| 8 | Volume mount failure | ~4% | "MountVolume.SetUp failed" | Fix PVC/storage class [src3, src5] |
| 9 | Permission denied (exit 126) | ~3% | "Permission denied" in logs | Fix securityContext [src4] |
| 10 | Image pull issues then crash | ~2% | "ImagePullBackOff" | Fix image/registry [src1, src3] |
| 11 | Container exits successfully (exit 0) | ~2% | Repeated exit 0 | Use Job; or keep process running [src5, src6] |
Decision Tree
START — Pod shows CrashLoopBackOff
├── kubectl describe pod → Check "Last State" Exit Code
│ ├── Exit 0 → App exits but shouldn't → needs long-running process or Job [src6]
│ ├── Exit 1 → App error → check logs: kubectl logs --previous [src1, src3]
│ ├── Exit 126 → Permission denied → fix securityContext [src4]
│ ├── Exit 127 → Command not found → fix image/CMD; check arch [src4, src6]
│ ├── Exit 137 → OOMKilled? → increase memory limits [src3, src5]
│ └── Exit 143 → SIGTERM → check probe timing or preStop hook [src2]
├── Check Events section
│ ├── "Liveness probe failed" → fix probe config; use startupProbe [src2, src3]
│ ├── "MountVolume.SetUp failed" → fix PVC [src3]
│ └── "configmap/secret not found" → create resource [src3, src5]
├── Check Init Containers
│ └── Failing? → kubectl logs -c <init-name> [src7]
├── Check Sidecar Containers (K8s 1.29+)
│ └── Sidecar healthy but main crashing? → debug main container separately [src7]
└── kubectl logs --previous → find the error message [src1]
Step-by-Step Guide
1. Get the pod status and restart count
Identify which pods are in CrashLoopBackOff. [src1, src3]
kubectl get pods -A | grep CrashLoopBackOff
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].restartCount}'
2. Describe the pod for events and state
The single most important debugging command. [src1, src3, src5]
kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
3. Read the previous container's logs
The *previous* container has the crash output. [src1, src3, src5]
kubectl logs <pod> --previous
kubectl logs <pod> -c <container> --previous
kubectl logs <pod> -c <init-container-name>
kubectl logs <pod> -c <sidecar-name> # K8s 1.29+
4. Fix OOMKilled (exit code 137)
Container exceeded its memory limit. [src3, src5]
resources:
requests:
memory: 256Mi
limits:
memory: 512Mi
kubectl top pod <pod>
5. Fix liveness probe failures
Use startupProbe for slow starters. [src2, src3]
startupProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10
failureThreshold: 30 # 5 min startup allowance
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 15
failureThreshold: 3
6. Fix missing ConfigMap / Secret / env vars
Missing configuration causes immediate crashes. [src3, src4, src5]
kubectl get configmap -n <ns>
kubectl get secret -n <ns>
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].env}'
7. Debug with ephemeral containers
When logs don't show enough. K8s 1.25+ GA. [src1, src4]
kubectl debug -it <pod> --image=busybox --target=<container>
kubectl exec -it <pod> -- /bin/sh
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
Code Examples
Comprehensive CrashLoopBackOff diagnostic script
#!/bin/bash
POD="$1"; NS="${2:-default}"
if [ -z "$POD" ]; then
echo "Usage: $0 <pod> [namespace]"
kubectl get pods -A | grep CrashLoopBackOff; exit 1
fi
echo "=== CrashLoopBackOff Diagnostic: $POD (ns: $NS) ==="
RESTARTS=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].restartCount}')
EXIT=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}')
REASON=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}')
echo "Restarts: $RESTARTS | Exit: $EXIT | Reason: $REASON"
case "$EXIT" in
0) echo "Exit 0 — completed (shouldn't for services)" ;;
1) echo "Exit 1 — Application error" ;;
126) echo "Exit 126 — Permission denied" ;;
127) echo "Exit 127 — Command not found" ;;
137) echo "Exit 137 — $( [ "$REASON" = "OOMKilled" ] && echo "OOMKilled" || echo "SIGKILL")" ;;
143) echo "Exit 143 — SIGTERM" ;;
esac
echo "=== Resources ==="
kubectl get pod "$POD" -n "$NS" -o jsonpath='{range .spec.containers[*]}{.name}: req={.resources.requests.memory} lim={.resources.limits.memory}{"\n"}{end}'
kubectl top pod "$POD" -n "$NS" 2>/dev/null || echo "(metrics unavailable)"
echo "=== Events ==="
kubectl get events -n "$NS" --field-selector "involvedObject.name=$POD" --sort-by='.lastTimestamp' | tail -10
echo "=== Previous Logs (30 lines) ==="
kubectl logs "$POD" -n "$NS" --previous --tail=30 2>/dev/null || echo "(none)"
Production-ready pod with proper probes and resources
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -zv db-service 5432; do sleep 2; done']
resources:
limits: { cpu: 50m, memory: 32Mi }
- name: log-shipper # Sidecar (K8s 1.29+)
image: fluentbit:2.2
restartPolicy: Always # Survives main container restarts
resources:
requests: { cpu: 50m, memory: 64Mi }
limits: { cpu: 100m, memory: 128Mi }
containers:
- name: api
image: myapp:1.2.3
ports: [{ containerPort: 8080 }]
envFrom:
- configMapRef: { name: api-config }
- secretRef: { name: api-secrets }
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
startupProbe:
httpGet: { path: /health, port: 8080 }
periodSeconds: 10
failureThreshold: 30
livenessProbe:
httpGet: { path: /health, port: 8080 }
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet: { path: /ready, port: 8080 }
periodSeconds: 10
terminationGracePeriodSeconds: 30
Automated CrashLoopBackOff watcher
#!/usr/bin/env python3
import subprocess, json, time
from datetime import datetime
def get_crash_pods():
result = subprocess.run(
"kubectl get pods -A -o json".split(), capture_output=True, text=True)
if result.returncode != 0: return []
pods = json.loads(result.stdout)
crash = []
for pod in pods.get("items", []):
for cs in pod.get("status", {}).get("containerStatuses", []):
w = cs.get("state", {}).get("waiting", {})
if w.get("reason") == "CrashLoopBackOff":
last = cs.get("lastState", {}).get("terminated", {})
crash.append({
"name": pod["metadata"]["name"],
"ns": pod["metadata"]["namespace"],
"restarts": cs.get("restartCount", 0),
"exit": last.get("exitCode"),
"reason": last.get("reason", "Unknown"),
})
return crash
def monitor(interval=30):
seen = set()
while True:
for p in get_crash_pods():
key = f"{p['ns']}/{p['name']}"
if key not in seen:
seen.add(key)
print(f"[{datetime.now():%H:%M:%S}] ALERT {key} "
f"exit={p['exit']} ({p['reason']}) restarts={p['restarts']}")
time.sleep(interval)
if __name__ == "__main__":
monitor()
Anti-Patterns
Wrong: No resource limits
# BAD — unlimited resources; OOM kills other pods [src3, src5]
containers:
- name: app
image: myapp:latest
Correct: Always set resource requests and limits
# GOOD — predictable; scheduler places properly [src3, src5]
containers:
- name: app
image: myapp:1.2.3
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
Wrong: Aggressive liveness probe on slow-starting app
# BAD — kills Spring Boot during startup [src2, src3]
livenessProbe:
httpGet: { path: /health, port: 8080 }
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 3
Correct: Use startupProbe for slow starters
# GOOD — startupProbe allows up to 5 min init [src2]
startupProbe:
httpGet: { path: /health, port: 8080 }
periodSeconds: 10
failureThreshold: 30
livenessProbe:
httpGet: { path: /health, port: 8080 }
periodSeconds: 15
Wrong: Using :latest tag
# BAD — unpredictable; hard to rollback [src3, src6]
containers:
- name: app
image: myapp:latest
imagePullPolicy: Always
Correct: Pin specific image version
# GOOD — reproducible; easy rollback [src3, src6]
containers:
- name: app
image: myapp:1.2.3
imagePullPolicy: IfNotPresent
Wrong: Sidecar as regular container (K8s 1.29+)
# BAD — sidecar dies with main container [src7]
containers:
- name: app
image: myapp:1.2.3
- name: log-shipper
image: fluentbit:2.2
Correct: Sidecar as init container with restartPolicy: Always
# GOOD — sidecar survives main container restarts [src7, src8]
initContainers:
- name: log-shipper
image: fluentbit:2.2
restartPolicy: Always
containers:
- name: app
image: myapp:1.2.3
Decision Logic
If kubectl describe pod shows Reason: OOMKilled (exit 137)
The container hit its resources.limits.memory ceiling. Profile with kubectl top pod
first, then raise the memory limit (typically 1.5–2× observed working set). Do NOT just remove
the limit — unbounded pods evict their neighbors. [src3, src5, src9]
If exit code is 1 AND kubectl logs --previous shows a stack trace
This is an application-level bug, not an orchestration problem. Fix the code path, not the pod spec. Common: unhandled exception during init, missing config value, schema migration not run. [src1, src3, src11]
If exit code is 0 AND restartPolicy: Always
The container completed and the controller keeps restarting it. Either (a) the entrypoint is a one-shot task
— convert the Deployment to a Job/CronJob; or (b) the long-running process
exited cleanly — keep it in the foreground (no daemonizing, no &). [src5, src6, src9]
If exit code is 137 AND Reason is NOT OOMKilled
The kubelet killed it (usually a failed liveness probe with SIGKILL after grace period). Check
kubectl describe pod Events for “Liveness probe failed”. Fix: add
startupProbe for slow starters, raise failureThreshold, or relax
periodSeconds. [src2,
src3]
If exit code is 127 (“command not found”) OR “exec format error”
Either the CMD/ENTRYPOINT references a binary that isn’t in the image, OR
there’s an image architecture mismatch (amd64 image scheduled on arm64 node). Run
kubectl describe pod to confirm the node arch, then either rebuild as a multi-arch image or add
nodeSelector: kubernetes.io/arch: amd64. [src4, src6]
If you have a sidecar container AND main container keeps crashing
K8s 1.29+ (GA in 1.33): sidecar containers (init containers with
restartPolicy: Always) survive main container restarts and have separate lifecycle. The pod can
appear partially ready while the main container is in CrashLoopBackOff. Debug each container individually
with kubectl logs -c <name>; don’t assume pod-level signals reflect the main app.
[src1, src7, src10]
If you’re on K8s 1.33+ AND the 5-minute backoff is slowing down recovery
Enable the ReduceDefaultCrashLoopBackOffDecay alpha feature gate on the kubelet
(--feature-gates=ReduceDefaultCrashLoopBackOffDecay=true). This caps backoff at 60s instead of
300s, with a 1s initial delay. Per-node override is available via KubeletCrashLoopBackOffMax
(K8s 1.32+ alpha). Both are alpha — do not enable in production without testing. [src8, src10]
Common Pitfalls
--previousflag forgotten:kubectl logs <pod>shows the current (just-restarted) container. Use--previousfor the crash logs. [src1, src3]- Liveness probe killing healthy containers: Too-short
initialDelaySecondscreates permanent CrashLoopBackOff. UsestartupProbe(K8s 1.20+ GA). [src2, src3] - Exit 0 CrashLoopBackOff:
restartPolicy: Always(default) restarts even successful exits. Use a Job for one-shot tasks. [src5, src6] - ConfigMap/Secret race condition: Pod starts before config exists — crash. Ensure configs exist before deployments. [src3, src5]
- Node resource pressure: Node memory/disk evicts pods. Check
kubectl describe nodeConditions. [src1] - Image architecture mismatch: amd64 image on arm64 node — "exec format error". Use multi-arch images. [src4]
- Sidecar masking real failure: In K8s 1.29+, sidecar containers survive main container restarts. Check each container status individually. [src7]
- Backoff timer confusion: The 5-minute max backoff means long waits. Delete and recreate
to reset. On K8s 1.32+, use
KubeletCrashLoopBackOffMaxto reduce max delay. [src8]
Diagnostic Commands
# === Find CrashLoopBackOff pods ===
kubectl get pods -A | grep CrashLoopBackOff
# === Pod details ===
kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# === Logs ===
kubectl logs <pod> --previous
kubectl logs <pod> -c <container> --previous
kubectl logs <pod> -c <init-container>
kubectl logs <pod> -c <sidecar>
# === Events ===
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
# === Resources ===
kubectl top pod <pod>
kubectl top nodes
# === Debugging ===
kubectl debug -it <pod> --image=busybox --target=<container>
kubectl exec -it <pod> -- /bin/sh
# === Config ===
kubectl get configmap -n <ns>
kubectl get secret -n <ns>
# === Node health ===
kubectl describe node <node> | grep -A5 "Conditions"
Version History & Compatibility
| Version | Behavior | Key Changes |
|---|---|---|
| K8s 1.34 (Aug 2025) | Current | ContainerRestartRules alpha (KEP-5307) — per-container
restartPolicyRules matching exit codes; restart in place even when pod-level
restartPolicy: Never. Targets AI/ML batch jobs on GPUs [src10] |
| K8s 1.33 (Apr 2025) | Stable | Sidecar containers GA; ReduceDefaultCrashLoopBackOffDecay
alpha — initial 1s, max 60s backoff (down from 10s/300s) [src10, src11] |
| K8s 1.32 (Dec 2024) | Stable | KubeletCrashLoopBackOffMax alpha — configurable max backoff delay (1s-300s per
node) [src8] |
| K8s 1.29 (Dec 2023) | Stable | Sidecar containers beta; better init container lifecycle [src1, src7] |
| K8s 1.28 | Stable | Sidecar alpha; improved probe logging [src1] |
| K8s 1.25 | Ephemeral GA | Ephemeral debug containers GA [src1] |
| K8s 1.23 | Debug beta | kubectl debug beta [src1]
|
| K8s 1.20 | startupProbe GA | startupProbe graduated to stable [src2] |
| K8s 1.18 | startupProbe beta | startupProbe for slow-starting containers [src2] |
| K8s 1.16 | Stable probes | Liveness/readiness stable in all workloads [src2] |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Pod shows CrashLoopBackOff | Pod stuck in Pending | Debug scheduling: kubectl describe pod |
| Container keeps restarting | Pod in ImagePullBackOff | Fix image name/registry |
| Exit code is non-zero | Pod running but not ready | Debug readiness probe |
| Events show probe failures | Container is Evicted | Check node pressure |
Important Caveats
- Backoff delay doubles: 10s, 20s, 40s, 80s, 160s, 300s (max 5 min). Delete and recreate to reset.
K8s 1.32+ allows configuring max backoff via
KubeletCrashLoopBackOffMaxfeature gate. kubectl logs --previousonly shows the last crash. For earlier crashes, use centralized logging (ELK, Loki, Datadog).- Resource
requestsaffect scheduling;limitsaffect runtime. Don't set them equal unless you need QoS "Guaranteed". startupProbedisables liveness/readiness until it succeeds. A failing startup probe prevents the pod from ever being "live".- Init containers run sequentially. If any fails, all restart from the beginning.
- Always use
resources.requestsandlimitsin production for predictable scheduling and OOM prevention. - K8s 1.34 (Aug 2025) introduced
ContainerRestartRulesalpha (KEP-5307) — per-containerrestartPolicyRuleskeyed off exit codes that restart in place even when the pod’srestartPolicy: Never. Targets AI/ML batch jobs on expensive GPUs. Enable via--feature-gates=ContainerRestartRules=true. - K8s 1.33 (Apr 2025) added the
ReduceDefaultCrashLoopBackOffDecayalpha feature gate. When enabled on the kubelet, restart backoff starts at 1s (was 10s) and caps at 60s (was 300s). Sequence becomes 1s → 2s → 4s → 8s → 16s → 32s → 60s. Recovery is ~5× faster but failing pods burn more CPU.