kubectl describe pod <pod> shows events, exit
codes, and probe config. kubectl logs <pod> --previous shows the crashed container's
last output. These two solve 90% of cases.initialDelaySeconds —
Kubernetes kills the container before it starts. Use startupProbe for slow starters.KubeletCrashLoopBackOffMax.kubectl logs --previous and kubectl describe pod — deletion permanently
destroys crash logs and events. [src1]kubectl logs --previous only retrieves the last crash.
Earlier crash logs require centralized logging (ELK, Loki, Datadog). [src1, src3]startupProbe is configured, it
disables liveness and readiness probes until it succeeds. If the startup probe never passes, the pod
enters CrashLoopBackOff with no liveness/readiness evaluation. [src2]
resources.limits.memory below an app's baseline working set guarantees OOMKill (exit 137).
Always profile memory usage with kubectl top pod before setting limits. [src3, src5]restartPolicy: Always) survive main container restarts. Debug them
separately with kubectl logs -c <sidecar-name>. [src1, src7]| # | Cause | Likelihood | Key Signal | Fix |
|---|---|---|---|---|
| 1 | Application error / crash | ~30% | Exit code 1; error in logs | Fix application code [src1, src3] |
| 2 | Missing/wrong env vars or config | ~20% | Exit 1; "env var not set" | Fix ConfigMap/Secret/env [src3, src4] |
| 3 | OOMKilled (exit 137) | ~15% | OOMKilled: true |
Increase resources.limits.memory [src3,
src5]
|
| 4 | Liveness probe failure | ~10% | "Liveness probe failed" in events | Use startupProbe [src2,
src3]
|
| 5 | Command not found (exit 127) | ~5% | "exec format error" | Fix CMD/entrypoint [src4, src6] |
| 6 | Missing ConfigMap or Secret | ~5% | "CreateContainerConfigError" | Create missing resource [src3, src5] |
| 7 | Init container failure | ~4% | Init container crashing | kubectl logs -c init-name [src7]
|
| 8 | Volume mount failure | ~4% | "MountVolume.SetUp failed" | Fix PVC/storage class [src3, src5] |
| 9 | Permission denied (exit 126) | ~3% | "Permission denied" in logs | Fix securityContext [src4] |
| 10 | Image pull issues then crash | ~2% | "ImagePullBackOff" | Fix image/registry [src1, src3] |
| 11 | Container exits successfully (exit 0) | ~2% | Repeated exit 0 | Use Job; or keep process running [src5, src6] |
START — Pod shows CrashLoopBackOff
├── kubectl describe pod → Check "Last State" Exit Code
│ ├── Exit 0 → App exits but shouldn't → needs long-running process or Job [src6]
│ ├── Exit 1 → App error → check logs: kubectl logs --previous [src1, src3]
│ ├── Exit 126 → Permission denied → fix securityContext [src4]
│ ├── Exit 127 → Command not found → fix image/CMD; check arch [src4, src6]
│ ├── Exit 137 → OOMKilled? → increase memory limits [src3, src5]
│ └── Exit 143 → SIGTERM → check probe timing or preStop hook [src2]
├── Check Events section
│ ├── "Liveness probe failed" → fix probe config; use startupProbe [src2, src3]
│ ├── "MountVolume.SetUp failed" → fix PVC [src3]
│ └── "configmap/secret not found" → create resource [src3, src5]
├── Check Init Containers
│ └── Failing? → kubectl logs -c <init-name> [src7]
├── Check Sidecar Containers (K8s 1.29+)
│ └── Sidecar healthy but main crashing? → debug main container separately [src7]
└── kubectl logs --previous → find the error message [src1]
Identify which pods are in CrashLoopBackOff. [src1, src3]
kubectl get pods -A | grep CrashLoopBackOff
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].restartCount}'
The single most important debugging command. [src1, src3, src5]
kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
The *previous* container has the crash output. [src1, src3, src5]
kubectl logs <pod> --previous
kubectl logs <pod> -c <container> --previous
kubectl logs <pod> -c <init-container-name>
kubectl logs <pod> -c <sidecar-name> # K8s 1.29+
Container exceeded its memory limit. [src3, src5]
resources:
requests:
memory: 256Mi
limits:
memory: 512Mi
kubectl top pod <pod>
Use startupProbe for slow starters. [src2, src3]
startupProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10
failureThreshold: 30 # 5 min startup allowance
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 15
failureThreshold: 3
Missing configuration causes immediate crashes. [src3, src4, src5]
kubectl get configmap -n <ns>
kubectl get secret -n <ns>
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].env}'
When logs don't show enough. K8s 1.25+ GA. [src1, src4]
kubectl debug -it <pod> --image=busybox --target=<container>
kubectl exec -it <pod> -- /bin/sh
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
#!/bin/bash
POD="$1"; NS="${2:-default}"
if [ -z "$POD" ]; then
echo "Usage: $0 <pod> [namespace]"
kubectl get pods -A | grep CrashLoopBackOff; exit 1
fi
echo "=== CrashLoopBackOff Diagnostic: $POD (ns: $NS) ==="
RESTARTS=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].restartCount}')
EXIT=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}')
REASON=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}')
echo "Restarts: $RESTARTS | Exit: $EXIT | Reason: $REASON"
case "$EXIT" in
0) echo "Exit 0 — completed (shouldn't for services)" ;;
1) echo "Exit 1 — Application error" ;;
126) echo "Exit 126 — Permission denied" ;;
127) echo "Exit 127 — Command not found" ;;
137) echo "Exit 137 — $( [ "$REASON" = "OOMKilled" ] && echo "OOMKilled" || echo "SIGKILL")" ;;
143) echo "Exit 143 — SIGTERM" ;;
esac
echo "=== Resources ==="
kubectl get pod "$POD" -n "$NS" -o jsonpath='{range .spec.containers[*]}{.name}: req={.resources.requests.memory} lim={.resources.limits.memory}{"\n"}{end}'
kubectl top pod "$POD" -n "$NS" 2>/dev/null || echo "(metrics unavailable)"
echo "=== Events ==="
kubectl get events -n "$NS" --field-selector "involvedObject.name=$POD" --sort-by='.lastTimestamp' | tail -10
echo "=== Previous Logs (30 lines) ==="
kubectl logs "$POD" -n "$NS" --previous --tail=30 2>/dev/null || echo "(none)"
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -zv db-service 5432; do sleep 2; done']
resources:
limits: { cpu: 50m, memory: 32Mi }
- name: log-shipper # Sidecar (K8s 1.29+)
image: fluentbit:2.2
restartPolicy: Always # Survives main container restarts
resources:
requests: { cpu: 50m, memory: 64Mi }
limits: { cpu: 100m, memory: 128Mi }
containers:
- name: api
image: myapp:1.2.3
ports: [{ containerPort: 8080 }]
envFrom:
- configMapRef: { name: api-config }
- secretRef: { name: api-secrets }
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
startupProbe:
httpGet: { path: /health, port: 8080 }
periodSeconds: 10
failureThreshold: 30
livenessProbe:
httpGet: { path: /health, port: 8080 }
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet: { path: /ready, port: 8080 }
periodSeconds: 10
terminationGracePeriodSeconds: 30
#!/usr/bin/env python3
import subprocess, json, time
from datetime import datetime
def get_crash_pods():
result = subprocess.run(
"kubectl get pods -A -o json".split(), capture_output=True, text=True)
if result.returncode != 0: return []
pods = json.loads(result.stdout)
crash = []
for pod in pods.get("items", []):
for cs in pod.get("status", {}).get("containerStatuses", []):
w = cs.get("state", {}).get("waiting", {})
if w.get("reason") == "CrashLoopBackOff":
last = cs.get("lastState", {}).get("terminated", {})
crash.append({
"name": pod["metadata"]["name"],
"ns": pod["metadata"]["namespace"],
"restarts": cs.get("restartCount", 0),
"exit": last.get("exitCode"),
"reason": last.get("reason", "Unknown"),
})
return crash
def monitor(interval=30):
seen = set()
while True:
for p in get_crash_pods():
key = f"{p['ns']}/{p['name']}"
if key not in seen:
seen.add(key)
print(f"[{datetime.now():%H:%M:%S}] ALERT {key} "
f"exit={p['exit']} ({p['reason']}) restarts={p['restarts']}")
time.sleep(interval)
if __name__ == "__main__":
monitor()
# BAD — unlimited resources; OOM kills other pods [src3, src5]
containers:
- name: app
image: myapp:latest
# GOOD — predictable; scheduler places properly [src3, src5]
containers:
- name: app
image: myapp:1.2.3
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
# BAD — kills Spring Boot during startup [src2, src3]
livenessProbe:
httpGet: { path: /health, port: 8080 }
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 3
# GOOD — startupProbe allows up to 5 min init [src2]
startupProbe:
httpGet: { path: /health, port: 8080 }
periodSeconds: 10
failureThreshold: 30
livenessProbe:
httpGet: { path: /health, port: 8080 }
periodSeconds: 15
# BAD — unpredictable; hard to rollback [src3, src6]
containers:
- name: app
image: myapp:latest
imagePullPolicy: Always
# GOOD — reproducible; easy rollback [src3, src6]
containers:
- name: app
image: myapp:1.2.3
imagePullPolicy: IfNotPresent
# BAD — sidecar dies with main container [src7]
containers:
- name: app
image: myapp:1.2.3
- name: log-shipper
image: fluentbit:2.2
# GOOD — sidecar survives main container restarts [src7, src8]
initContainers:
- name: log-shipper
image: fluentbit:2.2
restartPolicy: Always
containers:
- name: app
image: myapp:1.2.3
--previous flag forgotten: kubectl logs <pod> shows the
current (just-restarted) container. Use --previous for the crash logs. [src1, src3]initialDelaySeconds
creates permanent CrashLoopBackOff. Use startupProbe (K8s 1.20+ GA). [src2,
src3]restartPolicy: Always (default) restarts even
successful exits. Use a Job for one-shot tasks. [src5, src6]kubectl describe node Conditions. [src1]KubeletCrashLoopBackOffMax to reduce max delay. [src8]# === Find CrashLoopBackOff pods ===
kubectl get pods -A | grep CrashLoopBackOff
# === Pod details ===
kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# === Logs ===
kubectl logs <pod> --previous
kubectl logs <pod> -c <container> --previous
kubectl logs <pod> -c <init-container>
kubectl logs <pod> -c <sidecar>
# === Events ===
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
# === Resources ===
kubectl top pod <pod>
kubectl top nodes
# === Debugging ===
kubectl debug -it <pod> --image=busybox --target=<container>
kubectl exec -it <pod> -- /bin/sh
# === Config ===
kubectl get configmap -n <ns>
kubectl get secret -n <ns>
# === Node health ===
kubectl describe node <node> | grep -A5 "Conditions"
| Version | Behavior | Key Changes |
|---|---|---|
| K8s 1.32+ | Current | KubeletCrashLoopBackOffMax alpha — configurable max backoff [src8] |
| K8s 1.29+ | Stable | Sidecar containers GA; better init container lifecycle [src1, src7] |
| K8s 1.28 | Stable | Sidecar alpha; improved probe logging [src1] |
| K8s 1.25 | Ephemeral GA | Ephemeral debug containers GA [src1] |
| K8s 1.23 | Debug beta | kubectl debug beta [src1]
|
| K8s 1.20 | startupProbe GA | startupProbe graduated to stable [src2] |
| K8s 1.18 | startupProbe beta | startupProbe for slow-starting containers [src2] |
| K8s 1.16 | Stable probes | Liveness/readiness stable in all workloads [src2] |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Pod shows CrashLoopBackOff | Pod stuck in Pending | Debug scheduling: kubectl describe pod |
| Container keeps restarting | Pod in ImagePullBackOff | Fix image name/registry |
| Exit code is non-zero | Pod running but not ready | Debug readiness probe |
| Events show probe failures | Container is Evicted | Check node pressure |
KubeletCrashLoopBackOffMax feature gate.kubectl logs --previous only shows the last crash. For earlier crashes, use
centralized logging (ELK, Loki, Datadog).requests affect scheduling; limits affect runtime. Don't set them
equal unless you need QoS "Guaranteed".startupProbe disables liveness/readiness until it succeeds. A failing startup probe
prevents the pod from ever being "live".resources.requests and limits in production for predictable
scheduling and OOM prevention.RestartAllContainers (alpha) — per-container restart rules that trigger
in-place restart of all containers on specific exit codes, avoiding expensive pod rescheduling for AI/ML
workloads.