K8s pod keeps restarting

- Bottom line: CrashLoopBackOff means a container in a pod keeps crashing and Kubernetes keeps restarting it with increasing delays (10s, 20s, 40s, ... up to 5 minutes). It's not an error itself — it's Kubernetes telling you the container can't stay running. The root cause is always in the container: app error, bad config, missing dependency, OOM, or failing health probes.

Kubernetes pod crash loop

- Bottom line: CrashLoopBackOff means a container in a pod keeps crashing and Kubernetes keeps restarting it with increasing delays (10s, 20s, 40s, ... up to 5 minutes). It's not an error itself — it's Kubernetes telling you the container can't stay running. The root cause is always in the container: app error, bad config, missing dependency, OOM, or failing health probes.

kubectl CrashLoopBackOff fix

- Bottom line: CrashLoopBackOff means a container in a pod keeps crashing and Kubernetes keeps restarting it with increasing delays (10s, 20s, 40s, ... up to 5 minutes). It's not an error itself — it's Kubernetes telling you the container can't stay running. The root cause is always in the container: app error, bad config, missing dependency, OOM, or failing health probes.

Kubernetes container keeps crashing

- Bottom line: CrashLoopBackOff means a container in a pod keeps crashing and Kubernetes keeps restarting it with increasing delays (10s, 20s, 40s, ... up to 5 minutes). It's not an error itself — it's Kubernetes telling you the container can't stay running. The root cause is always in the container: app error, bad config, missing dependency, OOM, or failing health probes.

How to Debug Kubernetes CrashLoopBackOff

How do I debug Kubernetes CrashLoopBackOff?

TL;DR

Bottom line: CrashLoopBackOff means a container in a pod keeps crashing and Kubernetes keeps restarting it with increasing delays (10s, 20s, 40s, ... up to 5 min max). It's not an error itself — it's Kubernetes telling you the container can't stay running. The root cause is always in the container.
Key tool/command: kubectl describe pod <pod> shows events, exit codes, and probe config. kubectl logs <pod> --previous shows the crashed container's last output. These two solve 90% of cases.
Watch out for: Liveness probes with too-short initialDelaySeconds — Kubernetes kills the container before it starts. Use startupProbe for slow starters.
Works with: All Kubernetes versions (1.20+). Same concepts apply to OpenShift, EKS, GKE, AKS, k3s, minikube. K8s 1.32+ adds configurable max backoff via KubeletCrashLoopBackOffMax. K8s 1.33+ (Apr 2025) reduces default backoff to 1s initial / 60s max under the ReduceDefaultCrashLoopBackOffDecay alpha feature gate, and graduates sidecar containers to GA.

Constraints

Capture before deleting: Never delete a CrashLoopBackOff pod without first running kubectl logs --previous and kubectl describe pod — deletion permanently destroys crash logs and events. [src1]
Logs are ephemeral: kubectl logs --previous only retrieves the last crash. Earlier crash logs require centralized logging (ELK, Loki, Datadog). [src1, src3]
startupProbe blocks other probes: When a startupProbe is configured, it disables liveness and readiness probes until it succeeds. If the startup probe never passes, the pod enters CrashLoopBackOff with no liveness/readiness evaluation. [src2]
Memory limits are hard cgroups enforcement: Setting resources.limits.memory below an app's baseline working set guarantees OOMKill (exit 137). Always profile memory usage with kubectl top pod before setting limits. [src3, src5]
Init containers restart as a group: If any init container fails, ALL init containers restart from the beginning — not just the failed one. [src7]
Sidecar containers have separate lifecycle: Sidecar containers (K8s 1.29+ GA, defined as init containers with restartPolicy: Always) survive main container restarts. Debug them separately with kubectl logs -c <sidecar-name>. [src1, src7]

Quick Reference

#	Cause	Likelihood	Key Signal	Fix
1	Application error / crash	~30%	Exit code 1; error in logs	Fix application code [src1, src3]
2	Missing/wrong env vars or config	~20%	Exit 1; "env var not set"	Fix ConfigMap/Secret/env [src3, src4]
3	OOMKilled (exit 137)	~15%	`OOMKilled: true`	Increase `resources.limits.memory` [src3, src5]
4	Liveness probe failure	~10%	"Liveness probe failed" in events	Use `startupProbe` [src2, src3]
5	Command not found (exit 127)	~5%	"exec format error"	Fix CMD/entrypoint [src4, src6]
6	Missing ConfigMap or Secret	~5%	"CreateContainerConfigError"	Create missing resource [src3, src5]
7	Init container failure	~4%	Init container crashing	`kubectl logs -c init-name` [src7]
8	Volume mount failure	~4%	"MountVolume.SetUp failed"	Fix PVC/storage class [src3, src5]
9	Permission denied (exit 126)	~3%	"Permission denied" in logs	Fix `securityContext` [src4]
10	Image pull issues then crash	~2%	"ImagePullBackOff"	Fix image/registry [src1, src3]
11	Container exits successfully (exit 0)	~2%	Repeated exit 0	Use Job; or keep process running [src5, src6]

Decision Tree

START — Pod shows CrashLoopBackOff
├── kubectl describe pod → Check "Last State" Exit Code
│   ├── Exit 0 → App exits but shouldn't → needs long-running process or Job [src6]
│   ├── Exit 1 → App error → check logs: kubectl logs --previous [src1, src3]
│   ├── Exit 126 → Permission denied → fix securityContext [src4]
│   ├── Exit 127 → Command not found → fix image/CMD; check arch [src4, src6]
│   ├── Exit 137 → OOMKilled? → increase memory limits [src3, src5]
│   └── Exit 143 → SIGTERM → check probe timing or preStop hook [src2]
├── Check Events section
│   ├── "Liveness probe failed" → fix probe config; use startupProbe [src2, src3]
│   ├── "MountVolume.SetUp failed" → fix PVC [src3]
│   └── "configmap/secret not found" → create resource [src3, src5]
├── Check Init Containers
│   └── Failing? → kubectl logs -c <init-name> [src7]
├── Check Sidecar Containers (K8s 1.29+)
│   └── Sidecar healthy but main crashing? → debug main container separately [src7]
└── kubectl logs --previous → find the error message [src1]

Step-by-Step Guide

1. Get the pod status and restart count

Identify which pods are in CrashLoopBackOff. [src1, src3]

kubectl get pods -A | grep CrashLoopBackOff
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].restartCount}'

2. Describe the pod for events and state

The single most important debugging command. [src1, src3, src5]

kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

3. Read the previous container's logs

The *previous* container has the crash output. [src1, src3, src5]

kubectl logs <pod> --previous
kubectl logs <pod> -c <container> --previous
kubectl logs <pod> -c <init-container-name>
kubectl logs <pod> -c <sidecar-name>  # K8s 1.29+

4. Fix OOMKilled (exit code 137)

Container exceeded its memory limit. [src3, src5]

resources:
  requests:
    memory: 256Mi
  limits:
    memory: 512Mi

kubectl top pod <pod>

5. Fix liveness probe failures

Use startupProbe for slow starters. [src2, src3]

startupProbe:
  httpGet:
    path: /health
    port: 8080
  periodSeconds: 10
  failureThreshold: 30    # 5 min startup allowance
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  periodSeconds: 15
  failureThreshold: 3

6. Fix missing ConfigMap / Secret / env vars

Missing configuration causes immediate crashes. [src3, src4, src5]

kubectl get configmap -n <ns>
kubectl get secret -n <ns>
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].env}'

7. Debug with ephemeral containers

When logs don't show enough. K8s 1.25+ GA. [src1, src4]

kubectl debug -it <pod> --image=busybox --target=<container>
kubectl exec -it <pod> -- /bin/sh
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20

Code Examples

Comprehensive CrashLoopBackOff diagnostic script

#!/bin/bash
POD="$1"; NS="${2:-default}"
if [ -z "$POD" ]; then
    echo "Usage: $0 <pod> [namespace]"
    kubectl get pods -A | grep CrashLoopBackOff; exit 1
fi

echo "=== CrashLoopBackOff Diagnostic: $POD (ns: $NS) ==="
RESTARTS=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].restartCount}')
EXIT=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}')
REASON=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}')
echo "Restarts: $RESTARTS | Exit: $EXIT | Reason: $REASON"

case "$EXIT" in
    0)   echo "Exit 0 — completed (shouldn't for services)" ;;
    1)   echo "Exit 1 — Application error" ;;
    126) echo "Exit 126 — Permission denied" ;;
    127) echo "Exit 127 — Command not found" ;;
    137) echo "Exit 137 — $( [ "$REASON" = "OOMKilled" ] && echo "OOMKilled" || echo "SIGKILL")" ;;
    143) echo "Exit 143 — SIGTERM" ;;
esac

echo "=== Resources ==="
kubectl get pod "$POD" -n "$NS" -o jsonpath='{range .spec.containers[*]}{.name}: req={.resources.requests.memory} lim={.resources.limits.memory}{"\n"}{end}'
kubectl top pod "$POD" -n "$NS" 2>/dev/null || echo "(metrics unavailable)"

echo "=== Events ==="
kubectl get events -n "$NS" --field-selector "involvedObject.name=$POD" --sort-by='.lastTimestamp' | tail -10

echo "=== Previous Logs (30 lines) ==="
kubectl logs "$POD" -n "$NS" --previous --tail=30 2>/dev/null || echo "(none)"

Production-ready pod with proper probes and resources

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      initContainers:
        - name: wait-for-db
          image: busybox:1.36
          command: ['sh', '-c', 'until nc -zv db-service 5432; do sleep 2; done']
          resources:
            limits: { cpu: 50m, memory: 32Mi }
        - name: log-shipper      # Sidecar (K8s 1.29+)
          image: fluentbit:2.2
          restartPolicy: Always   # Survives main container restarts
          resources:
            requests: { cpu: 50m, memory: 64Mi }
            limits: { cpu: 100m, memory: 128Mi }
      containers:
        - name: api
          image: myapp:1.2.3
          ports: [{ containerPort: 8080 }]
          envFrom:
            - configMapRef: { name: api-config }
            - secretRef: { name: api-secrets }
          resources:
            requests: { cpu: 100m, memory: 256Mi }
            limits: { cpu: 500m, memory: 512Mi }
          startupProbe:
            httpGet: { path: /health, port: 8080 }
            periodSeconds: 10
            failureThreshold: 30
          livenessProbe:
            httpGet: { path: /health, port: 8080 }
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet: { path: /ready, port: 8080 }
            periodSeconds: 10
      terminationGracePeriodSeconds: 30

Automated CrashLoopBackOff watcher

#!/usr/bin/env python3
import subprocess, json, time
from datetime import datetime

def get_crash_pods():
    result = subprocess.run(
        "kubectl get pods -A -o json".split(), capture_output=True, text=True)
    if result.returncode != 0: return []
    pods = json.loads(result.stdout)
    crash = []
    for pod in pods.get("items", []):
        for cs in pod.get("status", {}).get("containerStatuses", []):
            w = cs.get("state", {}).get("waiting", {})
            if w.get("reason") == "CrashLoopBackOff":
                last = cs.get("lastState", {}).get("terminated", {})
                crash.append({
                    "name": pod["metadata"]["name"],
                    "ns": pod["metadata"]["namespace"],
                    "restarts": cs.get("restartCount", 0),
                    "exit": last.get("exitCode"),
                    "reason": last.get("reason", "Unknown"),
                })
    return crash

def monitor(interval=30):
    seen = set()
    while True:
        for p in get_crash_pods():
            key = f"{p['ns']}/{p['name']}"
            if key not in seen:
                seen.add(key)
                print(f"[{datetime.now():%H:%M:%S}] ALERT {key} "
                      f"exit={p['exit']} ({p['reason']}) restarts={p['restarts']}")
        time.sleep(interval)

if __name__ == "__main__":
    monitor()

Anti-Patterns

Wrong: No resource limits

# BAD — unlimited resources; OOM kills other pods [src3, src5]
containers:
  - name: app
    image: myapp:latest

Correct: Always set resource requests and limits

# GOOD — predictable; scheduler places properly [src3, src5]
containers:
  - name: app
    image: myapp:1.2.3
    resources:
      requests: { cpu: 100m, memory: 256Mi }
      limits: { cpu: 500m, memory: 512Mi }

Wrong: Aggressive liveness probe on slow-starting app

# BAD — kills Spring Boot during startup [src2, src3]
livenessProbe:
  httpGet: { path: /health, port: 8080 }
  initialDelaySeconds: 5
  periodSeconds: 3
  failureThreshold: 3

Correct: Use startupProbe for slow starters

# GOOD — startupProbe allows up to 5 min init [src2]
startupProbe:
  httpGet: { path: /health, port: 8080 }
  periodSeconds: 10
  failureThreshold: 30
livenessProbe:
  httpGet: { path: /health, port: 8080 }
  periodSeconds: 15

Wrong: Using :latest tag

# BAD — unpredictable; hard to rollback [src3, src6]
containers:
  - name: app
    image: myapp:latest
    imagePullPolicy: Always

Correct: Pin specific image version

# GOOD — reproducible; easy rollback [src3, src6]
containers:
  - name: app
    image: myapp:1.2.3
    imagePullPolicy: IfNotPresent

Wrong: Sidecar as regular container (K8s 1.29+)

# BAD — sidecar dies with main container [src7]
containers:
  - name: app
    image: myapp:1.2.3
  - name: log-shipper
    image: fluentbit:2.2

Correct: Sidecar as init container with restartPolicy: Always

# GOOD — sidecar survives main container restarts [src7, src8]
initContainers:
  - name: log-shipper
    image: fluentbit:2.2
    restartPolicy: Always
containers:
  - name: app
    image: myapp:1.2.3

Decision Logic

If `kubectl describe pod` shows `Reason: OOMKilled` (exit 137)

The container hit its resources.limits.memory ceiling. Profile with kubectl top pod first, then raise the memory limit (typically 1.5–2× observed working set). Do NOT just remove the limit — unbounded pods evict their neighbors. [src3, src5, src9]

If exit code is 1 AND `kubectl logs --previous` shows a stack trace

This is an application-level bug, not an orchestration problem. Fix the code path, not the pod spec. Common: unhandled exception during init, missing config value, schema migration not run. [src1, src3, src11]

If exit code is 0 AND `restartPolicy: Always`

The container completed and the controller keeps restarting it. Either (a) the entrypoint is a one-shot task — convert the Deployment to a Job/CronJob; or (b) the long-running process exited cleanly — keep it in the foreground (no daemonizing, no &). [src5, src6, src9]

If exit code is 137 AND `Reason` is NOT `OOMKilled`

The kubelet killed it (usually a failed liveness probe with SIGKILL after grace period). Check kubectl describe pod Events for “Liveness probe failed”. Fix: add startupProbe for slow starters, raise failureThreshold, or relax periodSeconds. [src2, src3]

If exit code is 127 (“command not found”) OR “exec format error”

Either the CMD/ENTRYPOINT references a binary that isn’t in the image, OR there’s an image architecture mismatch (amd64 image scheduled on arm64 node). Run kubectl describe pod to confirm the node arch, then either rebuild as a multi-arch image or add nodeSelector: kubernetes.io/arch: amd64. [src4, src6]

If you have a sidecar container AND main container keeps crashing

K8s 1.29+ (GA in 1.33): sidecar containers (init containers with restartPolicy: Always) survive main container restarts and have separate lifecycle. The pod can appear partially ready while the main container is in CrashLoopBackOff. Debug each container individually with kubectl logs -c <name>; don’t assume pod-level signals reflect the main app. [src1, src7, src10]

If you’re on K8s 1.33+ AND the 5-minute backoff is slowing down recovery

Enable the ReduceDefaultCrashLoopBackOffDecay alpha feature gate on the kubelet (--feature-gates=ReduceDefaultCrashLoopBackOffDecay=true). This caps backoff at 60s instead of 300s, with a 1s initial delay. Per-node override is available via KubeletCrashLoopBackOffMax (K8s 1.32+ alpha). Both are alpha — do not enable in production without testing. [src8, src10]

Common Pitfalls

--previous flag forgotten: kubectl logs <pod> shows the current (just-restarted) container. Use --previous for the crash logs. [src1, src3]
Liveness probe killing healthy containers: Too-short initialDelaySeconds creates permanent CrashLoopBackOff. Use startupProbe (K8s 1.20+ GA). [src2, src3]
Exit 0 CrashLoopBackOff: restartPolicy: Always (default) restarts even successful exits. Use a Job for one-shot tasks. [src5, src6]
ConfigMap/Secret race condition: Pod starts before config exists — crash. Ensure configs exist before deployments. [src3, src5]
Node resource pressure: Node memory/disk evicts pods. Check kubectl describe node Conditions. [src1]
Image architecture mismatch: amd64 image on arm64 node — "exec format error". Use multi-arch images. [src4]
Sidecar masking real failure: In K8s 1.29+, sidecar containers survive main container restarts. Check each container status individually. [src7]
Backoff timer confusion: The 5-minute max backoff means long waits. Delete and recreate to reset. On K8s 1.32+, use KubeletCrashLoopBackOffMax to reduce max delay. [src8]

Diagnostic Commands

# === Find CrashLoopBackOff pods ===
kubectl get pods -A | grep CrashLoopBackOff

# === Pod details ===
kubectl describe pod <pod> -n <ns>
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# === Logs ===
kubectl logs <pod> --previous
kubectl logs <pod> -c <container> --previous
kubectl logs <pod> -c <init-container>
kubectl logs <pod> -c <sidecar>

# === Events ===
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20

# === Resources ===
kubectl top pod <pod>
kubectl top nodes

# === Debugging ===
kubectl debug -it <pod> --image=busybox --target=<container>
kubectl exec -it <pod> -- /bin/sh

# === Config ===
kubectl get configmap -n <ns>
kubectl get secret -n <ns>

# === Node health ===
kubectl describe node <node> | grep -A5 "Conditions"

Version History & Compatibility

Version	Behavior	Key Changes
K8s 1.34 (Aug 2025)	Current	`ContainerRestartRules` alpha (KEP-5307) — per-container `restartPolicyRules` matching exit codes; restart in place even when pod-level `restartPolicy: Never`. Targets AI/ML batch jobs on GPUs [src10]
K8s 1.33 (Apr 2025)	Stable	Sidecar containers GA; `ReduceDefaultCrashLoopBackOffDecay` alpha — initial 1s, max 60s backoff (down from 10s/300s) [src10, src11]
K8s 1.32 (Dec 2024)	Stable	`KubeletCrashLoopBackOffMax` alpha — configurable max backoff delay (1s-300s per node) [src8]
K8s 1.29 (Dec 2023)	Stable	Sidecar containers beta; better init container lifecycle [src1, src7]
K8s 1.28	Stable	Sidecar alpha; improved probe logging [src1]
K8s 1.25	Ephemeral GA	Ephemeral debug containers GA [src1]
K8s 1.23	Debug beta	`kubectl debug` beta [src1]
K8s 1.20	startupProbe GA	startupProbe graduated to stable [src2]
K8s 1.18	startupProbe beta	startupProbe for slow-starting containers [src2]
K8s 1.16	Stable probes	Liveness/readiness stable in all workloads [src2]

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Pod shows CrashLoopBackOff	Pod stuck in Pending	Debug scheduling: `kubectl describe pod`
Container keeps restarting	Pod in ImagePullBackOff	Fix image name/registry
Exit code is non-zero	Pod running but not ready	Debug readiness probe
Events show probe failures	Container is Evicted	Check node pressure

Important Caveats

Backoff delay doubles: 10s, 20s, 40s, 80s, 160s, 300s (max 5 min). Delete and recreate to reset. K8s 1.32+ allows configuring max backoff via KubeletCrashLoopBackOffMax feature gate.
kubectl logs --previous only shows the last crash. For earlier crashes, use centralized logging (ELK, Loki, Datadog).
Resource requests affect scheduling; limits affect runtime. Don't set them equal unless you need QoS "Guaranteed".
startupProbe disables liveness/readiness until it succeeds. A failing startup probe prevents the pod from ever being "live".
Init containers run sequentially. If any fails, all restart from the beginning.
Always use resources.requests and limits in production for predictable scheduling and OOM prevention.
K8s 1.34 (Aug 2025) introduced ContainerRestartRules alpha (KEP-5307) — per-container restartPolicyRules keyed off exit codes that restart in place even when the pod’s restartPolicy: Never. Targets AI/ML batch jobs on expensive GPUs. Enable via --feature-gates=ContainerRestartRules=true.
K8s 1.33 (Apr 2025) added the ReduceDefaultCrashLoopBackOffDecay alpha feature gate. When enabled on the kubelet, restart backoff starts at 1s (was 10s) and caps at 60s (was 300s). Sequence becomes 1s → 2s → 4s → 8s → 16s → 32s → 60s. Recovery is ~5× faster but failing pods burn more CPU.

How to Debug Kubernetes CrashLoopBackOff

How do I debug Kubernetes CrashLoopBackOff?

TL;DR

Constraints

Quick Reference

Decision Tree

Step-by-Step Guide

1. Get the pod status and restart count

2. Describe the pod for events and state

3. Read the previous container's logs

4. Fix OOMKilled (exit code 137)

5. Fix liveness probe failures

6. Fix missing ConfigMap / Secret / env vars

7. Debug with ephemeral containers

Code Examples

Comprehensive CrashLoopBackOff diagnostic script

Production-ready pod with proper probes and resources

Automated CrashLoopBackOff watcher

Anti-Patterns

Wrong: No resource limits

Correct: Always set resource requests and limits

Wrong: Aggressive liveness probe on slow-starting app

Correct: Use startupProbe for slow starters

Wrong: Using :latest tag

Correct: Pin specific image version

Wrong: Sidecar as regular container (K8s 1.29+)

Correct: Sidecar as init container with restartPolicy: Always

Decision Logic

If kubectl describe pod shows Reason: OOMKilled (exit 137)

If exit code is 1 AND kubectl logs --previous shows a stack trace

If exit code is 0 AND restartPolicy: Always

If exit code is 137 AND Reason is NOT OOMKilled

If exit code is 127 (“command not found”) OR “exec format error”

If you have a sidecar container AND main container keeps crashing

If you’re on K8s 1.33+ AND the 5-minute backoff is slowing down recovery

Common Pitfalls

Diagnostic Commands

Version History & Compatibility

When to Use / When Not to Use

Important Caveats

If `kubectl describe pod` shows `Reason: OOMKilled` (exit 137)

If exit code is 1 AND `kubectl logs --previous` shows a stack trace

If exit code is 0 AND `restartPolicy: Always`

If exit code is 137 AND `Reason` is NOT `OOMKilled`