How to Debug Kubernetes Pods Stuck in Pending State

Type: Software Reference Confidence: 0.94 Sources: 8 Verified: 2026-02-23 Freshness: quarterly

TL;DR

Constraints

Quick Reference

# Cause Likelihood Scheduler Message Fix
1 Insufficient CPU/memory ~35% "Insufficient cpu/memory" Reduce requests, add nodes [src1, src4]
2 Taints without tolerations ~20% "had taint pod didn't tolerate" Add toleration or remove taint [src2, src4]
3 PVC not bound ~15% "unbound PersistentVolumeClaims" Fix StorageClass, provision PV [src3, src6]
4 nodeSelector mismatch ~12% "didn't match node affinity/selector" Fix selector or label nodes [src4, src7]
5 Node affinity mismatch ~5% "didn't match node affinity" Fix affinity or add nodes [src4, src7]
6 Pod anti-affinity ~4% "didn't match anti-affinity" Reduce replicas or add nodes [src4, src7]
7 Node cordoned ~3% "nodes were unschedulable" kubectl uncordon <node> [src3, src4]
8 ResourceQuota exceeded ~2% "exceeded quota" Increase quota [src1, src3]
9 PodDisruptionBudget blocking ~2% "disruption budget violated" Adjust PDB [src3]
10 Topology spread constraints ~1% "didn't satisfy topology spread" Relax constraints [src7]
11 Node condition (DiskPressure/MemoryPressure) ~1% "node(s) had condition" Resolve node pressure [src4, src5]
12 Scheduler not running <1% No events at all Check kube-scheduler [src1]

Decision Tree

START — Pod stuck in Pending
├── kubectl describe pod → Check Events section
│   ├── "Insufficient cpu/memory" → Check requests vs allocatable [src1, src4]
│   │   └── K8s 1.35+? → Consider In-Place Pod Resize for running pods [src8]
│   ├── "taint pod didn't tolerate" → Add toleration or remove taint [src2]
│   ├── "unbound PersistentVolumeClaims" → Fix PVC/StorageClass [src6]
│   ├── "didn't match node affinity/selector" → Fix labels/selectors [src7]
│   ├── "nodes were unschedulable" → kubectl uncordon [src3]
│   ├── "exceeded quota" → Increase ResourceQuota [src1]
│   ├── "didn't satisfy topology spread" → Relax maxSkew or ScheduleAnyway [src7]
│   ├── "node(s) had condition" → Check DiskPressure/MemoryPressure [src4]
│   └── No events → Check kube-scheduler is running [src1]
├── Managed K8s? → Check Cluster Autoscaler events [src4]
└── kubectl get events -A --sort-by='.lastTimestamp'

Step-by-Step Guide

1. Identify the Pending pods

Find which pods are stuck. [src1, src3]

kubectl get pods -A --field-selector=status.phase=Pending
kubectl get pods -o wide | grep Pending

2. Read the scheduler events

The most important step — events tell you exactly why. [src1, src3, src4]

kubectl describe pod <pod> -n <ns>
kubectl get events -n <ns> --field-selector involvedObject.name=<pod>

3. Check node resources

Compare requests against allocatable. [src1, src4, src5]

kubectl top nodes
kubectl describe nodes | grep -A5 "Allocatable:"
kubectl describe node <node> | grep -A20 "Allocated resources"
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources.requests}'

4. Check taints and tolerations

Taints repel pods without matching tolerations. [src2, src4]

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints[*].key
kubectl get pod <pod> -o jsonpath='{.spec.tolerations}'
kubectl taint node <node> key=value:NoSchedule-    # Remove taint

5. Fix PVC binding issues

PVCs must bind before pods can schedule. [src3, src6]

kubectl get pvc -n <ns>
kubectl describe pvc <name> -n <ns>
kubectl get pv
kubectl get storageclass

6. Fix nodeSelector and affinity

Selectors must match actual node labels. [src4, src7]

kubectl get pod <pod> -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels
kubectl label node <node> disktype=ssd

7. Check ResourceQuota and node cordon

Quotas and cordoned nodes block scheduling. [src1, src3]

kubectl get resourcequota -n <ns>
kubectl get nodes     # "SchedulingDisabled" = cordoned
kubectl uncordon <node>
kubectl describe limitrange -n <ns>

Code Examples

Comprehensive Pending pod diagnostic script

#!/bin/bash
POD="$1"; NS="${2:-default}"
if [ -z "$POD" ]; then
    echo "Usage: $0 <pod> [ns]"
    kubectl get pods -A --field-selector=status.phase=Pending; exit 1
fi

echo "=== Pending Pod Diagnostic: $POD (ns: $NS) ==="
kubectl get pod "$POD" -n "$NS" -o wide

echo "=== Resource Requests ==="
kubectl get pod "$POD" -n "$NS" -o jsonpath='{range .spec.containers[*]}  {.name}: cpu={.resources.requests.cpu} mem={.resources.requests.memory}{"\n"}{end}'

echo "=== Node Selector ==="
kubectl get pod "$POD" -n "$NS" -o jsonpath='{.spec.nodeSelector}'

echo "=== Tolerations ==="
kubectl get pod "$POD" -n "$NS" -o jsonpath='{range .spec.tolerations[*]}  {.key}={.value}:{.effect}{"\n"}{end}'

echo "=== Node Resources ==="
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory,TAINTS:.spec.taints[*].key

echo "=== PVCs ==="
PVCS=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.spec.volumes[*].persistentVolumeClaim.claimName}')
for PVC in $PVCS; do
    kubectl get pvc "$PVC" -n "$NS" 2>/dev/null
done

echo "=== Scheduler Events ==="
kubectl get events -n "$NS" --field-selector "involvedObject.name=$POD" --sort-by='.lastTimestamp' | tail -10

# Auto-diagnosis
EVENTS=$(kubectl get events -n "$NS" --field-selector "involvedObject.name=$POD" -o jsonpath='{.items[-1].message}')
if echo "$EVENTS" | grep -qi "insufficient"; then echo "DIAGNOSIS: Insufficient resources"
elif echo "$EVENTS" | grep -qi "taint"; then echo "DIAGNOSIS: Taint/toleration mismatch"
elif echo "$EVENTS" | grep -qi "PersistentVolumeClaim"; then echo "DIAGNOSIS: PVC not bound"
elif echo "$EVENTS" | grep -qi "affinity\|selector"; then echo "DIAGNOSIS: Selector/affinity mismatch"
elif echo "$EVENTS" | grep -qi "unschedulable"; then echo "DIAGNOSIS: Node cordoned"
elif echo "$EVENTS" | grep -qi "condition"; then echo "DIAGNOSIS: Node condition issue"
elif [ -z "$EVENTS" ]; then echo "DIAGNOSIS: No events — check kube-scheduler"
fi

Production-ready pod with proper scheduling config

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: myapp:1.2.3
          resources:
            requests: { cpu: 250m, memory: 256Mi }
            limits: { cpu: 500m, memory: 512Mi }
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
        - key: "dedicated"
          operator: "Equal"
          value: "api"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: disktype
                    operator: In
                    values: ["ssd"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: api-server
                topologyKey: kubernetes.io/hostname
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: api-server

Cluster capacity audit script

#!/usr/bin/env python3
import subprocess, json

def run_kubectl(cmd):
    result = subprocess.run(f"kubectl {cmd} -o json".split(), capture_output=True, text=True)
    return json.loads(result.stdout) if result.returncode == 0 else None

def parse_resource(value):
    if not value: return 0
    value = str(value)
    if value.endswith("m"): return int(value[:-1])
    elif value.endswith("Mi"): return int(value[:-2]) * 1024 * 1024
    elif value.endswith("Gi"): return int(value[:-2]) * 1024**3
    else:
        try: return int(float(value) * 1000)
        except: return 0

def audit():
    nodes = run_kubectl("get nodes")
    pods = run_kubectl("get pods -A")
    if not nodes or not pods: print("Cannot access cluster"); return

    node_req = {}
    for pod in pods.get("items", []):
        node = pod.get("spec", {}).get("nodeName")
        if not node or pod["status"].get("phase") != "Running": continue
        if node not in node_req: node_req[node] = {"cpu": 0, "mem": 0}
        for c in pod["spec"].get("containers", []):
            r = c.get("resources", {}).get("requests", {})
            node_req[node]["cpu"] += parse_resource(r.get("cpu", "0"))
            node_req[node]["mem"] += parse_resource(r.get("memory", "0"))

    for node in nodes["items"]:
        name = node["metadata"]["name"]
        alloc = node["status"]["allocatable"]
        taints = [t["key"] for t in node["spec"].get("taints", [])]
        req = node_req.get(name, {"cpu": 0, "mem": 0})
        print(f"{name}: CPU free={parse_resource(alloc['cpu'])-req['cpu']}m "
              f"Mem free={((parse_resource(alloc['memory'])-req['mem'])/1024**2):.0f}Mi "
              f"Taints={taints or 'none'}")

if __name__ == "__main__":
    audit()

Anti-Patterns

Wrong: Requesting more resources than any node has

# BAD — no node has 64Gi allocatable [src1, src4]
resources:
  requests:
    cpu: "16"
    memory: 64Gi

Correct: Size requests based on node capacity

# GOOD — fits within typical node [src1, src4]
resources:
  requests: { cpu: 500m, memory: 512Mi }
  limits: { cpu: "1", memory: 1Gi }

Wrong: nodeSelector for non-existent labels

# BAD — label doesn't exist on any node [src4, src7]
nodeSelector:
  gpu-type: a100

Correct: Verify labels exist first

# GOOD — check labels, then set selector [src4, src7]
kubectl get nodes --show-labels | grep gpu-type
kubectl label node worker-1 gpu-type=a100

Wrong: No tolerations for tainted nodes

# BAD — all nodes tainted, no toleration [src2, src4]
spec:
  containers:
    - name: app
      image: myapp
  # No tolerations!

Correct: Add matching tolerations

# GOOD — toleration matches taint [src2, src4]
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "special"
      effect: "NoSchedule"

Wrong: RequiredDuringScheduling anti-affinity with too many replicas

# BAD — 5 replicas with required anti-affinity on 3-node cluster [src7]
spec:
  replicas: 5
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: myapp
              topologyKey: kubernetes.io/hostname
# 2 pods will be Pending forever

Correct: Use preferredDuringScheduling anti-affinity

# GOOD — preferred anti-affinity allows co-location as fallback [src7]
spec:
  replicas: 5
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: myapp
                topologyKey: kubernetes.io/hostname

Common Pitfalls

Diagnostic Commands

# === Find Pending pods ===
kubectl get pods -A --field-selector=status.phase=Pending

# === Pod details ===
kubectl describe pod <pod> -n <ns>

# === Node resources ===
kubectl top nodes
kubectl describe nodes | grep -A5 "Allocatable:"

# === Taints ===
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints[*].key

# === Labels ===
kubectl get nodes --show-labels

# === PVC ===
kubectl get pvc -n <ns>
kubectl get pv
kubectl get storageclass

# === Quotas ===
kubectl get resourcequota -n <ns>
kubectl describe limitrange -n <ns>

# === Schedulability ===
kubectl get nodes     # SchedulingDisabled = cordoned
kubectl uncordon <node>

# === Node conditions ===
kubectl describe node <node> | grep -A5 "Conditions:"

# === Cluster Autoscaler (managed K8s) ===
kubectl get events -n kube-system | grep cluster-autoscaler

Version History & Compatibility

Version Behavior Key Changes
K8s 1.35 (2025-12) Current In-Place Pod Resize GA; gang scheduling alpha; mutable PV node affinity [src8]
K8s 1.34 (2025-08) Stable Async scheduler API; nominatedNodeName for more pods [src1]
K8s 1.32 (2025-04) Stable QueueingHint beta (faster Pending pod requeue) [src1]
K8s 1.29+ (2024) Stable Improved scheduling hints; sidecar containers GA [src1]
K8s 1.27 (2023) Stable In-place resource resize alpha [src8]
K8s 1.24 (2022) Stable Non-graceful node shutdown; PV topology [src6]
K8s 1.19 (2020) TopologySpread GA PodTopologySpreadConstraints GA [src7]
K8s 1.18 (2020) WaitForFirstConsumer More StorageClasses default to delayed binding [src6]

When to Use / When Not to Use

Use When Don't Use When Use Instead
Pod shows Pending status Pod shows CrashLoopBackOff Debug container crash (logs, exit code)
Events mention scheduling failures Pod shows ContainerCreating Wait; or debug image pull / volume
No node selected for pod Pod Running but not Ready Debug readiness probe
PVC stuck in Pending Pod Evicted Check node pressure conditions
Cluster Autoscaler not scaling up Pod OOMKilled Debug memory limits (not scheduling)

Important Caveats

Related Units