Pod Troubleshooting Commands

This guide provides actionable commands and best practices for troubleshooting pods in Kubernetes clusters (AKS, EKS, GKE, and on-prem). Use these steps for real-life incident response and GitOps workflows.

Common Troubleshooting Commands

List all Pods in all Namespaces:

kubectl get pods --all-namespaces

Check Resource Consumption:

kubectl top pods --all-namespaces

Describe a Pod:

kubectl describe pod <pod-name> -n <namespace>

View Pod Logs:

kubectl logs <pod-name> -n <namespace>

Follow Pod Logs (stream in real-time):

kubectl logs -f <pod-name> -n <namespace>

Exec into a Pod:

kubectl exec -it <pod-name> -n <namespace> -- <command>

Get Events for a Pod:

kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>

Check Pod Health (Readiness/Liveness):

kubectl describe pod <pod-name> -n <namespace> | grep -i 'readiness\|liveness\|conditions'

Retrieve Pod IP and Node:

kubectl get pod <pod-name> -n <namespace> -o wide

Restart a Pod:

kubectl delete pod <pod-name> -n <namespace>

Check Pod Status:

kubectl get pod <pod-name> -n <namespace> -o wide

List Pod Events (sorted):

kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.metadata.creationTimestamp'

Verify Pod Affinity/Anti-Affinity:

kubectl describe pod <pod-name> -n <namespace> | grep -i nodeaffinity

Check Resource Requests and Limits:

kubectl describe pod <pod-name> -n <namespace> | grep -i resources

Identify Stuck Pods:

kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.metadata.creationTimestamp' | tail -n 1

Real-Life Troubleshooting Workflow

Identify the failing pod:
```
kubectl get pods -A | grep -i error
```

Check pod status and events:

kubectl describe pod <pod-name> -n <namespace>
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>

Inspect logs:
```
kubectl logs <pod-name> -n <namespace>
```

Check resource usage:

kubectl top pod <pod-name> -n <namespace>

Exec into the pod for deeper inspection:

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Review affinity, resource limits, and node assignment:

kubectl describe pod <pod-name> -n <namespace> | grep -i 'affinity\|resources\|node'

If using GitOps: Check if the manifest in Git matches the running pod. If not, investigate drift or failed syncs (ArgoCD/Flux dashboards).

Best Practices (2025)

Always check pod events and logs before restarting or deleting pods
Use kubectl get events sorted by timestamp for recent issues
Validate resource requests/limits to avoid OOMKilled or throttling
Use LLMs (Copilot, Claude) to generate troubleshooting scripts or analyze logs
Document recurring issues and solutions in your team knowledge base

Common Pitfalls

Ignoring events (often contain the root cause)
Restarting pods without root cause analysis
Not checking for node-level issues (disk, network, taints)
Manual changes outside Git in GitOps-managed clusters

References

PreviousTroubleshooting - Diagnosing and resolving common issues NextEnterprise Architecture

Last updated 1 day ago

Pod Troubleshooting Commands

Common Troubleshooting Commands

List all Pods in all Namespaces:

kubectl get pods --all-namespaces

Check Resource Consumption:

kubectl top pods --all-namespaces

Describe a Pod:

kubectl describe pod <pod-name> -n <namespace>

View Pod Logs:

kubectl logs <pod-name> -n <namespace>

Follow Pod Logs (stream in real-time):

kubectl logs -f <pod-name> -n <namespace>

Exec into a Pod:

kubectl exec -it <pod-name> -n <namespace> -- <command>

Get Events for a Pod:

kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>

Check Pod Health (Readiness/Liveness):

kubectl describe pod <pod-name> -n <namespace> | grep -i 'readiness\|liveness\|conditions'

Retrieve Pod IP and Node:

kubectl get pod <pod-name> -n <namespace> -o wide

Restart a Pod:

kubectl delete pod <pod-name> -n <namespace>

Check Pod Status:

kubectl get pod <pod-name> -n <namespace> -o wide

List Pod Events (sorted):

kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.metadata.creationTimestamp'

Verify Pod Affinity/Anti-Affinity:

kubectl describe pod <pod-name> -n <namespace> | grep -i nodeaffinity

Check Resource Requests and Limits:

kubectl describe pod <pod-name> -n <namespace> | grep -i resources

Identify Stuck Pods:

kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.metadata.creationTimestamp' | tail -n 1

Real-Life Troubleshooting Workflow

Identify the failing pod:
```
kubectl get pods -A | grep -i error
```

Check pod status and events:

kubectl describe pod <pod-name> -n <namespace>
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>

Inspect logs:
```
kubectl logs <pod-name> -n <namespace>
```

Check resource usage:

kubectl top pod <pod-name> -n <namespace>

Exec into the pod for deeper inspection:

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Review affinity, resource limits, and node assignment:

kubectl describe pod <pod-name> -n <namespace> | grep -i 'affinity\|resources\|node'

If using GitOps: Check if the manifest in Git matches the running pod. If not, investigate drift or failed syncs (ArgoCD/Flux dashboards).

Best Practices (2025)

Always check pod events and logs before restarting or deleting pods
Use kubectl get events sorted by timestamp for recent issues
Validate resource requests/limits to avoid OOMKilled or throttling
Use LLMs (Copilot, Claude) to generate troubleshooting scripts or analyze logs
Document recurring issues and solutions in your team knowledge base

Common Pitfalls

Ignoring events (often contain the root cause)
Restarting pods without root cause analysis
Not checking for node-level issues (disk, network, taints)
Manual changes outside Git in GitOps-managed clusters

References

- Specific tools for debugging pods
- Understanding fundamentals helps troubleshooting
- Collecting logs from Kubernetes
- Monitoring Kubernetes performance

PreviousTroubleshooting - Diagnosing and resolving common issues NextEnterprise Architecture

Last updated 1 day ago

Common Troubleshooting Commands

Real-Life Troubleshooting Workflow

Best Practices (2025)

Common Pitfalls

References

Related Topics

Common Troubleshooting Commands

Real-Life Troubleshooting Workflow

Best Practices (2025)

Common Pitfalls

References

Related Topics