DevOps Debugging Part 4: ps
This is a multi-part series where we will explore essential unix commands for debugging applications. These skills are critical when an outage occurs or something doesn’t work as expected. This is aimed at DevOps Engineers, SREs and linux sysadmins. Below is a quick navigation if you want to jump to the other parts.
In this part we are going to cover ps
. This command is a great way of debugging linux processes and to find bottlenecks during outages. When a server is under high load or unavailable, it is common practice to find out what is the cause of this and eventually to find a mitigation. With ps
we can identify parent and child processes to fine tune our targets for a root cause analysis. Keep in mind that we will not cover the whole usage of the command and what fancy things it can do but rather how to use the command to debug servers and applications.
Installation
To install network on redhat/centos/ubuntu/osx run:
# redhat/centos/amazon linux
$ yum install procps# ubuntu
$ apt-get install procps# OSX/Mac (usually already installed)# test for installation
$ ps --help
If you get a command not found back then please reach out below in the comments section.
Usage
List all running processes inside of an instance or docker container.
$ ps ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:03 /sbin/init
2 ? S 0:00 [kthreadd]
3 ? I< 0:00 [rcu_gp]
4 ? I< 0:00 [rcu_par_gp]
5 ? I< 0:00 [netns]
6 ? I 0:00 [kworker/0:0-events]
...
We can see the PID (Process ID), STAT (Process state) and the command that is running. Let’s look for a particular process.
$ ps ax | grep mongo
449 ? Ssl 0:56 /usr/bin/mongod --config /etc/mongod.conf
As we can see above, I can use grep
to search for a particular process. In my case the mongodb daemon is running with the above command. With this I can confirm that the process itself is running, so if it isn’t responding I can try a few things.
# I can try to restart the process
$ sudo systemctl restart mongod# If a restart is not working, I can try to kill
$ sudo kill 449# kill by force
$ sudo kill -9 449# If the process is in Z state, we will need to reboot
$ reboot
Sometimes if a process is killed, it can be respawned by the parent process. Let’s find out the parent of our mongodb process.
$ ps -ef | grep mongo
mongodb 449 1 0 09:27 ? 00:00:58 /usr/bin/mongod --config /etc/mongod.conf
We can see from the above output that the parent process ID is 1. PID 1 is a special process ID for the kernel, meaning systemd started this process upon boot. In this case we won’t try to kill the kernel process but if we had rogue parent processes, we can get their PIDs and kill them.
Debugging
During an outage ps
could be a great to determine if a process is running or not. SSH into the server, run ps
and see if you can see the process running. Now it might not be the python process or java process, it could even be docker that is not running or kubelet has died. Try to see if disk space is running out or check the logs as to why that process died. Once root cause determined, restart the process with systemctl
.
Alternatives
ps
is super powerful at the low level to try to figure out any bottlenecks in a given docker container or Kubernetes worker. However there are better alternatives.
htop
is a much more detailed view of what is running on a server. We can see how much CPU or memory each process is taking and aggregate CPU/memory alongside the PID. The only downside of htop
is that when a server is under high load or hung, it might not be possible to load up the interface since that takes some processing power compared to ps
.
Similarly top
is the poor mans version of htop
. Shows pretty much the same information but is more lightweight. There are a bunch of alternatives that have popped up in recent years, for example gtop
, vtop
and ptop
but keep in mind these require external dependencies such as python or npm to install.
Conclusion
In the next part we are going to cover less
for debugging applications. These parts will be released on a weekly basis, if you want to skip the queue please buy the book here:
https://www.amazon.com/dp/B0BJC4Y1N1
Please leave comments and share your outage debugging stories.