Unusual Suspects - in strong I/O systems

So, it was time to put on the Sherlock Holmes hat and grab a magnifying glass!

The other day, I was called aside to handle one of those issues where system utilization had reached above and beyond the normal levels. Of course it was one of those systems that had its' hand in everything. The only warnings received from the server were CPU utilization and system wait-times at elevated levels. So, it was time to put on the Sherlock Holmes hat and grab a magnifying glass!

The Suspects
I began by attaching myself to the system and then started to go through the preliminary routine of looking for the usual suspects. After a couple of ps's and vmstat's showing a high number of b's, I narrowed the suspects down to multiple rsync processes. Which looked like the following:


  USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  root      23453  0.0  0.2   1024  7216 ?        D    May17   0:00
	rsync -aupSz --stats --progress --include=* --include=.*
	/repository/http_docs/ /system3/repository3/http_docs
  root      29213  0.0  0.2   6840  2018 ?        DN   May11   0:42
	rsync -aupSz --stats --progress --include=* --include=.*
	/web_databases/mysql/ /system4/repository4/mysql
  root      25256  0.0  0.2   2016  4006 ?        D    May16   4:31
	rsync -aupSz --stats --progress --include=* --include=.*
	/cust_databases/mysql/ /system3/repository3/cust_mysql

Interrogation of the Suspects
Now the rsync's were not unusual because this system normally has a lot of data transfer routines going back and forth. The rsync processes could actually be waiting on I/O from the disk, which might account for the high b's in 'vmstat'. Also 'vmstat' did not show much data transfer in or out. If it did I could use the 'iostat' command to pinpoint exactly where the heavy disk activity was taking place. But, the key thing to notice was the STAT field within the 'ps' command used with the x flag. If you notice 'D' STATE processes you will need to investigate furthur. A 'D' STATE happens when a process is in an uninterruptible sleep state. I would say it's likened to 'REM sleep' in a human sleep cycle. But when a person is in REM sleep, we can say the person is not a zombie and will eventually wake up. Likewise with these processes, how do you know which processes are actually doing legitimate processing and which have somehow become stuck never to wake up?

An good way to examine the suspects is to use 'strace'. With 'strace' we can trace a process and see what's really going on like this:


	sastri@serv_mframe02: strace -p 29213
	Process 21612 attached - interrupt to quit

	read(3,"221yWt16337Np344212s-247362&'37263314"..., 16384) = 52
	write(3,"256I336371226344377375331736364200\372-c|344211E
		236"315320362W357"..., 516) = 516
	select(10, [3 5 9], [3 7], NULL, NULL)  = 3 (in [9], out [3 7])
	rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
	rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
	......and so on

	sastri@serv_mframe02: strace -p 23453
	Process 23253 attached - interrupt to quit
	Read()
	........nothing

From the details above we can see one process that has much I/O and one where it is pretty much doing nothing. Although multiple processes were in D states, only one was the real culprit. Normally you can't kill a 'D' state process. Killing said process would risk leaving the kernel in an inconsistent state, leading to a panic. If the process is not driver specific, you may try to kill the parent processes. But often times a reboot is necessary. And yes, sometimes UNIX systems need a reboot to solve problems...but rarely:-)


Peace be unto you. Thank you for visiting!