We had a Bamboo incident the other day – lots of Bamboo agent went offline , and quite a few build plans have to wait in the queue for a long time. We checked the logs and saw something like “Agent could not access JMS invoker queue“.
JMS stands for Java Message Service which in this case is used by Bamboo server to manage the jobs queue. JMS is IO intensive that needs fast disk to accommodate. A quick iostat 6 10 command shows that the %iowait is consistently above 15, and some even reached 50. It means that 50% of the CPU time has been wasted for waiting the disk to be ready. It definitely tells us something wrong with the disk. After the VMware team vmotioned the Bamboo server to another host, we saw a dramatical drop in %iowait (from 15 down to 0.5). And most of the agents came back online by themselves. A couple required re-enablement. It turned out that the previous host had some degraded disks which were overlooked.
The disk busy time digraph.