PBS kills jobs that are suspended when the pbs_comm is restarted


One of our customer suspended all running jobs on their cluster in order to perform some system activities, namely turning off 16 racks that will be retired. Part of the activities involved restarting pbs_comm on all cluster rack leaders to update their config file with the now-smaller list of pbs_comm instances to connect to (due to rack retirement).

It turns out that when we restarted pbs_comm we inadvertently killed about 100 of the 300+ suspended jobs. My post-mortem analysis and testing indicates a bug in im_eof() in mom_comm.c regarding what to do when a mom\'s pbs_comm connection is dropped. I\'ve attached the code in question from 13.1, our version is nearly identical except for the first if-statement:

if (pjob->ji_qs.ji_substate == JOB_SUBSTATE_PRERUN ||
pjob->ji_qs.ji_substate == JOB_SUBSTATE_RUNNING) {

I believe the bug in both versions of the if-statement is that it only protects running jobs from being killed due to possibly-transient pbs_comm trouble, but doesn\'t extend the protection to suspended jobs. My guess is that adding JOB_SUBSTATE_SUSPEND as another option in the if-statement will fix the bug and allow suspended multi-node jobs to survive. Based on the 13.1 code that modified if-statement might look like:

if ((((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) == 0) && (pjob->ji_qs.ji_substate == JOB_SUBSTATE_PRERUN)) ||
(pjob->ji_qs.ji_substate == JOB_SUBSTATE_RUNNING) ||
(pjob->ji_qs.ji_substate == JOB_SUBSTATE_SUSPEND)) {

I was able to reproduce the failure by suspending a multi-node job that spanned 2+ pbs_comm\'s (head node connected to pbs_comm A, some number of sister nodes attached to pbs_comm B) then stopping/restarting pbs_comm A.

Acceptance Criteria



David Block
November 21, 2016, 6:35 PM

pbsproadmin pbsproadmin added a comment - 02/Apr/16 4:05 PM - edited

<p>I am able to reproduce the issue and have verified that the supplied fix works fine. ie including the SUSPEND substate as part of the check. <span style='line-height: 20px;'>(pjob->ji_qs.ji_substate == JOB_SUBSTATE_SUSPEND)</span></p>
<p><span style='line-height: 20px;'>So, should I still provide them with the updated source code? (since he already knows exactly what/where to fix).</span></p>

David Block
November 21, 2016, 6:36 PM

rampranesh Ram Pranesh added a comment - 12/Jun/16 9:50 PM - edited

Closing this issue as it was addressed as part of CLOSED .
Commited with 1f7af20fc4bfb46deec95635e39d9c172e0be242


Ram Pranesh


Former user





Start Date


Pull Request URL



Fix versions