PBS kills jobs that are suspended when the pbs_comm is restarted

Description

When restarting pbs_comm in a cluster all suspended jobs are being killed. As part of the analysis in mom_comm.c the following check was not checking jobs in suspended state.

if (pjob->ji_qs.ji_substate == JOB_SUBSTATE_PRERUN ||
pjob->ji_qs.ji_substate == JOB_SUBSTATE_RUNNING) {

Issue can be reproduced by suspending a multi-node job that spanned 2+ pbs_comm\'s (head node connected to pbs_comm A, some number of sister nodes attached to pbs_comm B) then stopping/restarting pbs_comm A.

Acceptance Criteria

None

Activity

Show:
Jayadev Chanakath
June 13, 2016, 2:26 AM

This issue is a duplicate of PP-68 and that ticket has some additional comments. Please mark this ticket and duplicate of PP-68 and use that ticket. The status of this ticket needs to be updated as well.

Ram Pranesh
June 13, 2016, 4:39 AM

This issue is a duplicate of PP-68. So closing this issue.

Assignee

Ram Pranesh

Reporter

Ram Pranesh

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Affects versions

Priority

Low
Configure