Sometimes suspended job does not resume after high priority job finishes eventhough sched_preempt_enforce_resumption is set

Description

Sometimes suspended job does not resume after high priority job finishes even though sched_preempt_enforce_resumption is set

sometimes randomly with suspended job not resuming.
This issue is usually seen when there is Policy change and there is error in calculating start time of suspended job.

failure:

Fri May 20 21:51:34 2016 qaca-02-s11p2: jobs 49.qaca-02-s11p2 - submitted successfully
Fri May 20 21:51:34 2016 qaca-02-s11p2: job 44.qaca-02-s11p2 is in S state as expected
Fri May 20 21:51:35 2016 qaca-02-s11p2: job 45.qaca-02-s11p2 is in R state as expected
Fri May 20 21:51:35 2016 qaca-02-s11p2: job 46.qaca-02-s11p2 is in Q state as expected
Fri May 20 21:51:35 2016 qaca-02-s11p2: job 47.qaca-02-s11p2 is in R state as expected
Fri May 20 21:51:35 2016 qaca-02-s11p2: job 48.qaca-02-s11p2 is in Q state as expected
Fri May 20 21:51:36 2016 qaca-02-s11p2: job 49.qaca-02-s11p2 is in R state as expected
Fri May 20 21:51:36 2016 qaca-02-s11p2: job 46.qaca-02-s11p2 qstat output got as expected
Fri May 20 21:51:36 2016 qaca-02-s11p2: qstat -f run, comment value for 46.qaca-02-s11p2 is Not Running: Job would conflict with reservation or top job
Fri May 20 21:51:36 2016 qaca-02-s11p2: job comment of 46.qaca-02-s11p2 is as expected - Not Running: Job would conflict with reservation or top job
Fri May 20 21:51:36 2016 qaca-02-s11p2: job 48.qaca-02-s11p2 qstat output got as expected
Fri May 20 21:51:36 2016 qaca-02-s11p2: qstat -f run, comment value for 48.qaca-02-s11p2 is Not Running: Job would conflict with reservation or top job
Fri May 20 21:51:36 2016 qaca-02-s11p2: job comment of 48.qaca-02-s11p2 is as expected - Not Running: Job would conflict with reservation or top job
Fri May 20 21:51:38 2016 qaca-02-s11p2: Got log message from sched_logs
Fri May 20 21:51:39 2016 qaca-02-s11p2: Got log message from sched_logs
Fri May 20 21:53:22 2016 qaca-02-s11p2: job 49.qaca-02-s11p2 is completed
Fri May 20 21:55:23 2016 qaca-02-s11p2: *** ERROR: job 44.qaca-02-s11p2 should be R:44.qaca-02-s11p2 has S status - should be R

Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
44.qaca-02-s11p2 J2 pbsroot 00:00:00 S workq
46.qaca-02-s11p2 J2 pbsroot 00:00:00 R workq
48.qaca-02-s11p2 J2 pbsroot 00:00:00 R workq

05/20/2016 21:51:34;0400;pbs_sched;Node;49.qaca-02-s11p2;Evaluating subchunk: ncpus=4
05/20/2016 21:51:34;0400;pbs_sched;Node;49.qaca-02-s11p2;Allocated one subchunk: ncpus=4
05/20/2016 21:51:34;0100;pbs_sched;Job;49.qaca-02-s11p2;Simulation: Preempted enough work to run job
05/20/2016 21:51:34;0040;pbs_sched;Job;44.qaca-02-s11p2;Job preempted by suspension
05/20/2016 21:51:34;0400;pbs_sched;Node;49.qaca-02-s11p2;Evaluating subchunk: ncpus=4
05/20/2016 21:51:34;0400;pbs_sched;Node;49.qaca-02-s11p2;Allocated one subchunk: ncpus=4
05/20/2016 21:51:34;0040;pbs_sched;Job;49.qaca-02-s11p2;Job run
...
05/20/2016 21:53:22;0100;pbs_sched;Svr;;It is non-primetime. It will end in 201998 seconds at 05/23/2016 06:00:00
05/20/2016 21:53:22;0100;pbs_sched;Node;qaca-02-s11p2;Job 49.qaca-02-s11p2 reported running on node no longer exists or is not in running state
05/20/2016 21:53:22;0100;pbs_sched;Node;qaca-02-s11p2;Job 49.qaca-02-s11p2 reported running on node no longer exists or is not in running state
05/20/2016 21:53:22;0100;pbs_sched;Node;qaca-02-s11p2;Job 49.qaca-02-s11p2 reported running on node no longer exists or is not in running state
05/20/2016 21:53:22;0100;pbs_sched;Node;qaca-02-s11p2;Job 49.qaca-02-s11p2 reported running on node no longer exists or is not in running state
...
05/20/2016 21:53:22;0400;pbs_sched;Job;prime time;Simulation: Policy change [Mon May 23 06:00:00 2016]
05/20/2016 21:53:22;0040;pbs_sched;Sched;44.qaca-02-s11p2;Can't find start time estimate Insufficient amount of resource: ncpus (R: 8 A: 4 T: 8)
05/20/2016 21:53:22;0040;pbs_sched;Job;44.qaca-02-s11p2;Error in calculation of start time of top job
05/20/2016 21:53:22;0040;pbs_sched;Job;44.qaca-02-s11p2;Insufficient amount of resource: ncpus (R: 8 A: 4 T: 8)
05/20/2016 21:53:22;0080;pbs_sched;Job;46.qaca-02-s11p2;Considering job to run
...
05/20/2016 21:53:22;0080;pbs_sched;Job;44.qaca-02-s11p2;Considering job to run
05/20/2016 21:53:22;0100;pbs_sched;Job;44.qaca-02-s11p2;Estimating the start time for a top job.
05/20/2016 21:53:22;0400;pbs_sched;Job;46.qaca-02-s11p2;Simulation: job end point [Fri May 20 21:58:23 2016]
05/20/2016 21:53:22;0400;pbs_sched;Job;prime time;Simulation: Policy change [Mon May 23 06:00:00 2016]
05/20/2016 21:53:22;0400;pbs_sched;Job;non-prime time;Simulation: Policy change [Mon May 23 17:30:00 2016]
05/20/2016 21:53:22;0400;pbs_sched;Job;prime time;Simulation: Policy change [Tue May 24 06:00:00 2016]
05/20/2016 21:53:22;0400;pbs_sched;Job;non-prime time;Simulation: Policy change [Tue May 24 17:30:00 2016]

Prime/Nonprime Table
*
Prime Non-Prime
Day Start Start
*
weekday 0600 1730
saturday none all
sunday none all

Acceptance Criteria

None

Activity

Show:
Arun Grover
July 2, 2016, 8:01 AM

This Issue is a timing issue and happens intermittently and I've only managed to reproduce this by tweaking the values using gdb on a running scheduler process.
have only one node with ncpu added to the pbs complex
Submit a low priority job.
I attached the running scheduler process to gdb - gdb -p <pid>
break at collect_jobs_on_node() also break at find_resource_resv() function
Now from another terminal submit a high priority job
gdb will stop at collect_jobs_on_node function; press continue ('ç') twice (This because this time it is trying to run the high priority job)
gdb will again stop at collect_jobs_on_node function; press continue ('c') once
gdb will stop at find_resource_resv() function; type "return 0" and enter
now delete both breakpoints and press continue
This will make the suspended job run too. You might see 2 ncpus in the complex getting utilized when there is only one ncpu present in the complex but that is fine because we tweaked with values using gdb.

Assignee

Arun Grover

Reporter

Jon Shelley

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Fix versions

Affects versions

Priority

Low
Configure