Changes to run_count attribute, hold and release operation of subjob(s) in a Job Array

This design has been checked into mainline through https://github.com/PBSPro/pbspro/pull/1119

This design document was written in accordance with the PBS Pro Design Document Guidelines.

Forum discussion is located at: http://community.pbspro.org/t/proposal-of-interface-regarding-hold-release-of-subjob-s-and-job-array/1442

Overview:

Currently run_count attribute of an instantiated subjob is not incremented for each retry by server to run it on mom and hence any exception or rejection by Mom to run it will result in to and fro transfer of subjob between server and mom for eternity causing unnecessary mom and server cpu time and log space. Interfaces in this document will help to solve this problem and enable admin to indirectly release subjob of a Job Array which is currently not available. The pbs commands qhold and qrls can be used on jobs and job arrays, but not on subjobs or ranges of subjobs.

I propose we allow qrls of a job array to indirectly release its held subjobs.

Motivation:
In the (v18.2) guide sections 14.12.3,14.18 in AG and 6.4.5, 6.5.6, 6.7 in UG, it mentions about limit by which server retries a job that failed to run on exec node. For example I have reproduced a part of section 6.7 of UG below:

6.7 Controlling Number of Times Job is Re-run
PBS has a built-in limit of 21 on the number of times it will try to run your job. The number of attempts is tracked in the job’s run_count attribute. By default, the value of run_count is zero at job submission. The job is held when the value of run_count goes above 20.

To simulate this scenario, create and import a simple execution hook that rejects job or raises exception during execjob_begin event.

Now when we submit a regular job, the server retries to run job 20 times and after that it is moved to Held state with comment field updated as "job held, too many failed attempts to run" and will have attribute run_count = 21.
Now after we correct the hook to accept the job, admin can release the system hold on this job using qhold -h s <held jobid>

In the above scenario if we submit a job array instead of a regular job, due to the server reincarnating the retrying subjob with run_count reset to 0 instead of incrementing previous value, subjob retries do not adhere to limit of 20 retries. Due to this “bug” we see mom and server play ping pong with the subjob for perpetuity, wasting mom’s and server’s cpu time and log space.

Technical details:

This document proposes below interfaces:

Interface 1: "run_count" attribute of subjob

Change control: Stable
Synopsis: increment and enforce limit on "run_count" of subjob similar to a regular job
Details:
- Increment "run_count" attribute value each time server tries to run the subjob by sending it to mom.
- Once the value of "run_count" goes above 20 server will set a system Hold on the subjob.
- When this system Hold gets applied, the comment field of the subjob is set as with a regular job.

Interface 2: system Hold on parent Job Array

Change control: Stable
Synopsis: system Hold on Parent Job Array due to held subjob
Details:
- when any subjob gets Held by system due to Interface 1, as a consequence, set a system Hold on its parent Job Array
- This will inhibit further instantiation of its subjobs, which are more likely to fail to launch
- The comment field of Job array will be updated as "Job Array Held, too many failed attempts to run subjob <subjob jobid>"

Interface 3: changes to PBS command "qrls"

Change control: Stable
Synopsis: enable "qrls" to indirectly release system Hold on subjob
Details:
- when qrls is invoked with a system Held Job Array id prior to releasing the hold on it, all of its system Held subjobs are released
- If there is a failure in above operation then, the error message will be same as with a regular job and will remain in H state

OSS Site Map

Project Documentation Main Page

Developer Guide Pages