PP-928 Reliable Job Startup
Objective
Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.
Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649
Interface 1: New job attribute 'tolerate_node_failures'
Visibility: Public
Change Control: Stable
Value: 'all', 'job_start', or 'none'
Python type: str
Synopsis:
When set to 'all', this means to tolerate all node failures resulting from communication problems (e.g. polling) between the primary mom and the sister moms assigned to the job, as well as due to rejections from execjob_begin, or execjob_prologue hook execution by remote moms.
When set to 'job_start', this means to tolerate node failures that occur only during job start like an assigned sister mom failing to join job, communication errors that happen between the primary mom or sister moms, just before the job executes the execjob_launch hook and/or the top level shell or executable.
When set to 'none' or if the attribute is unset, this means no node failures are tolerated (default behavior).
It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
This option is best used when job is assigned extra nodes using pbs.event().job.select.increment_chunks() method (interface 7).
The ‘tolerate_node_failures’ job option is currently not supported on Cray systems. If specified, a Cray primary mom would ignore the setting.
Privilege: user, admin, or operator can set it
Examples:
Via qsub:
qsub -W tolerate_node_failures="all" <job_script>
qalter -W tolerate_node_failures="job_start" <jobid>
# cat qjob.py
import pbs
e=pbs.event()
e.job.tolerate_node_failures = "all"
# qmgr -c "create hook qjob event=queuejob"
# qmgr -c "import hook application/x-python default qjob.py"
% qsub job.scr
23.borg
% qstat -f 23
...
tolerate_node_failures = all
Log/Error messages:
When a job that has tolerate_node_failures attribute set to 'all' or 'job_start', the following mom_logs messages will appear in the following conditions: sister moms that failed to join job due to either communication error or execjob_begin hook rejects, when a sister mom fails to setup a job like cpuset creation failure, when a sister mom rejects an execjob_prologue hook, when the primary mom fails to poll a sister mom for status, or for any communication error to the sister mom:
DEBUG level: "ignoring from <node_host> error as job is tolerant of node failures"
Interface 2: New server accounting record: 's' for secondary start record when job's assigned resources get pruned during job startup
Visibility: Public
Change Control: Stable
Synopsis: When a job has tolerate_node_failures attribute set to 'all' or 'job_start', there'll be this new accounting record that will reflect the adjusted (pruned) values to the job's assigned resources, as a result of the call to pbs.event().job.release_nodes() inside execjob_prologue or execjob_launch hooks.
Note: This is a new accounitng record; the start of job record ('S') remains as before.
Example:
04/07/2016 17:08:09;s;20.corretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203 exec_host=corretja/0*3+lendl/0*2+nadal/0 exec_vnode=(corretja:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(nadal:ncpus=1:mem=3145728kb) Resource_List.mem=6291456kb Resource_List.ncpus=6 Resource_List.nodect=3 Resource_List.place=scatter:excl Resource_List.select=1:ncpus=3:mem=1048576kb+1:ncpus=2:mem=2097152kb+1:ncpus=1:mem=3145728kb Resource_List.site=ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb resource_assigned.mem=24gb resource_assigned.ncpus=9
Interface 3: sister_join_job_alarm mom config option
Visibility: Public
Change Control: Stable
Details:
This is the number of seconds that the primary mom will wait to receive acknowledgement from all the sister moms for the IM_JOIN_JOB requests sent, if job's tolerate_node_failures attribute is set to 'all' or 'job_start'. That is, just before the job officially launches its program (script/executable), the primary pbs_mom will ignore any errors from sister moms including failed IM_JOIN_JOB requests. Once all the IM_JOIN_JOB requests have been acknowledged or when the 'sister_join_job_alarm' value wait time has been exceeded, then pre-starting the job (calling finish_exec()) continues.Default value: set to to the total amount of 'alarm' values associated with enabled execjob_begin hooks. Example, if there are 2 execjob_begin hooks with first hook having alarm=30 and second hook having alarm=20, then the default value of sister_join_job_alarm will be 50 seconds. If there are no execjob_begin hooks, then this is set to 30 seconds.
To change value, add the following line in mom's config file:
$sister_join_job_alarm <# of seconds>Log/Error messages:
When the $sister_join_job_alarm value is specified, then there'll be PBSEVENT_SYSTEM level message that will be shown when mom starts up or kill -HUPed: "sister_join_job_alarm;<alarm_value>"
When not all join job request from sister moms have been acknowledged within the $sister_join_job_alarm time limit, then the following mom_logs message appears at DEBUG2 level: "sister_join_job_alarm wait time <alarm_value> secs exceeded"
Interface 4: job_launch_delay mom config option
Visibility: Public
Change Control: Stable
Details:
This is the number of seconds that the primary mom will wait before launching (executing the job script or executable), if the job that has tolerate_node_failures set to "all" or "job_start". This wait time can be used to let execjob_prologue hooks finish execution to capture or report any node failures, or for mother superior to notice of any communication problems with other nodes. pbs_mom will not necessarily wait fot the entire time but proceed to execute execjob_launch hook (when specified) once all prologue hook acknowledgements have been received from sister moms.Default value: set to to the total amount of 'alarm' values associated with enabled execjob_prologue hooks. For example, if there are 2 execjob_prologue hooks, where first hook has alarm=30 and second hook has alarm=60, then the default job_launch_delay value will be 90 seconds. If there are no execjob_prologue hooks, then this is set to 30 seconds.
To change value, add the following line in mom's config file:
$job_launch_delay <number of seconds>Restriction:
This option is currently not supported under Windows. NOTE: Allowing it would cause the primary mom to hang waiting on the job_launch_delay timeout, preventing other jobs from starting. This is because jobs are not pre-started in a forked child process, unlike in Linux/Unix systems.
Log/Error messages:
When $job_launch_delay value is set, there'll be PBSEVENT_SYSTEM level message upon mom startup or when it is kill -HUPed: "job_launch_delay;<delay_value>"
When primary mom notices that not all acks were received from the sister moms in regards to execjob_prologue hook execution, then mom_logs would show the DEBUG2 level message: "not all prologue hooks to sister moms completed, but job will proceed to execute"