PP-928 Reliable Job Startup

Objective

Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.

Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649

Interface 1: New job attribute 'tolerate_node_failures'

  • Visibility: Public
  • Change Control: Stable
  • Value: 'all', 'job_start', or 'none'
  • Python type: str
  • Synopsis:  
    • When set to 'all', this means to tolerate all node failures resulting from communication problems (e.g. polling) between the primary mom and the sister moms assigned to the job, as well as due to rejections from execjob_begin, or execjob_prologue hook execution by remote moms.
    • When set to 'job_start', this  means to tolerate node failures that occur only during job start like an assigned sister mom failing to join job, communication errors that happen between the primary mom or sister moms, just before the job executes the execjob_launch hook and/or the top level shell or executable.
    • When set to 'none' or if the attribute is unset, this means no node failures are tolerated (default behavior).

    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
    • This option is best used when job is assigned extra nodes using pbs.event().job.select.increment_chunks() method (interface 7).
    • The ‘tolerate_node_failures’ job option is currently not supported on Cray systems. If specified, a Cray primary mom would ignore the setting.
  • Privilege: user, admin, or operator can set it
  • Examples:
    • Via qsub:

                            qsub -W tolerate_node_failures="all" <job_script>

    • Via qalter:

                            qalter -W tolerate_node_failures="job_start" <jobid>

    • Via a hook:

                            # cat qjob.py
                            import pbs
                            e=pbs.event()
                            e.job.tolerate_node_failures = "all"
                            # qmgr -c "create hook qjob event=queuejob"
                            # qmgr -c "import hook application/x-python default qjob.py"
                            % qsub job.scr
                            23.borg
                            % qstat -f 23
                              ...
                              tolerate_node_failures = all

  • Log/Error messages:
    • When a job that has tolerate_node_failures attribute set to 'all' or 'job_start', the following mom_logs messages will appear in the following conditions: sister moms that failed to join job due to either communication error or execjob_begin hook rejects, when a sister mom fails to setup a job like cpuset creation failure, when a sister mom rejects an execjob_prologue hook, when the primary mom fails to poll a sister mom for status, or for any communication error to the sister mom:
      • DEBUG level: "ignoring from <node_host> error as job is tolerant of node failures"

Interface 2: New server accounting record: 's' for secondary start record when job's assigned resources get pruned during job startup

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: When a job has tolerate_node_failures attribute set to 'all' or 'job_start', there'll be this new accounting record that will reflect the adjusted (pruned) values to the job's assigned resources, as a result of the call to pbs.event().job.release_nodes() inside execjob_prologue or execjob_launch hooks.
  • Note:  This is a new accounitng record; the start of job record ('S') remains as before.
  • Example:

    04/07/2016 17:08:09;s;20.corretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203  exec_host=corretja/0*3+lendl/0*2+nadal/0 exec_vnode=(corretja:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(nadal:ncpus=1:mem=3145728kb) Resource_List.mem=6291456kb Resource_List.ncpus=6 Resource_List.nodect=3 Resource_List.place=scatter:excl Resource_List.select=1:ncpus=3:mem=1048576kb+1:ncpus=2:mem=2097152kb+1:ncpus=1:mem=3145728kb Resource_List.site=ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb resource_assigned.mem=24gb resource_assigned.ncpus=9

Interface 3: sister_join_job_alarm mom config option

  • Visibility: Public
  • Change Control: Stable
  • Details:
    This is the number of seconds that the primary mom will wait to receive acknowledgement from all the sister moms for the IM_JOIN_JOB requests sent, if job's tolerate_node_failures attribute is set to 'all' or 'job_start'. That is, just before the job officially launches its program (script/executable), the primary pbs_mom will ignore any errors from sister moms including failed IM_JOIN_JOB requests. Once all the IM_JOIN_JOB requests have been acknowledged or when the 'sister_join_job_alarm' value wait time has been exceeded, then  pre-starting the job (calling finish_exec()) continues. 
  • Default value: set to to the total amount of 'alarm' values associated with enabled execjob_begin hooks. Example, if there are 2 execjob_begin hooks with first hook having alarm=30 and second hook having alarm=20, then the default value of sister_join_job_alarm will be 50 seconds. If there are no execjob_begin hooks, then this is set to 30 seconds.
        To change value, add the following line in mom's config file:
                            $sister_join_job_alarm <# of seconds>
  • Log/Error messages:
  1. When the $sister_join_job_alarm value is specified, then there'll be PBSEVENT_SYSTEM level message that will be shown when mom starts up or kill -HUPed:                                                                      "sister_join_job_alarm;<alarm_value>"
  2. When not all join job request from sister moms have been acknowledged within the $sister_join_job_alarm time limit, then the following mom_logs message appears at DEBUG2 level:                          "sister_join_job_alarm wait time <alarm_value> secs exceeded"

Interface 4: job_launch_delay mom config option

  • Visibility: Public
  • Change Control: Stable
  • Details:
    This is the number of seconds that the primary mom will wait before launching (executing the job script or executable), if the job that has tolerate_node_failures set to "all" or "job_start". This wait time can be used to let execjob_prologue hooks finish execution  to capture or report any node failures, or for mother superior to notice of any communication problems with other nodes. pbs_mom will not necessarily wait fot the entire time but proceed to execute execjob_launch hook (when specified) once all prologue hook acknowledgements have been received from sister moms.
  • Default value: set to to the total amount of 'alarm' values associated with enabled execjob_prologue hooks. For example, if there are 2 execjob_prologue hooks, where first hook has alarm=30 and second hook has alarm=60, then the default job_launch_delay value will be 90 seconds. If there are no execjob_prologue hooks, then this is set to 30 seconds.
    To change value, add the following line in mom's config file:
                   $job_launch_delay <number of seconds>
  • Restriction:
    • This option is currently not supported under Windows. NOTE: Allowing it would cause the primary mom to hang waiting on the job_launch_delay timeout, preventing other jobs from starting. This is because jobs are not pre-started in a forked child process, unlike in Linux/Unix systems.
  • Log/Error messages:
  1. When $job_launch_delay value is set, there'll be PBSEVENT_SYSTEM level message upon mom startup or when it is kill -HUPed:                                                                                                      "job_launch_delay;<delay_value>"
  2. When primary mom notices that not all acks were received from the sister moms in regards to execjob_prologue hook execution, then mom_logs would show the DEBUG2 level message:                                                                                                                                                                                                                                                                                                         "not all prologue hooks to sister moms completed, but job will proceed to execute"

Interface 5: pbs.event().vnode_list_fail[] hook parameter

  • Visibility: Public
  • Change Control: Stable
  • Python Type: dict (dictionary of pbs.vnode objects keyed by vnode name)
  • Details:
    This is a new event parameter for the execjob_prologue and execjob_launch hook. It will contain the list of vnodes and their assigned resources that are managed by unhealthy moms. This can include those vnodes from sister moms that failed to join the job, that rejected an execjob_begin hook or execjob_prologue hook request, encountered communication error while primary mom is polling the sister mom host. This dictionary object is keyed by vnode name. And one can walk through this list and start offlining the vnodes, for example:

    for vn in e.vnode_list_fail:
        v = e.vnode_list_fail[vn]
        pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
        v.state = pbs.ND_OFFLINE

  • Additional Details:
    • Any sister nodes that are able to join the job will be considered as healthy.
    • The sucess of join job request maybe the result of a check made by a remote execjob_begin hook. After successfully joining the job, the node may further check its status via a remote execjob_prologue hook. A reject by the remote prologue hook will cause primary mom to treat the sister node as a problem node, and will be marked as unhealthy. Unhealthy nodes are not selected when pruning a job's request via the release_nodes(keep_select) call (see interface 8 below).
    • If there's an execjob_prologue hook in place, the primary mom would track node hosts that have given IM_ALL_OKAY acknowledgement for their execution of the execjob_prologue hook. Then after some ‘job_launch_delay’ amount of time of job startup (interface 4), primary mom would start reporting as failed nodes those who have not given their positive acknowledgement during prologue hook execution. This info is communicated to the child mom running on behalf of the job, so that vnodes from the failed hosts would not be used when pruning a job (i.e. release_nodes(keep_select=X) call).
    • If after some time, a node's host comes back with an acknowledgement of successful prologue hook execution, the primary mom would add back the host to the healthy list.

Interface 6: Allow execjob_launch hooks to modify job and vnode attributes

  • Visibility: Public
  • Change Control: Stable
  • Detail: With this feature, execjob_launch hooks are now allowed to modify job and vnode attributes, in particular, job's Execution_Time, Hold_Types, resources_used, and run_count values. This is the same with vnode object attributes like state and resources_available.
  • Examples:

                           Set a job's Hold_Types in case the hook script rejects the execjob_launch event:

                                pbs.event().job.Hold_Types = pbs.hold_types('s')

                           Set a vnode's state to offline:

                               pbs.event().vnode_list[<node_name>].state = pbs.ND_OFFLINE

  • Log/Error messages:

                           In previous version of PBS, when a job or vnode attribute/resource is set in execjob_launch, the hook rejects the request and returns the following message:

                                     "Can only set progname, argv, env event parameters under execjob_launch hook"

                           Now, setting vnode and job attributes are allowed and would no longer give the above message. If something else get set in the hook, like a server attribute, then

                          this will now be the new DEBUG2 level mom_logs message:

                                     "Can only set progname, argv, env event parameters as well as job, resource, vnode under execjob_launch hook."

Interface 7: pbs.select.increment_chunks(increment_spec)

  • Visibility: Public
  • Change Control: Stable
  • Return Python Type: pbs.select
  • Details:
    This is a new method in the pbs.select type where 'increment_spec' number of chunks are added to each chunk (except for the first chunk assigned to primary mom) in the chunk specification. So given a select spec of "[N:][chunk specification][+[N:]chunk specification]", this function will return [N+increment:][chunk specification][+[N+increment:]chunk specification]". A missing 'N' values means 1. The first chunk is the single chunk inside the first item (in the plus-separated specs) that is assigned to the
    primary mom. It is left as is.
    For instance, given a chunk specs of "3:ncpus=2+2:ncpus=4", this can be viewed as "(1:ncpus=2+2:ncpus=2)+(2:ncpus=4)", and the increment specs described below would apply to the chunks after the initial, single chunk "1:ncpus=2" and in all the succeeding chunks.
  • Input:
    • if 'increment_spec' is a number (int or long), then it will be the amount to add to the number of chunks (that is not the first chunk) specified for each chunk in the pbs.select spec.
    • if 'increment_spec' is a numeric string (int or long), then it will also be the amount to add to the number of chunks (that is not the first chunk) specified for each chunk in the pbs.select spec.
    • if 'increment_spec' is a numeric string that ends with a percent sign (%), then this will be the percent amount of chunks to increase each chunk (that is not the first chunk) in the pbs.select spec. The resulting amount is rounded up (i.e. ceiling) (e.g. 1.23 rounds up to 2).
    • Finally, if 'increment_spec' is a dictionary with elements of the form:
                       {<chunk_index_to_select_spec> : <increment>, ...}
      where <chunk_index_to_select_spec> starts at 0 for the first chunk, and <increment> can be numeric, numeric string or a percent increase value. This allows for individually specifying the number of chunks to increase original value. Note that for the first chunk in the list (0th index), the increment will apply to the chunks beyond the initial single chunk, which is assigned to the primary mom.
  • Example:

Given:
      sel=pbs.select("ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")

Calling sel.increment_chunks(2) would return a string:
     "1:ncpus=3:mem=1gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=3gb"

Calling sel.increment_chunks("3") would return a string:
     "1:ncpus=3:mem=1gb+4:ncpus=2:mem=2gb+5:ncpus=1:mem=3gb"

Calling sel.increment_chunks("23.5%"), would return a pbs.select value mapping to:
      "1:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"

with the first chunk, which is a single chunk, is left as is, and the second and third chunks are increased by 23.5 % resulting in 1.24 rounded up to 2, and 2.47 rounded up to 3.

Calling sel.increment_chunks({0: 0, 1: 4, 2: "50%"}), would return a pbs.select value mapping to:
     "1:ncpus=3:mem=1gb+5:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"

where no increase (0) for chunk 1, additional 4 chunks for chunk 2, 50% increase for chunk 3 resulting in 3.

               Given:
                         sel=pbs.select("5:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")

               Then calling sel.increment_chunks("50%") or sel.increment_chunks({0: "50%", 1: "50%", 2: "50%}) would return a pbs.select value mapping to:
                         "7:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"
                as for the first chunk, the initial single chunk of "1:ncpus=3:mem=1gb" is left as is, with the "50%" increase applied to the remaining chunks "4:ncpus=3:mem=1gb", and then                        added back to the single chunk to make 7, while chunks 2 and 3 are increased to 2 and 3, respectively.

Interface 8: pbs.event().job.release_nodes(keep_select) method

  • Visibility: Public
  • Change Control: Stable
  • Return Python type: PBS job object reflecting the new values to some of the attributes like 'exec_vnode', Resource_List.* as a result of nodes getting released.
  • Input: keep_select - a pbs.select string that should be a subset of the job’s original select request, mapping to a set of nodes that should be kept.
  • Restriction: This is currently callable only from mom hooks execjob_launch and execjob_prologue and makes sense only when executed from the hook ran by the primary mom.
    • It is advisable to put this call in an 'if pbs.event().job.in_ms_mom()' clause.
    • Also, since execjob_launch hook will also get called when spawning tasks via pbsdsh, or tm_spawn, ensure the execjob_launch hook invoking release_nodes() call has 'PBS_NODEFILE' in the pbs.event().env list. The presence of 'PBS_NODEFILE' in the environment ensures that the primary mom is executing on behalf of starting the top level job, and not spawning a sister task. One can just add at the top of the hook:

                   if 'PBS_NODEFILE' not in pbs.event().env:

                       pbs.event().accept()

                   ...

                  pbs.event().job.release_nodes(keep_select=...)

NOTE: On Windows, where PBS_NODEFILE would always appear in pbs.event().env, need to put the following on top of the execjob_launch hook:


if any("mom_open_demux.exe") in s for s in e.argv):
      e.accept()


    • This call makes sense only when job is node failure tolerant (i.e. tolerate_node_failures=job_start or tolerate_node_failures=all) since it is when the
      list of healthy and failed nodes are gathered to be consulted by release_nodes() for determining which chunk should be assigned, freed. If it is invoked and yet the job is not tolerant of node failures, the following message is displayed in mom_logs under DEBUG level:

                     "<jobid>: no nodes released as job does not tolerate node failures"

  • Detail: Release nodes that are assigned to a job in such a way that it still satisfies the given 'keep_select' specification, with no nodes that are known to be bad  (in pbs.event().vnode_list_fail). With a successful execution of release_nodes() call from execjob_prologue and execjob_launch hooks, the 's' accounting record (interface 2) is generated. and primary mom will notify the sister moms to also update its internal nodes table, so future use of task manager API  (e.g. tm_spawn, pbsdsh) will be aware of the change. if pbs_cgroups is enabled ( PP-325 Support Cgroups), the cgroup already created for the job is also updated to match the job's new resources. If the kernel rejects the update to the job's cgroup resources, then the job will be aborted on the execution host side, and requeued/rerun on the server side.
  • Examples:

           Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:

                pj = e.job.release_nodes(keep_select="ncpus=2:mem=2gb+ncpus=2:mem=2gb+ncpus=1:mem=1gb")
                if pj != None:
                    pbs.logmsg(pbs.LOG_DEBUG, "pj.exec_vnode=%s" % (pj.exec_vnode,))
                else:               # returned None job object, so we can put a hold on the job and requeue it, rejecting the hook event
                    e.job.Hold_Types = pbs.hold_types("s")
                    e.job.rerun()
                    e.reject("unsuccessful at LAUNCH")


  • Log/Error messages:
    • When job's assigned nodes get pruned (nodes released to satisfy 'keep_select') , mom_logs will show the following info under PBSEVENT_JOB log level:

      ";Job;<jobid>;pruned from exec_vnode=<original value>"
      ";Job;<jobid>;pruned to exec_vnode=<new value>"

    • When a multinode job's assigned resources have been modified, primary mom will do a quick 5 seconds wait  for an acknowledgement from the sister moms that they have updated their nodes table. When not all acknowledgements were received by primary mom during that 5 seconds wait, then there'll be this DEBUG2 level mom_logs message:

                       "not all job updates to sister moms completed"

                   Seeing this log message means that a job can momentarily receive an error when doing tm_spawn or pbsdsh to a node that did not complete the nodes table update yet.

    • When mother superior fails  to prune currently assigned chunk resource, then the following detailed mom_logs message are shown in DEBUG2 level:
      • "could not satisfy select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) 

      • "NEED chunks for keep_select (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN)
      • "HAVE chunks from job's exec_vnode (<exec_vnode value>
    • When a sister mom updated its internal nodes table due to some nodes getting released as  a result of the release_nodes() call, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:                                                                                                                                                                                     ";<jobid>;updated nodes info"
    • Calling release_nodes() from a hook that is not execjob_prologue or execjob_launch hook would return None as this is currently not supported.
    • Upon successful execution of release_nodes() call, it is normal to receive messages in the mom_logs of the form:

                    " stream <num> not found to job nodes"
                    "im_eof, No error from addr <ipaddr>:<port> on stream <num>

                 which corresponds to the connection stream of a released mom host.

Interface 9: new hook event: execjob_resize

  • Visibility: Public
  • Change Control: Stable
  • Python constant: pbs.EXECJOB_RESIZE
  • Event Parameters: 
    • pbs.event().job - This is a pbs.job object representing the job whose resources has been updated. This job object cannot be modified under this hook.
    • pbs.event().vnode_list[] - This is a dictionary of pbs.vnode objects, keyed by vnode name, listing the vnodes that are assigned to the job. The vnode objects in the vnode_list cannot be modified.
  • Restriction: The execjob_resize hook will run under the security context of Admin user.
  • Details:
    • An execjob_resize event has been introduced primarily as a new event for pbs_cgroups hook to be executed when there's an update to the job's assigned resources, as a result of the release_nodes() call. This would allow pbs_cgroups to act on a change to job's resources. The action would be to update the limits of the job's cgroup.
    • If the pbs_cgroups hook is executing in response to an execjob_resize event,  calling pbs.event().reject(<message>),  encountering an exception, or terminating due to an alarm call, would result in the following DEBUG2 mom_logs message, and the job is aborted on the mom side, and requeued/rerun on the server side:

      “execjob_resize” request rejected by ‘pbs_cgroups”
      <message>

  • New qmgr output:
    • The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):

      # qmgr –c “set hook <hook_name> event = <bad_event>”

      from:
      invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach or "" for no event

      to:
      invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no event

  • External documentation:
    • This hook event is intentionally not added to the external documentation (as of 2021.1.3), because it is intended for use primarily by the cgroups hook.

Case of Reliable Job Startup:

In order to have a job to reliably start, we'll need a queuejob hook that makes a job tolerate node failures by setting the 'tolerate_node_failures' attribute to 'job_start', adding extra chunks to the job's select specification using the pbs.event().job.select.increment_chunks()  method, while saving the job's original select value into the builtin resource say "site", and having an execjob_launch hook that will call pbs.event().job.release_nodes() to prune back the job's select value back to the original.

NOTE: In the future, we would allow any custom resource to be created and use that to save the 'select' value, It's just that currently, custom resources populating Resource_List are not propagated from the server to the mom, and it needs to be as mom hook will use the value.

                  

First, introduce a queuejob hook:
% cat qjob.py

import pbs
e=pbs.event()

j = e.job

j.tolerate_node_failures = "job_start"

Then, save the current of 'select' in a builtin resource "site". 

e.job.Resource_List["site"] = str(e.job.Resource_List["select"])

Next, add extra chunks to current select:

new_select = e.job.Resource_List["select"].increment_chunks(1)
e.job.Resource_List["select"] = new_select

Now instantiate the queuejob hook:
# qmgr -c "c h qjob event=queuejob"
# qmgr -c "i h qjob application/x-python default qjob.py"

Soon introduce an execjob_launch hook so that before job officially runs its program, the job's current assigned resources is pruned to match the original 'select' request of user:

% cat launch.py

import pbs
e=pbs.event()

if 'PBS_NODEFILE' not in e.env:

    e.accept()

j = e.job
pj = j.release_nodes(keep_select=e.job.Resource_List["site"])

if pj is None:          # was not successful pruning the nodes

    j.rerun()        # rerun (requeue) the job

   e.reject("something went wrong pruning the job back to its original select request")

Instantiate the launch hook:

# qmgr -c "c h launch event=execjob_launch"
# qmgr -c "i h launch application/x-python default launch.py"


And say a job is of the form:


% cat jobr.scr
#PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
#PBS -l place=scatter:excl

echo $PBS_NODEFILE
cat $PBS_NODEFILE
echo END
echo "HOSTNAME tests"
echo "pbsdsh -n 0 hostname"
pbsdsh -n 0 hostname
echo "pbsdsh -n 1 hostname"
pbsdsh -n 1 hostname
echo "pbsdsh -n 2 hostname"
pbsdsh -n 2 hostname
echo "PBS_NODEFILE tests"
for host in `cat $PBS_NODEFILE`
do
    echo "HOST=$host"
    echo "pbs_tmrsh $host hostname"
    pbs_tmrsh $host hostname
    echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
    ssh $host pbs_attach -j $PBS_JOBID hostname
done


When job first starts, it will get assigned 5 nodes first, as select specification was modified causing 2 extra nodes getting assigned:

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
Resource_List.mem = 11gb
Resource_List.ncpus = 9
Resource_List.nodect = 5
Resource_List.place = scatter:excl
Resource_List.select = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

tolerate_node_failures = job_start

Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+lendl/0*2+agassi/0*2
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)
Resource_List.mem = 6gb
Resource_List.ncpus = 6
Resource_List.nodect = 3
Resource_List.place = scatter:excl
Resource_List.select = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

Resource_List.site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

A snapshot of the job's output would show the pruned list of nodes:

/var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
borg.pbspro.com
lendl.pbspro.com
agassi.pbspro.com
END

HOSTNAME tests

pbsdsh -n 0 hostname
borg.pbspro.com
pbsdsh -n 1 hostname
lendl.pbspro.com
pbsdsh -n 2 hostname
agassi.pbspro.com

PBS_NODEFILE tests
HOST=borg.pbspro.com
pbs_tmrsh borg.pbspro.com hostname
borg.pbspro.com
ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
borg.pbspro.com
HOST=lendl.pbspro.com
pbs_tmrsh lendl.pbspro.com hostname
lendl.pbspro.com
ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
lendl.pbspro.com
HOST=agassi.pbspro.com
pbs_tmrsh agassi.pbspro.com hostname
agassi.pbspro.com
ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
agassi.pbspro.com