Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

...

  • Visibility: Public
  • Change Control: Stable
  • Value: 'all', 'job_start', or 'none'
  • Python type: str
  • Synopsis:  
    • When set to 'all', this means to tolerate all node failures resulting from communication problems (e.g. polling) between the primary mom and the sister moms assigned to the job, as well as due to rejections from execjob_begin, or execjob_prologue hook execution by remote moms.
    • When set to 'job_start', this  means to tolerate node failures that occur only during job start like an assigned sister mom failing to join job, communication errors that happen between the primary mom or sister moms, just before the job executes the execjob_launch hook and/or the top level shell or executable.
    • When set to 'none' or if the attribute is unset, this means no node failures are tolerated (default behavior).

    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
    • This option is best used when job is assigned extra nodes using pbs.event().job.select.increment_chunks() method (interface 7).
    • The ‘tolerate_node_failures’ job option is currently not supported on Cray systems. If specified, a Cray primary mom would ignore the setting.
  • Privilege: user, admin, or operator can set it
  • Examples:
    • Via qsub:

...

                            qalter -W tolerate_node_failures="job_start" <jobid>

    • Via a hook:

                            # cat qjob.py
                            import pbs
                            e=pbs.event()
                            e.job.tolerate_node_failures = "all"
                            # qmgr -c "create hook qjob event=queuejob"
                            # qmgr -c "import hook application/x-python default qjob.py"
                            % qsub job.scr
                            23.borg
                            % qstat -f 23
                              ...
                              tolerate_node_failures = all

...

                   Seeing this log message means that a job can momentarily receive an error when doing tm_spawn or pbsdsh to a node that did not complete the nodes table update yet.

    • When mother superior fails  to prune currently assigned chunk resource, then the following detailed mom_logs message are shown in DEBUG2 level:
      • "could not satisfy select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) 

      • "NEED chunks for keep_select (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN)
      • "HAVE chunks from job's exec_vnode (<exec_vnode value>

...

  • Visibility: Public
  • Change Control: Stable
  • Python constant: pbs.EXECJOB_RESIZE
  • Event Parameters: 
    • pbs.event().job - This is a pbs.job object representing the job whose resources has been updated. This job object cannot be modified under this hook.
    • pbs.event().vnode_list[] - This is a dictionary of pbs.vnode objects, keyed by vnode name, listing the vnodes that are assigned to the job. The vnode objects in the vnode_list cannot be modified.
  • Restriction: The execjob_resize hook will run under the security context of Admin user.
  • Details:
    • An execjob_resize event has been introduced primarily as a new event for pbs_cgroups hook to be executed when there's an update to the job's assigned resources, as a result of the release_nodes() call. This would allow pbs_cgroups to act on a change to job's resources. The action would be to update the limits of the job's cgroup.
    • If the pbs_cgroups hook is executing in response to an execjob_resize event,  calling pbs.event().reject(<message>),  encountering an exception, or terminating due to an alarm call, would result in the following DEBUG2 mom_logs message, and the job is aborted on the mom side, and requeued/rerun on the server side:

      “execjob_resize” request rejected by ‘pbs_cgroups”
      <message>

  • New qmgr output:
    • The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):

      # qmgr –c “set hook <hook_name> event = <bad_event>”

      from:
      invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach or "" for no event

      to:
      invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no event

  • External documentation:
    • This hook event is intentionally not added to the external documentation (as of 2021.1.3), because it is intended for use primarily by the cgroups hook.

Case of Reliable Job Startup:

...