Objective
There are several mom hook events already in place in PBS that respond to when job first enters pbs_mom for execution (EXECJOB_BEGIN), when job does its initial setup (EXECJOB_PROLOGUE), when job is requested to be terminated early (EXECJOB_PRETERM), when job starts performing its cleanup (EXECJOB_EPILOGUE), and when job finally leaves pbs_mom (EXECJOB_END). One event that is useful to have and is proposed in this design is a hook event that responds to when a job prematurely exits during startup. Such a hook event would be useful especially when a site has coded an execjob_begin (or execjob_prologue) hook that does some sort of system setup like pre-creating files for a job, and those files are needed to be cleaned up after job ends using an execjob_epilogue or execjob_end hook. But if the job suddenly ends prematurely, the epilogue hook or end hook may not always execute. Thus, there's a need for a new hook event, which will be called EXECJOB_ABORT, to handle the situation.
Forum: http://community.pbspro.org
Interface 1: new hook event: execjob_abort
- Visibility: Public
- Change Control: Stable
- Python constant: pbs.EXECJOB_ABORT
- Event Parameters:
- pbs.event().job - This is a pbs.job object representing the job that is ending prematurely.This job object cannot be modified under this hook.
- pbs.event().vnode_list[] - This is a dictionary of pbs.vnode objects, keyed by vnode name, listing the vnodes that are assigned to the job. The vnode objects in the vnode_list cannot be modified.
- Restriction: The execjob_abort hook will run under the security context of Admin user.
- Details:
- An execjob_abort hook is executed by the primary mom when a job has problems starting up and needing to be aborted. Some sample failure conditions include:
- execjob_prologue hook rejections (from primary mom or sister mom) whether explicitly or implicitly due to unhandled exceptions
- execjob_launch hook rejections (whether explicitly or implicitly due to unhandled exceptions) from primary mom before executing top-level job script
- errors in fork() calls when starting child job process
- failure to save task information on disk for checkpoint recovery later
- communication pipes and sockets errors.
- failed to restart job from checkpoint file image
- failed to create cpuset
- failed to setup ptys for interactive job
An execjob_abort hook is executed by the sister mom when it encounters an error while attempting to join a job, where error conditions include:
errors during job setup
failed to create of cpuset
failure in mkdir() temp dir/file call,
failure in mkjobdir() call
- problem obtaining owning user's credential
communication errors with the primary mom
An execjob_abort hooks is also executed by sister mom on behalf of a job that has been requested by the primary mom to be aborted, as primary mom has encountered problems starting the job.
A call to pbs.event().accept() means the hook code has executed cleanly, but there'll be no changes to job attributes, resources, or vnodes in vnode list.. The following message would appear in the MoM log at log event class PBSEVENT_DEBUG2:
“execjob_abort request rejected by ”
If the execjob_abort hook script encounters an unexpected error causing an unhandled exception, the following messages would appear in the MoM logs at event class PBSEVENT_DEBUG2:
“execjob_end hook encountered an exception, request rejected”
“alarm call while running execjob_end hook '', request rejected”
- An execjob_abort hook is executed by the primary mom when a job has problems starting up and needing to be aborted. Some sample failure conditions include:
- Additional details:
- A primary mom that fails to start a job would now result in an execjob_abort hook and an execjob_end hook to execute. In contrast, a sister mom that fails to join a job would result in only an execjob_abort hook to execute.
- NOTE: If a site that has cleanup code in an execjob_end hook, could simply add the cleanup code in an execjob_abort hook, but let it only execute if hook is called by a sister mom, which can be done as follows:
import pbs
e=pbs.event()
if e.job.not_in_msmom():
<do cleanup code>
- New qmgr output:
The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):
# qmgr –c “set hook <hook_name> event = <bad_event>”
from:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no eventto:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize,execjob_abort or "" for no event
- External dependency:
- The pbs_cgroups hook will be modified to add an execjob_abort handler, which would call the cleanup code done in the execjob_end handler.