Execjob_abort hook
Objective
There are several mom hook events already in place in PBS that respond to when job first enters pbs_mom for execution (EXECJOB_BEGIN), when job does its initial setup (EXECJOB_PROLOGUE), when job is requested to be terminated early (EXECJOB_PRETERM), when job starts performing its cleanup (EXECJOB_EPILOGUE), and when job finally leaves pbs_mom (EXECJOB_END). One event that is useful to have and is proposed in this design is a hook event that responds to when a job prematurely exits during startup. Such a hook event would be useful especially when a site has coded an execjob_begin (or execjob_prologue) hook that does some sort of system setup like pre-creating files for a job, and those files are needed to be cleaned up after job ends using an execjob_epilogue or execjob_end hook. But if the job suddenly ends prematurely, the epilogue hook or end hook may not always execute. Thus, there's a need for a new hook event, which will be called EXECJOB_ABORT, to handle the situation.
Forum: http://community.pbspro.org/t/a-new-hook-event-execjob-abort/1460
Why add a new hook execjob_abort event instead of calling existing execjob_end hook for all abort cases?
- The execjob_end hook is called by both primary and sister moms at the end of job after running to completion, or when job stops after
being interrupted by qdel or by a communication problem between the server and the sister moms. - A sister mom may not always execute an execjob_end hook on behalf of a job especially when sister mom has problems joining the job.
A new execjob_abort hook has been introduced instead of reusing the execjob_end hook to allow backwards compatibility. Some sites may already have execjob_end hooks in place, and would be surprised to see that their hook be called additionally when a sister mom fails to join the job, or when primary mom fails to start a job.
The end hook's purpose may not always be to do cleanup, but just print end of job statistics that it would not expect to be displayed when a sister mom even fails to become part of the job.
Interface 1: new mom hook event: execjob_abort
- Visibility: Public
- Change Control: Stable
- Python constant: pbs.EXECJOB_ABORT
- Event Parameters:
- pbs.event().job - This is a pbs.job object representing the job that is ending prematurely.This job object cannot be modified under this hook.
- pbs.event().vnode_list[] - This is a dictionary of pbs.vnode objects, keyed by vnode name, listing the vnodes that are assigned to the job. The vnode objects in the vnode_list cannot be modified.
- Restriction: The execjob_abort hook will run under the security context of Admin user.
- Details:
- An execjob_abort hook is executed by the primary mom when a job has problems starting up and needing to be aborted. Some sample failure conditions include:
- execjob_prologue hook rejections (from primary mom or sister mom) whether explicitly or implicitly due to unhandled exceptions
- execjob_launch hook rejections (whether explicitly or implicitly due to unhandled exceptions) from primary mom before executing top-level job script
- errors in fork() calls when starting child job process
- failure to save task information on disk for checkpoint recovery later
- communication pipes and sockets errors.
- failed to restart job from checkpoint file image
- failed to create cpuset
- failed to setup ptys for interactive job
An execjob_abort hook is executed by the sister mom when it encounters an error while attempting to join a job, where error conditions include:
errors during job setup
failed to create of cpuset
failure in mkdir() temp dir/file call,
failure in mkjobdir() call
- problem obtaining owning user's credential
communication errors with the primary mom
An execjob_abort hooks is also executed by sister mom on behalf of a job that has been requested by the primary mom to be aborted, as primary mom has encountered problems starting the job.
A call to pbs.event().accept() means the hook code has executed cleanly, but this hook will not cause changes to job attributes, resources, or vnodes in vnode list..
A call to pbs.event().reject() means the hook code was not able to fully accomplishing its task. The following message would appear in the MoM log at log event class PBSEVENT_DEBUG2:
“execjob_abort request rejected by ”
If the execjob_abort hook script encounters an unexpected error causing an unhandled exception, the following messages would appear in the MoM logs at event class PBSEVENT_DEBUG2:
“execjob_abort hook encountered an exception, request rejected”
“alarm call while running execjob_abort hook '', request rejected”
- An execjob_abort hook is executed by the primary mom when a job has problems starting up and needing to be aborted. Some sample failure conditions include:
- Additional details:
- A primary mom that fails to start a job would now result in an execjob_abort hook and an execjob_end hook to execute. In contrast, a sister mom that fails to join a job would result in only an execjob_abort hook to execute.
- In this case where both execjob_abort and execjob_end hook execute, the former will always get called first.
- Normally, a job will requeue after abort hook executes, but there maybe cases when the job would actually exit completely, if there are earlier execjob_begin or execjob_prologue hooks that executed, which instructed the job to be deleted via the pbs.event().job.delete() call. Job would also exit completely if an earlier execjob_launch hook resulted in a rejection.
- Normally, job will exit completely after the execjob_end hook runs. However, the job may actually requeue if there's an earlier execjob_epilogue that executed, which instructed job to be requeued via the pbs.event().job.rerun() call.
- NOTE: If a site that has cleanup code in an execjob_end hook, could simply add the cleanup code in an execjob_abort hook, but let it only execute if hook is called by a sister mom, which can be done as follows:
import pbs
e=pbs.event()
if e.job.not_in_msmom():
<do cleanup code>
- New qmgr output:
The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):
# qmgr –c “set hook <hook_name> event = <bad_event>”
from:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no eventto:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize,execjob_abort or "" for no event
- External dependency:
- The pbs_cgroups hook will be modified to add an execjob_abort handler, which would call the cleanup code done in the execjob_end handler.
Project Documentation Main Page