[WIP] PP-1303: Adding Job-Suspend hook event in MoM side
Overview:
The objective of this https://pbspro.atlassian.net/browse/PP-1303 is to provide a hook event that should be invoked when MOM receives the suspend request to a running JOB. Just before the SIGSTOP is sent, an execjob_suspend hook can a) Modify the job’s Execution_Time, Hold_Types, and resources_used attributes b) Cause the job to keep running c) Set attributes and resources on the vnode(s) managed by the MoM where this job executes d) Flag the job to be rerun e) For third-party licensing software currently, in mainline, even license resources are deemed to be freed on suspension, but sending a STOP signal to a job is not enough to actually free the licenses and f) special checkpointing operations that can also free a lot of resources that preempted jobs are hogging, g) Weather domain customers would like to customize the suspend signal to make weather workflow tools like Cylc to work properly.
Jira ID | PP-1303 - Adding Job-Suspend hook event in MoM side |
Forum Discussion | Click here |
Requirements and Use cases | Click here |
Interface 1: MoM hook event - “execjob_suspend”
Visibility: Public
Change Control: Experimental
- Details:
- The hook created for this event invokes when there is a suspend request to a running JOB.
- This hook is created by pbsadmin
- The hook will run in foreground.
- The hook shall be executed before the jobs are being signaled to stop using SIGSTOP.
- The hook can accept or reject the SUSPEND request, Upon reject(), the hook will not interrupt the execution of the process invoking it and a log message - "an execjob_suspend request rejected by <hook name>" is added in the MoM logs.
- On hitting the hook’s alarm, the hook gets rejected with a message - alarm call while running execjob_suspend hook <hook name>, request rejected in the MoM logs.
Usage:
qmgr -c "create hook sus_hook event=execjob_suspend"
qmgr -c "import hook sus_hook application/x-python default <script path>"
In the following case, If one tries to create a hook with a wrong event name:
qmgr -c "s h sus_hook event=execjobsuspend"
qmgr obj=sus_hook svr=default: invalid argument (execjobsuspend) to event.
Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue
execjob_suspend,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach
or "" for no event
qmgr: hook error returned from server