Overview:
The objective of this https://pbspro.atlassian.net/browse/PP-1303 is to provide a hook event that should be invoked when MOM receives the suspend request to a running JOB. Just before the SIGSTOP is sent, an execjob_suspend hook can a) Modify the job’s Execution_Time, Hold_Types, and resources_used attributes b) Cause the job to keep running c) Set attributes and resources on the vnode(s) managed by the MoM where this job executes d) Flag the job to be rerun e) For third-party licensing software currently, in mainline, even license resources are deemed to be freed on suspension, but sending a STOP signal to a job is not enough to actually free the licenses and f) special checkpointing operations that can also free a lot of resources that preempted jobs are hogging, g) Weather domain customers would like to customize the suspend signal to make weather workflow tools like Cylc to work properly.
Jira ID | PP-1303 - Adding Job-Suspend hook event in MoM side |
Forum Discussion | Click here |
Requirements and Use cases | Click here |
Interface 1: MoM hook event - “execjob_suspend”
Visibility: Public
Change Control: Experimental
Usage:
qmgr -c "create hook sus_hook event=execjob_suspend"
qmgr -c "import hook sus_hook application/x-python default <script path>"
In the following case, If one tries to create a hook with a wrong event name:
qmgr -c "s h sus_hook event=execjobsuspend"
qmgr obj=sus_hook svr=default: invalid argument (execjobsuspend) to event.
Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue
execjob_suspend,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach
or "" for no event
qmgr: hook error returned from server