[WIP] PP-1303: Adding Job-Suspend hook event in MoM side

Overview:

The objective of this https://pbspro.atlassian.net/browse/PP-1303 is to provide a hook event that should be invoked when MOM receives the suspend request to a running JOB. Just before the SIGSTOP is sent, an execjob_suspend hook can  a) Modify the job’s Execution_Time, Hold_Types, and resources_used attributes b) Cause the job to keep running c) Set attributes and resources on the vnode(s) managed by the MoM where this job executes d) Flag the job to be rerun e) For third-party licensing software currently, in mainline, even license resources are deemed to be freed on suspension, but sending a STOP signal to a job is not enough to actually free the licenses and f) special checkpointing operations that can also free a lot of resources that preempted jobs are hogging, g) Weather domain customers would like to customize the suspend signal to make weather workflow tools like Cylc to work properly. 

Jira ID

PP-1303 - Adding Job-Suspend hook event in MoM side

Forum Discussion  

Click here

Requirements and Use cases Click here

Interface 1: MoM hook event - “execjob_suspend”

  • Visibility: Public

  • Change Control: Experimental

  • Details: 
    • The hook created for this event invokes when there is a suspend request to a running JOB. 
    • This hook is created by pbsadmin 
    • The hook will run in foreground.
    • The hook shall be executed before the jobs are being signaled to stop using SIGSTOP.
    • The hook can accept or reject the SUSPEND request, Upon reject(), the hook will not interrupt the execution of the process invoking it and a log message - "an execjob_suspend request rejected by <hook name>" is added in the MoM logs.
    • On hitting the hook’s alarm, the hook gets rejected with a message - alarm call while running execjob_suspend hook <hook name>, request rejected in the MoM logs.

      Usage:

                 qmgr -c "create hook sus_hook event=execjob_suspend"
                 qmgr -c "import hook sus_hook application/x-python default <script path>"

      In the following case, If one tries to create a hook with a wrong event name:

                 qmgr -c "s h sus_hook event=execjobsuspend"
      qmgr obj=sus_hook svr=default: invalid argument (execjobsuspend) to event.
      Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue
      execjob_suspend,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach
      or "" for no event
      qmgr: hook error returned from server