...
- Python constant: pbs.EXECJOB_PRERESUME
- Event Parameters:
- pbs.event().job - This is a pbs.job object representing the job that will be resumed. This job object cannot be modified under this hook.
- pbs.event().vnode_list[] - This is a dictionary of pbs.vnode objects, keyed by vnode name, listing the vnodes that are assigned to the job. The vnode objects in the vnode_list cannot be modified.
- Hook Attributes:
- fail_action: This hook will not allow a fail action to be set.
- user: This hook will only allow the value "pbsadmin".
- Details:
- An execjob_preresume hook is executed by the primary mom when a request to resume the job is received.
An execjob_preresume hook is executed by the sister mom when a request from the primary mom to resume the job's tasks is received.
A call to pbs.event().accept() means the hook code has executed cleanly.
A call to pbs.event().reject() means the hook code was not able to fully accomplish its task
- Note: this will prevent all MoMs from resuming jobs.
- Keeping with hook design, if one execjob_preresume hook is rejected, the other execjob_preresume hooks with a higher order value will not run.
If the execjob_postsuspend preresume hook script encounters an unexpected error causing an unhandled exception, or times out due to the hook's alarm setting, the hook will act similar to a pbs.event().reject().
- Note: this will prevent all MoMs from resuming jobs.
- Internal Design:
- The MS will complete the event first. If it is not rejected, the sisters will then run their hooks.
- All moms must accept the event before the job can be resumed.
- Consumer:
- Hooks like cgroups that need to take action when resource allocation changes.
- Because the cgroups are cleaned up on suspension, it has to be recreated/modified when the job is resumed. Otherwise, the job will not have the resources it requested.
- Hooks like cgroups that need to take action when resource allocation changes.
- Caveats
- Current behavior shows that when a PBS_BATCH_SignalJob to resume a job is rejected by the mom, the server starts another scheduling cycle. If the scheduler says it can still be resumed, it will try again. If the execjob_preresume hook always rejects, there is nothing preventing this loop. Again, this is consistent with current behavior, but now it's easier to enter this loop.
...