Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Python constant: pbs.EXECJOB_PRERESUME
  • Event Parameters: 
    • pbs.event().job - This is a pbs.job object representing the job that will be resumed. This job object cannot be modified under this hook.
    • pbs.event().vnode_list[] - This is a dictionary of pbs.vnode objects, keyed by vnode name, listing the vnodes that are assigned to the job. The vnode objects in the vnode_list cannot be modified.
  • Hook Attributes:
    • fail_action: This hook will not allow a fail action to be set.
    • user: This hook will only allow the value "pbsadmin".
  • Details:
    • An execjob_preresume hook is executed by the primary mom when a request to resume the job is received.
    • An execjob_preresume hook is executed by the sister mom when a request from the primary mom to resume the job's tasks is received.

    • A call to pbs.event().accept() means the hook code has executed cleanly.

    • A call to pbs.event().reject() means the hook code was not able to fully accomplish its task

      • Note: this will prevent all MoMs from resuming jobs.
      • Keeping with hook design, if one execjob_preresume hook is rejected, the other execjob_preresume hooks with a higher order value will not run.
    • If the execjob_postsuspend preresume hook script encounters an unexpected error causing an unhandled exception, or times out due to the hook's alarm setting, the hook will act similar to a pbs.event().reject().

      • Note: this will prevent all MoMs from resuming jobs.
  • Internal Design:
    • The MS will complete the event first. If it is not rejected, the sisters will then run their hooks.
    • All moms must accept the event before the job can be resumed.
  • Consumer:
    • Hooks like cgroups that need to take action when resource allocation changes.
      • Because the cgroups are cleaned up on suspension, it has to be recreated/modified when the job is resumed. Otherwise, the job will not have the resources it requested.
  • Caveats
    • Current behavior shows that when a PBS_BATCH_SignalJob to resume a job is rejected by the mom, the server starts another scheduling cycle. If the scheduler says it can still be resumed, it will try again. If the execjob_preresume hook always rejects, there is nothing preventing this loop. Again, this is consistent with current behavior, but now it's easier to enter this loop.

...