PP-481 Execute execjob_prologue hooks on all sister moms all the time

Objective

Currently the execjob_prologue runs on mother superior and then launches the job. On a sister mom, execjob_prologue hook will only execute upon receiving the FIRST tm_spawn (i.e. pbsdsh, pbs_tmrsh command) or a tm_attach (i.e. pbs_attach command) request from the mother superior.The objective is to change this such that the execjob_prologue is run on every sister node before the job is launched, similar to the execjob_begin behavior.

Interface 1: execjob_prologue executes on all sister moms every time

  • Visibility: Public
  • Change Control: Stable
  • Details:
  • The execjob_prologue hook will now execute on all the sister moms as soon as the mother superior mom has successfully executed the prologue hook. This is unlike before where the execjob_prologue hook only executed on a sister mom upon the first spawned task on the mom via a tm_spawn or tm_attach. If the execution of the prologue hook on mother superior is rejected, an unhandled exception is encountered, or the script timed out, then the sister moms will not get a request to execute execjob_prologue hooks.

Interface 2: fail_action=offline_vnodes hook attribute value for execjob_prologue

  • Visibility: Public
  • Change Control: Stable
  • Details:
    The execjob_prologue hook will recognize a hook attribute fail_action = "offline_vnodes" value, to automatically offline the vnodes managed by the executing mom, when the hook prematurely ends due to an un-handled exception or when it alarms out.
  • Example:

    # cat prolo.py
    import pbs
    import time
    time.sleep(60) # long running hook

    # qmgr -c "create hook begin event=execjob_prologue,alarm=10,fail_action=offline_vnodes"
    # qmgr -c "import hook begin application/x-python default prolo.py"
    Given the following vnodes:
    # pbsnodes -av
    ricardo
    Mom = ricardo.pbspro.com
    Port = 15002
    ntype = PBS
    state = free
    pcpus = 4
    resources_available.arch = linux

    ricardo[1]
    Mom = ricardo.pbspro.com
    Port = 15002
    ntype = PBS
    state = free

    Submit job:

    % qsub job.scr
    <job-id>

    Job runs but fails due to prologue hook timing out, and job is requeued.

    Now pbsnodes shows things offlined:
    % pbsnodes -av
    ricardo
    Mom = ricardo.pbspro.com
    Port = 15002
    ntype = PBS
    state = offline
    pcpus = 4
    comment = offlined by hook 'prolo' due to hook error

    ricardo[1]
    Mom = ricardo.pbspro.com
    Port = 15002
    ntype = PBS
    state = offline
    resources_available.arch = linux
    comment = offlined by hook 'prolo' due to hook error

  • Log/Error messages:
    • Setting the fail_action value to "offline_vnodes", and yet the hook itself does not have a mom hook event that matches 'execjob_begin', 'exechost_startup', or now 'execjob_prologue' hook, will result in the following error message printed in STDERR:

      # qmgr –c “set hook <server_hook_name> fail_action=offline_vnodes”

      “Can't set hook fail_action value to 'offline_vnodes': hook event must contain at least one of execjob_begin, exechost_startup, execjob_prologue”

      NOTE: The above message is also returned by pbs_geterrmsg() after calling pbs_manager() operating on a hook and its ‘fail_action’ attribute.

Interface 3: fail_action=scheduler_restart_cycle hook attribute value for execjob_prologue

  • Visibility: Public
  • Change Control: Stable
  • Details:
    The execjob_prologue hook will recognize a hook attribute fail_action = "scheduler_restart_cycle" value, to restart the scheduling cycle, when the hook prematurely ends due to an un-handled exception or when it alarms out.
  • Log/Error messages:
    • Setting the fail_action value to “scheduler_restart_cycle", and yet the hook itself does not have a mom hook event that matches 'execjob_begin' or now 'execjob_prologue' hook, will result in the following error message printed in STDERR:

      # qmgr –c “set hook <server_hook_name> fail_action=scheduler_restart_cycle”

      “Can't set hook fail_action value to 'scheduler_restart_cycle': hook event must contain at least one of execjob_begin, execjob_prologue”

      NOTE: The above message is also returned by pbs_geterrmsg() after calling pbs_manager() operating on a hook and its ‘fail_action’ attribute.



Attachments


Site Map

Developer Guide Pages