PP-481 Execute execjob_prologue hooks on all sister moms all the time
Objective
Currently the execjob_prologue runs on mother superior and then launches the job. On a sister mom, execjob_prologue hook will only execute upon receiving the FIRST tm_spawn (i.e. pbsdsh, pbs_tmrsh command) or a tm_attach (i.e. pbs_attach command) request from the mother superior.The objective is to change this such that the execjob_prologue is run on every sister node before the job is launched, similar to the execjob_begin behavior.
Interface 1: execjob_prologue executes on all sister moms every time
- Visibility: Public
- Change Control: Stable
- Details:
- The execjob_prologue hook will now execute on all the sister moms as soon as the mother superior mom has successfully executed the prologue hook. This is unlike before where the execjob_prologue hook only executed on a sister mom upon the first spawned task on the mom via a tm_spawn or tm_attach. If the execution of the prologue hook on mother superior is rejected, an unhandled exception is encountered, or the script timed out, then the sister moms will not get a request to execute execjob_prologue hooks.
Interface 2: fail_action=offline_vnodes hook attribute value for execjob_prologue
- Visibility: Public
- Change Control: Stable
- Details:
The execjob_prologue hook will recognize a hook attribute fail_action = "offline_vnodes" value, to automatically offline the vnodes managed by the executing mom, when the hook prematurely ends due to an un-handled exception or when it alarms out. - Example:
# cat prolo.py
import pbs
import time
time.sleep(60) # long running hook# qmgr -c "create hook begin event=execjob_prologue,alarm=10,fail_action=offline_vnodes"
# qmgr -c "import hook begin application/x-python default prolo.py"
Given the following vnodes:
# pbsnodes -av
ricardo
Mom = ricardo.pbspro.com
Port = 15002
ntype = PBS
state = free
pcpus = 4
resources_available.arch = linuxricardo[1]
Mom = ricardo.pbspro.com
Port = 15002
ntype = PBS
state = freeSubmit job:
% qsub job.scr
<job-id>Job runs but fails due to prologue hook timing out, and job is requeued.
Now pbsnodes shows things offlined:
% pbsnodes -av
ricardo
Mom = ricardo.pbspro.com
Port = 15002
ntype = PBS
state = offline
pcpus = 4
comment = offlined by hook 'prolo' due to hook error
ricardo[1]
Mom = ricardo.pbspro.com
Port = 15002
ntype = PBS
state = offline
resources_available.arch = linux
comment = offlined by hook 'prolo' due to hook error - Log/Error messages:
Setting the fail_action value to "offline_vnodes", and yet the hook itself does not have a mom hook event that matches 'execjob_begin', 'exechost_startup', or now 'execjob_prologue' hook, will result in the following error message printed in STDERR:
# qmgr –c “set hook <server_hook_name> fail_action=offline_vnodes”
“Can't set hook fail_action value to 'offline_vnodes': hook event must contain at least one of execjob_begin, exechost_startup, execjob_prologue”
NOTE: The above message is also returned by pbs_geterrmsg() after calling pbs_manager() operating on a hook and its ‘fail_action’ attribute.
Interface 3: fail_action=scheduler_restart_cycle hook attribute value for execjob_prologue
- Visibility: Public
- Change Control: Stable
- Details:
The execjob_prologue hook will recognize a hook attribute fail_action = "scheduler_restart_cycle" value, to restart the scheduling cycle, when the hook prematurely ends due to an un-handled exception or when it alarms out.
- Log/Error messages:
Setting the fail_action value to “scheduler_restart_cycle", and yet the hook itself does not have a mom hook event that matches 'execjob_begin' or now 'execjob_prologue' hook, will result in the following error message printed in STDERR:
# qmgr –c “set hook <server_hook_name> fail_action=scheduler_restart_cycle”
“Can't set hook fail_action value to 'scheduler_restart_cycle': hook event must contain at least one of execjob_begin, execjob_prologue”
NOTE: The above message is also returned by pbs_geterrmsg() after calling pbs_manager() operating on a hook and its ‘fail_action’ attribute.