Add node state change hook event
Links
Link to discussion on Developer Forum: https://community.openpbs.org/t/design-document-for-node-state-change-hook-event/2223
Link to ALCF PBS development repo: https://github.com/ericpershey/pbspro/tree/hook_modifyvnode
Link to pull request: https://github.com/openpbs/openpbs/pull/2105
Overview
Some sites that track system node availability must account for every node second. For such sites, timeliness and accuracy in recording node state changes are paramount. To this end we introduce a modifyvnode hook event that is triggered when a vnode changes state. This new event enables admins to deploy site-specific scripts to facilitate realtime node availability tracking.
Glossary
Node availability - Percentage of time a system is available to users: (time in period - time unavailable due to outages in period)/time in period * 100
Technical Details
Info for hook writers
The type for the modifyvnode event is pbs.MODIFYVNODE
modifyvnode hooks run at the server
Hooks registered to the modifyvnode event will execute after the vnode's state attribute is changed by the server
Two objects are available to modifyvnode event hook writers:
pbs.event().vnode: this read-only object’s attributes represent the new/current state of the vnode (i.e., after the server has successfully changed the vnode state attribute)
pbs.event().vnode_o: this read-only object’s attributes appear as they were prior to the server changing the vnode state
Two new functions have been added to the python vnode object:
extract_state_strs() returns a list of the string values currently set in the vnode’s state bits
extract_state_ints() returns a list of the integer values currently set in the vnode’s state bits
A pbs.event().accept() call terminates hook execution, as does pbs.event().reject(). The vnode and vnode_o event objects are unaffected by either call.
vnode state constant changes
New constants
ND_STATE_FREE
ND_STATE_OFFLINE
ND_STATE_DOWN
ND_STATE_DELETED
ND_STATE_STALE
ND_STATE_JOBBUSY
ND_STATE_JOB_EXCLUSIVE
ND_STATE_RESV_EXCLUSIVE
ND_STATE_BUSY
ND_STATE_UNKNOWN
ND_STATE_NEEDS_HELLOSVR
ND_STATE_INIT
ND_STATE_PROV
ND_STATE_WAIT_PROV
ND_STATE_UNRESOLVABLE
ND_STATE_SLEEP
ND_STATE_OFFLINE_BY_MOM
ND_STATE_MARKEDDOWN
ND_STATE_NEED_ADDRS
ND_STATE_MAINTENANCE
ND_STATE_NEED_CREDENTIALS
ND_STATE_VNODE_AVAILABLE
ND_STATE_VNODE_UNAVAILABLE
Deprecated constants
ND_FREE
ND_OFFLINE
ND_DOWN
ND_STALE
ND_JOBBUSY
ND_JOB_EXCLUSIVE
ND_RESV_EXCLUSIVE
ND_BUSY
ND_PROV
ND_WAIT_PROV
ND_UNRESOLVABLE
ND_SLEEP
Example hook script that records current and previous vnode values in the pbs log only if the vnode just went down:
# VnodeDownReport draft 20201102 19:46 # Sample modifyvnode event hook script import pbs import os, sys try: e = pbs.event() vnode = e.vnode # Represents the current (recently changed) state vnode_o = e.vnode_o # Represents the state prior to the change if ((int(vnode.state)) & pbs.ND_STATE_VNODE_UNAVAILABLE) and not ((int(vnode_o.state)) & pbs.ND_STATE_VNODE_UNAVAILABLE): # # A node just went down. Report current and previous vnode values. # # Reports attributes in "Table 5-7: Vnode Attributes" from the 2020.1 Hooks Guide, # EXCEPT: # arch (vnode attribute not defined in demo deployment) # hpcbp_enable (vnode attribute not defined in demo deployment) # hpbcbp_stage_protocol (vnode attribute not defined in demo deployment) # hpcbp_webservice_address (vnode attribute not defined in demo deployment) # hhpcbp_user_name (vnode attribute not defined in demo deployment) # topology_info (due to output size) # # Demonstrate the new vnode state list functions vnode_state_str_list = ",".join(vnode.extract_state_strs()) vnode_o_state_str_list = ",".join(vnode_o.extract_state_strs()) vnode_state_int_list = ','.join([str(_) for _ in vnode.extract_state_ints()]) vnode_o_state_int_list = ','.join([str(_) for _ in vnode_o.extract_state_ints()]) # First print the state values pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;state: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, hex(vnode.state), hex(vnode_o.state))) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;state string list: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode_state_str_list, vnode_o_state_str_list)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;state int list: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode_state_int_list, vnode_o_state_int_list)) # Next print the remaining vnode members pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;comment: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.comment, vnode_o.comment)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;current_aoe: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.current_aoe, vnode_o.current_aoe)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;in_multivnode_host: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.in_multivnode_host, vnode_o.in_multivnode_host)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;jobs: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.jobs, vnode_o.jobs)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;last_state_change_time: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, str(vnode.last_state_change_time), str(vnode_o.last_state_change_time))) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;Mom: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.Mom, vnode_o.Mom)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;ntype: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, hex(vnode.ntype), hex(vnode_o.ntype))) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;pcpus: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.pcpus, vnode_o.pcpus)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;pnames: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.pnames, vnode_o.pnames)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;Port: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.Port, vnode_o.Port)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;Priority: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.Priority, vnode_o.Priority)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;provision_enable: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.provision_enable, vnode_o.provision_enable)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;queue: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.queue, vnode_o.queue)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;resources_assigned: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.resources_assigned, vnode_o.resources_assigned)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;resources_available: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.resources_available, vnode_o.resources_available)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;resv: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.resv, vnode_o.resv)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;resv_enable: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.resv_enable, vnode_o.resv_enable)) pbs.logmsg(pbs.LOG_DEBUG, \ '%s;%s;sharing: vnode=%s vnode_o=%s' % \ (e.hook_name, vnode.name, vnode.sharing, vnode_o.sharing)) e.accept() except SystemExit: pass except: pbs.event().reject("%s hook failed with %s" % (pbs.event().hook_name, sys.exc_info()[:2]))
PBS log excerpt of a vnode state change in response to a host being offlined by "sudo pbsnodes -o pbsdev-centos7-mvn6-mom1":
11/04/2020 05:08:24.288615;0004;Server@pbsdev-centos7-mvn6-server;Node;pbsdev-centos7-mvn6-mom1;attributes set: at request of root@pbsdev-centos7-mvn6-server.pbsdev-centos7-mvn6.local 11/04/2020 05:08:24.294421;0100;Server@pbsdev-centos7-mvn6-server;Node;pbsdev-centos7-mvn6-mom1;set_vnode_state;vnode.state=0x1 vnode_o.state=0x0 vnode.last_state_change_time=1604466504 vnode_o.last_state_change_time=1604466244 state_bits=0x1 state_bit_op_type_str=Nd_State_Set state_bit_op_type_enum=0 11/04/2020 05:08:24.296099;0800;Server@pbsdev-centos7-mvn6-server;Hook;hook_perf_stat;label=hook_modifyvnode_VnodeDownReport_278 action=server_process_hooks profile_start 11/04/2020 05:08:24.296171;0400;Server@pbsdev-centos7-mvn6-server;Hook;VnodeDownReport;started 11/04/2020 05:08:24.296208;0086;Server@pbsdev-centos7-mvn6-server;Svr;Server@pbsdev-centos7-mvn6-server;Compiling script file: </var/spool/pbs/server_priv/hooks/VnodeDownReport.PY> 11/04/2020 05:08:24.296986;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;state: vnode=0x1 vnode_o=0x0 11/04/2020 05:08:24.297004;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;state string list: vnode=ND_STATE_OFFLINE,ND_STATE_VNODE_UNAVAILABLE vnode_o=ND_STATE_FREE,ND_STATE_VNODE_AVAILABLE 11/04/2020 05:08:24.297012;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;state int list: vnode=1,409903 vnode_o=0,8400 11/04/2020 05:08:24.297021;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;comment: vnode=None vnode_o=None 11/04/2020 05:08:24.297029;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;current_aoe: vnode=None vnode_o=None 11/04/2020 05:08:24.297037;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;in_multivnode_host: vnode=None vnode_o=None 11/04/2020 05:08:24.297053;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;jobs: vnode=None vnode_o=None 11/04/2020 05:08:24.297062;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;last_state_change_time: vnode=1604466504 vnode_o=1604466244 11/04/2020 05:08:24.297071;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;Mom: vnode=pbsdev-centos7-mvn6-mom1.pbsdev-centos7-mvn6.local vnode_o=pbsdev-centos7-mvn6-mom1.pbsdev-centos7-mvn6.local 11/04/2020 05:08:24.297079;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;ntype: vnode=0x0 vnode_o=0x0 11/04/2020 05:08:24.297087;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;pcpus: vnode=4 vnode_o=4 11/04/2020 05:08:24.297095;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;pnames: vnode=None vnode_o=None 11/04/2020 05:08:24.297103;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;Port: vnode=15002 vnode_o=15002 11/04/2020 05:08:24.297110;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;Priority: vnode=None vnode_o=None 11/04/2020 05:08:24.297118;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;provision_enable: vnode=None vnode_o=None 11/04/2020 05:08:24.297126;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;queue: vnode=None vnode_o=None 11/04/2020 05:08:24.297137;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;resources_assigned: vnode=accelerator_memory=0kb,hbmem=0kb,mem=0kb,naccelerators=0,ncpus=0,vmem=0kb vnode_o=accelerator_memory=0kb,hbmem=0kb,mem=0kb,naccelerators=0,ncpus=0,vmem=0kb 11/04/2020 05:08:24.297146;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;resources_available: vnode=arch=linux,host=pbsdev-centos7-mvn6-mom1,mem=2038904kb,ncpus=4,vnode=pbsdev-centos7-mvn6-mom1 vnode_o=arch=linux,host=pbsdev-centos7-mvn6-mom1,mem=2038904kb,ncpus=4,vnode=pbsdev-centos7-mvn6-mom1 11/04/2020 05:08:24.297154;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;resv: vnode=None vnode_o=None 11/04/2020 05:08:24.297163;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;resv_enable: vnode=1 vnode_o=1 11/04/2020 05:08:24.297171;0006;Server@pbsdev-centos7-mvn6-server;Hook;Server@pbsdev-centos7-mvn6-server;VnodeDownReport;pbsdev-centos7-mvn6-mom1;sharing: vnode=1 vnode_o=1 11/04/2020 05:08:24.297186;0800;Server@pbsdev-centos7-mvn6-server;Hook;hook_perf_stat;label=hook_modifyvnode_VnodeDownReport_278 action=run_code walltime=0.000320 cputime=0.000000 11/04/2020 05:08:24.297246;0400;Server@pbsdev-centos7-mvn6-server;Hook;VnodeDownReport;finished 11/04/2020 05:08:24.297285;0800;Server@pbsdev-centos7-mvn6-server;Hook;hook_perf_stat;label=hook_modifyvnode_VnodeDownReport_278 action=server_process_hooks walltime=0.001183 cputime=0.000000 profile_stop 11/04/2020 05:08:24.297300;0004;Server@pbsdev-centos7-mvn6-server;Node;pbsdev-centos7-mvn6-mom1;attributes set: state + offline
Internals
New functional tests for vnode state changes defined in pbs_hook_modifyvnode_state_changes.py:
Includes tests that induce state changes via various operations (e.g., mom stop, offline mom, server restart, etc.)
Includes checks verifying existence of expected nodes state constants
Reuses existing vnode object logic where possible; two functions added to class _vnode in _svrtypes.py:
extract_state_strs() returns list of string values from the vnode’s state bits
extract_state_ints() returns list of int values from the vnode’s state bits
New code for propagating vnode state changes has been added, including:
New batch request structure:
/* ModifyVnode - used for node state changes */ struct rq_modifyvnode { struct pbsnode *rq_vnode_o; /* old/previous vnode state */ struct pbsnode *rq_vnode; /* new/current vnode state */ };
New event type:
New event object:
New event param:
A call to process_hooks() has been added to set_vnode_state() in node_manager.c to fire off the modifyvnode event
New pbs log entry added to set_vnode_state() in node_manager.c:
Project Documentation Main Page