JOBOBIT Hook Event
Links
- Link to discussion on Developer Forum: https://community.openpbs.org/t/design-document-for-endjob-hook-event/2525
- Link to ALCF development repo: https://github.com/ericpershey/pbspro/compare/hook_event_jobobit
- Link to pull requests:
Overview
In order to provide more detailed accounting of jobs as required by some sites, recording the job information and the time of when a job leaves execution is required. Therefore an jobobit hook event has been added to the PBS server. Registered jobobit hook scripts are executed when a job or subjob leaves execution. Each script is provided information about the job and any associated attributes.
Technical Details
Added Python constants:
- pbs.HOOK_EVENT_QUEUEJOB
- pbs.HOOK_EVENT_MODIFYJOB
- pbs.HOOK_EVENT_RESVSUB
- pbs.HOOK_EVENT_MODIFYRESV
- pbs.HOOK_EVENT_MOVEJOB
- pbs.HOOK_EVENT_RUNJOB
- pbs.HOOK_EVENT_JOBOBIT
- pbs.HOOK_EVENT_MANAGEMENT
- pbs.HOOK_EVENT_MODIFYVNODE
- pbs.HOOK_EVENT_PROVISION
- pbs.HOOK_EVENT_RESV_END
- pbs.HOOK_EVENT_RESV_BEGIN
- pbs.HOOK_EVENT_RESV_CONFIRM
- pbs.HOOK_EVENT_EXECJOB_BEGIN
- pbs.HOOK_EVENT_EXECJOB_PROLOGUE
- pbs.HOOK_EVENT_EXECJOB_EPILOGUE
- pbs.HOOK_EVENT_EXECJOB_PRETERM
- pbs.HOOK_EVENT_EXECJOB_END
- pbs.HOOK_EVENT_EXECJOB_LAUNCH
- pbs.HOOK_EVENT_EXECHOST_PERIODIC
- pbs.HOOK_EVENT_EXECHOST_STARTUP
- pbs.HOOK_EVENT_EXECJOB_ATTACH
- pbs.HOOK_EVENT_EXECJOB_RESIZE
- pbs.HOOK_EVENT_EXECJOB_ABORT
- pbs.HOOK_EVENT_EXECJOB_POSTSUSPEND
- pbs.HOOK_EVENT_EXECJOB_PRERESUME
- pbs.HOOK_EVENT_MOM_EVENTS
- pbs.HOOK_EVENT_PERIODIC
- pbs.REVERSE_JOB_STATE
- pbs.REVERSE_JOB_SUBSTATE
- pbs.REVERSE_HOOK_EVENT
- pbs.JOB_SUBSTATE_UNKNOWN
- pbs.JOB_SUBSTATE_TRANSIN
- pbs.JOB_SUBSTATE_TRANSICM
- pbs.JOB_SUBSTATE_TRNOUT
- pbs.JOB_SUBSTATE_TRNOUTCM
- pbs.JOB_SUBSTATE_QUEUED
- pbs.JOB_SUBSTATE_PRESTAGEIN
- pbs.JOB_SUBSTATE_SYNCRES
- pbs.JOB_SUBSTATE_STAGEIN
- pbs.JOB_SUBSTATE_STAGEGO
- pbs.JOB_SUBSTATE_STAGECMP
- pbs.JOB_SUBSTATE_HELD
- pbs.JOB_SUBSTATE_SYNCHOLD
- pbs.JOB_SUBSTATE_DEPNHOLD
- pbs.JOB_SUBSTATE_WAITING
- pbs.JOB_SUBSTATE_STAGEFAIL
- pbs.JOB_SUBSTATE_PRERUN
- pbs.JOB_SUBSTATE_RUNNING
- pbs.JOB_SUBSTATE_SUSPEND
- pbs.JOB_SUBSTATE_SCHSUSP
- pbs.JOB_SUBSTATE_EXITING
- pbs.JOB_SUBSTATE_STAGEOUT
- pbs.JOB_SUBSTATE_STAGEDEL
- pbs.JOB_SUBSTATE_EXITED
- pbs.JOB_SUBSTATE_ABORT
- pbs.JOB_SUBSTATE_KILLSIS
- pbs.JOB_SUBSTATE_RUNEPILOG
- pbs.JOB_SUBSTATE_OBIT
- pbs.JOB_SUBSTATE_TERM
- pbs.JOB_SUBSTATE_DELJOB
- pbs.JOB_SUBSTATE_RERUN
- pbs.JOB_SUBSTATE_RERUN1
- pbs.JOB_SUBSTATE_RERUN2
- pbs.JOB_SUBSTATE_RERUN3
- pbs.JOB_SUBSTATE_EXPIRED
- pbs.JOB_SUBSTATE_BEGUN
- pbs.JOB_SUBSTATE_PROVISION
- pbs.JOB_SUBSTATE_WAITING_JOIN_JOB
- pbs.JOB_SUBSTATE_TERMINATED
- pbs.JOB_SUBSTATE_FINISHED
- pbs.JOB_SUBSTATE_FAILED
- pbs.JOB_SUBSTATE_MOVED
Details:
- The jobobit event will contain job information
- The job object attached to the the jobobit event is read-only
- Type for jobobit event is pbs.HOOK_EVENT_JOBOBIT
- jobobit hooks run by the server
- Hooks registered to the jobobit event will execute when:
- a job ends
- A call to process_hooks() has been added to the end_job function (req_jobobit.c)
- a job array parent jobends
- A call to process_hooks() has been added to the chk_array_doneness function (array_func.c)
- a subjob ends
- A call to process_hooks() has been added to the end_job function (req_jobobit.c)
- a job needs to rerun
- A call to process_hooks() has been added to the force_reque function (req_rerun.c)
- A call to process_hooks() has been added to the on_job_rerun function (req_jobobit.c)
- a running job is deleted and the mom can not be accessed
- A call to process_hooks() has been added to the post_discard_job function (node_manager.c)
- a running job is forced deleted
- A call to process_hooks() has been added to the req_deletejob2 function (req_delete.c)
- New job attribute "obittime"
- Format: seconds since epoch time
- Permissions: read-only for all users
- Represents the time a job or subjob obit occurs (state 'F')
- A reservation object is attached to the job object if a reservation is associated with the job
- A pbs.event().accept() call terminates hook execution, as does pbs.event().reject(). The jobobit event object is unaffected by either call.
- In the case where the hook script encounters an exception, the error is logged.
- New functional test for jobobit hook event defined in pbs_hook_jobobit.py, includes tests for:
- single jobs
- array jobs
- jobs run under reservations
- jobs that are requeued
- New code for jobobit hook event has been added, including:
- New batch request structure:
/* JobObit */ struct rq_jobobit { struct job *rq_pjob; char rq_jid[PBS_MAXSVRJOBID + 1]; char *rq_destin; };
- New event type
HOOK_EVENT_JOBOBIT
Example Hook Script for jobobit:
import pbs import sys try: e = pbs.event() job = e.job pbs.logjobmsg(job.id, 'jobobit hook, "%s" started' % (e.hook_name,)) pbs.logjobmsg(job.id, 'jobobit hook, job starttime:%s' % (job.stime,)) pbs.logjobmsg(job.id, 'jobobit hook, job obittime:%s' % (job.obittime,)) pbs.logjobmsg(job.id, 'jobobit hook, job_state=%s' % (job.job_state,)) pbs.logjobmsg(job.id, 'jobobit hook, job_substate=%s' % (job.substate,)) state_desc = pbs.REVERSE_JOB_STATE.get(job.job_state, '(None)') substate_desc = pbs.REVERSE_JOB_SUBSTATE.get(job.substate, '(None)') pbs.logjobmsg(job.id, 'jobobit hook, job_state_desc=%s' % (state_desc,)) pbs.logjobmsg(job.id, 'jobobit hook, job_substate_desc=%s' % (substate_desc,)) if hasattr(job, "resv") and job.resv: pbs.logjobmsg(job.id, 'jobobit hook, resv:%s' % (job.resv.resvid,)) pbs.logjobmsg(job.id, 'jobobit hook, resv_nodes:%s' % (job.resv.resv_nodes,)) pbs.logjobmsg(job.id, 'jobobit hook, resv_state:%s' % (job.resv.reserve_state,)) else: pbs.logjobmsg(job.id, 'jobobit hook, resv:(None)') pbs.logjobmsg(job.id, 'jobobit hook, "%s" finished' % (e.hook_name,)) except Exception as err: ty, _, tb = sys.exc_info() pbs.logmsg(pbs.LOG_ERROR, "jobobit hook, error: " + str(ty) + str(tb.tb_frame.f_code.co_filename) + str(tb.tb_lineno)) e.reject() else: e.accept()
PBS server log excerpt of an jobobit hook executing:
10/11/2021 19:57:16.179481;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, "jobobit_example" started 10/11/2021 19:57:16.179498;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, job starttime:1633982235 10/11/2021 19:57:16.179505;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, job obittime:1633982236 10/11/2021 19:57:16.179511;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, job_state=5 10/11/2021 19:57:16.179521;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, job_substate=53 10/11/2021 19:57:16.179529;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, job_state_desc=JOB_STATE_EXITING 10/11/2021 19:57:16.179534;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, job_substate_desc=JOB_SUBSTATE_EXITED 10/11/2021 19:57:16.179540;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, resv:(None) 10/11/2021 19:57:16.179545;0008;Server@pdw-s1;Job;7.pdw-s1;jobobit hook, "jobobit_example" finished
Project Documentation Main Page