PP-305: If server_dyn_res script does not return on UNIX/LINUX, scheduler will hang
Target Release | 15.0.0 |
---|---|
JIRA Link | |
Document status | Initial Version |
Document owner | |
Forum Discussion Link | http://community.pbspro.org/t/pp-305-if-server-dyn-res-script-does-not-return-scheduler-hangs/547 |
Objective:
As of today, if server_dyn_res program/script does not return or hangs. The scheduler keeps on waiting for the script to complete the execution.
The objective of this design document is to propose the solution for this hang issue.
Interface 1: New Configurable Scheduler attribute: server_dyn_res_alarm
- Visibility: sched object |Operator Read | Manager Read/Write
- Change Control: Stable
- Details:
- Admin can configure the scheduler attribute "server_dyn_res_alarm". Default value is 30 seconds.
- Usage :
qmgr -c "set sched server_dyn_res_alarm = 15"
- PBS will start polling from the time the server_dyn_res program/script starts executing and will wait for "server_dyn_res_alarm" time. After the timeout, the interaction with the script/program will end and the scheduler will log a timeout info message.
- This timeout will be applicable for each server_dyn_res program/script.
- On timeout, the value of the resource will be assumed to be "0" and scheduling cycle will continue normally.
- If the alarm is set to 0, the scheduler will not timeout server_dyn_res scripts.
- PBS will start polling from the time the server_dyn_res program/script starts executing and will wait for "server_dyn_res_alarm" time. After the timeout, the interaction with the script/program will end and the scheduler will log a timeout info message.
Interface 2: Log message for timeout of server_dyn_res program/script
- Visibility: Scheduler log message at PBSEVENT_SCHED, PBS_EVENTCLASS_SERVER, and syslog LOG_INFO
- Change Control: Unstable
- Details:
- Once the timeout is reached a timeout info message is logged in the scheduler logs. Something like as follows :
... ...;0040;pbs_sched;Svr;server_dyn_res;program /bin/get_foo timed out
- Once the timeout is reached a timeout info message is logged in the scheduler logs. Something like as follows :
Interface 3: Log message for value of server_dyn_res on timeout
- Visibility: Scheduler log message at PBSEVENT_DEBUG, PBS_EVENTCLASS_SERVER, and syslog LOG_DEBUG
- Change Control: Unstable
- Details:
- On timeout a debug message is logged in the scheduler logs for assuming the resource value as "0". Something like as follows :
... ...;0080;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0
- On timeout a debug message is logged in the scheduler logs for assuming the resource value as "0". Something like as follows :