PP-305: If server_dyn_res script does not return on UNIX/LINUX, scheduler will hang

Target Release15.0.0
JIRA Link

PP-305 - Getting issue details... STATUS

Document statusInitial Version
Document owner
Forum Discussion Linkhttp://community.pbspro.org/t/pp-305-if-server-dyn-res-script-does-not-return-scheduler-hangs/547

Objective:

As of today, if server_dyn_res program/script does not return or hangs. The scheduler keeps on waiting for the script to complete the execution.

The objective of this design document is to propose the solution for this hang issue.


Interface 1: New Configurable Scheduler attribute: server_dyn_res_alarm

  • Visibility: sched object |Operator Read | Manager Read/Write
  • Change Control: Stable
  • Details: 
    • Admin can configure the scheduler attribute "server_dyn_res_alarm". Default value is 30 seconds.
    • Usage :  

qmgr -c "set sched  server_dyn_res_alarm = 15"

    • PBS will start polling from the time the server_dyn_res program/script starts executing and will wait for "server_dyn_res_alarm" time. After the timeout, the interaction with the script/program will end and the scheduler will log a timeout info message.
    • This timeout will be applicable for each server_dyn_res program/script.
    • On timeout, the value of the resource will be assumed to be "0" and scheduling cycle will continue normally.
    • If the alarm is set to 0, the scheduler will not timeout server_dyn_res scripts.


Interface 2: Log message for timeout of server_dyn_res program/script

  • Visibility: Scheduler log  message at PBSEVENT_SCHED, PBS_EVENTCLASS_SERVER, and syslog LOG_INFO
  • Change Control: Unstable
  • Details:
    • Once the timeout is reached a timeout info message is logged in the scheduler logs. Something like as follows :
      ... ...;0040;pbs_sched;Svr;server_dyn_res;program /bin/get_foo timed out


Interface 3: Log message for value of server_dyn_res on timeout

  • Visibility: Scheduler log  message at PBSEVENT_DEBUG, PBS_EVENTCLASS_SERVER, and syslog LOG_DEBUG
  • Change Control: Unstable
  • Details:
    • On timeout a debug message is logged in the scheduler logs for assuming the resource value as "0". Something like as follows :
      ... ...;0080;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0