Make the wait time between Cray ALPS release reservation requests adjustable

Add your comments in the Discussion Forum.


This design only applies to Cray ALPS systems.

On a Cray ALPS system, when a job's script is finished running PBS sends to ALPS a release reservation request.  PBS then will intermittently poll until the ALPS response is "No entry for resId" is received.  This is the indication to PBS that the ALPS reservation has successfully been canceled.

What happens today

Today, the amount of time between when PBS will send an ALPS release reservation request will grow exponentially with each try.  PBS also randomly adds between 0-4 seconds to each interval as jitter.  The jitter is so that in the case that the jobs all end at the same time, PBS will not overwhelm ALPS with reservation release requests all at once.  The jitter helps to randomly make each ALPS reservation release happen at a different interval.  Thus the total time between ALPS release reservation requests was the combination of the base loop exponent result plus the value randomly generated between 0-4.  Both of these timings for the interval and the jitter were requested by Cray.

New proposal

Cray says that things have changed and we should now be able to poll at a different interval.  This way, the job's ALPS reservation being released can be discovered sooner, and the next job can use those resources sooner.  The best way for PBS to handle this, will be to put the control in the PBS administrator's hands.   2 new mom tunables will allow the PBS administrator to individually adjust the base interval value, and the amount of potential jitter added to the total interval time.  Total interval time is determined by adding the value for alps_release_wait_time + the randomly generated value based off alps_release_jitter.  The minimum wait time interval is implementation dependent and may be different for different versions of ALPS and PBS Pro.  The supplied value may be adjusted (rounded or truncated) based on the available resolution.

Tunable 1 - alps_release_wait_time
  • This sets the base wait time in seconds to wait between ALPS release reservation requests 
  • It is a floating point number.
    • Remember, there is an existing mom tunable alps_release_timeout which defaults to 600 seconds (10 min).  That is the point at which PBS gives up trying to contact ALPS, and no more ALPS release reservation requests will be sent to ALPS.
  • Set alps_release_wait_time in the mom_priv/config file
  • If it is not set in the mom_priv/config file, the default value of alps_release_wait_time is 0.4 sec
Tunable 2 - alps_release_jitter

Turned this into a tunable to the PBS administrator could choose to increase or decrease the amount of jitter added to the interval

  • Based on this value, PBS will randomly generate how much time to add as jitter.  The jitter amount is randomly generated and can range from 0 to alps_release_jitter.
  • alps_release_jitter is a floating point.
  • Set alps_release_jitter in the mom_priv/config file
  • If it is not set in the mom_priv/config file, the default value of of alps_release_jitter is 0.12 sec




OSS Site Map

Developer Guide Pages