PP-864 Support suspend/resume on Cray X* series
https://pbspro.atlassian.net/browse/PP-864
Overview:
Cray X* series systems have the ability to support suspending one or more jobs to run a higher priority job. PBS needs to modify the suspend pseudo signal (used by the qsig command and preemption) to support doing suspend and resume on a Cray X* series.
Important things to note:
Cray systems with a Gemini interconnect do NOT support suspend/resume
Cray systems with an Aries interconnect and newer Cray X* series systems DO support suspend/resume
In order to do suspend/resume set suspendResume 1 in /etc/opt/cray/alps/alps.conf (using xtopview on CLE 5.2 and prior CLEs) and then restart ALPS
Please refer to Cray's System Administration Guide for more details about using suspend/resume on Cray X* series
On Cray X* series system PBS issues a request to ALPS to switch IN (resume) or OUT (suspend) an ALPS reservation
On a Cray X* series, the suspended low priority job and the high priority job must fit into the Cray compute node’s memory
On a Cray X* series systems have a limitation of having at maximum of 4 co-resident jobs on a compute node. Please read Cray documentation for more details.
Interface #1 - New error code when ALPS fails to switch reservation from suspend to resume or resume to suspend
Change Control: Stable
Details:
New error code "15219" will be returned for pbs_sigjob() IFL call when ALPS fails to switch reservation in mom.
Command "qsig" will print the following error message when ALPS fails to switch reservation
"qsig: Switching ALPS reservation failed <job id>"
Interface #2 - New mom log messages
Change Control: Unstable
Details:
Following mom log message is logged on Cray X* series systems when
PBS tries to suspend/resume a job (PBSEVENT_DEBUG2)-
"Switching ALPS reservation <ALPS reservation id> to <suspend/resume>"ALPS fails to accept a reservation switch request (PBSEVENT_SYSTEM)-
"Failed to switch <OUT/IN> ALPS reservation"PBS issues the ALPS reservation switch request successfully (PBSEVENT_DEBUG2)-
"Made the ALPS SWITCH request"It is possible to incorrectly get an 'EMPTY' response (which means there is no claim on the ALPS reservation) when in reality there is a claim on the ALPS reservation. PBS will print this log message so it is possible to see how often the false 'EMPTY' response is received (PBSEVENT_DEBUG2).
"ALPS reservation <ALPS reservation ID> SWITCH status is = 'EMPTY'"
Interface #3 -Server attribute "restrict_res_to_release_on_suspend" is set to "ncpus" by default on Cray X* series systems
Change Control: Stable
Details:
During suspension of a job PBS will only release ncpus on Cray by default
This attribute is set in pbs_habitat script which is executed when PBS is started for the first time after an install/upgrade
Admin may choose to update "restrict_res_to_release_on_suspend" and add more resources to it. But, removing "ncpus" from the list of resource names is not advisable on Cray X* series system (Please read this page for more information in "restrict_res_to_release_on_suspend" server attribute)
Interface #4 - Jobs with exclusive placement can not be suspended
Change Control: Stable
Details:
On a Cray X* series system a job that requests exclusive access (i.e. -lplace=excl) to a node can not be suspended. An error is thrown by ALPS while trying to switch out an exclusive job's ALPS reservation
If it is tried to be suspended mom returns with an error code 15219 and logs an DEBUG level error message as mentioned in Interface 5
Interface #5 - New mom log messages
Change Control: Unstable
Details:
Following mom log message is logged on Cray X* series systems when
While switching out an exclusive ALPS reservation (suspending a job with exclusive placement) following error is logged (PBSEVENT_DEBUG)
"BASIL;ERROR: ALPS error: apsched: at least resid <ALPS reservation id> is exclusive"