PP-864 Support suspend/resume on Cray X* series
https://pbspro.atlassian.net/browse/PP-864
Overview:
Cray X* series systems have the ability to support suspending one or more jobs to run a higher priority job. PBS needs to modify the suspend pseudo signal (used by the qsig command and preemption) to support doing suspend and resume on a Cray X* series.
Important things to note:
- Cray systems with a Gemini interconnect do NOT support suspend/resume
- Cray systems with an Aries interconnect and newer Cray X* series systems DO support suspend/resume
- In order to do suspend/resume set suspendResume 1 in /etc/opt/cray/alps/alps.conf (using xtopview on CLE 5.2 and prior CLEs) and then restart ALPS
- Please refer to Cray's System Administration Guide for more details about using suspend/resume on Cray X* series
- On Cray X* series system PBS issues a request to ALPS to switch IN (resume) or OUT (suspend) an ALPS reservation
- On a Cray X* series, the suspended low priority job and the high priority job must fit into the Cray compute node’s memory
- On a Cray X* series systems have a limitation of having at maximum of 4 co-resident jobs on a compute node. Please read Cray documentation for more details.
Interface #1 - New error code when ALPS fails to switch reservation from suspend to resume or resume to suspend
- Change Control: Stable
- Details:
- New error code "15219" will be returned for pbs_sigjob() IFL call when ALPS fails to switch reservation in mom.
- Command "qsig" will print the following error message when ALPS fails to switch reservation
"qsig: Switching ALPS reservation failed <job id>"
Interface #2 - New mom log messages
- Change Control: Unstable
- Details:
- Following mom log message is logged on Cray X* series systems when
- PBS tries to suspend/resume a job (PBSEVENT_DEBUG2)-
"Switching ALPS reservation <ALPS reservation id> to <suspend/resume>" - ALPS fails to accept a reservation switch request (PBSEVENT_SYSTEM)-
"Failed to switch <OUT/IN> ALPS reservation" - PBS issues the ALPS reservation switch request successfully (PBSEVENT_DEBUG2)-
"Made the ALPS SWITCH request" - It is possible to incorrectly get an 'EMPTY' response (which means there is no claim on the ALPS reservation) when in reality there is a claim on the ALPS reservation. PBS will print this log message so it is possible to see how often the false 'EMPTY' response is received (PBSEVENT_DEBUG2).
"ALPS reservation <ALPS reservation ID> SWITCH status is = 'EMPTY'"
- PBS tries to suspend/resume a job (PBSEVENT_DEBUG2)-
- Following mom log message is logged on Cray X* series systems when
Interface #3 -Server attribute "restrict_res_to_release_on_suspend" is set to "ncpus" by default on Cray X* series systems
- Change Control: Stable
- Details:
- During suspension of a job PBS will only release ncpus on Cray by default
This attribute is set in pbs_habitat script which is executed when PBS is started for the first time after an install/upgrade
- Admin may choose to update "restrict_res_to_release_on_suspend" and add more resources to it. But, removing "ncpus" from the list of resource names is not advisable on Cray X* series system (Please read this page for more information in "restrict_res_to_release_on_suspend" server attribute)
Interface #4 - Jobs with exclusive placement can not be suspended
- Change Control: Stable
- Details:
- On a Cray X* series system a job that requests exclusive access (i.e. -lplace=excl) to a node can not be suspended. An error is thrown by ALPS while trying to switch out an exclusive job's ALPS reservation
- If it is tried to be suspended mom returns with an error code 15219 and logs an DEBUG level error message as mentioned in Interface 5
Interface #5 - New mom log messages
- Change Control: Unstable
- Details:
- Following mom log message is logged on Cray X* series systems when
- While switching out an exclusive ALPS reservation (suspending a job with exclusive placement) following error is logged (PBSEVENT_DEBUG)
"BASIL;ERROR: ALPS error: apsched: at least resid <ALPS reservation id> is exclusive"
- While switching out an exclusive ALPS reservation (suspending a job with exclusive placement) following error is logged (PBSEVENT_DEBUG)
- Following mom log message is logged on Cray X* series systems when