PBS Design Changes for Shasta


v1.4 - Pass the "exclusive" field in POST - Lisa
v1.3 - clarify that all vnodes reported by a the mom will be offlined when the hook wants to offline nodes - vincent
v1.2 - change PBS_cray_jacs to PBS_cray_atom - lisa/vincent
v1.1 - "What happens when a hook timesout" - lisa
v1.0 - initial design


Please provide comments in the Discussion Forum: http://community.pbspro.org/t/pbs-design-changes-for-shasta-support/1481

Cray will be introducing a new supercomputer called Shasta that will allow PBS daemons to run directly on the compute nodes, in this way the Cray will look more like a Linux cluster.  The interface on XC systems was ALPS interfaces.  The new interface on Shasta uses REST interfaces.  Cray has provided ATOM (a service) using REST interfaces for PBS to notify the Cray services when a job is starting and when a job has ended.   In order to have PBS work with the new Shasta supercomputers, PBS will have to make changes to call the appropriate new interface at the right time.  This design is about supporting existing PBS behavior on a new platform, it is not introducing new PBS behavior.  As such it will go into internal design details about the changes that will be made to PBS in order to support Shasta supercomputers. Also, as a result of supporting the new Cray supercomputer  a few new externally visible tunables specific to Shasta will be added.  Admins can use the tunables to affect the behavior of PBS on Shasta.


New PBS hook

As noted, above, Cray’s REST APIs need to be called at job start and at job end. The execjob_begin and execjob_end hook events are perfectly suited for this.  Therefore, a new PBS hook called PBS_cray_atom.PY will be created.  The hook will notify Cray services using Cray’s new REST APIs when a job is starting, when the job should be deleted.  The PBS_cray_atom hook will use the default hook attribute settings, except for alarm, which by default will be set to 300 seconds (according to Cray this is the maximum amount of time that tasks can take to put the node into a usable state) and will be called as pbsadmin. For the PBS_cray_atom hook there will also be a hook configuration file that the admin can use to modify different hook specific tunables.

What happens if the hook alarms

If the hook alarms while running in the execjob_begin event, the vnodes reported by the mom whose hook timed out will be marked as offline. This will be done via the hook's fail_action
If the hook alarms while running in the execjob_end event, the hook will reject.

Hook's attributes/HK file

These are the default attributes for the PBS_cray_atom hook:

type=pbs
enabled=false
user=pbsadmin
event=execjob_begin,execjob_end
order=100
alarm=300
fail_action=offline_vnodes


Hook Configuration File

Configuration will be formatted as a json object. For example, this is what the default will look like:

{
"post_timeout": 30,
"delete_timeout": 30,
"unix_socket_file": "/var/run/atomd/atomd.sock"
}

The Cray REST API responses should be quick, but just in case, we have introduced new timeout hook configuration tunables.  Their behavior is described below:

  • "post_timeout"A new tunable in a hook configuration file.
    This timeout will be used for the POST requests
    • The value is a float in seconds.
    • The default timeout value will be 30.
  • "delete_timeout"A new tunable in a hook configuration file.
    This timeout will be used for the DELETE requests
    • The value is a float in seconds.
    • The default timeout value will be 30.

Details about what happens at the execjob_begin event

In order to notify the Cray services (apconfig) when a job is starting, the new PBS_cray_atom hook will be called at the mom execjob_begin event.  For execjob_begin events PBS will POST to the Shasta REST API  endpoint /jobs and provide the job ID and the Linux user ID.
The expected response is an http response 200 OK, if we get this response the PBS_cray_atom hook will accept and the PBS will continue with running the job.  If the response is a 400 (bad request), the PBS hook will retry the request.
If a response other than 200 Ok or a 400 (bad request)  is received the hook will reject and offline the vnodes reported by the mom (fail_action = offline_vnodes).  The response code and all of the fields of the response will be printed in the mom logs at the default log level.  The rejection will cause the job to be rejected by the job’s main mom and cause the job to be requeued (if the job is requeueable).

If the POST encounters a timeout, it will reject the event and offline the vnodes reported by the mom (fail_action = offline_vnodes).

There is a new field in ATOM POST for jobs called "exclusive", it will be set to true or false depending on the combination of the job's placement and the node's sharing value.
The hook will follow the same logic as what is used in the scheduler to determine exclusivity as described in the PBS Professional Reference Guide (see RG-322 of the 19.4 Reference Guide)

Details about what happens at execjob_end event

The PBS_cray_atom hook will also be called at the mom execjob_end event for when a job has ended.  For execjob_end events PBS will send a DELETE to the Shasta REST API endpoint /jobs/<jobid> where <jobid> is the job ID of the job that is ending. 
The expected response is 204 No Content. A 204 response means the request has been received and fully processed, but no response body was included.

If the DELETE encounters a timeout, it will reject the event.

NOTE:

If the DELETE encounters a timeout, it will not offline the vnodes, since it's possible between that point and the next job coming in, the node could be fixed. If the POST fails from that new job coming in, it is evident there is a problem with the node and the hook will offline them then.


When Cray's tasks encounter a problem

It is important to note that if one of Cray’s tasks finds a problem with the node, Cray’s tasks will be responsible for marking the compute node as unavailable and later marking the node as available. (Cray plans to bring the PBS mom down if the node health is marked as “Admindown”.  And Cray plans to tell the administrator to bring the PBS mom back up after the admin resolves the node health.)   It should be noted that the PBS MoM should be restarted with the "-p" option to preserve running jobs and to track them.


Authentication and the mandatory Python modules

In order to use the Shasta REST APIs easily we plan to use the Python requests module in the PBS_cray_atom hook.  Therefore, to use the above hook, it is mandatory that the system have the Python requests module.  And because Cray’s REST API will authenticate over a domain socket, the requests-unixsocket Python module will also be mandatory.  PBS needs to be able to talk HTTP via a UNIX domain socket for the Shasta REST APIs.  This module will be incorporated as part of this feature work.


Forming the URIs

PBS will have to compose a URL for the requests.  The URL will consist of a percent encoded path to the unix domain socket file concatenated with the base URL of the path to the version and finally adding on the rest of the API endpoint.

  • ["unix_socket_file"] – A new tunable in a hook configuration file
    It will be the path to the unix socket file to use for authentication.
    • The value is a string.
    • The default path will be “/var/run/atomd/atomd.sock
  • ["version_uri"] – an internal variable
    This is the base URL of the API.  New versions of the interface can have a new base URLs.  Due to a lack of a use case, and not knowing how a new version will affect PBS, this is not being made into an external tunable at this time.  However, rather than code the base URL into the request, we want to make it easy to update the path/version.
    • The value is a string
    • It will be set to "/rm/v1”
    • This value will be stored in the internal representation of the configuration, but will NOT be over-writable by a config file.


Rationale for why we decided to use hook configuration file tunables

  • If we use attributes or custom resources each MoM will have to query the server for each attribute/resource it will use.  If there is a 100 node job the main MoM will query the server for the info, accept  the event, and then the 999 sister MoMs will query at the same time.
  • On the other hand, the hook config file will get sent to the MoMs initially, and every time the MoM starts up, the MoM and the server will confirm the config file is correct.  And that is more efficient than using an attribute or resource.


Ideas for possible future enhancements

In the future (current does not apply) if it is necessary to have specific hook configurations per node, this what could be done to the configuration:
{
"post_timeout": 30,
"delete_timeout": 30,
"unix_socket_file": "/var/run/atomd/atomd.sock"
    "vn_type": {
"arm": {
"delete_timeout": 30
}
}
"nodes": {
"cmp17": {
"unix_socket_file": "/var/atomd/atomd.sock"
}
}
}



OSS Site Map

Project Documentation Main Page

Developer Guide Pages