Introduction:
Use case:
As clusters get larger and workloads vary it is becoming critical that the jobs get evaluated in as short as time possible to ensure that the correct workload is being run. Using multiple schedulers to address this issue can allow for different scheduling policies and quicker turnaround time for large number of jobs or nodes.
Gist of design proposal::
PBS scheduler in it's current form can run easily run in multiple instances on the same machine. There are only two major problems that we have to deal with:
- Managing the scheduler - This includes starting the scheduler, configuring them, making PBS server connect to each one of them and then make them run on specific events.
- Make sure that schedulers do not overrun on each other's territory. Make sure that they run on clearly partitioned complex in terms of jobs and nodes.
Design proposal mentioned below tends to address both these problem.
Interface 1: Extend PBS to support a list of scheduler objects
- Visibility: Public
- Change Control: Stable
- Details:
- PBS supports a list of scheduler objects to be created using qmgr. It is similar to how we create nodes in server.
- qmgr command can be used to create a scheduler object . It must be invoked by a PBS admin/manager.
- To create a scheduler object and make it run, the following are the attributes that can be set by the user
- Name of the scheduler is mandatory to be given while creating a scheduler object.
- qmgr -c "c sched multi_sched_1"
- This will create/set the following attributes for the sched object
- port - If not defined by the user, It will start from 15050 and try to run the scheduler on the next available port number.
- partition = "None" (default)*
- sched_priv = $PBS_HOME/sched_priv_multi_sched_1 (default)*
- sched_log = $PBS_HOME/sched_log_multi_sched_1 (default)*
- scheduling = False (default)*
- scheduler_iteration = 600 (default)*
- sched_user = <pbs_server user> (default)
- comment
sites can use the comment field to
notify them if scheduler undergoes restarting 2-3 times due to potential crashes in an hour for example (i.e. comment => “NEEDS_ATTENTION”)
tell when a particular scheduler is ready to function again by setting the comment as follows.
comment => “READY_TO_USE
- This will create/set the following attributes for the sched object
- qmgr -c "c sched multi_sched_1"
- Name of the scheduler is mandatory to be given while creating a scheduler object.
- "*" indicates that the value will be visible when the admin lists or prints the sched object after the sched object is create
- Set the priv directory for the scheduler.
- The directory must be owned by the sched_user specified while creating scheduler object. It should have permissions set as "750". By default a sched object has
it's priv directory set as $PBS_HOME/sched_priv_<sched-name>. If the directory is already used by some other scheduler then error code is set "15216" with error message
"Another Sched object also has same value for its sched_log directory" - qmgr -c "s sched multi_sched_1 sched_priv=/var/spool/pbs/sched_priv_1"
- If the priv directory is not accessible by scheduler process, or the scheduler files are not found in the directory, then comment is updated with following error message
"scheduler can not access it's priv directory"
- The directory must be owned by the sched_user specified while creating scheduler object. It should have permissions as "755". By default a sched object has
it's logs directory set as $PBS_HOME/sched_logs_<sched_name>. . If the directory is already used by some other scheduler then error code is set "15215" with error message
"Another Sched object also has same value for its sched_priv directory" - qmgr -c "s sched multi_sched_1 sched_log=/var/spool/pbs/sched_logs"
- If the log directory is not accessible by scheduler process, or the scheduler files are not found in the directory, then comment is updated with following error message
"scheduler can not access it's log directory" - By default a multi-sched object has scheduling set as False.
qmgr -c " s sched <scheduler name> scheduling = 1"
- The directory must be owned by the sched_user specified while creating scheduler object. It should have permissions set as "750". By default a sched object has
- The following attributes will be set on the default scheduler if the the user sets them on the server
- scheduling
- scheduling_iteration
- Max length of scheduler name is 15
- By default PBS server will configure a default scheduler which will run out of the box.
- The name of this default scheduler will be "default"
- The sched_priv directory of this default scheduler will be set to the $PBS_HOME/sched_priv
- Default scheduler will log in $PBS_HOME/sched_logs directory.
- Default scheduler will be provided with default set of policies as mentioned in sched_config.
- One can set a scheduler attribute through qmgr either in the usual way as shown below or they can use the new syntax. Old syntax is supported for backward compatibility which will be deprecated soon.
Ex: qmgr -c "set sched job_sort_formula_threshold = <value>
qmgr -c "set sched default job_sort_formula_threshold = <value>"
Interface 2: Changes to PBS scheduler
- Visibility: Public
- Change Control: Stable
- Details:
- Scheduler now has additional attributes which can be set in order to run it.
- sched_priv - to point to the directory where scheduler keeps the fairshare usage, resource_group, holidays file and sched_config
- sched_logs - to point to the directory where scheduler logs.
- partitions - list of all the partitions for which this scheduler is going to schedule jobs.
- host - hostname on which scheduler is running. For default scheduler it is set to pbs server hostname.
- port - port number on which scheduler is listening.
- state - This attribute shows the status of the scheduler. It is a parameter that is set only by pbs server.
- One can set a partition or a comma separated list of partitions to scheduler object. Once set, given scheduler object will only schedule jobs from the queues attached to specified partition"
- qmgr -c "s sched multi_sched_1 partitions='part1,part2'"
- If no partition are specified for a given scheduler object, other than the default scheduler where no partition value can be set, then that scheduler will not schedule any jobs.
- By default, All new queues created will be scheduled by the default scheduler, until they have been assigned to a specific partition.
- A partition once attached to a scheduler can not be attached to a second scheduler without removing it from the first scheduler. If tried, then it will throw following error:
- qmgr -c "s sched multi_sched_1 partitions+='part2'"
Partition part2 is already associated with scheduler <scheduler name>.
- qmgr -c "s sched multi_sched_1 partitions+='part2'"
- Scheduler object "state" attribute will show one of these 3 values - DOWN, IDLE, SCHEDULING
- If a scheduler object is created but scheduler is not running for some reason state will be shown as "DOWN"
- If a scheduler is up and running but waiting for a cycle to be triggered the state will be shown as "IDLE"
- If a scheduler is up and running and also running a scheduling cycle then the state will be shown as "SCHEDULING"
- The default sched object is the only sched object that cannot be deleted.
- Trying to set sched_port, sched_priv and sched_host on default scheduler will not be allowed. The following error message is thrown in server_logs when we try to change sched_priv directory.
- qmgr -c "s sched default sched_priv = /tmp
Operation is not permitted on default scheduler
- qmgr -c "s sched default sched_priv = /tmp
- Trying to start a new scheduler other than the default scheduler, without assigning a partition will throw the following error message in sched_logs.
Scheduler does not contain a partition If Scheduler fails to accept new value for its sched_log directory then comment of the corresponding scheduler object at server is updated with the following message. Also scheduling attribute is set to false.
Unable to change the sched_logs directoryIf Scheduler fails to accept new value for its sched_priv directory then comment of the corresponding scheduler object at server is updated with the following message. Also scheduling attribute is set to false.
Unable to change the sched_priv directory- If PBS validation checks for new value of sched_priv directory do not pass then comment of the corresponding scheduler object at server is updated with the following message. Also scheduling attribute is set to false.
PBS failed validation checks for sched_priv directory - If Scheduler is successful in accepting the new log_dir configured at qmgr then the following error message is thrown in the sched_logs.
Scheduler log directory is changed to <value of path of the log directory>
If Scheduler is successful in accepting the new sched_priv configured at qmgr then the following error message is thrown in the sched_logs.
Scheduler priv directory is changed to <value of path of the sched_priv directory>If we keep on disassociating partitions from a scheduler until it does not contain any of the partitions then this scheduler is identical to default scheduler in which case we shutdown this scheduler and following error message is thrown in sched_logs.
Scheduler does not contain a partition.If Scheduler fails in getting its stats from Server then the following error message is shown in sched_logs.
Unable to retrieve the scheduler attributes from server
- Scheduler now has additional attributes which can be set in order to run it.
Interface 3: Removed
Interface 4: Changes to PBS server.
- Visibility: Public
Change Control: StableDetails:- PBS does not allow attributes like scheduling, scheduler_iteration to be set on PBS server object.
- scheduling and scheduler_iteration now belong to the sched object
- During failover when secondary server takes control it will try to connect to connect to schedulers by using their host attribute.
- If secondary server is unable to connect to scheduler running on remote host then it will start that scheduler locally and update it's "host" attribute.
- When Primary pbs server takes control from secondary it will always check if scheduler's host attribute matches it's server name, if it doesn't then it will shutdown the remote scheduler and spawn it locally on primary server.
- If set at the server level, the changes will be applied to the default sched object
- During failover when secondary server takes control it will try to connect to connect to schedulers by using their host attribute.
- As backward compatibility PBS still allows attributes like scheduling, scheduler_iteration to be set on PBS server object. Any changes made to these attributes are automatically reflected in scheduler object. Similarly if any changes are made to these attributes in scheduler object, they are automatically reflected in the server object.
- If at any point in time if Server is not able to contact or reach the corresponding scheduler one of the following messages are shown in server_logs.
Unable to reach scheduler associated with partition
Unable to reach scheduler associated with job <job id>
Interface 5: Changes to PBS Nodes objects.
- Visibility: Public
- Change Control: Stable
- Details:
- Node object in PBS will have an additional attribute called "partition" which can be used to associate a node to a particular partition.
- This attribute will be of type string and it will be settable only by Manager/operator and viewable by all users.
- If "partition" attribute is not set, node will not belong to any partition and default scheduler will schedule jobs on this node.
- PBS admin/manager can set node's partition attribute to an existing partition name and it's corresponding scheduler will be scheduling jobs on this node.
- If nodes are associated to a partition then they can not be linked to any queue which isn't part of that partition. Trying to set a node to a queue which isn't part of it's partition will result into the following error:
- Qmgr: s n node1 queue=workq1
qmgr obj=stblr3 svr=default: Partition p1 is not part of queue for node
qmgr: Error (15220) returned from server
- Qmgr: s n node1 queue=workq1
- If a node is associated to a queue then trying to set a partition on this node which does not belong to the same queue will result into the following error.
- Qmgr: s n stblr3 partition=p2
qmgr obj=stblr3 svr=default: Queue q1 is not part of partition for node
qmgr: Error (15219) returned from server
- Qmgr: s n stblr3 partition=p2
- If a queue is associated to one or multiple nodes then trying to change partition of this queue to a value other than those that are set on these nodes will result into the following error.
- Qmgr: s q q1 partition=p2
qmgr obj=q1 svr=default: Invalid partition in queue
qmgr: Error (15221) returned from server
- Qmgr: s q q1 partition=p2
Nodes with a partition ID (but no queue statement) can run jobs from any of the queues assigned to the same partition (depending upon resource constraints).
- Node object in PBS will have an additional attribute called "partition" which can be used to associate a node to a particular partition.
Interface 6: Changes to Queues.
- Visibility: Public
- Change Control: Stable
- Details:
- Queue will have a new queue attribute named "partition" which can be used to associate a queue to a particular partition.
- This attribute will be of type string and it will be settable only by admin/manager and viewable by all users.
- If "partition" attribute is not set to anything, queue will not belong to any partition and the default scheduler will schedule jobs from this queue.
- Setting "partition" attribute on routing queues is not allowed. Trying to set the same will throw the following error.
Qmgr: s q q4 partition=p1
qmgr obj=q4 svr=default: Can not assign a partition to route queue
qmgr: Error (15217) returned from serverExecution queue can not be changed to routing queue if "partition" attribute is set on it. Trying to set it will throw the following error.
Qmgr: s q q1 queue_type=route
qmgr obj=q1 svr=default: Route queues are incompatible with the partition attribute queue_type
qmgr: Error (15218) returned from server
- Queue will have a new queue attribute named "partition" which can be used to associate a queue to a particular partition.
Interface 7: How PBS server runs scheduler.
- Visibility: Public
- Change Control: Stable
- Details:
- Upon startup PBS server will start all schedulers which have their scheduling attribute set to "True"
- "PBS_START_SCHED" pbs.conf variable is now deprecated and it's value will get overridden by schedulers "scheduling" attribute.
- PBS server will connect to these schedulers on their respective host names and port number.
- Scheduling cycles for all configured schedulers are started by PBS server when a job is queued, finished, when scheduling attribute is set to True or when scheduler_iteration is elapsed.
- When a job gets queued or finished, server will check it's corresponding queue and try to connect to it's corresponding scheduler to run a scheduling cycle.
- If a scheduler is already running a scheduling cycle while server will just wait for the previous cycle to finish before trying to start another one.
- If job_accumulation_time is set then server will wait until that time has passed after the submission of a job before starting a new cycle.
- Each scheduler while querying server specifies it's scheduler name and then gets only a chunk of the universe which is relevant to this scheduler.
- It gets all the running, queued, exiting jobs from the queues it is associated with one of it's partitions.
- It gets all the list of nodes which are associated with the partition managed by the scheduler.
- It gets the list of all the global policies like run soft/hard limits set on the server object.
- PBS's init script will now be reporting status of pbs server only. Schedulers will be managed by server and their status can be fetched using a qmgr command.
- When pbs_server daemon is stopped using "qterm -s" then, it will also stop all the running scheduler processes.
- pbs init script while shutting down pbs_server will use the "-s" option to qterm so that all schedulers also come down along with server.
- Upon startup PBS server will start all schedulers which have their scheduling attribute set to "True"
Interface 8: Changes to pbs_rsub command
- Visibility: Public
- Change Control: Stable
- Details:
- Reservations can now be submitted to a specific partition using a new "-p" option with pbs_rsub command.
- "-p" option in pbs_rsub command takes partition name as input and makes pbs_server to trigger a scheduling cycle of the scheduler that is servicing the partition.
- If a scheduler servicing the requested partition isn't up and running then pbs server will store the reservation with itself and mark it as "UNCONFIRMED" until it is able to trigger a scheduling cycle of the said scheduler.
Interface 9: Deleted
Notes:
- What is not supported when multiple scheduler objects are present.
- With the introduction of this feature following things are not supported.
- Run limits set on server are not supported because a scheduler object does not have a view of the whole PBS universe.
- Fairshare scheduling policy per whole PBS complex is not supported going forward instead this policy can be limited to each individual scheduler.
2. Server's backfill_depth will be default value for all the schedulers in the complex.
Ex: Default server's backfill_depth is 1 ,1 job per each scheduler will be backfilled
If server's backfill_depth is set to 5 , 5 jobs from each scheduler will get backfilled