Limit on maximum number of subjobs that can run at a time

Follow the PBS Pro Design Document Guidelines.

Overview

Motivation: When PBS Scheduler decides to run a subjob, it tries to run as many subjobs as it can until it's unable to run anymore. This can potentially cause a very large array job to hog all the machines. Admins want a finer control on how many array-subjobs can be running at a given time for a given array job.

Use case: A user may have a job array in which all subjobs will interface with a single instance of a shared data file, and as more and more subjobs run simultaneously the job performance sharply degrades. Further, different applications/or runs of the same application may have different impacts on the shared resource, so we have had requests for the ability to limit this number at the user’s request per job array.

External Interface changes

Interface 1:  Change to qsub -J option

  • Users can specify -J option with a %<num> extension. The number followed by '%' would specify the maximum number of concurrent running array-subjobs.
  • If the user does not extend -J option with a '%<num>' then the PBS scheduler will behave in the way it does today. It will try to run as many subjobs as it can and stop when it cannot run the subjobs anymore.
  • "%num" does not reflect back in job's "array_indices_submitted" attribute, instead it sets "max_run_subjobs" job attribute (mentioned in interface 2). This change is done in qsub just for ease of usage. Hook writers can access and modify the maximum number of subjobs that can run by modifying "max_run_subjobs" job attribute. If customers have clients that directly access IFL interface, they must also use "max_run_subjobs" attribute.

Interface 2: New job attribute "max_run_subjobs"

  • A new 'long' type attribute is added to the job object. This attribute shows the number of subjobs the user has requested to run concurrently. The value of this attribute must only be a "whole number".
  • If the user does not specify the number of concurrent running subjobs as mentioned by interface 1 or by specifying -Wmax_run_subjobs=<num>, PBS scheduler will behave the way it does today and consider all the subjobs as runnable.
  • A user can use this attribute name to qalter an array job. Like, qalter -Wmax_run_subjobs=8 12[].server1
  • If the max_run_subjobs option and -J "%<num>" extension both are given in a qsub command then the command will be rejected with the following message - "qsub: multiple max_run_subjobs values found"
  • Admin can modify this attribute in the following server job-specific hooks events - queuejob, modifyjob.
  • If an array job hits the max_run_subjobs limit, it will accrue eligible time as long as eligible time is enabled and the job is not hitting any other limit.
  • PBS scheduler will not try preemption if an array subjob cannot run because it is hitting max_run_subjobs limit.
  • Suspended subjobs are not counted as running and thus these subjobs will not be counted against the max_run_subjobs limit.
  • If an admin issues 'qrun' on an array subjob job, PBS scheduler will try to run the job even if the array is hitting its max_run_subjobs limit. 
  • When the scheduler detects that an array job has hit the limit and it logs the following debug message - "Number of concurrent running subjobs limit reached".
  • If a user tries to set "max_run_subjobs" on non-array job, PBS will reject the job submission IFL request with a new error number - 15231 and with the following error message - "Attribute has to be set on an array job".



OSS Site Map

Project Documentation Main Page

Developer Guide Pages