PP-1018: [STALLED] As an admin, I would like to be able to prescribe the order in which placement sets are considered, so that I can direct the scheduler always consider first a particular set of resources

PP-1018: [STALLED] As an admin, I would like to be able to prescribe the order in which placement sets are considered, so that I can direct the scheduler always consider first a particular set of resources

[[This EDD is stalled]]



Target release

Future release 

JIRA link

https://pbspro.atlassian.net/browse/PP-1018

Document status

DRAFT

Document owner

Shrinivas Harapanahalli

Designer

Shrinivas Harapanahalli

Developers



QA



Forum Discussion

PP-1018: Design document review of Placement set sorting feature





Interface: New sched_config "node_group_sort_key" to enable placement set sorting.

  • Change Control:  stable

  • Standing of the interface:  new interface

  • Interface type:  Configure variable  (in sched_config)

  • Synopsis:  A new sched_config named "node_group_sort_key" option is added which will enable admin to configure the order of placement set used for job placement

  • Details:

Currently the order of placement set used while job placement is hard coded as below and cannot be customized by admin

The scheduler examines the placement sets in the pool and orders them, from smallest to largest, according to the following rules:

  1. Static total ncpus of all vnodes in set

  2. Static total mem of all vnodes in set

  3. Dynamic free ncpus of all vnodes in set

  4. Dynamic free mem of all vnodes in set

The new interface introduced in this EDD, node_group_sort_key will give the admin a free hand in customizing the order of placement sets considered in placement pool during job placement. This new interface provides multiple sorting domains in a placement pool with multiple placement series.

The syntax of node_group_sort_key is similar to node_sort_key sched_config, i.e a multi-word multi-line key, as described below

node_group_sort_key:  “<resource>  HIGH | LOW      total | assigned | unused”      <prime option>

where
resource    could be a custom vnode resources (including string / string array) or built-in vnode resource like ncpu or mem

total           Use the resources_available value
assigned   Use the resources_assigned value
unused      Use the value given by resources_available - resources_assigned

**ignored for non-consumable resources

HIGH    to sort descending order i.e, high first and low last
LOW     to sort ascending order i.e, low first and high last

Unlike node_sort_key, here the resource is not restricted to numerical value. The idea is to allow use of string values such as "string array resource names that define placement series" to order placement sets as per alphabetical order of the string values used in defining each placement set.

node_group_sort_key can be defined more than once in multiple lines with different resources in the sched_config. The scheduler will order the placement sets based on multiple node_group_sort_key in the order it appears in sched_config. This behavior is same as in node_sort_key

Default Value:

The default value of node_group_sort_key shall be defined in the sched_config as below. This value will be equivalent to the current hard coded rules mentioned at the beginning above.

node_group_sort_key: "ncpus LOW total" all
  node_group_sort_key: "mem LOW total" all
node_group_sort_key: "ncpus LOW unused" all
node_group_sort_key: "mem LOW unused" all

  • Subtle change in Placement set identification in a Placement pool:

Current behavior: Placement sets are created and partitioned based on the different string values defined at the vnodes under custom string array resources named in node_group_key list, indexed in a single dimention, i.e after flatening the resource names.  For example, if the server’s node_group_key attribute contains “router,switch”, and router can take the values “R1” and “R2” and switch can take the values “S1”, “S2”, and “S3”, then there are five placement sets "R1, R2, S1, S2, S3", in two placement series, in the server’s placement pool.

New behavior introduced with this EDD: Placement sets are created and partitioned based on the combination of different string values defined at the vnodes under custom string array resources named in node_group_key list, indexed multi dimensionally with custom string array resources named in the node_group_key as the index appearing in that order. This is in addition to the placement sets created as in current behavior, i.e flatened index.  For example, if the server’s node_group_key attribute contains “router,switch”, and router can take the values “R1” and “R2” and switch can take the values “S1”, “S2”, and “S3”, then there are elleven placement sets "R1-S1, R1-S2, R1-S3, R2-S1, R2-S2, R2-S3, R1, R2, S1, S2, S3", in two placement series, in the server’s placement pool.

**The implicit unset placement set and the placement set containing all nodes is not shown here.


Example Scenario of placement set sorting:

  1. Simple Placement set (single Placement series)

  1. Simple Placement set (single Placement series)

Lets assume we have simple placement set, i.e a single placement series configured as below

set server node_group_key = switch
set server node_group_enable = True
set sched do_not_span_psets = True


Lets assume we have below vnode list as below

Vnode

name

resource_available.

ncpus

mem

switch

vn0

4

8GB

"sw3,sw5"

vn1

2

8GB

"sw2"

vn2

8

8GB

"sw4,sw1"

vn3

2

4GB

"sw1"

vn4

8

16GB

"sw2"

vn5

4

16GB

"sw3,sw1"

vn6

8

8GB

"sw4,sw5,sw6"

vn7

4

4GB

"sw3,sw4"

Since here we have 8 vnodes in a single placement series, the partitioning and identification of psets will be same with current and new behavior. Here there are 6 placement sets created and identified as below.

psets

Total

Vnodes in pset

ncpus

mem

sw1

14

28GB

vn2, vn3, vn5

sw2

10

24GB

vn1, vn4

sw3

12

28GB

vn0, vn5, vn7

sw4

20

20GB

vn2, vn6, vn7

sw5

12

16GB

vn1, vn6

sw6

8

8GB

vn6

Now the order of placement sets considered for a job placement for currernt and new behaviour is as described below

** only static resource is considered here for simplicity

Ordering of Placement sets:

Current Beharviour

New Behaviour

psets

Total

Vnodes in pset

ncpus

mem

sw6

8

8GB

vn6

sw2

10

24GB

vn1, vn4

sw5

12

16GB

vn1, vn6

sw3

12

28GB

vn0, vn5, vn7

sw1

14

28GB

vn2, vn3, vn5

sw4

20

20GB

vn2, vn6, vn7

node_group_sort_key: "switch HIGH" all

psets

Total

Vnodes in pset

ncpus

mem

sw6

8

8GB

vn6

sw5

12

16GB

vn1, vn6

sw4

20

20GB

vn2, vn6, vn7

sw3

12

28GB

vn0, vn5, vn7

sw2

10

24GB

vn1, vn4

sw1

14

28GB

vn2, vn3, vn5