Finer grained node grouping

Overview

There is a desire for jobs to be placed on nodes of a single type.  The type groupings can have many values.  Job submitters do not care which value of the group their job is placed on as long as all the nodes for the job have the same value for the group.  In PBS today, this grouping is called a placement set.  Placement sets can be defined complex wide, queue wide, or just for the job.  The more specific placement sets overrides the less specific (e.g. queue placement sets override the complex-wide placement sets).  The new desire is more finely grained grouping.  Instead of per-job, grouping be able to be for part of a resource request, but not all of it.

Technical Details

Glossary

chunk complex - Part of a select statement which is between the pluses.  One or more identical chunks specified in the form of N:chunk.

New PBS resource: group

There is a new PBS resource named group.  This can only be requested in the select statement.  The value of the group resource will be a placement set resource (just like the resources in node_group_key).  When a chunk complex requests group, placement sets will be created based on the specified resource and the chunks from that chunk complex will be placed on nodes where the resource is set to the same value.  A select statement can contain multiple group resources with the same/different values as long as there aren't two group requests in the same chunk complex.  If two chunk complexes contain the same group resource, each chunk complex will be evaluated individually.  This means the different chunk complexes can be placed on different placement sets in the same placement pool.


Interaction with other placement sets

Currently a job can be run within one pool of placement sets.  These will come from the server, queue, or job.  The job's place=group overrides the queue's node_group_key which overrides the server's node_group_key.  Per-chunk grouping will work similarly.  If a chunk complex requests a group, it will override any other placement sets for the job.  If a job has multiple chunk complexes where some request a group and others do not, the chunk complexes that do not request a group will be placed over all nodes available to the job (e.g. if the job is in a queue with nodes associated with it, only those nodes).  It is invalid to to request place=group and per-chunk grouping.

Example:

NodesColorShape
1-2bluesquare
3-4bluetriangle
5-6redsquare
7-8redtriangle


Current:

If the server has node_group_key=color and a job requests place=group=shape, the job will be placed 

shape=square - nodes 1-2, 5-6

shape=triangle - nodes 3-4, 7-8

Example 1: Interaction between server's node_group_key with multiple chunk complexes.  One chunk complex has per-chunk grouping and the other does not.

node_group_key=color

select=2:ncpus=1:group=shape+2:ncpus=1

Chunk complex 2:ncpus=1:group=shape will be run on its group=shape

shape=square - nodes 1-2, 5-6

shape=triangle - nodes 3-4, 7-8

chunk complex 2:ncpus=1 will be run on any node

all nodes - 1-8

Example 2: Per-chunk grouping with two chunk complexes with different groups.

no per-job placement

select=2:ncpus=1:group=color+2:ncpus=1:group=shape

Chunk complex 2:ncpus=1:group=color will be placed on group=color

color=blue - nodes 1-4

color=red - nodes 5-8

Chunk complex 2:ncpus=1:group=shape will be placed

shape=square - nodes 1-2, 5-6

shape=triangle - nodes 3-4, 7-8


Example 3: Per-chunk grouping where two chunk complexes request the same group.

no per-job placement

select=2:ncpus=1:group=color+2:ncpus=1:group=color

Both chunk complexes will be placed similarly: 

color=blue nodes 1-4

color=red nodes 5-8

The difference between per-job place=group=color and this request is that the two chunk complexes can be placed on different placement sets.  It is possible for the first chunk complex to be placed on color=blue and the second chunk complex be placed on color=red

Interaction with placement set spanning

Currently if no placement set is large enough (when empty) to fit a job, the job will span across all nodes available to the job.  This can be controlled with the scheduler's do_not_span_psets attribute.  If do_not_span_psets is true, and a job can not fit within any placement set, the job not span and will never run.

Spanning will not change.  The decision to span will be made at the job level.

If the job requests per-job grouping and not per-chunk grouping, spanning will happen like it does today.

If the job requests per-chunk grouping, and any chunk can not fit, the entire job will span.  This is regardless if other chunks can fit in their placement sets.

Clarifications

The nodes used to create placement sets:

  • If the job is in a queue with nodes associated with it (node's queue attribute), only the nodes associated with the queue are available
  • If the job is not in a queue with nodes associated with it, but there are other queues with nodes associated with them, then all nodes that are not associated with queues are available
  • If there are no nodes associated with any queue, then all nodes are available.

Restrictions

  • Per-chunk grouping is incompatible with place=pack.  With place=pack, all chunks must be placed on one host.
  • Per-chunk grouping is incompatible with place=group.  You can either have per-job placement or per-chunk placement, but not both.