PP-586: On a Cray X-series, create a vnode per compute node.

PP-586 - Getting issue details... STATUS

Forum discussion (EDD review).

Overview:

In order to support Cray features like requesting particular NUMA nodes, or some number of NUMA nodes, PBS had previously
created one vnode for each NUMA node that is reported by the ALPS inventory output.

Sites have requested the ability to change the way PBS reports the Cray nodes, so that they can have fewer vnodes to deal with and not
encounter some of the differences that multiple vnodes per host brings.

The behavior of PBS will be changed to create, by default, a vnode per compute node as reported in the ALPS inventory.

Please note, this feature is available when PBS is build with configure --enable-alps.
Most feature support specific to Cray is only available when PBS is built with configure --enable-alps.  
 

New Interface: vnode_per_numa_node

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details:
    • New interface: new mom_priv/config variable.
    • Creates PBS vnodes based on the information provided by ALPS.
      • If there is no information to use, then this feature has no effect.
    • vnode_per_numa_node is a Boolean. 
      Setting it to FALSE will cause PBS to create one vnode per Cray compute node reported via ALPS.  This is the default behavior
      when $vnode_per_numa_node is not set in mom_priv/config.
      Setting it to TRUE will cause PBS to create one vnode per Cray NUMA node reported via ALPS.
    • the value of vnode_per_numa_node should be set to the same value on all of the PBS moms for that Cray host.

Interface: PBScrayseg

  • resources_available.PBScrayseg will be unset when vnode_per_numa_node is unset or set to FALSE
    • When we create a vnode per numa node PBScrayseg is used to specify the particular segment of the compute node that the PBS vnode represents.
      However when all of the numa nodes are being represented by one vnode (e.g. vnode_per_numa_node is unset or set to FALSE), then PBScrayseg has no meaning.  
  • resources_available.PBScrayseg will be set to the segment ordinal of the associated NUMA node when vnode_per_numa_node is set to TRUE

Change to interface: the vnode name

  • Visibility: Public
  • Change Control: Stable 
  • The vnode name was previously created by concatenating "mpp_host"_<node_id>_<NUMA node ordinal>.  With this change in behavior, the vnode name will no longer contain the _<NUMA node ordinal>.  Thus the vnode name will be made up of <mpp_host>_<node_id>.


Administrator's instructions.

  • When changing from one version of PBS using one type of vnode create versus another, you must first:
    • Quiesce the system (ensure there are no running jobs).
    • Remove all existing vnodes of the vntype cray_compute. The easiest way to do this is to delete all vnodes using 
      • qmgr -c "delete node @default"
    • Stop the version of the mom that create vnodes per NUMA node
    • Start the new version of MoM that will create vnodes per compute node
    • All moms on a Cray X* system must be of the same PBS version.
    • On the server add the MoMs back using qmgr: qmgr -c "create node <output of hostname for the login node you are trying to add>"


Definition of technical terms, spelling out acronyms and abbreviations.

Technical termDescription or definition
NUMA nodes

It expands to Non Uniform Memory Access. These are the individual segments that make up a Cray compute node. There can be 1, 2 or 4 of them per Cray compute node depending on the hardware.