Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Architecture Design

Synopsis

As part of the BASIL 1.7 project, we will be making a System Query to get KNL Node information. One vnode per KNL node will be created using this information.

...

We are currently at BASIL 1.4. For BASIL 1.5 and 1.6, changes have not been implemented by Altair in PBS. This project aims to support the BASIL 1.7 System Query for KNL nodes only.

...

In the current system, for non-KNL nodes returned as part of the Inventory (BASIL 1.4) Query, we create a vnode per Segment( vnode_per_numa_node=True). The PBScrayseg attribute of the created vnode will reflect the segment ordinal e.g.

for ordinal=0, PBScrayseg=0, for ordinal=1, PBScrayseg=1 etc.

...

Additional attributes such as numa_cfg, hbm_cache_pct and hbm_size_mb will also be considered when creating KNL vnodes.

 

Current behavior

PBS makes an INVENTORY Query request (using BASIL 1.4).

The Query response (from ALPS) is an XML representation of Compute Nodes.

Flow of control

New behavior

PBS will make a SYSTEM Query request (using BASIL 1.7) to collect information on KNL Nodes.

...

    • It also has definitions of structures that we populate during XML Parsing.
      • The Structure used to store information per <Nodes> element (parsed from the SYSTEM Query XML Response) is: basil_system_element_t.
    • This file has information pertaining to BASIL 1.5, 1.6 & 1.7.
    • Since we are only concerned with a BASIL 1.7 feature i.e. the System Query, in this Project, Macro/Structure definitions will be selectively taken from this file & incorporated into basil.h (in the Altair Cray Release Branch in Perforce). A few new definitions, as needed, will be added.
    • It was decided not to use the Cray-supplied header file as-is, since they have made some changes to it that will, if used in its entirety, break existing Inventory (BASIL 1.4) functionality.
    • Moreover, this file also has BASIL 1.5 & 1.6 definitions that are not currently supported in PBS Cray code; hence importing those additions into the existing basil.h could lead to confusion.
    • Comments documenting some of the above information will be included in the latest basil.h header file to be used in this project.

The following Table shows how the System Query attributes (in the XML Response) map into the basil.h structure (basil_system_element_t) that gets populated with this parsed XML information.


XML attribute name

Corresponding Structure element name (in basil.h)

Expected Values

Comments

rolerolebatch, interactiveThis attribute is used for KNL node determination. The structure element "role" will be set to "UNKNOWN" when unexpected attribute values are encountered in the XML response.
statestateup, down, unavailable, routing, suspect, adminThis attribute is used for KNL node determination. The structure element "state" will be set to "UNKNOWN" when unexpected attribute values are encountered in the XML response.
speedspeedValue cannot be an empty string, cannot be negative, cannot be "0". 
numa_nodesnuma_nodesValue cannot be an empty string, cannot be negative, cannot be "0".This attribute is ignored during KNL vnode creation.
diesn_diesValue cannot be an empty string, cannot be negative, can be "0".This attribute is ignored during KNL vnode creation.
compute_unitscompute_unitsValue cannot be an empty string, cannot be negative, can be "0".This attribute will be displayed in 'resources_available.nppus'.
cpus_per_cucpus_per_cuValue cannot be an empty string, cannot be negative, cannot be "0".This will be displayed in 'resources_available.vps_per_ppu' (the product of compute_units & cpus_per_cu will be displayed in 'resources_available.ncpus').
page_size_kbavlmem

Value of attribute page_size_kb cannot be an empty string, cannot be negative, cannot be "0".

 avlmem holds the product of page_size_kb & page_count.

This represents conventional DRAM memory (will be displayed as 'resources_available.mem').
 pgszl2pgszl2 holds X, where 2^X is page_size_kb in Bytes. 
page_countRefer to avlmem note above (under "Values")Value cannot be an empty string, cannot be negative, can be "0". 
accelsaccel_nameNot every Node group in the System 1.7 XML response may have this attribute. When it is present, the attribute value cannot be an empty string.

If this attribute is present in the XML response, we capture the attribute value during XML parsing. However, this attribute is ignored during subsequent KNL vnode creation i.e. KNL vnodes will be created without this attribute. KNL nodes cannot have GPUs.

accel_stateaccel_stateNot every Node group in the System 1.7 XML response may have this attribute. When it is present, the attribute value should be "up" or "down".If this attribute is present in the XML response, we capture the attribute value during XML parsing and set the structure element "accel_state" to "UNKNOWN" when unexpected values are encountered. However, this attribute is ignored during subsequent KNL vnode creation i.e. KNL vnodes will be created without this attribute.
numa_cfgnuma_cfga2a, snc2, snc4, hemi, quad. This attribute will always have a value (non-empty string) for KNL Nodes. The value will be an empty string for non-KNL Nodes. 
hbm_size_mbhbmsizeValue of hbm_size_mb cannot be negative. This attribute will always have a value (non-empty string) for KNL Nodes. This will be an empty string for non-KNL Nodes.This represents High Bandwidth MCDRAM memory (in MB) (will be displayed as 'resources_available.hbmem').
hbm_cache_pcthbm_cfgValue of hbm_cache_pct will be 0, 25, 50, 100. This attribute will always have a value (non-empty string) for KNL Nodes. This will be an empty string for non-KNL Nodes. 
NonenidlistThe Rangelist of Node IDs.The XML response does not have a specific attribute name corresponding to the "nidlist" structure element. During XML parsing, the Rangelist of Node IDs (in the incoming XML) is assigned to the "nidlist" structure element. This is repeated for every Node group in the XML response.



Handling unexpected attribute values.

...

The following new functions will be added as part of Task PBS-14859.:

    • alps_system_KNL(), new_alps_req_KNL(), system_start(), node_group_start(), parse_nidlist_char_data().

The following existing functions will be modified as part of PBS-14859.:

    • response_start(), response_data_start(), allow_char_data(), free_basil_response_data().

The following new functions will be added as part of Task PBS-14861.:

    • alps_engine_query_KNL(), exclude_from_KNL_processing(), system_to_vnodes_KNL(), create_vnodes_KNL(), process_nodelist_KNL(), store_nids(), free_basil_elements_KNL().

The following functions will be modified as part of Task PBS-14861.:

    • alps_system_KNL(), parse_nidlist_char_data(), response_data_start().

...