KNL Analysis

Brief about for KNL.

Intel® Xeon Phi™ processors codenamed “Knights Landing,” are based on Intel’s
Many Integrated Core (MIC) architecture, offering an alternative performance/power configuration
to Intel Xeon processor products.

“Knights Landing” processor family is comprised of self-booting host processors which can be embedded up to
64–68 cores(each core support 4 threads with hyperthreading) in Cray XC compute blade configurations. This new many-core
device supports wider vector units and more threads per core to deliver in excess of 3 TF per device. One of the keys to
scaling parallelism is localizing and optimizing data movement.

Each of the new XC series compute nodes implements an Intel Xeon Phi processor socket with pipeline to local DDR memory, as well
as PCIe Gen 3 x16 access to the high-performance Cray-developed Aries interconnect. This processor family implements innovative
new onboard high-bandwidth DRAM memory (HBM configurations up to 16 GB), which is tightly coupled with the host compute die.
Users can see a significant performance boost from identifying high-bandwidth data and placing that data in the on-chip memory for
their HPC applications

The high-performance memory can be configured on the flexible XC series compute nodes at boot time (job launch) to be used as local
cache or as directly-accessible fast memory. This device is Xeon processor binary compatible, and whether programmers write all their
own code or users load pre-existing ISVs, this productivity capability makes it easy to support different use modes.

Following are the Numa Modes for KNL processors:

Mode	Name	Description
a2a	All-to-All	Addresses are uniformly hashed across distributed directories
quad	Quadrant	Addresses are hashed to a directory in the same quadrant as the memory
hemi	Hemisphere	Addresses are hashed to a directory in the same hemisphere as the memory
snc4	(4) Sub-NUMA	Tiles are divided into four sub-NUMA Clusters cluster, each cluster is one NUMA node.
snc2	(2) Sub-NUMA	Tiles are divided into two sub-NUMA Clusters clusters; each cluster is one NUMA node

Following are the MCDRAM modes for KNL processors:

Mode	Name	Description
cache	Cache	MCDRAM is used as a cache between the processor and DDR4 memory
flat	Flat	MCDRAM is physically addressable, in a separate NUMA node
equal	Hybrid	Equal 50% of MCDRAM is Flat, and 50% of MCDRAM is Cache
split	Hybrid	Split 75% of MCDRAM is Flat, and 25% of MCDRAM is Cache

The KNL processor and MCDRAM can be configured in twenty different combinations of NUMA and MCDRAM
modes. For example, the quadrant NUMA mode can be combined with the cache MCDRAM mode, and this
combination is known as quad/cache. For more detail on same please check here.

Requirements & Analysis

Requirements:

Create a vnode per KNL node.

Analysis:

Currently the smallest theoretical node entry in the BASIL QUERY(INVENTORY) response requires 21lines of XML,
and actual node entries are much larger. Thus a size of a full inventory for a system of thousands of nodes is
extremely large. Much of a full system inventory's information is redundant

From BASIL 1.5 onwards new BASIL query types(SYSTEM) will return simpler, smaller portions of the Cray system's inventory
information. Inasmuch as possible, these new queries' response XML node subelements will be converted into attributes
containing unique values or counts. This will allow node information to be grouped with multiple nodes per record. These
node lists will be specified in rangelist format.

To Support KNL node, BASIL 1.7 was introduced, and Cray considered changing the Inventory and Summary queries, but decided
that the changes to the SYSTEM query fully describe the current configuration of KNL nodes.

We thought of using the new "SYSTEM" query to generate both knl and non-knl node. But there are certain elements/attrib which are
are exclusive to the query types. And just using "SYSTEM" query will not be viable.

These diffrences are as following:

ELEMENT/ATTRIBUTES	INVENTORY_QUERY	SYSTEM_QUERY
Architecture	Available (usually value is XT But it seems, there are other supported architecture)	-NA-
Segment	Available	-NA- (But can be computed)
PBScrayorder	Available( Easily computed)	-NA-
LabelArray	Available	-NA-
numa_cfg	NA	Available
hbm_size_mb	NA	Available
hbm_cache_pct	NA	Available

NA: Not Available

Solution:

PBS will make two basil query. One to get information about KNL("SYSTEM" Query) and other for non-KNL("INVENTORY" query) node.
and combine the results to create vnode

Existing implementation already handles the non-KNL node infromation through basil INVENTORY query. Implementation for handling SYSTEM query
response is required.

Sample Alps SYSTEM query response:

<Nodes role="interactive" state="up" speed="1200" numa_nodes="1" dies="1" compute_units="68" cpus_per_cu="4" page_size_kb="4" page_count="25165874"\
numa_cfg="quad" ="16584" hbm_cache_pct="100">
40-47
</Nodes>

Sample Alps INVENTORY query response single node:

<Node node_id="28" name="xxxx" architecture="XT" role="BATCH" state="UP">
     <SocketArray>
      <Socket ordinal="0" architecture="x86_64" clock_mhz="2100">
       <SegmentArray>
        <Segment ordinal="0">
         <ComputeUnitArray>
          <ComputeUnit ordinal="0">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="1">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="2">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="3">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
         </ComputeUnitArray>
         <MemoryArray>
          <Memory type="OS" page_size_kb="4" page_count="4194304"/>
         </MemoryArray>
         <LabelArray/>
        </Segment>
        <Segment ordinal="1">
         <ComputeUnitArray>
          <ComputeUnit ordinal="0">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="1">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="2">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="3">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
         </ComputeUnitArray>
         <MemoryArray>
          <Memory type="OS" page_size_kb="4" page_count="4194304"/>
         </MemoryArray>
         <LabelArray/>
        </Segment>
       </SegmentArray>
      </Socket>
     </SocketArray>
     <AcceleratorArray>
      <Accelerator ordinal="0" type="GPU" state="UP" family="Tesla_K20X" memory_mb="6144" clock_mhz="732"/>
     </AcceleratorArray>
</Node>

Project Documentation

KNL Analysis

Brief about for KNL.

Requirements & Analysis

Related content