PP-277: Multinode jobs may fail to start

Objective:

Address a corner case where a multihomed mom is unable to join a multinode job because it fails to identify itself in the job's node list.

The following document was authored by Altair field support and provides some background:

Interface 1: PBS_MOM_NODE_NAME configuration variable

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: Adding PBS_MOM_NODE_NAME configuration variable
  • Details:
    PBS_MOM_NODE_NAME is a new variable that may be defined in the pbs.conf configuration file. It is used to ensure that when the MoM starts up it uses a name for the natural vnode that is consistent with the name used when creating the node on the server.

    The value is used when MoM builds a list of local vnodes at startup. The list consists of either the natural vnode alone, or a list of local vnodes (either configured with a v2 configuration file or with an exechost_startup or exechost_periodic hook). MoM cannot check what the value on the server because the server may not be running at the time MoM is started. By default, MoM assumes that the name of the natural vnode is the (non-canonicalized) hostname returned by gethostname().

    If administrators want to use an alias (or a name bound to another IP address on the host) to create nodes rather than the default hostname, PBS_MOM_NODE_NAME provides them with the ability to override the default hostname. When PBS_MOM_NODE_NAME is defined, mom will perform a sanity check to ensure the value is a resolvable host.

Interface 2: Log messages when MoM fails to identify its hostname

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: Messages that could appear in the MoM log when it fails to identify and validate its hostname
  • Details:
    If PBS_MOM_NODE_NAME is unset and the call to gethostbyname fails OR if PBS_MOM_NODE_NAME is set and the value does not conform to RFCs 952 and 1123, the following message will be printed to the log:
    Unable to obtain my host name

    Once the hostname is obtained, MoM will ensure the hostname resolves properly by calling get_fullhostname(). If the hostname fails to resolve, the following message will be printed to the log:
    Unable to resolve my host name