moms cannot communicate with one another in a cloud configuration when cloud nodes resolve each other's hostnames to IP addresses not known to the PBS server/comm

Description

When PBS is used in a configuration where cloud nodes resolve eachother's names to one set of IP addresses but the local PBS server/comm host resolves a different set of IP addresses (through a VPN) for the same names then the moms cannot communicate with one another for multinode jobs. This is because when the server runs a job it sends exec_host2 to the primary execution host (a cloud node in this example) to communicate all of the nodes in the job, where the hostnames get resolved to the cloud addresses. When the primary execution host then tries to send messages to these addresses through the pbs_comm it is unable to as only the VPN addresses are known to the comm.

Acceptance Criteria

multinode jobs work properly when cloud nodes resolve each other's names to one set of IP addresses but the local PBS server/comm host resolves a different set of IP addresses (through a VPN).

Activity

Show:
Scott Campbell
October 17, 2017, 5:32 AM

One way to fix this would be to have the moms register all of their interfaces with the pbs_comm rather than just one. With this proposed solution the cloud node mom-to-mom traffic would pass back and forth to the pbs_comm located on the local cluster through the VPN unless additional standard PBS Professional configuration is performed to keep cloud node to cloud node traffic from traversing the VPN. Specifically, a pbs_comm would need to be installed in the cloud environment, the local PBS server and pbs_comm would need to be configured to use the cloud pbs_comm as weell as the local, and the cloud mom nodes would need to be configured to use it as well (PBS_LEAF_ROUTERS and PBS_COMM_ROUTERS in the pbs.conf files). The pbs.conf files on the cloud nodes would need to differ from the local nodes, but the contents would not be unique per individual cloud node.

Brem Anand
December 5, 2017, 7:22 PM

I have raised the PR on this issue.
https://github.com/PBSPro/pbspro/pull/483
And,
Testing logs also attached.

Assignee

Brem Anand

Reporter

Scott Campbell

Severity

3-High

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Fix versions

Affects versions

Priority

Critical
Configure