Add option to resume multinode jobs

Overview

Currently it is possible to resume single-node jobs when mom is restarted, but multi-node jobs are killed and have to start over. This can be troublesome when there is a need to restart PBS cluster (server maintenance, PBS update, etc.). Therefore I propose this change which would add possibility to resume multi-node jobs if PBS mom is started with -p flag.

Proposed chages

  • When saving jobs to disk, save stdout and stderr ports (so when mother superior restarts, communication on these ports can be resumed)
  • When mom is restarted and receives a multi-node job one of 2 possibilities happens:
    • mom is a sister
      • sister sends a request to mother superior for required information about the job (number of nodes, stdout/stderr ports, security credentials)
      • mother superior sends this information to all available sisters
      • sisters that already have this information discard it, sisters that need this information use it
    • mom is MS
      • mom recovers stdout/stderr ports from job file
      • mom sends number of nodes, stdout/stderr ports, security credentials to all available sisters (this is needed in case whole cluster is restarted at once and MS boots up as last node)

Since all sisters have their own job files with information about progress of the job and resume of single-node jobs is already implemented, I believe these are the only necessary changes to enable resume on multi-node jobs


This design has one problem - if clients for multinode jobs (e.g. pbsdsh) don't support reconnecting to nodes, current task might either fail or enter infinite loop, therefore it will be needed to rewrite these clients to support reconnecting (for example mpirun already supports this).








OSS Site Map

Project Documentation Main Page

Developer Guide Pages