Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • When saving jobs to disk, save stdout and stderr ports (so when mother superior restarts, communication on these ports can be resumed)
  • When saving tasks to disk, save obits (so when mom restarts it knows it should send an obit to MS)When mom is restarted and receives a multi-node job one of 2 possibilities happens:
    • mom is a sister
      • sister sends a request to mother superior for required information about the job (number of nodes, stdout/stderr ports, security credentials)
      • mother superior sends this information to all available sisters
      • sisters that already have this information discard it, sisters that need this information use it
    • mom is MS
      • mom recovers stdout/stderr ports from job file
      • mom sends number of nodes, stdout/stderr ports, security credentials to all available sisters (this is needed in case whole cluster is restarted at once and MS boots up as last node)

Since all sisters have their own job files with information about progress of the job and resume of single-node jobs is already implemented, I believe these are the only necessary changes to enable resume on multi-node jobs


This design has one problem - if clients for multinode jobs (e.g. pbsdsh) don't support reconnecting to nodes, current task might either fail or enter infinite loop, therefore it will be needed to rewrite these clients to support reconnecting (for example mpirun already supports this).







...

OSS Site Map

Project Documentation Main Page

...