Links
- Link to discussion post: https://community.openpbs.org/t/option-to-resume-multinode-jobs/2244/1
- Link to pull request: https://github.com/openpbs/openpbs/pull/1955
Overview
Currently it is possible to resume single-node jobs when mom is restarted, but multi-node jobs are killed and have to start over. This can be troublesome when there is a need to restart PBS cluster (server maintenance, PBS update, etc.). Therefore I propose this change which would add possibility to resume multi-node jobs if PBS mom is started with -p
flag.
Proposed chages
- When saving jobs to disk, save stdout and stderr ports (so when mother superior restarts, communication on these ports can be resumed)
- When saving tasks to disk, save obits (so when mom restarts it knows it should send an obit to MS)
- When mom is restarted and receives a multi-node job one of 2 possibilities happens:
- mom is a sister
- sister sends a request to mother superior for required information about the job (number of nodes, stdout/stderr ports, security credentials)
- mother superior sends this information to all available sisters
- sisters that already have this information discard it, sisters that need this information use it
- mom is MS
- mom recovers stdout/stderr ports from job file
- mom sends number of nodes, stdout/stderr ports, security credentials to all available sisters (this is needed in case whole cluster is restarted at once and MS boots up as last node)
- mom is a sister
Since all sisters have their own job files with information about progress of the job and resume of single-node jobs is already implemented, I believe these are the only necessary changes to enable resume on multi-node jobs
Project Documentation Main Page