Currently it is possible to resume single-node jobs when mom is restarted, but multi-node jobs are killed and have to start over. This can be troublesome when there is a need to restart PBS cluster (server maintenance, PBS update, etc.). Therefore I propose this change which would add possibility to resume multi-node jobs if PBS mom is started with -p
flag.
Since all sisters have their own job files with information about progress of the job and resume of single-node jobs is already implemented, I believe these are the only necessary changes to enable resume on multi-node jobs
This design has one problem - if clients for multinode jobs (e.g. pbsdsh) don't support reconnecting to nodes, current task might either fail or enter infinite loop, therefore it will be needed to rewrite these clients to support reconnecting (for example mpirun already supports this).
Project Documentation Main Page