Links

Overview

Currently it is possible to resume single-node jobs when mom is restarted, but multi-node jobs are killed and have to start over. This can be troublesome when there is a need to restart PBS cluster (server maintenance, PBS update, etc.). Therefore I propose this change which would add possibility to resume multi-node jobs if PBS mom is started with -p flag.

Proposed chages

Since all sisters have their own job files with information about progress of the job and resume of single-node jobs is already implemented, I believe these are the only necessary changes to enable resume on multi-node jobs


This design has one problem - if clients for multinode jobs (e.g. pbsdsh) don't support reconnecting to nodes, current task might either fail or enter infinite loop, therefore it will be needed to rewrite these clients to support reconnecting (for example mpirun already supports this).








OSS Site Map

Project Documentation Main Page

Developer Guide Pages