Resilient Interactive Job

Follow the PBS Pro Design Document Guidelines.

Overview

This enhancement provides a way for an interactive job to survive even after the client host issuing 'qsub -I' loses connection to the execution host.
Possible reasons for the qsub disconnection include:

  • change in VPN affecting the client host connection
  • the client host going down either due to a crash or scheduled maintenance
  • someone forcibly killing the qsub -I process

In all these cases, the proposal is to allow job to continue running, and user can reconnect to the interactive session from a different terminal and host.

Approach

  • Allow an existing pbs.conf parameter "PBS_REMOTE_VIEWER" to be recognized under Linux/Unix. It is currently only consumed by PBS on Windows when running GUI interactive jobs.
  • It will be used as the viewer to the job's interactive session, whether locally or remotely run.
  • Currently, it would recognize only the value:

PBS_REMOTE_VIEWER=screen

to use the screen utility. The screen command in Linux provides the ability to launch and use multiple shell sessions from a single shell session. That single session can then be disconnected interactively, preserving any processes and its terminal started by that screen session. Then at a later time, another process can re-attach to the original 'screen' session.

Extend /etc/pbs.conf parameter under client host: PBS_REMOTE_VIEWER=screen

  • One can change the file path to 'screen' using the PBS_REMOTE_VIEWER parameter:

PBS_REMOTE_VIEWER=/usr/bin/screen

  • Either a value of "screen" or the full path to screen command can be specified as PBS_REMOTE_VIEWER. If the former, then it would depend on the job's PATH setting for success in finding the correct location of the command.
  • If PBS_REMOTE_VIEWER is not specified, the interactive job runs like before, in a foreground shell session.
  • With PBS_REMOTE_VIEWER set in the client host to "screen", running qsub -I would give the submitting user this message when the interactive session runs:

    Your interactive job is running 'screen' on <job-id> under host <exec_hostname>.
    
    To disconnect, you can run:
    
    	screen -d
    
    You can reconnect to the job at a later time in one of 2 ways:
    
    	1.		ssh <exec_hostname>
    			<exec_hostname>% screen -d -r <job-id>
    
    	2.		pbs_interact <job-id>
    
    If you want to shut down the screen session, simply run:
    
    	exit
    
    Press <return>, <ctrl+D>, <ctrl+J>, or <ctrl+M> to continue...
    Press <ctrl+C> to exit the job...
    
    
  • How it works:
    1. qsub will pass the PBS_REMOTE_VIEWER value to the PBS server, which in turn will give it to the primary mom as job executes.
    2. Job itself will carry a new environment variable, PBS_REMOTE_VIEWER, set to the same value given in pbs.conf.
    3. When the primary mom sees the PBS_REMOTE_VIEWER value of 'screen',  it would execute:

      /usr/bin/screen -S <job-id>
      
      where <job-id> is the screen name.

      The screen session would continue to run in the background, until an "exit" is done in that session.

    4. Primary mom would monitor for existence of the 'screen' process.

    5. The interactive job will end when the 'screen' process exits, the job has reached its 'walltime' limit, or a qdel has been issued.

    6. If an interactive job disconnects and then reconnects at a later time, the connected session would continue to be tracked by PBS for accounting purposes.

    7. When a submitted interactive job is waiting to be run on an execution host, and the connecting qsub client gets interrupted (killed or submit host goes down), the job on the server would remain queued.
      1. Job could go into execution and screen would run in the background, and owning user can reconnect to the screen session.
    8. Screen sessions will terminate when one types <cntrl+D>.

New utility: pbs_interact

  • Introduce a  new Linux/Unix-only command called pbs_interact that is executed as follows:

pbs_interact <job-id>

  • This utilty will connect to the job <job-id>'s interactive session that was started by a PBS_REMOTE_VIEWER.
    • Standard input, output, error are connected to the terminal executing 'pbs_interact'.
  • With 'screen' being the only value recognized right now, this would essentially do the equivalent of:

ssh <primary_host>

<primary host> % screen -d -r <job-id>

but instead of using 'ssh' talking to sshd daemon on the execution host, it would be 'pbs_interact' communicating with the primary pbs_mom executing <job-id>.

  • pbs_interact can be run only by the owning user of the job.
  • If executed by a non-owning user of the job, the following message is displayed in stderr:

"Unauthorized Request"

  • If a call to pbs_interact <job-id> is made but job does not exist, the message below is shown:
         "Unknown Job ID"
  • If pbs_interact <job-id> is called with a non-interactive job or the job is not running screen, the message below is shown:
         "Not an interactive job running screen"
  • pbs_interact can be called from a remote host, as long as the requesting user at remote host is authorized by the server to make requests.
    • This is done by setting qmgr -c "set server flatuid = true", or the requesting user passes an ruserok() test
    • ruserok() authorization is done via /etc/hosts.equiv on server host, or user at server host has local .rhosts setup authorizing access.
    • Without proper authorization, pbs_interact returns the message: "Unauthorized Request".
  • Since pbs_interact is doing an equivalent of 'screen -d -r', any active screen session of the job (i.e. qsub -I invoked) will be disconnected automatically. This allows pbs_interact to reconnect to the screen session.

Future

Allow the possibility for PBS_REMOTE_VIEWER to accept other screen-like facility such as 'tmux'.


OSS Site Map

Project Documentation Main Page

Developer Guide Pages