Option for sister moms to not delete job's files sitting on shared location

Follow the OpenPBS Design Document Guidelines.

Overview

If a job has been submitted with sandbox attribute set to PRIVATE, (i.e. qsub -W sandbox=PRIVATE), a special job’s staging and execution directory is created under the directory specified in MoM $jobdir_root configuration option.
Putting jobdir_root to a shared location for multiple MoMs breaks when using the node rampdown feature  (PP-Node Rampdown feature). When a node is released early from a job by calling 'pbs_release_nodes', any job-specific specific files including job temporary directories sitting on the $jobdir_root directory, are removed by the sister mom host managing that node. Unfortunately, the job-specific files are sitting on the same shared location seen by primary mom and the other sister moms, causing user's files to disappear. Primary mom would lose the ability to  stage out files and return stdout/stderr files, and executing job could actually abort.

If $jobdir_root is unset, the location defaults to the user's home directory. If a site sets up user home directories to be shared, the same pbs_release_nodes problem will be encountered.

Approach

Introduce a new, optional directive 'shared'  to the $jobdir_root mom configuration option, that tells mom that the specified path is a shared location among primary mom and sister moms.

$jobdir_root <stage directory root>[shared]

The optional 'shared' directive (brackets mean optional) tells PBS mom that the <stage directory root> is a shared (e.g. NFS) location, which means the primary mom and sister moms would see the same job-specific staging and execution directories for each job. Having this option set, the sister mom would not cleanup the job's directories, when a request to delete the job on the node managed by that sister mom is received. The latter request can happen if the node has been released from the job early, as a result of a call to pbs_release_nodes. At the end of the job, the primary mom would take care of cleaning up the files.

For testing purposes, a sister mom_logs message under LOG_DEBUG3 level would be shown as:

"shared jobdir <stageout/execution directory path> to be removed by primary mom"

Note: Directive value other than 'shared' is ignored, allowing pbs_mom to continue to start.

Another option is introduced, which is a special <stage directory root> value of "PBS_USER_HOME" to refer to the default location set up by PBS, which is the user's home directory.  So given:

$jobdir_root PBS_USER_HOME shared

this means the sister mom would not cleanup the job files under the default user's user's home directory,  which is shared, when a request to delete the job on the node managed by that sister mom is received. Only the primary mom would take care of cleaning up the files. Specifying the following would be a no op, since the default jobdir_root is also the user's home directory.


$jobdir_root PBS_USER_HOME

Examples

Set $jobdir_root to a shared location:

# cat /var/spool/pbs/mom_priv/config
$jobdir_root /r/shared/users/root shared
# /etc/init.d/pbs restart


Submit job, watch it run, and release the sister node:

% cat job.scr
#PBS -l select=2:ncpus=1
#PBS -l place=scatter
#PBS -W sandbox=PRIVATE
sleep 30

% qsub job.scr
3.a01

% qstat -f | egrep "exec|jobdir"
jobdir = /r/shared/users/root/pbs.3.a01.x8z
exec_vnode = (a01:ncpus=1)+(a02:ncpus=1

% qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
3.a01 job.scr bayucan 00:00:01 R workq

% pbs_release_nodes -j 3 a02

jobdir is still preserved:

% ls -l /r/shared/users/root/pbs.3.a01.x8z
total 0
-rw-------. 1 bayucan agt 0 Aug 3 08:44 3.a01.ER
-rw-------. 1 bayucan agt 0 Aug 3 08:44 3.a01.OU

With pbs_mom on sister node a02 having high $logevent (0xffff), we see this mom_logs message:
08/03/2020 08:43:28.415089;0008;pbs_mom;Job;2.a01;created the job directory /r/shared/users/root/pbs.2.a01.x8z
08/03/2020 08:43:28.415131;0008;pbs_mom;Job;2.a01;JOIN_JOB as node 1
08/03/2020 08:43:45.521132;0008;pbs_mom;Job;2.a01;DELETE_JOB2 received
08/03/2020 08:43:45.521181;0008;pbs_mom;Job;2.a01;kill_job
08/03/2020 08:43:45.522479;0400;pbs_mom;Job;2.a01;shared jobdir /r/shared/users/root/pbs.2.a01.x8z to be removed by primary mom

% qstat

After job runs, primary mom has deleted jobdir as expected:
% ls -l /r/shared/users/root/pbs.3.a01.x8z
ls: cannot access /r/shared/users/root/pbs.3.a01.x8z: No such file or directory





OSS Site Map

Project Documentation Main Page

Developer Guide Pages