Option for sister moms to not delete job's files sitting on shared location
Follow the OpenPBS Design Document Guidelines.
Links
- Link to discussion on https://community.openpbs.org/t/option-for-sister-moms-to-node-delete-jobs-files-sitting-on-shared-location/2219
- Link to issue: <issue link if available>
- Link to pull request: <PR link if available>
Overview
If a job has been submitted with sandbox attribute set to PRIVATE, (i.e. qsub -W sandbox=PRIVATE), a special job’s staging and execution directory is created under the directory specified in MoM $jobdir_root configuration option.
Putting jobdir_root to a shared location for multiple MoMs breaks when using the node rampdown feature (PP-Node Rampdown feature). When a node is released early from a job by calling 'pbs_release_nodes', any job-specific specific files including job temporary directories sitting on the $jobdir_root directory, are removed by the sister mom host managing that node. Unfortunately, the job-specific files are sitting on the same shared location seen by primary mom and the other sister moms, causing user's files to disappear. Primary mom would lose the ability to stage out files and return stdout/stderr files, and executing job could actually abort.
If $jobdir_root is unset, the location defaults to the user's home directory. If a site sets up user home directories to be shared, the same pbs_release_nodes problem will be encountered.
Approach
Introduce a new, optional directive 'shared' to the $jobdir_root mom configuration option, that tells mom that the specified path is a shared location among primary mom and sister moms.
$jobdir_root <stage directory root>[shared]
The optional 'shared' directive (brackets mean optional) tells PBS mom that the <stage directory root> is a shared (e.g. NFS) location, which means the primary mom and sister moms would see the same job-specific staging and execution directories for each job. Having this option set, the sister mom would not cleanup the job's directories, when a request to delete the job on the node managed by that sister mom is received. The latter request can happen if the node has been released from the job early, as a result of a call to pbs_release_nodes. At the end of the job, the primary mom would take care of cleaning up the files.
For testing purposes, a sister mom_logs message under LOG_DEBUG3 level would be shown as:
"shared jobdir <stageout/execution directory path> to be removed by primary mom"
Note: Directive value other than 'shared' is ignored, allowing pbs_mom to continue to start.
Another option is introduced, which is a special <stage directory root> value of "PBS_USER_HOME" to refer to the default location set up by PBS, which is the user's home directory. So given:
$jobdir_root PBS_USER_HOME shared
this means the sister mom would not cleanup the job files under the default user's user's home directory, which is shared, when a request to delete the job on the node managed by that sister mom is received. Only the primary mom would take care of cleaning up the files. Specifying the following would be a no op, since the default jobdir_root is also the user's home directory.
$jobdir_root PBS_USER_HOME
Examples
Set $jobdir_root to a shared location:
# cat /var/spool/pbs/mom_priv/config $jobdir_root /r/shared/users/root shared # /etc/init.d/pbs restart
Submit job, watch it run, and release the sister node:
% cat job.scr #PBS -l select=2:ncpus=1 #PBS -l place=scatter #PBS -W sandbox=PRIVATE sleep 30 % qsub job.scr 3.a01 % qstat -f | egrep "exec|jobdir" jobdir = /r/shared/users/root/pbs.3.a01.x8z exec_vnode = (a01:ncpus=1)+(a02:ncpus=1 % qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 3.a01 job.scr bayucan 00:00:01 R workq % pbs_release_nodes -j 3 a02 jobdir is still preserved: % ls -l /r/shared/users/root/pbs.3.a01.x8z total 0 -rw-------. 1 bayucan agt 0 Aug 3 08:44 3.a01.ER -rw-------. 1 bayucan agt 0 Aug 3 08:44 3.a01.OU With pbs_mom on sister node a02 having high $logevent (0xffff), we see this mom_logs message: 08/03/2020 08:43:28.415089;0008;pbs_mom;Job;2.a01;created the job directory /r/shared/users/root/pbs.2.a01.x8z 08/03/2020 08:43:28.415131;0008;pbs_mom;Job;2.a01;JOIN_JOB as node 1 08/03/2020 08:43:45.521132;0008;pbs_mom;Job;2.a01;DELETE_JOB2 received 08/03/2020 08:43:45.521181;0008;pbs_mom;Job;2.a01;kill_job 08/03/2020 08:43:45.522479;0400;pbs_mom;Job;2.a01;shared jobdir /r/shared/users/root/pbs.2.a01.x8z to be removed by primary mom % qstat After job runs, primary mom has deleted jobdir as expected: % ls -l /r/shared/users/root/pbs.3.a01.x8z ls: cannot access /r/shared/users/root/pbs.3.a01.x8z: No such file or directory
Project Documentation Main Page