PTL should cleanup job folders

Description

job temporary directories were left uncleaned for some jobs. Looking at the logs it happens when job is in substate41 and qdel -W force is issued. This failure is intermittently seen on few test systems.

drwx------. 2 pbsuser tstgrp00 4096 Nov 21 14:01 pbs.5.x53-c6p6

from mom_logs:
[root@x53-c6p6 tmp]# grep 5.x53 /var/spool/pbs/mom_logs/20171121
11/21/2017 14:01:03;0080;pbs_mom;Job;5.x53-c6p6;signal job request received
11/21/2017 14:01:03;0004;pbs_mom;Job;5.x53-c6p6;signal job with SIGKILL
11/21/2017 14:01:03;0008;pbs_mom;Job;5.x53-c6p6;kill_job
11/21/2017 14:01:03;0001;pbs_mom;Job;5.x53-c6p6;Job recycled into exiting on signal from substate 41
11/21/2017 14:01:03;0001;pbs_mom;Job;5.x53-c6p6;Job discarded at request of Server
11/21/2017 14:01:03;0008;pbs_mom;Job;5.x53-c6p6;kill_job

[root@x53-c6p6 tmp]# grep 5.x53 /var/spool/pbs/server_logs/20171121
11/21/2017 14:00:53;0100;Server@x53-c6p6;Job;5.x53-c6p6;enqueuing into R2, state 1 hop 1
11/21/2017 14:00:53;0008;Server@x53-c6p6;Job;5.x53-c6p6;Job Queued at request of pbsuser@x53-c6p6.pbspro.com, owner = pbsuser@x53-c6p6.pbspro.com, job name = STDIN, queue = R2
11/21/2017 14:00:53;0008;Server@x53-c6p6;Job;5.x53-c6p6;Job Modified at request of Scheduler@x53-c6p6.pbspro.com
11/21/2017 14:01:02;0008;Server@x53-c6p6;Job;5.x53-c6p6;Job Modified at request of Scheduler@x53-c6p6.pbspro.com
11/21/2017 14:01:03;0008;Server@x53-c6p6;Job;5.x53-c6p6;Job Run at request of Scheduler@x53-c6p6.pbspro.com on exec_vnode (x53-c6p6:ncpus=1)
11/21/2017 14:01:03;0080;Server@x53-c6p6;Job;5.x53-c6p6;delete job request received
11/21/2017 14:01:03;0008;Server@x53-c6p6;Job;5.x53-c6p6;Delete forced
11/21/2017 14:01:03;0008;Server@x53-c6p6;Job;5.x53-c6p6;Job to be deleted at request of pbsroot@x53-c6p6.pbspro.com
11/21/2017 14:01:03;0008;Server@x53-c6p6;Job;5.x53-c6p6;Discard running job, Forced Delete
11/21/2017 14:01:03;0100;Server@x53-c6p6;Job;5.x53-c6p6;dequeuing from R2, state 5

[root@x53-c6p6 tmp]# grep 5.x53 /var/spool/pbs/sched_logs/20171121
11/21/2017 14:00:53;0040;pbs_sched;Job;5.x53-c6p6;Queue not started
11/21/2017 14:00:53;0040;pbs_sched;Job;5.x53-c6p6;Queue not started
11/21/2017 14:01:02;0080;pbs_sched;Job;5.x53-c6p6;Considering job to run
11/21/2017 14:01:02;0040;pbs_sched;Job;5.x53-c6p6;Insufficient amount of queue resource: ncpus (R: 1 A: 0 T: 1)
11/21/2017 14:01:02;0080;pbs_sched;Job;5.x53-c6p6;Considering job to run
11/21/2017 14:01:02;0040;pbs_sched;Job;5.x53-c6p6;Insufficient amount of queue resource: ncpus (R: 1 A: 0 T: 1)
11/21/2017 14:01:03;0080;pbs_sched;Job;5.x53-c6p6;Considering job to run
11/21/2017 14:01:03;0040;pbs_sched;Job;5.x53-c6p6;Job run

[root@x53-c6p6 tmp]# tracejob 5

Job: 5.x53-c6p6

11/21/2017 14:00:53 S Job Queued at request of pbsuser@x53-c6p6.pbspro.com, owner = pbsuser@x53-c6p6.pbspro.com, job name = STDIN, queue = R2
11/21/2017 14:00:53 S Job Modified at request of Scheduler@x53-c6p6.pbspro.com
11/21/2017 14:00:53 L Queue not started
11/21/2017 14:00:53 L Queue not started
11/21/2017 14:00:53 S enqueuing into R2, state 1 hop 1
11/21/2017 14:00:53 A queue=R2
11/21/2017 14:01:02 L Considering job to run
11/21/2017 14:01:02 L Insufficient amount of queue resource: ncpus (R: 1 A: 0 T: 1)
11/21/2017 14:01:02 L Considering job to run
11/21/2017 14:01:02 L Insufficient amount of queue resource: ncpus (R: 1 A: 0 T: 1)
11/21/2017 14:01:02 S Job Modified at request of Scheduler@x53-c6p6.pbspro.com
11/21/2017 14:01:03 L Considering job to run
11/21/2017 14:01:03 S Job Run at request of Scheduler@x53-c6p6.pbspro.com on exec_vnode (x53-c6p6:ncpus=1)
11/21/2017 14:01:03 S Delete forced
11/21/2017 14:01:03 S Discard running job, Forced Delete
11/21/2017 14:01:03 M Job recycled into exiting on signal from substate 41
11/21/2017 14:01:03 M Job discarded at request of Server
11/21/2017 14:01:03 L Job run
11/21/2017 14:01:03 S delete job request received
11/21/2017 14:01:03 S Job to be deleted at request of pbsroot@x53-c6p6.pbspro.com
11/21/2017 14:01:03 S dequeuing from R2, state 5
11/21/2017 14:01:03 M kill_job
11/21/2017 14:01:03 M kill_job
11/21/2017 14:01:03 A requestor=pbsroot@x53-c6p6.pbspro.com
11/21/2017 14:01:03 M signal job request received
11/21/2017 14:01:03 M signal job with SIGKILL

Acceptance Criteria

None

Activity

Show:
Kumar Jakkali
January 15, 2018, 6:32 AM
Edited

I tried follwoing steps to reproduce this step.

  • set the ncpus = 100

  • submitted 25 jobs (sleep 1000)

  • all jobs were in R state

  • kill the comm and mom

  • after some time qdel -Wforce jobs

  • start the comm and mom

job folders were cleaned up when i start the mom back.

I tried second scenario

  • install PBS

  • set the ncpus = 100

  • submitted 25 jobs (sleep 1000)

  • all jobs were in R state

  • kill the comm and mom

  • after some time qdel -Wforce jobs

    • now /var/tmp has job folders

  • install PBS

  • now i submit job , job will goes H state due to job folder (of previous PBS) exists in /var/tmp.

I suspect this is not PBS issue. PTL should cleanup job folders.

: Its PTL issue than PBS, so i am updating component as PTL_Framework from MOM
I am updating Summary as 'PTL should cleanup job folders'

Kumar Jakkali
January 25, 2018, 4:06 AM

This PR does not need any new PTL tests.

Travis log
-----------
2018-01-25 01:56:53,317 INFO ============================
2018-01-25 01:56:53,318 INFO ok

2018-01-25 01:56:53,318 INFO ================================================================================
run: 49, succeeded: 49, failed: 0, errors: 0, skipped: 0, timedout: 0
Tests run in 0:07:47.570491
*2018-01-25 01:56:53,318 INFO Cleaning up temporary files
2018-01-25 01:56:53,319 INFO Cleaning up /var/tmp dir
2018-01-25 01:56:53,342 INFO Cleaning up /tmp dir*

travis_time:end:216bc6a8:start=1516844944821356595,finish=1516845413927564909,duration=469106208314
[0Ktravis_fold:end:install.4
[0Ktravis_time:start:14f8e35a
[0K$ true

Assignee

Kumar Jakkali

Reporter

anamika upadhyay

Severity

3-High

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Affects versions

Priority

High
Configure