Qdel optimization for a huge number of jobs

Link to the discussion forum: https://community.openpbs.org/t/qdel-optimization-for-a-huge-number-of-jobs/2327

Overview:

Since qdel is taking an enormous amount of time to delete the tens of thousands of jobs in PBS_SERVER, there is a need for optimization to improve its performance.  On analysis, observed that the qdel client is iterating over all the jobs and sending the pbs_deljob IFL in a serial fashion, there is a conspicuous problem in design. However this approach works well with a minimum set of jobs, It is not a beneficial solution with a large number of jobs, say for 1 million.

Technical details:

Need for new IFL API pbs_deljobbatch()

  1. To support backward compatibility, since the reply choice type is different between single job deletion and multiple job deletion. (pbs_deljob vs pbs_deljobbatch)

  2. Even reservation deletion causes to create multiple delete job requests, need refactoring on processing reply.

  3. By having a new IFL, we could maintain the existing IFL calls (pbs_deljob) between Server->Mom for job deletion, else it would take time to implement to support these changes in Server->mom requests.

  4. Changing the existing IFL would impact the IFL wraps and, also that welcomes more changes on hooks in server and mom. (pbs_ifl_wrap.c & pbs_tclWrap.c), since the change in reply type. For a single job, the return type is an integer that defines the status of job deletion. But in case of more number of jobs, the reply struct needs to be processed.

  5. The test framework also needs to be refactored to accommodate the IFL signature change.

 

Interface 1: New ifl api to send a bunch of job list to delete.

  • Visibility: Public

  • Change Control: Stable

  • Details:

    • A new IFL call is added to send a bunch of job list to delete.
      struct batch_deljob_status *pbs_deljoblist(int, char **, char *);

    • The first argument of the call is to provide the server connection handle, the second argument takes the NULL-terminated array of jobs list as the input and the third argument is to pass any extended parameters.

    • This IFL call will return a batch_deljob_status as a response in case of failure to delete the job.

    • If the server is unable to delete the requested job object, the return value of the IFL API would be batch_deljob_status record with jobid and error code.

    • In successful deletion, the server would return a NULL value.

    • This new IFL is used only by qdel command, (i:e) While sending the list of jobs from client command to pbs_server. This IFL uses the new batch request namely PBS_BATCH_DeleteJobbatch

    • Other internally generated delete requests would continue to use the existing batch request PBS_BATCH_DeleteJob.