Preemption via deletion
Overview
The main motivation behind this RFE is to ensure that jobs are preempted and high priority work is run if at all possible. The scheduler can currently preempt jobs via suspension, checkpointing, and requeuing. If the scheduler tries to preempt via checkpointing and the job requested -c n, or if scheduler is trying to preempt via requeue and the job requested - r n, the scheduler will skip over this job. If we provide another method of preemption to delete a job, jobs can't be ignored and will be moved out of the way. It gives a high priority job the best chance to run.
Technical Details
Interface 1:
There will be a new letter option to the sched object's 'preempt_order' attribute. The new letter is 'D'. This will mean to delete jobs. This means the set of letters accepted will be 'SCRD'.
This interface will be set by the admin. It will be consumed by both the scheduler and the server. The scheduler uses the interface when it decides if a job can be preempted. The server uses the interface when it decides how a job is to be preempted.
The default preempt_order will not change (SCR).
Examples:
qmgr -c 's sched default preempt_order = RD'
qmgr -c 's sched default preempt_order = SCRD 50 R'
The pbs_deljob() IFL call normally returns back to the caller immediately after the server has received the request and started the delete process. Unlike pbs_deljob(), if jobs are to be deleted, the server will not return back to a pbs_preempt_jobs() call until all the jobs have been fully preempted. This means if a job is to be deleted, the server will wait until the job is truly deleted before returning. This is because the scheduler needs the jobs to be out of the way before it starts the high priority job. If pbs_preempt_jobs() returned sooner, the scheduler would oversubscribe the nodes until jobs were finished being deleted.
Preemption is done via the pbs_preempt_jobs() IFL call. This call just tells the server to preempt the job. The server will then use the preempt_order attribute to determine the correct preemption method to use. Once the job is preempted, the scheduler will get a response back telling it what method was used. We'll add a new method 'D' to the response back to the scheduler.
On the server side, we will create an internal batch request to delete the job. This is similar to what would happen if a qdel happened, but it is coming from inside the server. This will delete the job.
We can't ack the original pbs_preempt_jobs() request when the delete job requests are finished. The jobs are not deleted until after the obit is returned from the mom and end of job processing is finished. Since the obit comes from the mom, we have no access to the initial preq. This means we'll have to keep the preq around on the job. This is similar to how a pbs_rerun request works. When end of job processing is finished and the job is purged, reply_preempt_jobs_request() will be called on the saved preq to add the job to the list to return back to the scheduler. Once all jobs have called reply_preempt_jobs_request(), the server will return back to the scheduler.
Advice:
It is unwise to use a runjob hook with preemption via deletion. This means the high priority job can have its run request rejected. If this happens we'll have deleted jobs for no reason.
Project Documentation Main Page