PP-506, PP-507 discussion artifact
Features/Scenarios | Implementation options | ||
Job_pool - qsub -A “abcd” -lscratch=5gb --job_pool="pool1" -lselect=1:ncpus=10:mem=20gb job.scr qsub -A "abcd" -lscratch=6gb --job_pool="pool1" -lselect=5:ncpus=2:mem=4gb job.scr qsub -A "xyz" --job_pool="pool2" -lselect=2:ncpus=5:mem=10gb job.scr qsub -A "xyz" --job_pool="pool2" -lselect=5:ncpus=2:mem=4gb" job.scr | job array - qsub -A"abcd" -J 1-2 -lscratch=5gb -l "select=1:ncpus=10:mem=20gb" --OR "select=5:ncpus=2:mem=4gb" job.scr qsub -A "xyz" -J 1-2 -l "select=2:ncpus=5:mem=10gb" --OR "select=5:ncpus=2:mem=4gb" job.scr | Multiple resource req in a single job - qsub -A"abcd" -lscratch=5gb -l "select=1:ncpus=10:mem=20gb" --OR "select=5:ncpus=2:mem=4gb" job.scr qsub -A "xyz" -l "select=2:ncpus=5:mem=10gb" --OR "select=5:ncpus=2:mem=4gb" job.scr | |
Job submission | Each job in a job pool will have a separate job id, may not be even continious. Jobs can be submitted with different job wide resources. Jobs from different queue can not be linked to same pool (peer scheduling gets affected otherwise) Jobs from different users can not be linked to same pool Jobs will not be allowed to submit to a pool where one of the previously queued job is already running Users must submit jobs with "user hold" option if they don't want to run any pool job unless they are done with submission | Every job will get a job id similar to that of array job. Each subjob will represent a runnable job. Number of select specification must match with each jobjob index. New syntax needs to take care of job wide resources | There will be only one job that gets submitted. PBS will consider it as bunch of jobs internally (depending on number of resource requests) and then try to run the job. Syntax in this case and in "job array" option is probably going to be similar. It also needs to take care of specifying multiple job wide resources. |
qstat | User can query jobs by specifying job_pool like - qstat -f `qselect -j job_pool="pool1"` | User can stat this job as they do for array jobs qstat -st <parent job id>. Qstat -f of each subjob should show their respective resource requests. Qstat -f of the parent job should show all the resource requests it was submitted with | User can stat the job. They will see nothing in schedselect of the job unless the job is running. All resource requests will be visible in submitted arguments. |
qdel | Users can delete jobs by specifying job_pool like - qdel `qselect -j job_pool="pool1"` | User can delete the whole array job or a specific array subjob. | User can delete this job just as they do with any normal job. |
qhold | User can lift or apply hold on any one of the pool job and it will take effect on all the jobs in the job_pool | User can lift or applu hold on the array parent and it will hold any subjob of that array from running. | User can lift or apply hold on the job just as they do today with any normal job. |
qrerun | User can do a qrerun on a running job of a pool this will make all the pool jobs to get back into queued state. | qrerun will requeue any running subjob of the array and put the array back in queued state. | User can rerun a running job. Doing this should make the schedselect disappear while queuing and all select options are again available to be looked at again |
qalter | User can qalter jobs the same way as it happens today | qalter does not work on subjobs, it can only work on array parent in it's current state. | If a user decides to alter any one of the resource request, he/she will have to specify all of the "ORed" resource requests again. |
queued limits | Queued limits are applied to each job (whether or not part of the job_pool) | Queued limits can either be a sum of all the resource requests OR PBS server can identify the maximum amount of resource requested across all requests and use that for limit. | Queued limits can either be a sum of all the resource requests OR PBS server can identify the maximum amount of resource requested across all requests and use that for limit. |
queued threshold limits | queued threshold limits are applied to each job (whether or not part of the job_pool) | Queued threshold limits can either be a sub of all the resource requests OR PBS server can identify the maximum amount of resource requested across all requests and use that for limit. | Queued threshold limits can either be a sub of all the resource requests OR PBS server can identify the maximum amount of resource requested across all requests and use that for limit. |
run hard limits | Hard run limits needs no change in their behavior. | Hard run limits needs no change in their behavior. | Hard run limits needs no change in their behavior. |
run soft limits | Soft run limits needs no change in their behavior. | Soft run limits needs no change in their behavior. | Soft run limits needs no change in their behavior. |
Server hooks | Server hooks will see an additional job attribute "job_pool" maybe change it as well (but only in queuejob hooks). Rest of the hook operation is unaffected. | Server hooks will need to change to parse select specification with multiple resource requests. Not sure if a server hook can make a change in a subjob specifically (???) | Server hooks will need to change to parse select specification with multiple resource requests. |
Mom hooks | No effect on mom hooks | No effects on mom hooks | No effects on mom hooks |
job completion | On Job completion the whole job_pool will be maked as finished. | On a subjob completion the whole array_parent will be marked as finished | On job completion the job will be marked as finished like any other normal job |
node fail requeue | On node fail requeue the whole job_pool will be considered as requeued (all the jobs of the pool will show up as queued). | On node fail requeue the array parent is marked as requeued. | On node fail requeue the job gets requeued like any other normal job. |
Preemption | Has no effect as each job is looked upon as a completely independent job. | Has almost no effect on preemption. Each subjob is looked as an independent job. | When scheduler tries to run the job and finds out it couldn't run then it, it will try to make the job run by trying preemption. Scheduler will do this for all the resource requests unless it finds a way to run this job. If this job is a running job and gets preempted then it will retain the schedselect if the preemption is by suspension or checkpointing. In case of requeue all of the submitted resource requests appear again for evaluation. |
peer scheduling | Due to peer scheduling jobs from different queues can not be allowed to be linked to the same job_pool. Once a job of a job_pool is moved to a peer complex, all other jobs will be marked with a state that they are not considered to run. | Other than existing array job issues with peer scheduling, it shouldn't have any more problems with peer scheduling. | Peer scheduling should just work the same way as it works for other normal jobs. |
Array jobs | Can submit array jobs to job_pool too. | Since the syntax is already an array job syntax, we can't really have a subjob acting as another array parent. This is a big limitation here. | These jobs can be submitted as an array job and this would also mean that eventually there can be multiple array subjobs running with different resource requests. Scheduler currently assumes that if a subjob fails to run then the next subjob may run with the same node solution and thus mark the array parent as "can not run". This needs to change. |
Job sorting | All jobs appear as independent job so job sorting in unaffected | Server or scheduler will have to imagine each subjob as a queued job and then make them appear at different places in job sort order. | This is one tricky part in scheduler. If scheduler needs to make a job appear a job at multiple places in sort order (having different resource request). Scheduler currently sorts job not resource requests. It either needs to now sort on the basis of resource request now or somehow make the same job appear multiple times (maybe by creating dummy jobs) |
Fairshare | Fairshare would work the way it does today | Fairshare would work the way it does today, it deals with running jobs. | Fairshare would work the way it does today |
Calendaring | It may happen that a job from a job_pool is already present on the calendar. If found scheduler can take a decision of not adding another one from the same pool to calendar OR maybe scheduler may put one which starts sooner OR add one which takes lesser walltime (based on policy) | It may happen that a subjob is already present on the calendar and if found scheduler can take a decision of not putting another one to the calendar or may be place it if the start time turns out to be sooner or walltime turns out to be shorter. | It may happen that one of the job's resource request goes on the calendar and then scheduler can take a decision of not adding another resource request to calendar OR it may put one which starts sooner OR add one which takes lesser walltime |
qmove | When a job is moved to another complex, all other jobs will be marked with a state that they are considered to run. | This should work fine here. | This should work fine with this syntax |
Job wide resources | Each job of a job_pool can have a different job_wide resource specified. | More complex syntax needed to make sure that each subjob has a choice of a different job_wide resource | Syntax needs to make sure that each Ored resource request also has a way of specifying different job_wide resource |
Accounting logs | Mostly unchanged as every job is seen as an independent job | May be multiple resource requests will get logged in the accounting logs. It may require changes in components that consumes accounting logs. | Multiple resource request may get logged in accounting logs and it may also not have a "schedselect" for queued jobs (not sure if schedselect goes on accounting logs today). It would need changes in components that consumes accounting logs. |
estimated start time | Each job will get looked upon independently and may get it's own estimated start time. | Each subjob may get an estimated start time. | The same job may appear multiple times in job order (depending on resource requests) so estimated start time might be the one which makes the job starts sooner. |