New sched attribute to control runjob wait + making pbs_asynrunjob truly async + deprecating 'throughput_mode'

Forum: http://community.pbspro.org/t/adding-truly-asynchronous-option-to-throughput-mode-making-pbs-asyrunjob-truly-asynchronous/2057

Motivation:

Scheduler can spend up to 94% of its total time waiting for ACK from server for runjob requests. This can obviously have a big impact on performance. So, I'm proposing adding a new scheduler attribute which will control how much the scheduler waits when doing runjobs. See http://community.pbspro.org/t/scheduler-can-spend-94-of-its-time-waiting-for-job-run-ack/2053 for more details on the motivation.

Things to consider if scheduler doesn’t wait for an ACK:

  • Not waiting for an ACK will mean that if the job gets rejected by a runjob hook, or an execjob_begin hook, then the scheduler will not be able to re-purpose those resources for other jobs in that cycle.

  • If a site has runjob hooks in place which rejects jobs, then PBS doesn’t penalize such jobs. This means that in the next cycle, scheduler will see the job as it did in the last cycle and might again run it assuming that it will pass. If a large number of such jobs get rejected repeatedly by a runjob hook without the hook penalizing them, then the cluster may see under-utilization of resources.

Interface Changes:

  • New sched attribute ‘job_run_wait’ which will accept the following values:

    • execjob_hook: scheduler will wait for the server to send an ACK, and the server will only send an ACK back to scheduler when it has run the runjob hook on server side, and sent the job to the mom and the mom sends an ACK back after running execjob hooks on the mom side.

      • Implications:

        • Scheduling cycles can be much slower than the other options.

        • On the upside, Scheduler will know about any runjob rejects from the runjob or execjob_begin hooks, and can re-purpose those resources for running some other job in that cycle.

    • runjob_hook: scheduler will wait for the server to send an ACK, and the server will send an ACK immediately after running any runjob hooks, it will NOT wait for the job to be sent to the mom and the mom sending an ACK. This will be the default value.

      • If no runjob hooks are configured then there’s no point waiting for server, so sched will internally behave as if the value was ‘none’

      • Implications:

        • Scheduling cycles will be slower than the “none” mode below, but still faster than the “execjob_hook” mode above.

        • When a mom level hook rejects a job, the job’s run_count is increased, so such jobs eventually get penalized by getting Held by the server. So, this mode shouldn’t need the admin to penalize such jobs via their execjob hooks.

    • none: scheduler will not wait for an ACK from the server at all, it will just shoot the runjob request and move on to the next job. This can make scheduling cycles an order of magnitude faster.

      • Implications:

        • To prevent the under-utilization situation described in 2nd bullet under the Caveats, it is recommended that if a site has runjob hooks in place, then the hook should prevent jobs which get repeatedly rejected from causing under-utilization of resources. This can be done by either de-prioritizing such jobs, or putting them on Hold.

    • Any changes to this attribute will take effect from the next scheduling cycle of the particular scheduler.

    • Attribute permissions: All can read, but Manager write only

  • pbs_asyrunjob() will be truly asynchronous:

    • No change to the function’s signature, just that the function will now no longer wait for any reply from the server

  • throughput_modesched attribute will be deprecated:

    • The “job_run_wait=runjob_hook” will take over the role of “throughput_mode=True” value, and “job_run_wait=execjob_hook” will take over the role of “throughput_mode=False” value.

    • While it’s still part of PBS, setting either throughput_mode or job_run_wait will set the other, except when job_run_wait is set to none, in that case throughput_mode will be unset (without being reset to default)

Technical/Internal details:

  • Scheduler will make a different run job related batch request for different values of the job_run_wait attribute.

  • It will call pbs_runjob (PBS_BATCH_RunJob) for “execjb_hook”, pbs_asyrunjob (PBS_BATCH_AsyrunJob) for “none”, and pbs_asyrunjob_ack (PBS_BATCH_AsyrunJob_ack) for “runjob_hook” mode.

  • Adding a new internal/secondary server attribute called “is_runjob_hook” which will be used to communicate to scheduler that there’s a runjob hook enabled. Scheduler will use this knowledge to decide whether to promote “runjob_hook” to “none” or not.