Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • New sched attribute “attr_update_freq”period”: This will determine the number of cycles specify the time in seconds after which scheduler will send “job can’t run” related attribute updates to the server. These attributes are as follows:

    • comment

    • estimated.exec_vnode

    • estimated.start_time

  • Default: The default value will be 10 seconds, i.e - scheduler will send updates every cycle, just like today. For sites which see a large volume of jobs, admins might want to increase this. With a workload of 100k jobs every cycle, 50k of which wouldn’t run, I saw a 3x+ performance boost when setting this value to 5just 60 seconds.

  • Exception: If “accrue_type” needs to be updated for a job, then scheduler will ignore the throttling waiting window & send all attribute updates for that job immediately so that eligible time is accrued accurately.

    • To ameliorate this, I’m also proposing that we change the default accrue_type of jobs to ‘eligible_time’ instead of ‘initial_time’. It seems like we already document “eligible_time” as the default for accrue_type, so this shouldn’t need a separate design change document.

  • Permissions: Manager write only, everyone can read

  • Implications:

    • Depending on the configured frequency of updates, there will be delays associated with when attributes like job comment get updated, which tells users why their job isn’t running. Since the sched cycles will be faster, this delay might not be much, but admins should keep this in mind and figure out what the value of this attribute works best for their site.

  • Can be configured per scheduler in a multi-sched scenario

  • All “job run” related attribute updates, like pset, walltime for STF jobs, etc., will be sent like before.

...

  • Caveats:

    • Scheduler will

    simply send “job can’t run” type updates every N cycles without caching them internally as these jobs will be looked at again the next
    • determine at the start of every cycle whether it should send attribute updates or not. So, there can be some additional delay in the attribute updates getting sent. For example:

      • If attr_update_period set to 5 mins and each sched cycle takes 2 mins, scheduler will send updates in the 4rd sched cycle which happens 6 mins after it last sent updates, even though the update period is 5 mins.

    • Depending on the configured period, there will be delays associated with when attributes like job comment get updated, which tells users why their job isn’t running, so admins should keep this in mind and figure out what the value of this attribute works best for their site.

Technical details:

  • Scheduler will check at the beginning of each cycle whether it’s been attr_update_period seconds since it last sent the updates to decide whether to send updates or not that cycle.