Exceeded resources notification

Overview:

  • Before this change, exceeding resources is noticed on stderr, but notification is not sent by email or is imprecise. Also, the job comment on the server lacks information about exceeding resources.

  • This document suggests using a different job exit code for exceeding each possible resource. Exceeding resources will be also considered to be an aborted job, which means an email (with the appropriate message) will be sent to the user. Also, the comment on the server will be set to a suitable message.

  • Resources in OpenPBS with a possibility to exceed resources: ncpus (burst), ncpus (sum), vmem, mem, cput, walltime

Interfaces:

  • Interfaces job exit codes:

    • Values:

      • JOB_EXEC_KILL_NCPUS_BURST -24

      • JOB_EXEC_KILL_NCPUS_SUM -25

      • JOB_EXEC_KILL_VMEM -26

      • JOB_EXEC_KILL_MEM -27

      • JOB_EXEC_KILL_CPUT -28

      • JOB_EXEC_KILL_WALLTIME -29

    • Visibility: public

    • Synopsis: job exit code

    • Details: Exit code is a value sent by a mom to the server as information on how the job ended. The new exit codes convey the information of exceeding resource kills to the server. E.g.: Job is killed due to exceeding walltime, the exit code JOB_EXEC_KILL_WALLTIME is returned to the server.

  • Interfaces job comments:

    • Values:

      • Job run at … on … and exceeded resource ncpus (burst)

      • Job run at … on … and exceeded resource ncpus (sum)

      • Job run at … on … and exceeded resource vmem

      • Job run at … on … and exceeded resource mem

      • Job run at … on … and exceeded resource cput

      • Job run at … on … and exceeded resource walltime

    • Visibility: public

    • Synopsis: job comment

    • Details: If the job is killed due to exceeding resources, the job comment is set to a corresponding value. E.g.: Job exceeds the walltime and the job comment is set to: “Job run at … on … and exceeded resource walltime“.

  • Interfaces email abort messages:

    • Values:

      • "Job exceeded resource ncpus (burst)\nSee job standard error file"

      • "Job exceeded resource ncpus (sum)\nSee job standard error file"

      • "Job exceeded resource vmem\nSee job standard error file"

      • "Job exceeded resource mem\nSee job standard error file"

      • "Job exceeded resource cput\nSee job standard error file"

      • "Job exceeded resource walltime\nSee job standard error file"

    • Visibility: public

    • Synopsis: email message on job abort

    • Details: An appropriate abort email is sent once the job is killed due to exceeding resources. E.g.: Job exceeds the walltime and the abort email with a message "Job exceeded resource walltime\nSee job standard error file" is sent.