Exceeded resources notification
Overview:
Before this change, exceeding resources is noticed on stderr, but notification is not sent by email or is imprecise. Also, the job comment on the server lacks information about exceeding resources.
This document suggests using a different job exit code for exceeding each possible resource. Exceeding resources will be also considered to be an aborted job, which means an email (with the appropriate message) will be sent to the user. Also, the comment on the server will be set to a suitable message.
Resources in OpenPBS with a possibility to exceed resources: ncpus (burst), ncpus (sum), vmem, mem, cput, walltime
Interfaces:
Interfaces job exit codes:
Values:
JOB_EXEC_KILL_NCPUS_BURST -24
JOB_EXEC_KILL_NCPUS_SUM -25
JOB_EXEC_KILL_VMEM -26
JOB_EXEC_KILL_MEM -27
JOB_EXEC_KILL_CPUT -28
JOB_EXEC_KILL_WALLTIME -29
Visibility: public
Synopsis: job exit code
Details: Exit code is a value sent by a mom to the server as information on how the job ended. The new exit codes convey the information of exceeding resource kills to the server. E.g.: Job is killed due to exceeding walltime, the exit code JOB_EXEC_KILL_WALLTIME is returned to the server.
Interfaces job comments:
Values:
Job run at … on … and exceeded resource ncpus (burst)
Job run at … on … and exceeded resource ncpus (sum)
Job run at … on … and exceeded resource vmem
Job run at … on … and exceeded resource mem
Job run at … on … and exceeded resource cput
Job run at … on … and exceeded resource walltime
Visibility: public
Synopsis: job comment
Details: If the job is killed due to exceeding resources, the job comment is set to a corresponding value. E.g.: Job exceeds the walltime and the job comment is set to: “Job run at … on … and exceeded resource walltime“.
Interfaces email abort messages:
Values:
"Job exceeded resource ncpus (burst)\nSee job standard error file"
"Job exceeded resource ncpus (sum)\nSee job standard error file"
"Job exceeded resource vmem\nSee job standard error file"
"Job exceeded resource mem\nSee job standard error file"
"Job exceeded resource cput\nSee job standard error file"
"Job exceeded resource walltime\nSee job standard error file"
Visibility: public
Synopsis: email message on job abort
Details: An appropriate abort email is sent once the job is killed due to exceeding resources. E.g.: Job exceeds the walltime and the abort email with a message "Job exceeded resource walltime\nSee job standard error file" is sent.