PP-339 and PP-647: node ramp down feature: release vnodes early from running jobs
http://community.pbspro.org/t/pp-339-and-pp-647-release-vnodes-early-from-running-jobs/419/44
Objective
This is to introduce the node ramp down feature, which basically releases no longer needed sister nodes/vnodes early from running jobs.
(Forum Discussion: http://community.pbspro.org/t/pp-339-and-pp-647-release-vnodes-early-from-running-jobs)
Interface 1: New command: 'pbs_release_nodes'
Interface 2: New job attribute 'release_nodes_on_stageout'
Visibility: Public
Change Control: Stable
Value: 'true' or 'false'
Synopsis: When set to 'true', this will do an equivalent of 'pbs_release_nodes -a' for releasing all the sister nodes when stageout operation begins.
Example:
% qsub -W stageout=my_stageout@federer:my_stageout.out -W release_nodes_on_stageout=true job.scrThis can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with release_nodes_stageout set by default.
If there was no stageout parameter specified, then release_nodes_on_stageout is not consulted even if it is set to true.
The use of this attribute is not currently supported with nodes/vnodes that are tied to Cray X* series systems. These are nodes/vnodes whose vntype matches "cray_" prefix.,
This is also not supported with nodes/vnodes managed by cpuset moms, given that partial release of vnodes may result in leftover cpusets. These are the vnodes whose 'arch' attribute value is "linux_cpuset".
If cgroups support is enabled, and this option is used to release some of the vnodes but not all the vnodes from the same mom host, resources on those vnodes that are part of a cgroup would not get automatically released, until entire cgroup is released.
This attribute can also be set in a queuejob, modifyjob hook, and the Python type is boolean with valid values 'True' or 'False'.
Example:
# cat qjob.py
import pbs
e=pbs.event()
e.job.release_nodes_on_stageout = True
# qmgr -c "create hook qjob event=queuejob"
# qmgr -c "import hook application/x-python default qjob.py"
% qsub job.scr
23.borg
% qstat -f 23
...
release_nodes_on_stageout = True
Interface 3: New server accounting record: 'u' for update record
Visibility: Public
Change Control: Stable
Synopsis: For every release nodes action, there'll be an accounting_logs record written, which is called the
'u' (for update) record.
Details: The 'u' record represents a just concluded phase of the job, which consists of a set of resources assigned to the job ( exec_vnode, exec_host, Resource_List items), and amount of resources used (resources_used) during that phase of the job.
Example:
% qsub -l select=3:ncpus=1:mem=1gb job.scr
242.borg
% qstat -f 242 | egrep "exec|Resource_List|select"
exec_host = borg/0+federer/0+lendl/0
exec_vnode = (borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb)
Resource_List.mem = 3gb
Resource_List.ncpus = 3
Resource_List.nodect = 3
Resource_List.place = scatter
Resource_List.select = 3:ncpus=1:mem=1gb
schedselect = 3:ncpus=1:mem=1gb
% pbs_release_nodes -j 241 lendl
Accounting logs show:
# tail -f /var/spool/PBS/server_priv/accounting/201701231
23/2017 18:53:24;u;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7503 run_count=1 exec_host=borg/0+federer/0+lendl/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb) Resource_List.mem=3gb Resource_List.ncpus=3 Resource_List.nodect=3 Resource_List.place=scatter Resource_List.select=3:ncpus=1:mem=1gb resources_used.cpupercent=5 resources_used.cput=00:04:35 resources_used.mem=4288kb resources_used.ncpus=3 resources_used.vmem=42928kb resources_used.walltime=00:00:26
Another pbs_release_nodes call yield:
% pbs_release_nodes -j 241 federer
# tail -f /var/spool/PBS/server_priv/accounting/201701231
01/23/2017 18:59:35;u;242.borg;user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7773 run_count=1 exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1048576kb resources_used.cpupercent=3 resources_used.cput=00:03:35 resources_used.mem=2048kb resources_used.ncpus=2 resources_used.vmem=32928kb resources_used.walltime=00:00:26
Interface 4: New server accounting record: 'c' for continue record:
Visibility: Public
Change Control: Stable
Synopsis: The 'c' accounting record will show the next assigned exec_vnode, exec_host, Resource_List. along with the job attributes in the new/next phase of the job. This is generated for every release nodes action done, and is paired up with the 'u' accounting record (interface 3).
Given the following example: % qsub -l select=3:ncpus=1:mem=1gb job.scr
242.borg
% qstat -f 242 | egrep "exec|Resource_List|select"
exec_host = borg/0+federer/0+lendl/0
exec_vnode = (borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb)
Resource_List.mem = 3gb
Resource_List.ncpus = 3
Resource_List.nodect = 3
Resource_List.place = scatter
Resource_List.select = 3:ncpus=1:mem=1gb
schedselect = 3:ncpus=1:mem=1gb
% pbs_release_nodes -j 241 lendl
Accounting logs show:
# tail -f /var/spool/PBS/server_priv/accounting/201701231
23/2017 18:53:24;u;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7503 run_count=1 exec_host=borg/0+federer/0+lendl/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb) Resource_List.mem=3gb Resource_List.ncpus=3 Resource_List.nodect=3 Resource_List.place=scatter Resource_List.select=3:ncpus=1:mem=1gb resources_used.cpupercent=5 resources_used.cput=00:04:35 resources_used.mem=4288kb resources_used.ncpus=3 resources_used.vmem=42928kb resources_used.walltime=00:00:26
23/2017 18:53:24;c;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7503 run_count=1 exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter updated_Resource_List.select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1048576kb resources_used_incr.cpupercent=5 r
Another pbs_release_nodes call yield 'federer' vnode assignment gone::
% pbs_release_nodes -j 241 federer
# tail -f /var/spool/PBS/server_priv/accounting/201701231
01/23/2017 18:53:24;u;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7503 run_count=1 exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter
01/23/2017 18:53:24;c;242.borg;user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq cctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7773 run_count=1 exec_host=borg/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb) Resource_List.mem=1048576kb Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb
Interface 5: New server accounting record: 'e' (end) for end of job record for a phased job
Visibility: Public
Change Control: Stable
Synopsis: The 'e' accounting record will show the resources assigned to the job ( exec_vnode, exec_host, Resource_List items), and amount of resources used (resources_used) during that last phase of the job.
Details: It will be up to the log parser to take all the 'u' records and 'e' record of the job and either sum up the resources_used.* values (e.g. resources_used.walltime) or to average them out whichever makes sense (e.g. resources_used.ncpus). Note that the regular 'E' (end) accounting record will continue to be generated for a job, whether it has released nodes or not, showing the job's values in total at the end of the job.
Interface 6: Additions to log messages
Visibility: Public
Change Control: Stable
Details:
Special mom_logs messages:
A pbs_release_nodes request causes the server to send a job update to the mother superior (MS) of the job. The
MS in turn looks into the list of nodes being removed. If it's the last node from the same host, MS sends a new DELETE_JOB2 request to that owning sister mom. Upon receiving this request, the sister mom goes and kills job processes on the node, and sends back to the mother superior the summary
accounting information for the job on that node. Mom_logs will show the following DEBUG messages:sister mom_logs: "DELETE_JOB2 received"
Mother superior log: "<reporting_sister_host>;cput=YY mem=ZZ"
Special server_logs messages:
When a job has been completely removed from an early released vnode, the following DEBUG2 message will be shown:
"clearing job <job-id> from node <vnode-name>
"Node<sister-mom-hostname>;deallocating 1 cpu(s) from job <job-id>
Interface 7: New server attribute 'show_hidden_attribs'
Visibility: Private
Change Control: Unstable
Value: 'true' or 'false'
Synopsis: When set to 'true', this allows qstat -f to also show values of internal attributes created by the server to implement the node ramp down feature. Some example internal job attributes that may show its value are exec_vnode_orig, exec_vnode_acct, exec_vnode_deallocated, exec_host_orig, exec_host_acct, Resource_List_orig, Resource_List_acct.
Note: This attribute is provided as an aid to debugging PBS.