Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Forum discussion link :http://community.pbspro.org/t/external-design-document-for-pp-824-cray-ramp-rate-limiting/693

...

  • Interface 3: New node attribute: poweroff_eligible
    • Change control: Stable
    • Synopsis: Node attribute for power control.
    • Details: This new node attribute will control if a node can be allowed to power off or not.
      • PBS type: Boolean
      • Python type: Boolean
      • Default value: False
      • Manager has set permission. All have read permission.
      • To modify the default value use qmgr:
        • qmgr -c "set node <node_name> poweroff_eligible=True"
  • Interface 4: New node attribute: last_state_change_time
    • Change control: Stable
    • Synopsis: Read only node attribute to capture timestamp.
    • Details: This new node attribute will be updated with time stamp when the node changes from its current state to a new state.
      • Managers and Operators have read permission.
      • Node status command pbsnodes will convert internal date format (seconds since epoch) to human readable format and display the value of this attribute in “MON DD YY HH:MM:SS” format.
      • PBS type: long
      • Python type: int
  • Interface 5: New node attribute: last_used_time
    • Change Control: Stable
    • Synopsis: Read only node attribute to capture timestamp.
    • Details: This new node attribute will be updated with time stamp at the end of any job or reservation.
      • If node is released early from a running job this timestamp gets updated.
      • Node status command pbsnodes will convert internal date format (seconds since epoch) to human readable format and display the value of this attribute in "MON DD YY HH:MM:SS" format.
      • Attribute will be reset when node is ramped up.
      • Managers and Operators have read permission.
      • For vnodes this attribute will be updated for the first time with the current timestamp when they are created or when the nodes are rebooted.
      • This attribute can now be used in sched_config as a node_sort_key. This will help sort the nodes based on their last used time.
      • PBS type: long
      • Python type: int
      • Example:
        • node_sort_key: "last_used_time HIGH"
        • node_sort_key: "last_used_time LOW"
  • Interface 6: New node state: sleep
    • Change Control: Stable
    • Synopsis: New node state that shows node is put down by PBS.
    • Details: This new node state will be set when nodes are ramped down or powered-off by PBS via power ramp rate limiting or power on/off feature.
      • A server periodic hook (pbs hook PBS_power provided as part of PBS package) runs every $freq seconds and takes list of vnodes to power ramp down/power-off the nodes and marks them in new sleep node state.
      • At most max_concurrent_power_limit nodes will be ramped down/powered-off every freq seconds, freq being the server periodic hook frequency.
      • Scheduler can consider the nodes in sleep state to run jobs now.
      • Server periodic hook can ramp-up/power-on the nodes which are in sleep state based on the requirement. Requirement is calculated based on analyzing the jobs estimated start time and the exec_vnode list.
  • Interface 7: Log/Error messages.
    • Change Control: Unstable
    • Synopsis: New log/error messages.
    • Details: Below listed are the new log and error messages introduced by power ramp limiting feature.

    • Change Control: Stable
    • Synopsis: New behaviour.
    • Details: Node state Stale can be set from Hooks and qmgr.

      qmgr -c " set node <node_name> state=Stale"
      #ScenarioLog/error message
      1Nodes are being ramped down

      In server logs:

      Job;power_ramp_down;launch: /opt/cray/capmc/default/bin/capmc set_sleep_state_limit --nids 24-25 --limit 4

      Job;power_ramp_down;launch: finished

      Log level: LOG_INFO

      2Nodes are being ramped up

      In server logs:

      Job;power_ramp_up;launch: /opt/cray/capmc/default/bin/capmc set_sleep_state_limit --nids 24-25 --limit 0

      Job;power_ramp_up;launch: finished

      Log level: LOG_INFO

      3Server periodic hook output

      In server logs:

      power_ramp_limit: nodes to ramp up: <node_list>

      power_ramp_limit: nodes to ramp down: <node_list>

      Log level: LOG_INFO

      4Nodes are being powered off

      In server logs:


      03/29/2016 02:05:59;0008;Server@sdb;Job;node_power_off;launch: /opt/cray/capmc/default/bin/capmc node_off --nids 24-25

      03/29/2016 02:06:01;0008;Server@sdb;Job;node_power_off;launch: finished

      Log level: LOG_INFO

      5Nodes are being powered on

      In server logs:


      03/29/2016 02:05:59;0008;Server@sdb;Job;node_power_on;launch: /opt/cray/capmc/default/bin/capmc node_on --nids 24-25

      03/29/2016 02:06:01;0008;Server@sdb;Job;node_power_on;launch: finished

      Log level: LOG_INFO

    Interface 8: Node state Stale