Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • mem_fences set to true is unsafe unless the workload is organised to prevent NUMA vnodes from being shared by jobs not entirely contained in the node.

    When failing to adhere to the (documented) safety rules, it is possible to deplete the physical memory on NUMA nodes even when no job violates its group memory usage limits, and the fences may then be forcing jobs to use swap instead of physical memory; when there is no swap or once swap is fully depleted, the jobs are killed by OOM, through no fault of the job (just a site admin's inadvertent failure to follow safety restrictions for the workload); since the reason is resource depletion in swap or a NUMA node, there will be no hints in the job stdout/stderr.

    We have seen at least three sites completely baffled by that behaviour.

    mem_fences was set to true by default to mimic the behaviour of the cpuset MoM, but the context for the decision to enable memory fences in the cpuset MoM was different: in that MoM, there is no separate cgroup memory controller, so memory fences are the only way to contain rogue jobs that spread beyond the NUMA nodes assigned to them be allocating more memory than they reserved.

    With the cgroup hook,there is now a separate memory controller to stop rogues from allocating too much memory.

    With this in mind,  enabling memory fences by default is probably seriously misguided in the cgroup hook.

    If sites adhere to the restrictions necessary to make memory fences safe, then they can enable them, but they should not do so before reading the documentation about them. Enabling them by default makes that impossible – they are enabled even when site admins have no idea of the implications.

  • the 'memory' section reserve_amount used to be 64MB. Sites have invariably felt OOM's wrath when they have not tuned this – if the jobs indeed hog all the memory they requested, the OS processes require far more than 64MB, so the host runs out of memory and calls OOM to kill a job anyway. 1GB is a more reasonable default for today's clusters, even though it is on the low side for most clusters (5-6GB are more typical sane values on HPC nodes running a fairly substantial software stack and accessing a remote shared filesystem.

    We should not ship a default that is invariably too low for any modern compute node. Sites often fail to adjust the default and are puzzled by the inevitable job failures that result. It's better for the default to be slightly more conservative – the net effect will be to reduce resources_available.mem on nodes, and if sites wonder "why is 1GB missing", they will usually be prompted to read the documentation.
     
  • vnode_hidden_mb needs to be at least 1 in al sections of the cgroup configuration file. E.g. when a job requests some 'mem' but no' vmem' and enforce_defaults is false for memsw, the cgroup memsw limit is raised to all memory available to jobs. But to the scheduler, such jobs use zero vmem; that means that when computing "already assigned" memory for jobs that are already running on the node, we need to ignore those limits.

    When vnode_hidden_mb is 1, then the scheduler cannot see 1mb, so any job that gets a cgroup limit because the scheduler assigns the memory resource will have a limit that is guaranteed to be lower than an implicitly raised limit, so we can recognize the latter.

  • in contrast to the other values in the "memsw" section, the "default" for memsw used to be inclusive of the default value specified for physical memory in the "memory" subsection.

    That is an aberration: it makes it possible to specify impossible defaults (the kernel will refuse to set memsw to something smaller than mem), it makes it impossible for the memory and memsw sections to be seen as controlling different resources (physical memory vs. swap), and it makes "default" inconsistent with "reserve_amount" and "reserve_percent", which specify some swap to be added to the physical memory in the "memory" section. The default values also must  have an independent meaning in both sections if defaults in both sections can be either enforced or not independently, or the behaviour would become completely baffling (you could turn off enforcement of the memory default, but the default value could still be relevant for the other section).

    Most sites that enable both "memory" and "memsw" expect jobs that do not explicitly request swap not to use any swap (i.e. to have the memory and memsw limits set to identical values).  That suggests semantics where "default" in memsw controls an amount of swap (to be added to whatever limit was set for memory) and a default of "0B" for the extra swap to allow by default allowed swap usage for a job.

Implementation note about 0 memory size. 

...