Server crashed after requesting multi-chunk reservation with scatter.

Description

The PBS server crashed after requesting a multiple-chunk reservation with place=scatter.

The following is the stack trace with debug info:

  1. gdb $PBS_EXEC/sbin/pbs_server.bin /var/spool/pbs/server_priv/core
    GNU gdb (GDB; SUSE Linux Enterprise 12) 7.7
    Copyright (C) 2014 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law. Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-suse-linux".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://bugs.opensuse.org/>.
    Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.
    For help, type "help".
    Type "apropos word" to search for commands related to "word"...

warning: /etc/gdbinit.d/gdb-heap.py: No such file or directory
/opt/pbs123/sbin/pbs_server.bin: No such file or directory.
[New LWP 8004]
[New LWP 8006]
Reading symbols from /opt/pbs/sbin/pbs_server.bin...Reading symbols from /usr/lib/debug/opt/pbs/sbin/pbs_server.bin.debug...done.
done.

warning: Ignoring non-absolute filename: <linux-vdso.so.1>
Missing separate debuginfo for linux-vdso.so.1
Try: zypper install -C "debuginfo(build-id)=3c8095bc13d95c966b8b8c2123f14ef3eac4372c"

warning: Could not load shared library symbols for /tmp/xf-dll/xf-8004861d44a89f0562ba4fe42c4861d3fb60.tmp.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/pbs/sbin/pbs_server.bin'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000049d280 in req_confirmresv (preq=<optimized out>)
at /home/pbsbuild/ramdisk/workspace/build/pbspro/src/server/req_rescq.c:700
700 /home/pbsbuild/ramdisk/workspace/build/pbspro/src/server/req_rescq.c: No such file or directory.
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.19-17.72.x86_64 libopenssl1_0_0-debuginfo-1.0.1i-2.12.x86_64 libz1-debuginfo-1.2.8-5.1.x86_64
(gdb)
(gdb) where
#0 0x000000000049d280 in req_confirmresv (preq=<optimized out>)
at /home/pbsbuild/ramdisk/workspace/build/pbspro/src/server/req_rescq.c:700
#1 0x383778282b29313d in ?? ()
#2 0x7570636e3a36345f in ?? ()
#3 0x3778282b29313d73 in ?? ()
#4 0x70636e3a37345f38 in ?? ()
#5 0x78282b29313d7375 in ?? ()
#6 0x636e3a38345f3837 in ?? ()
<...snip...>
#284 0x2b29313d73757063 in ?? ()
#285 0x3139315f38377828 in ?? ()
#286 0x313d737570636e3a in ?? ()
#287 0x0000000000000029 in ?? ()
#288 0x0000000000000000 in ?? ()
(gdb)

Acceptance Criteria

None

Activity

Show:
Bhroam Mann
October 12, 2017, 10:35 PM

It's a buffer overrun. The req_confirmresv() function has a buffer which is ~600 characters long. It copies the accounting log record into this buffer using sprintf(). The accounting log includes the resv_nodes which can be very long.

Two things need to happen. Either the buffer needs to be much larger (or dynamically grown) and snprintf() needs to be used.

Line 650:
(void)sprintf(buf, "requestor=%s@%s start=%ld end=%ld nodes=%s",
preq->rq_user, preq->rq_host,
presv->ri_qs.ri_stime, presv->ri_qs.ri_etime,
next_execvnode);

The crash happens at line 658:
free(next_execvnode);

The buf variable is the first variable declared. This means the subsequent variables all get spammed when the sprintf() overflows. At this free(), next_execvnode is garbage.

The crash should happen on any moderately large reservation. I used one of 128 nodes, but I suspect one much smaller could cause the crash.

Assignee

Prakash Varandani

Reporter

Voltaire Cardenas

Severity

4-Critical

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Affects versions

Priority

Blocker
Configure