Workers dying with "out of memory in new"

We must have crossed some threshold yesterday. Suddenly we are suffering an epidemic of workers dying with “out of memory in new” even though we made no changes. Previously, we would have a few die each day. Now we have had 250 alerts of workers dying and being restarted from 00:00 to 10:00. I have no idea where to start debugging the problem. Any suggestions?

What causes a worker to die by running out of memory? The sensors have lots of memory (see below) so I would not expect to have any out of memory deaths. (To monitor the problem, I am in the process of setting up collectd and graphana.)

Some details:

  • 5 sensors, each with 16-core, AMD Epyc 7351P, 128 GB RAM, Intel X520-T2
  • Zeek 2.6.1
  • node.cfg: lb_procs=15, pin_cpus=1-15, af_packet_buffer_size=110241024*1024
  • broctl.cfg: setcap enabled
  • Not shunting any traffic

Mark

Interestingly enough, we started suffering the same problem at the same time.

  • 1 node with 44 cores, 256GB of RAM
  • Zeek 2.5.5
  • node.cfg:
    [worker-1]

type=worker

host=localhost

interface=af_packet::ens4f0

lb_method=custom

lb_procs=25

pin_cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24

  • broctl.cfg:

MemLimit = 100000000 #100GB

setcap.enabled=1

For additional reference:

Linux snout 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11) x86_64 GNU/Linux

on 10-11 I patched libssl,and libc
on 10-17 I upgraded sudo (about 30 mins after the first worker crashed)

[Bro] Crash report from worker-1-12 email received at 16:00

Log output from dpkg for reference:

less /var/log/dpkg.log |grep “status installed”

2019-10-11 14:59:23 status installed telegraf:amd64 1.12.3-1

2019-10-11 14:59:23 status installed libssl1.0.2:amd64 1.0.2t-1~deb9u1

2019-10-11 14:59:23 status installed libc-bin:amd64 2.24-11+deb9u4

2019-10-11 14:59:23 status installed libssl1.1:amd64 1.1.0l-1~deb9u1

2019-10-11 14:59:23 status installed openssl:amd64 1.1.0l-1~deb9u1

2019-10-11 14:59:24 status installed man-db:amd64 2.7.6.1-2

2019-10-11 14:59:24 status installed libssl1.0-dev:amd64 1.0.2t-1~deb9u1

2019-10-11 14:59:24 status installed libc-bin:amd64 2.24-11+deb9u4

2019-10-17 16:25:47 status installed sudo:amd64 1.8.19p1-2.1+deb9u1

2019-10-17 16:25:47 status installed apache2-utils:amd64 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed apache2-bin:amd64 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed apache2-data:all 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed systemd:amd64 232-25+deb9u12

2019-10-17 16:25:47 status installed man-db:amd64 2.7.6.1-2

2019-10-17 16:25:48 status installed apache2:amd64 2.4.25-3+deb9u9

Hi,

both of you are running rather old versions of Zeek.

Both 2.5.5 and 2.6.1 have a number of issues.

One of the issues that was fixed could be the cause for crashes. A bug could result in Zeek requesting huge allocations that cannot be fulfilled by the operating system; see e.g. tcmalloc large alloc crashes · Issue #245 · zeek/zeek · GitHub for more details. This specific issue was fixed on 2.6.3.

So - upgrading to 2.6.4 (or even better - 3.0.0) might fix those problems for you.

Besides that - both 2.5.5 and 2.6.1 have several vulnerabilities - and you really really really should upgrade them :).

Johanna