Workers dying with "out of memory in new"

Mark_Gardner · October 18, 2019, 2:47pm

We must have crossed some threshold yesterday. Suddenly we are suffering an epidemic of workers dying with “out of memory in new” even though we made no changes. Previously, we would have a few die each day. Now we have had 250 alerts of workers dying and being restarted from 00:00 to 10:00. I have no idea where to start debugging the problem. Any suggestions?

What causes a worker to die by running out of memory? The sensors have lots of memory (see below) so I would not expect to have any out of memory deaths. (To monitor the problem, I am in the process of setting up collectd and graphana.)

Some details:

5 sensors, each with 16-core, AMD Epyc 7351P, 128 GB RAM, Intel X520-T2
Zeek 2.6.1
node.cfg: lb_procs=15, pin_cpus=1-15, af_packet_buffer_size=110241024*1024
broctl.cfg: setcap enabled
Not shunting any traffic

Mark

Munroe_Sollog · October 18, 2019, 3:12pm

Interestingly enough, we started suffering the same problem at the same time.

1 node with 44 cores, 256GB of RAM
Zeek 2.5.5
node.cfg:
[worker-1]

type=worker

host=localhost

interface=af_packet::ens4f0

lb_method=custom

lb_procs=25

pin_cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24

broctl.cfg:

MemLimit = 100000000 #100GB

setcap.enabled=1

Munroe_Sollog · October 18, 2019, 3:26pm

For additional reference:

Linux snout 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11) x86_64 GNU/Linux

on 10-11 I patched libssl,and libc
on 10-17 I upgraded sudo (about 30 mins after the first worker crashed)

[Bro] Crash report from worker-1-12 email received at 16:00

Log output from dpkg for reference:

less /var/log/dpkg.log |grep “status installed”

2019-10-11 14:59:23 status installed telegraf:amd64 1.12.3-1

2019-10-11 14:59:23 status installed libssl1.0.2:amd64 1.0.2t-1~deb9u1

2019-10-11 14:59:23 status installed libc-bin:amd64 2.24-11+deb9u4

2019-10-11 14:59:23 status installed libssl1.1:amd64 1.1.0l-1~deb9u1

2019-10-11 14:59:23 status installed openssl:amd64 1.1.0l-1~deb9u1

2019-10-11 14:59:24 status installed man-db:amd64 2.7.6.1-2

2019-10-11 14:59:24 status installed libssl1.0-dev:amd64 1.0.2t-1~deb9u1

2019-10-11 14:59:24 status installed libc-bin:amd64 2.24-11+deb9u4

2019-10-17 16:25:47 status installed sudo:amd64 1.8.19p1-2.1+deb9u1

2019-10-17 16:25:47 status installed apache2-utils:amd64 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed apache2-bin:amd64 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed apache2-data:all 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed systemd:amd64 232-25+deb9u12

2019-10-17 16:25:47 status installed man-db:amd64 2.7.6.1-2

2019-10-17 16:25:48 status installed apache2:amd64 2.4.25-3+deb9u9

johanna · October 18, 2019, 11:38pm

Hi,

both of you are running rather old versions of Zeek.

Both 2.5.5 and 2.6.1 have a number of issues.

One of the issues that was fixed could be the cause for crashes. A bug could result in Zeek requesting huge allocations that cannot be fulfilled by the operating system; see e.g. tcmalloc large alloc crashes · Issue #245 · zeek/zeek · GitHub for more details. This specific issue was fixed on 2.6.3.

So - upgrading to 2.6.4 (or even better - 3.0.0) might fix those problems for you.

Besides that - both 2.5.5 and 2.6.1 have several vulnerabilities - and you really really really should upgrade them :).

Johanna

Topic		Replies	Views
Worker being "killed nohup" Zeek	8	180	November 7, 2019
worker dies out of memory Zeek	2	172	June 22, 2015
Increased memory usage by Zeek.. Zeek	3	178	September 6, 2019
Bro-2.6.4 to Zeek-3.0.x upgrade tentative - huge memory consumption Zeek	3	135	February 19, 2021
out of memory after a couple days? Zeek	21	308	December 12, 2013

Workers dying with "out of memory in new"

less /var/log/dpkg.log |grep “status installed”

Related topics