Bro-2.6.4 to Zeek-3.0.x upgrade tentative - huge memory consumption

Hi folks,

We are still running bro-2.6.4 (I know, pretty old) but I’m trying to get out of this.

So I started trying to go to the current LTS 3.0.12 but as soon I pushed to an “test” environment we runned out of memory.
So I went down and tried all releases from github until 3.0.0 but the issue remains:

Here is a zeekctl top from one of our host:

$ zeekctl top
Name Type Host Pid VSize Rss Cpu Cmd
logger-1 logger localhost 45226 50G 49G 201% zeek
manager manager localhost 45505 438M 339M 71% zeek
proxy-1 proxy localhost 45745 432M 336M 35% zeek
worker-snf0-1 worker localhost 46258 21G 20G 89% zeek
worker-snf0-2 worker localhost 46255 21G 20G 89% zeek
worker-snf0-3 worker localhost 46262 21G 20G 79% zeek
worker-snf0-4 worker localhost 46204 21G 20G 95% zeek
worker-snf0-5 worker localhost 46273 21G 20G 80% zeek
worker-snf0-6 worker localhost 46299 21G 20G 88% zeek
worker-snf0-7 worker localhost 46287 21G 20G 93% zeek
worker-snf0-8 worker localhost 46334 21G 20G 86% zeek
worker-snf0-9 worker localhost 46350 21G 20G 89% zeek
worker-snf0-10 worker localhost 46345 21G 20G 93% zeek
worker-snf0-11 worker localhost 46378 21G 20G 91% zeek
worker-snf0-12 worker localhost 46370 21G 20G 80% zeek
worker-snf0-13 worker localhost 46408 21G 20G 93% zeek
worker-snf0-14 worker localhost 46435 21G 20G 84% zeek
worker-snf0-15 worker localhost 46438 21G 20G 89% zeek
worker-snf0-16 worker localhost 46443 21G 20G 93% zeek
worker-snf0-17 worker localhost 46482 21G 20G 86% zeek
worker-snf0-18 worker localhost 46483 21G 20G 91% zeek
worker-snf2-1 worker localhost 46481 21G 20G 93% zeek
worker-snf2-2 worker localhost 46472 21G 20G 79% zeek
worker-snf2-3 worker localhost 46527 21G 20G 93% zeek
worker-snf2-4 worker localhost 46517 21G 20G 71% zeek
worker-snf2-5 worker localhost 46552 21G 20G 79% zeek
worker-snf2-6 worker localhost 46546 21G 20G 84% zeek
worker-snf2-7 worker localhost 46580 21G 20G 93% zeek
worker-snf2-8 worker localhost 46570 21G 20G 80% zeek
worker-snf2-9 worker localhost 46586 21G 20G 93% zeek
worker-snf2-10 worker localhost 46607 21G 20G 91% zeek
worker-snf2-11 worker localhost 46638 21G 20G 91% zeek
worker-snf2-12 worker localhost 46642 21G 20G 91% zeek
worker-snf2-13 worker localhost 46641 21G 20G 84% zeek
worker-snf2-14 worker localhost 46645 21G 20G 93% zeek
worker-snf2-15 worker localhost 46640 21G 20G 62% zeek
worker-snf2-16 worker localhost 46656 21G 20G 93% zeek
worker-snf2-17 worker localhost 46657 21G 20G 93% zeek
worker-snf2-18 worker localhost 46660 21G 20G 95% zeek

The zeekctl top shows the logger process running really hot on CPU and eating almost all memory in the system until the point the OMM killer kicks in.
I could also notice that the workers are using more CPU to run compared with bro.

I’ve tested several different releases and I’ve experienced this very issue issue in any of 3.0.x, also in the 3.1.5 and 3.2.2, when running the 3.2.3 the memory consumption issue seems to disappear however the cores still consuming more CPU than before

A bit more of context, we are using Myricom 10G NIC and running in a cluster model with 18 workers per RX port.
I didn’t change any configuration other than naming changes from bro to zeek.
We also run Suricata on the same host looking at the same data and we didn’t notice any issue on Suricata while running bro-2.6.4 or zeek.

When running a zeek version with this issue, I can see that the files.log and the x509.log are getting way bigger than running bro.
For comparison bellow are the file sizes from the same host:
running zeek-3.0.12:
-rw-r–r-- 1 root root 1.2G Feb 16 23:27 files.log
-rw-r–r-- 1 root root 1.1G Feb 16 23:27 x509.log

running bro-2.6.4
-rw-r–r-- 1 root root 390M Feb 16 22:47 files.log
-rw-r–r-- 1 root root 354M Feb 16 22:47 x509.log

I’ve also tried to start zeek disabling all plugins and scripts but the memory issue stills the same.

Could you please help me to understand what is causing this issue?

Thanks,

Hello Sergio,

I moved from bro-2.5.5 to zeek-3.1.4 but without any problems!

I do see that you are using myricom cards so that is very similar to my setup!
Could you try with zeek-3.1.4 ?

I would suggest, install with a fresh new dir instead of doing install in the
existing directory. (ie if you have been installing in /usr/local/zeek; for
test, try configure with --prefix=/usr/local/zeek-test).

This is just Just to make sure previously existing lib, include etc doesn't affect it.

I take in your test environment you are running with the default policies or do
you have any specific packages added too ?

Looking at zeekctl top, your CPU's are way too high on all the workers and
logger ! What is the link you are monitoring ? how many conn.logs/min ?

Aashish

Thanks Aashish,

We realised that we were duplicating the packages within the workers, I just removed the Myricom configuration and the issue had not occurred anymore.
We still running a bit hot on CPU but I’m trying to start fresh will all new configuration and not loading any special script/plugins

(After I replied, I noticed the conversation on #general )

We realised that we were duplicating the packages within the workers, I

Really high CPU on *all* workers is a good indicator of it.

This policy (below) helps me quite a bit in making sure load-balancing is right
(or even the fact all workers are seeing traffic):

$ cat peer.zeek

redef record Conn::Info += {
   peer: string &log &optional;
};

event connection_state_remove(c: connection) {
   if ( c?$conn )
       c$conn$peer = peer_description;
}

$ tail -n 10000 ~zeek/logs/current/conn.log | awk -F\t '{print $22}' - | sort | uniq -c

<snip>
266 worker-1-2
274 worker-1-3
254 worker-1-4
260 worker-1-5
148 worker-1-6
133 worker-1-7
136 worker-1-8
270 worker-1-9
270 worker-1-10
<snip>

Hope this helps,
Aashish