Worker System Memory Exhaustion

Hello List,

We're running Bro version 2.5-467 and shunting connections in an Arista
5100 via the excellent Dumbno. The system is essentially a Bro-in-a-box
with 251G RAM and a Myricom SNF+ card. Bro spawns 32 workers to monitor
a 10G link that averages roughly 5G and spikes to 7 on occasion. Logs
are rotated every 1/2 hour.

The current configuration has been running for about 2 weeks and thus
far the only problem I have encountered is that given enough time, the
workers will eventually exhaust all of the available system memory
including scratch.

Currently I am running a watcher process which slowly restarts the
workers once available memory is < 4G and this has solved the problem,
but this seems an imperfect solution thus I am wondering if anyone knows
the source of the apparent memory leak? Loaded scripts are attached and
thanks in advance for any informative illumination on this issue.

loaded_scripts.11:19:27-11:30:00.log.gz (2.8 KB)

Hello List,

We're running Bro version 2.5-467 and shunting connections in an Arista
5100 via the excellent Dumbno. The system is essentially a Bro-in-a-box
with 251G RAM and a Myricom SNF+ card. Bro spawns 32 workers to monitor
a 10G link that averages roughly 5G and spikes to 7 on occasion. Logs
are rotated every 1/2 hour.

Cool :slight_smile:

The current configuration has been running for about 2 weeks and thus
far the only problem I have encountered is that given enough time, the
workers will eventually exhaust all of the available system memory
including scratch.

Less cool :frowning:

Currently I am running a watcher process which slowly restarts the
workers once available memory is < 4G and this has solved the problem,
but this seems an imperfect solution thus I am wondering if anyone knows
the source of the apparent memory leak? Loaded scripts are attached and
thanks in advance for any informative illumination on this issue.

try commenting out

@load misc/scan

from local.bro.

If you have a lot of address space and bro is before any firewall as opposed to after it, this is likely the source of the problems.

If that fixes it there are other scan detection implementations that are a bit more efficient.

This was a battle we endured for many many moons (12+ months), look to the archives for the pain and suffering.

Final solution : Enable multiple loggers (now part of Bro), disable writing logs to disk and stream logs to Kafka. (Thank you KafkaLogger author)

Reasoning : At some point Bro’s log writing cannot keep up with the volume. Believed to be a bottleneck with the the default architecture using a single “Logger” node.

Possible alternative : Enable multiple loggers, but when writing to disk you might have a possible race condition with filenames and dates. Also you’ll have multiple logs for each rotation interval (ex: 4 loggers means 4 conn.log, 4 http.log, 4 ssh.log, etc…)

^ Hovsep

Ah, yeah, it could be that too. Things got better for the most part once the logger node was introduced, so this hasn't been the problem for people recently.

I think most of the remaining problems with the logger node scaling are limited to extremely large log volumes and people who had AMD systems with many slow cores.. I think you had one of those.

In any case, that is easy to check for by looking at broctl top and monitoring the log lag. If the logs are not behind, the problem is something else.

https://gist.github.com/JustinAzoff/01396a34c8f92d4dda1b

is a script for munin that will output how old the most recent record in the conn.log is. You can just run it manually though:

[jazoff@bro-dev ~]$ curl -o log_lag.py https://gist.githubusercontent.com/JustinAzoff/01396a34c8f92d4dda1b/raw/2dba7fdf93915748948b238c20de965b4636cb9e/log_lag.py
[jazoff@bro-dev ~]$ python log_lag.py
lag.value 5.526168

The number should be 5-10s and not growing.

I endorse commenting out "@load misc/scan", that really stabilized memory usage in our environment. We also commented out "@load misc/detect-traceroute" and that seemed to help as well.

From your description I don't know if this applies to your situation, but for what it's worth we use PF_RING and experienced high resource utilization initially due to PF_RING not loading correctly. The issue tracker for it is here: https://bro-tracker.atlassian.net/browse/BIT-1864 - TL;DR, change pfringclusterid in broctl.cfg to 21; if pfringclusterid is set to 0 PF_RING doesn't actually do anything even though everything will indicate it's running. Again, don't know if it applies but thought I'd throw it out there. :slight_smile:

Thanks,
Carl
University of Idaho

Fast Intel CPU and the live logs write to a RAID 1 virtual disk built
on enterprise SSD drives, logs are archived to a virtual disk RAID 10
built on 15K SAS spindles.

I will give your debugging a try and see what it says.

Thanks all and have a good weekend.

Greg

Thanks Carl, will consider this avenue too.

Greg

I think Justin hit the nail on the head, we monitor two full /16, 3 /24
and 2 partial /16, in front of any local FW devices; similar to LBL.
Commenting out misc/scan did the trick, memory is now being freed as one
would expect.

We already know we have TONS of scanners traversing the network, so
probably don't need this at all although I am interested in hearing of
good alternatives.

Thanks again everyone, greatly appreciate the help.

Greg

https://github.com/ncsa/bro-simple-scan

https://github.com/initconf/scan-NG

both are available in bro-pkg. I'm obviously partial to simple-scan, but
Aashish is closer to you if you need someone to blame if it breaks :slight_smile:

[ sorry to chime late - still catching up on the thread ]

Greg,

I generally shunt (disable) sumstats on all my clusters. back in the day, while
early adapting I had memory issues in cluster and since then I have disabled it
every where for years now.

Basically: in /usr/local/bro/share/bro/base/frameworks/sumstats/main.bro

to completely make sumstats ineffective - Add a return at the top of "function observe"

--- /home/bro/master/share/bro/base/frameworks/sumstats/main.bro 2018-04-06 14:02:05.131016000 -0700
+++ /home/bro/master/share/bro/base/frameworks/sumstats/main.bro.dis 2018-04-06 14:01:54.384697000 -0700
@@ -402,6 +402,9 @@ function create(ss: SumStat)
function observe(id: string, key: Key, obs: Observation)
        {

+ ### this retun disables the sumstats
+ return;

I’ve been monitoring /20 and several small subnets plus internal /8 with Justin’s simple scans and it’s excellent. Highly recommended. It’s also faster when it comes to detection.

So why not just replace the core misc/scan.bro with Justin’s that seems clearly better?

-Dop