We're running Bro version 2.5-467 and shunting connections in an Arista
5100 via the excellent Dumbno. The system is essentially a Bro-in-a-box
with 251G RAM and a Myricom SNF+ card. Bro spawns 32 workers to monitor
a 10G link that averages roughly 5G and spikes to 7 on occasion. Logs
are rotated every 1/2 hour.
The current configuration has been running for about 2 weeks and thus
far the only problem I have encountered is that given enough time, the
workers will eventually exhaust all of the available system memory
including scratch.
Currently I am running a watcher process which slowly restarts the
workers once available memory is < 4G and this has solved the problem,
but this seems an imperfect solution thus I am wondering if anyone knows
the source of the apparent memory leak? Loaded scripts are attached and
thanks in advance for any informative illumination on this issue.
We're running Bro version 2.5-467 and shunting connections in an Arista
5100 via the excellent Dumbno. The system is essentially a Bro-in-a-box
with 251G RAM and a Myricom SNF+ card. Bro spawns 32 workers to monitor
a 10G link that averages roughly 5G and spikes to 7 on occasion. Logs
are rotated every 1/2 hour.
Cool
The current configuration has been running for about 2 weeks and thus
far the only problem I have encountered is that given enough time, the
workers will eventually exhaust all of the available system memory
including scratch.
Less cool
Currently I am running a watcher process which slowly restarts the
workers once available memory is < 4G and this has solved the problem,
but this seems an imperfect solution thus I am wondering if anyone knows
the source of the apparent memory leak? Loaded scripts are attached and
thanks in advance for any informative illumination on this issue.
try commenting out
@load misc/scan
from local.bro.
If you have a lot of address space and bro is before any firewall as opposed to after it, this is likely the source of the problems.
If that fixes it there are other scan detection implementations that are a bit more efficient.
This was a battle we endured for many many moons (12+ months), look to the archives for the pain and suffering.
Final solution : Enable multiple loggers (now part of Bro), disable writing logs to disk and stream logs to Kafka. (Thank you KafkaLogger author)
Reasoning : At some point Bro’s log writing cannot keep up with the volume. Believed to be a bottleneck with the the default architecture using a single “Logger” node.
Possible alternative : Enable multiple loggers, but when writing to disk you might have a possible race condition with filenames and dates. Also you’ll have multiple logs for each rotation interval (ex: 4 loggers means 4 conn.log, 4 http.log, 4 ssh.log, etc…)
Ah, yeah, it could be that too. Things got better for the most part once the logger node was introduced, so this hasn't been the problem for people recently.
I think most of the remaining problems with the logger node scaling are limited to extremely large log volumes and people who had AMD systems with many slow cores.. I think you had one of those.
In any case, that is easy to check for by looking at broctl top and monitoring the log lag. If the logs are not behind, the problem is something else.
I endorse commenting out "@load misc/scan", that really stabilized memory usage in our environment. We also commented out "@load misc/detect-traceroute" and that seemed to help as well.
From your description I don't know if this applies to your situation, but for what it's worth we use PF_RING and experienced high resource utilization initially due to PF_RING not loading correctly. The issue tracker for it is here: [BIT-1864] - Bro Tracker - TL;DR, change pfringclusterid in broctl.cfg to 21; if pfringclusterid is set to 0 PF_RING doesn't actually do anything even though everything will indicate it's running. Again, don't know if it applies but thought I'd throw it out there.
Fast Intel CPU and the live logs write to a RAID 1 virtual disk built
on enterprise SSD drives, logs are archived to a virtual disk RAID 10
built on 15K SAS spindles.
I will give your debugging a try and see what it says.
I think Justin hit the nail on the head, we monitor two full /16, 3 /24
and 2 partial /16, in front of any local FW devices; similar to LBL.
Commenting out misc/scan did the trick, memory is now being freed as one
would expect.
We already know we have TONS of scanners traversing the network, so
probably don't need this at all although I am interested in hearing of
good alternatives.
Thanks again everyone, greatly appreciate the help.
[ sorry to chime late - still catching up on the thread ]
Greg,
I generally shunt (disable) sumstats on all my clusters. back in the day, while
early adapting I had memory issues in cluster and since then I have disabled it
every where for years now.
Basically: in /usr/local/bro/share/bro/base/frameworks/sumstats/main.bro
to completely make sumstats ineffective - Add a return at the top of "function observe"
I’ve been monitoring /20 and several small subnets plus internal /8 with Justin’s simple scans and it’s excellent. Highly recommended. It’s also faster when it comes to detection.