Multiple masters to ease the workload

Close_Jason_M · June 2, 2015, 2:29pm

Our current configuration is showing a lot of heavy use by the master node. We currently run around 6 worker nodes that feed data to the master, and while the master is keeping up in terms of CPU, it is consistently teetering on using all available RAM we can throw at it (128GB at the moment). There are plans in place to increase our available bandwidth 10-fold, so the traffic coming to Bro will ramp up as well.

We could piece apart the subnets and create multiple Bro clusters. But it would be nice to have a single cluster, and be able to continue to throw more workers and managers at it. But I have not seen any documentation about configurations using multiple managers. If that does exist, can someone point me in the right direction?

And if that doesn’t exist, can I get some suggestions about mitigations to this problem? I know there are a lot of cool things being done with Bro, especially using scripts and APIs where Bro can help reduce traffic being thrown to it. But due to the taps we have in place, and the manpower availability, right now, spinning up a little more hardware would be a much easier and more economical investment of our time.

Thanks.

Jason

Seth_Hall3 · June 2, 2015, 6:38pm

Our current configuration is showing a lot of heavy use by the master node. We currently run around 6 worker nodes that feed data to the master, and while the master is keeping up in terms of CPU, it is consistently teetering on using all available RAM we can throw at it (128GB at the moment).

That’s indicating a problem. I’m going to send you a script off-list that you can run and we’ll see if we can nail down what’s causing that.

We could piece apart the subnets and create multiple Bro clusters. But it would be nice to have a single cluster, and be able to continue to throw more workers and managers at it. But I have not seen any documentation about configurations using multiple managers. If that does exist, can someone point me in the right direction?

You can only run a single manager.

But due to the taps we have in place, and the manpower availability, right now, spinning up a little more hardware would be a much easier and more economical investment of our time.

Unfortunately in this case you need to fix the problem and can’t really just throw more hardware at it.

.Seht

Dave_Crawford · June 5, 2015, 1:37am

Is it actually 100% RAM usage by applications? Since the manager can be performing a significant amount of disk writes the kernel will allocate ‘free’ memory as ‘cached’ to increase file performance. The cached memory is released when applications demand more memory.

Below is the current memory usage on one of my mangers that is handling 25 workers and 2 proxies. At first glance it appears that all the memory has been consumed, but notice how 122G is cached.

total used free shared buffers cached
Mem: 126G 125G 384M 0B 329M 122G
-/+ buffers/cache: 2.6G 123G
Swap: 33G 0B 33G

-Dave

Close_Jason_M · June 5, 2015, 12:08pm

Thanks.

I went ahead and rebooted the cluster, and that cleared things up (as well as sent out a LOT of emails…).

Has anyone else noticed a memory leak in the sensors? We slowly see memory usage grow, maybe by about 10GB a month, even when our total traffic has gone down. I attached an image from our Zabbix monitor. You can see that once we reboot the box, memory drops down, and then slowly creeps up. And traffic isn’t increasing (in fact, it decreases by half over the summer).

Jason Close
Information Security Analyst
OU Information Technology
Office: 405.325.8661 Cell: 405.421.1096

Mark_Buchanan · June 5, 2015, 4:09pm

I too have noticed memory complete memory exhaustion in Bro 2.3.2 (not sure what version Jason is running). If the workers are not restarted every few days or at least once a week, I run out of usable memory on a few sensors I’m testing.

I have found that just doing a restart within broctl will free the memory consumed up - but I regularly have to perform restarts to keep the sensors I am testing running smoothly.

Mark

Hosom_Stephen_M · June 5, 2015, 4:40pm

You’ll both want to check reporter.log. In most cases memory leaks are introduced due to scripts (either built-in or custom) that error. Errors within bro script land can result in memory leaks, so you want to do your best to avoid those. If you’re willing to share your reporter.log, I could possibly help you fix some of the errors that you’re running into.

Mark_Buchanan · June 5, 2015, 5:26pm

I’m using all stock bro scripts in the test I have going - but adding some intel indicators.

The most recurring message I have is for “NB-DNS error in DNS_Mgr::Process (recvfrom(): Connection refused)”. This machine does not have DNS access and the build we used put a DNS server that is not in service into the /etc/resolv.conf. This error is about 90% of what is in my reporter.log. I tried to comment out the /etc/resolv.conf entry and restart bro through broctl, but am still seeing the issues.

The other significant percentage are misc base64 messages:

“incomplete base64 group, padding with bits of 0” - ~5%
“extra base64 groups after ‘=’ padding are ignored” - ~4%
“character ignored by Base64 decoding” - ~< 1%

Mark

Dave_Crawford · June 5, 2015, 5:40pm

When I recently debugged memory exhaustion on my workers the root cause was related to the software detection scripts, specifically /protocols/http/software. If that script is running on a sensor that is monitoring the inside interface of a web proxy it tracks all the remote software as being on your local network.

Gary_Faulkner1 · June 5, 2015, 6:21pm

I used to see some stability issues on a reasonably large network (100K plus hosts) that appeared related to software asset tracking, coupled with a lot of IP churn due to wireless devices and short lease times, which resulted in the manager being oom-killed after some number of hours or days. It was suggested, I think by Seth, that I experiment with disabling it in local.bro:

I’m still on 2.3-419, so not the latest build by any means, but was able to test this, and have seen greatly improved stability in my particular environment, with the below line in local.bro:

redef Software::asset_tracking=NO_HOSTS;

It might be worth experimenting with if you think you may be having software asset tracking related issues. Be mindful if you rely on that data for other scripts.

Regards,
Gary

Topic		Replies	Views
Manager swapping.. Zeek	11	107	May 6, 2022
Version: 2.0-907 -- Bro manager memory exhaustion Zeek	1	73	May 6, 2022
Bro manager dies in a large cluster Zeek	1	123	May 6, 2022
Bro's limitations with high worker count and memory exhaustion Zeek	18	141	May 6, 2022
out of memory after a couple days? Zeek	22	133	May 6, 2022

Multiple masters to ease the workload

Related topics