Manager memory requirements for the intel framework

Hello, All,

We’re trying to understand manager memory requirements when the intel framework is in use, after experiencing multiple manager crashes per day when using the framework on a low-bandwidth (less than 1Gbps) CentOS 6 machine running a production Bro 2.4.1 cluster. These are happening because the manager is exhausting its tcmalloc heap limit of 16G, as reported in its stderr.log. We removed the heap limit on an idle (no network traffic) Bro 2.4.1 test system, and found the parent VSize reported by “broctl top manager” went to 27G for an intel input file of 18K unique Intel::DOMAIN items. It remained at 27G after many cycles of replacing the input file with 18K new unique items.

Restoring the heap limit and attaching gdb to the manager on the test system shows a malloc failure backtrace that comes out of RemoteSerializer::SendCall (). We commented the conditional that invokes “event Intel::new_item(item)” in base/frameworks/intel/main.bro to disable remote synchronization with the workers, and the huge VSize disappeared.

We then built bro from master (version 2.5-569) and retested. The manager VSize is much lower, but is still about 15G.

Any advice on how to proceed with further diagnostics to hopefully reign in the manager memory requirements for intel? It doesn’t appear at first blush that upgrading Bro will fix it, at least not entirely, and we’re reluctant to upgrade the production system without fully understanding the problem.

Thanks,

Brian

It may be worth upgrading your production system either way since I do not believe 2.4.x is technically supported any longer being that it’s a release from 2015. Plus, lots of improvements since then…

I’m not sure if this will help your specific problem but I’d be curious to know if you’re also running the default scan detection script on this box? In your local.bro are you loading “misc/scan”? If so, that script is known to be a huge memory hog and could be indirectly contributing to your high memory usage…just wanted to mention that, since you mentioned you commented out the conditional that invokes “event Intel::new_item(item)” the problem may reside more directly with the Intel Framework.

How many indicators of each indicator type are you loading into the Intel Framework?

-Drew

You should update to at least 2.4.2 due to this vulnerability:

http://blog.bro.org/2017/10/bro-252-242-release-security-update.html

It remained at 27G after many cycles of replacing the input file with 18K new unique items.

That is interesting because by default the intel framework doesn't expire items, so every time you replaced the file you were loading an additional 18k items..

If I get a chance I will resurrect the benchmarking code I was working on a while ago.. It would do things like create a table of hosts and add 10k,20k,30k,40k hosts to it and see what the memory usage was for each count to see what the real work data usage is for different sized data structures. I never tried it with the intel framework though.

We commented the conditional that invokes “event Intel::new_item(item)” in base/frameworks/intel/main.bro to disable remote synchronization with the workers, and the huge VSize disappeared.

This makes more sense.. I don't think your memory usage has anything to do with the intel itself, I think the communication code is falling behind.

How many worker processes do you have configured? Are they running on the same box or separate boxes?

If you load up 18k indicators but have 100 worker nodes, the bro manager needs to send out 1,800,000 events to all the workers. if the workers can't keep up, that data just ends up buffered in memory on the manager until it can be sent out.

Jon: this is the use case I had for the Cluster::relay_rr, offloading the messaging load from the manager

# On the manager, the new_item event indicates a new indicator that
# has to be distributed.
event Intel::new_item(item: Item) &priority=5
    {
    Broker::publish(indicator_topic, Intel::insert_indicator, item);
    }

so that should maybe be used there, instead of the manager having to do all the communication work.