Version: 2.0-907 -- Bro manager memory exhaustion

Hello,

I’m using the latest development build 2.0-907.

The deployment consists of six servers; one as a manager and the other five as nodes. Each node runs 20 workers and 2 proxies. The manager is FreeBSD; the workers are Linux with PF_RING transparent_mode=2.

After starting bro, the manger continually consumes memory until system exhaustion (64GB). The CPU usage is high as well.

Another problem is over 50% of the workers consume 100% CPU. This is very odd considering the low volume traffic between 400-1000 Mbps per node.

Where do you suggest I start debugging this ?

The deployment consists of six servers; one as a manager and the other five
as nodes. Each node runs 20 workers and 2 proxies. The manager is
FreeBSD; the workers are Linux with PF_RING transparent_mode=2.

(Note, you don't need 2 proxies per node; it may actually already be
fine to run a single proxy on the manager box).

After starting bro, the manger continually consumes memory until system
exhaustion (64GB). The CPU usage is high as well.

That's not a good sign for the manager ... It's possible that we have
a memory leak in there. Has it worked better with 2.0? (If you have
tried that?)

Another problem is over 50% of the workers consume 100% CPU. This is very
odd considering the low volume traffic between 400-1000 Mbps per node.

So that's 400-1000 Mbps divided by 20 workers processes? I'll let
other's chime in here, not really sure what to expect with PF_RING in
that setup.

Robin

(Taking to bro-dev).

That's not a good sign for the manager ... It's possible that we have
a memory leak in there.

I just reran our leak tests and they didn't report anything (which is
good, but doesn't completely rule out any leaks).

I did see this though from valgrind:

    Object at 0x94e3410 of 68 bytes from an IgnoreObject() has disappeared

Does anybody know what valgrind it trying to tell me with that? Is it
a problem?

Robin

(…)

After starting bro, the manger continually consumes memory until system
exhaustion (64GB). The CPU usage is high as well.

That’s not a good sign for the manager … It’s possible that we have
a memory leak in there. Has it worked better with 2.0? (If you have
tried that?)

I had the same problem with earlier development builds and 2.0 but the exhaustion was not as rapid. I suspect the number of peers has some influence in triggering or accelerating a leak.

Another problem is over 50% of the workers consume 100% CPU. This is very
odd considering the low volume traffic between 400-1000 Mbps per node.

So that’s 400-1000 Mbps divided by 20 workers processes? I’ll let
other’s chime in here, not really sure what to expect with PF_RING in
that setup.

I based the design on the very rough suggestion each core will handle 60-120 Mbps of traffic. So 20 workers on a 48 core would be 1.2 Gbps to 2.4 Gbps. I want to say I even recall reading about that threshold in a paper somewhere.

–TC

Have you seen any of my threads from earlier this year?

http://bit.ly/JJQVVf
http://bit.ly/N2l4yT

Your issue sounds similar to what I was experiencing.

Bro 2.0 is routinely uses up all available memory and then crashes for me.

In my case, an early suggestion was that Bro should not be run in a
virtual machine. I set up a second instance of Bro 2.0 on a FreeBSD
machine (not a VM), though, and got the same results -- routine
crashes. I read quite a few stack traces from those crashes, and
noticed that there seemed to be an issue (maybe a leak) allocating
memory when Bro attempts to reassemble fragmented traffic.

Can you get a stack trace from any of your crashes?

I have a cron job that restarts everything as soon as it experiences a
crash, so I can get fairly continuous coverage. Unfortunately, seeing
bro crash so frequently reduces my confidence that it's catching
everything.

If I want to be more confident about the results bro produces, I'll
run it over pcap from tcpdump.

-Chris

Ok so we are lucky to share the same misfortune. Sounds like the same problem I’ve had since using Bro. Contact me off-list and we’ll exchange notes.

Someone mentioned it’s likely due to the traffic on the network; they had a similar problem that involved certain SSL traffic. The idea is to disable features until finding the problem and then devise a workaround. That’s the plan for now.

–TC

I have the following in my local.bro file:

redef SMTP::generate_md5 += /image.*/;
redef HTTP::generate_md5 += /image.*/;
redef SMTP::generate_md5 += /text.*/;
redef HTTP::generate_md5 += /text.*/;
redef SMTP::generate_md5 += /application.*/;
redef HTTP::generate_md5 += /application.*/;
redef SMTP::generate_md5 += /audio.*/;
redef HTTP::generate_md5 += /audio.*/;
redef SMTP::generate_md5 += /video.*/;
redef HTTP::generate_md5 += /video.*/;

Using broctl's top and a little trial and error, I can see that these
lines are the cause of my high CPU usage. It also causes higher
memory usage as well, but memory usage always climbs and never gets
smaller. I don't know if these lines are responsible for just higher
memory usage in general, or whether they are also responsible gradual
climb in memory. It appears that memory gradually climbs even without
these lines, but I haven't had enough time to test that idea. I
believe that the climbing memory eventually leads to a crash,
typically when Reassem.cc attempts to allocate some new memory and an
unhandled exception is triggered. The broctl cron command restarts
bro for me.

-Chris

I've also noticed something peculiar about the node.cfg file that
causes high CPU usage, independent from generating MD5s.

I was under the impression that I needed to configure node.cfg from the default:

[bro]
type=standalone
host=localhost
interface=eth0

To something that makes more sense for my environment

[bro]
type=standalone
host=1.2.3.4
interface=eth0

For some reason, when I do this, it causes broctl to take a very long
time to return from the status command and the number of peers
reported is ??? and not the expected 0. Configuring my host to an IP
address also causes CPU to spike to about 100%.

-Chris

I was under the impression that I needed to configure node.cfg from the default:

[bro]
type=standalone
host=localhost
interface=eth0

To something that makes more sense for my environment

[bro]
type=standalone
host=1.2.3.4
interface=eth0

For some reason, when I do this, it causes broctl to take a very long
time to return from the status command and the number of peers
reported is ??? and not the expected 0. Configuring my host to an IP
address also causes CPU to spike to about 100%.

For a standalone config, localhost is probably preferable -- node.cfg is just specifying how nodes can contact/communicate with each other, so going over loopback or at least a private IP space might be what most people aim to do.

That 'peer' column of `broctl status` is obtained from the manager attempting to send a status request event to the Bro worker instance and expecting a status response event back. The failure cases in which "???" are displayed are:

(1) The manager's broccoli connection to the Bro peer fails
(2) The manager times out sending a status request event
(3) The manager times out receiving a status response event

To see which is the case, you could exit all BroControl shells, add "Debug=1" to your etc/broctl.cfg and then try `broctl status` again. There will be a spool/debug.log in which you should find one of these messages:

broccoli: cannot connect
broccoli: timeout during send
broccoli: timeout during receive

I'm guessing you're just running into (1) because a firewall is now blocking the connection. In that case, I think the timeout length for connect(2) can be pretty long (~2mins), but not sure if it also generally results in high CPU usage.

For (2) or (3), event status is polled once every second for a maximum of "CommTimeout" seconds (default 10 secs), so probably not CPU intensive, but can end up taking a long time for cluster setups especially since the manager queries node status serially.

    Jon

I have the following in my local.bro file:

redef SMTP::generate_md5 += /image.*/;
redef HTTP::generate_md5 += /image.*/;

...

Using broctl's top and a little trial and error, I can see that these
lines are the cause of my high CPU usage. It also causes higher
memory usage as well, but memory usage always climbs and never gets
smaller. I don't know if these lines are responsible for just higher
memory usage in general, or whether they are also responsible gradual
climb in memory. It appears that memory gradually climbs even without
these lines, but I haven't had enough time to test that idea.

In general, the digest BiFs don't look like they leak, but if there is not a md5_hash_finish() for each corresponding md5_hash_init(), that could lead to growth of some internal state over time. The base scripts all attempt to clean up any md5_hash_init()'s with a corresponding md5_hash_finish(), but I'm not confident all edge cases are covered.

If you have any other local changes, you might see if there's a difference running with them rather than just the vanilla bro scripts -- it can be easy to add something which causes too much state to accumulate over time. Another quick check is to look for any errors in reporter.log -- currently interpreter exceptions due to scripting errors will not abort bro, but do cause a memory leak. Otherwise, it might be easiest for you to start looking into using a memory profiling tool (e.g. valgrind, gperftools) to try to locate the problem more definitely.

    Jon