troubleshooting bro memory usage?

Hello,

I've just put in two sensors running bro (with security onion), and am having trouble with the bro processes progressively growing in RAM usage, until they crash or become unresponsive. For example, I have one bro worker process right now that's reached 2.8 GB in 2 hours while watching a < 100MB link. None of the other processes (manager/proxy/other workers) are anywhere near that...it's just this one worker.

Are there any config options I can enable to attempt to find the cause of the memory leak? Also, since I'm confident the link I'm watching is missing some traffic (the span it's on is slightly mis-configured at the moment), where can I configure protocol timeouts?

Thanks.

aaron

Hello,

I've just come across something that implies Bro is caching all DNS resolutions that go past it (https://bro-tracker.atlassian.net/browse/BIT-964). The bro systems I recently put in are in front of our main internal DNS resolvers, so almost all of the traffic they see is DNS resolution requests/answers. If Bro is caching all DNS, that would go a long way to explaining why bro's memory usage is continually increasing for my two sensors.

Is there a way to disable this caching? (or have I mis-understood what bro's doing with DNS?)

Thanks.

aaron

That's unrelated. It's referring to DNS lookup requests happening at script land. We ran into a case once where someone had written a script that did two reverse hostname lookups for every connection that was established (don't do this, it's *really* not a good idea). Although I should point out that their Bro cluster was running quite well even in the face of that, but I don't think their DNS resolver was very happy about it. :slight_smile:

In general, monitoring in front of a DNS resolver should be just fine.

  .Seth

Is there a way to disable this caching? (or have I mis-understood what
bro's doing with DNS?)

That's unrelated. It's referring to DNS lookup requests happening at script land. We ran into a case once where someone had written a script that did two reverse hostname lookups for every connection that was established (don't do this, it's *really* not a good idea). Although I should point out that their Bro cluster was running quite well even in the face of that, but I don't think their DNS resolver was very happy about it. :slight_smile:

Heh. I'll keep that in mind.

In general, monitoring in front of a DNS resolver should be just fine.

Hmm...that leaves me with my original problem, then: I have two vanilla securityonion installs (no custom .bro scripts added, just the ones that came with securityonion), watching just traffic to two different DNS resolvers...right now one of the worker parent processes (according to "broctl top") on each securityonion box grows monotonically in RAM usage until it gets killed by Linux (and is then restarted by broctl's cron job).

Any ideas on where I should start looking to identify what's causing the worker to grow in RAM like that?

Thanks.

aaron

I have two vanilla
securityonion installs (no custom .bro scripts added, just the ones that
came with securityonion), watching just traffic to two different DNS
resolvers

What traffic rate do you see?

right now one of the worker parent processes (according to
"broctl top") on each securityonion box grows monotonically in RAM usage
until it gets killed by Linux (and is then restarted by broctl's cron job).

How much RAM is in the box?

  --Vlad

I have two vanilla
securityonion installs (no custom .bro scripts added, just the ones that
came with securityonion), watching just traffic to two different DNS
resolvers

What traffic rate do you see?

95th percentile over a week (according to MRTG): Box 1: 34.6 Mbps. Box 2: 28Mbps

right now one of the worker parent processes (according to
"broctl top") on each securityonion box grows monotonically in RAM usage
until it gets killed by Linux (and is then restarted by broctl's cron job).

How much RAM is in the box?

16 GB. Both have 6-core 2.2GHz CPUs, also.

Thanks.

aaron

All,

I think I know what's causing this on the surface, but I'm unsure of the deeper cause. When I commented out the SecurityOnion bro scripts, bro's memory usage was stable and reasonable. So the problem was clearly coming from securityonion's scripts. I then started adding the SecurityOnion rules back in one by one, adding a ton of Reporter::warn statements, and watching the reporter.log. What I noticed was the securityonion hostname.bro script never completed *if* the device's hostname had a dash in it ("location-onion", for example). When I changed the server's hostname to not have a dash, the hostname script completed without issue.

I suspect this means that the "hostname" and "interface" variables from the securityonion scripts weren't being initialized properly while trying to start up with a dashed hostname, doing who-knows-what when bro was told to add those variables to every logged event.

Given that, I have an easy fix in the short term, which is to rename the box running securityonion to not have a dash in its hostname. What I'm confused by is why this would happen in the first place. (So I'm not clear yet on what patch to suggest to the securityonion folks to prevent this from coming up again.)

The securityonion hostname.bro file does the following:

    module SecurityOnion;

    @load base/frameworks/input

    export {
             ## Event to capture when the hostname is discovered.
             global SecurityOnion::found_hostname: event(hostname: string);

             ## Hostname for this box.
             global hostname = "";

    type HostnameCmdLine: record { s: string; };

    event SecurityOnion::hostname_line(description:
    Input::EventDescription, tpe: Input::Event, s: string)
             {
             hostname = s;
             system(fmt("rm %s", description$source));
             event SecurityOnion::found_hostname(hostname);
             }

    event add_hostname_reader(name: string)
             {
             Input::add_event([$source=name,
                               $name=name,
                               $reader=Input::READER_RAW,
                               $want_record=F,
                               $fields=HostnameCmdLine,
                               $ev=SecurityOnion::hostname_line]);
             }

    event bro_init() &priority=5
             {
             local tmpfile = "/tmp/bro-hostname-" + unique_id("");
             system(fmt("hostname > %s", tmpfile));
             event add_hostname_reader(tmpfile);
             }

The SecurityOnion::hostname_line event never fires if the hostname has a dash in it (for example, if the contents of the tmpfile are "location-onion"). I see the add_hostname_reader event fire, but not the hostname_line event. Do you all have any idea why that would fail if there's a string with a dash in the file? Is bro thinking it's an expression rather than a string? Two strings?

Thanks for all the help so far. This has been hard to nail down.

aaron

Hi Aaron,

There are definitely some issues with the hostname and interface scripts.

My demo at Bro Exchange last week failed due to the hostname script,
even though I put precautions in place which had always worked in the
past. My hostname did include a hyphen, but I recorded a video later
with the same VM (and same hostname) and everything worked fine:
http://youtu.be/0a2WDyBsxzk?t=2m36s

I'll also mention that all of my production servers have a hyphen in
the hostname and they work fine.

Another thing I noticed in testing a few weeks ago in a VM was that if
the VM had only a single CPU core the scripts were much more likely to
fail. Increasing to 2 or more CPU cores resulted in much higher
levels of success. Perhaps resource contention on Bro startup?

Seth, I know you're going to rewrite these scripts for Bro 2.2, but do
you have any ideas for troubleshooting in the meantime?

Thanks!

Doug

Can you send a sample of those message? How much is a ton? :slight_smile:

There's a known memory leak in Bro when the script interpreter reports
certain errors in script code. If this happens very often, it could
explain what you're seeing (unfortunately the leak is hard to fix, but
the messages usually indicate a problem in the corresponding script in
the first place).

Robin

The hyphen-in-hostname might be a red herring when at least part of the issue is there's a bit of a race condition in the script -- the system() call to invoke `hostname` and put the output in a temporary file happens in a different background process, subject to the OS scheduler. So if that process gets scheduled after the input reader has already tried and failed to open the temporary file, the input reader won't automatically recover from that.

I put a revision to the script you showed at [1] that *should* be a way to perform the same function without a race condition (though at the moment I'm not confident that the internals of the raw input reader are race-free in all cases, I'm looking in to some things).

Still, I don't really know if this was actually the cause of your memory issues.

- Jon

[1] https://gist.github.com/jsiwek/6222106

I *added* a ton of Reporter::warn messages. Before this, bro was issuing one interesting error (see below), but I was basically adding lines like "script <x> started with variables <y>", "script <x> finished", etc to the reporter.log.

So, the log messages looked like:

    0.000000 Reporter::WARNING making tempfile:
    /tmp/bro-hostname-ndOXgWQ3v52
    /opt/bro/share/bro/securityonion/./hostname.bro, line 40
    0.000000 Reporter::WARNING wrote hostname to tempfile
    /opt/bro/share/bro/securityonion/./hostname.bro, line 42
    0.000000 Reporter::WARNING called event to add hostname
    reader /opt/bro/share/bro/securityonion/./hostname.bro, line 44
    0.000000 Reporter::WARNING hostname reader starting on file:
    /tmp/bro-hostname-ndOXgWQ3v52
    /opt/bro/share/bro/securityonion/./hostname.bro, line 28
    1376401730.326379 Reporter::INFO processing suspended (empty)
    1376401730.326379 Reporter::INFO processing continued (empty)
    1376401730.370328 Reporter::INFO processing continued (empty)

What got me going this way was an error earlier that was:

    0.000000 Reporter::WARNING Template value remaining in BPFConf
    filename: /etc/nsm/{{hostname}}-{{interface}}/bpf-bro.conf
    /opt/bro/share/bro/securityonion/./bpfconf.bro, line 99

which said to me that either the "hostname" or "interface" variable hadn't been initialized in the bro setup.

aaron

I just looked through the scripts and I really don't know why that would happen. If anything, my guess is that Jon's probably right and there is a race condition that is causing it to fail in unpredictable ways. I'll start updating the scripts in that repository for 2.2 soon which might help a little. If I just update those in the master branch, could that cause any problems for SO?

  .Seth

Hi Jon,

Thanks for the revised script! I'll try it out this week and see if
it's more consistent.

Thanks,
Doug

Updating in the master branch shouldn't cause any problems for SO
since I packaged a static copy of the files and we're not actively
pulling anything from the master branch.

Thanks,
Doug

I’ve had this problem for too long. Wish I knew too. Seems each time it’s brought up on a mailing list the discussion gets hijacked and turns into feature requests or debates on new concepts and looses sight of the original problem.

Keep hammering away. Good luck.

Here’s a suggestion that has helped me in the past, disable all scripts except the SSH and SSH brute force detection. Basically you’re using process of elimination to find what aspect of Bro is not performing well in your environment. Turn on features of Bro one by one until you find which one is the culprit. It’s tricky to debug Bro from site to site because of different traffic profiles.

–TC

Thanks. Of the two boxes I have, one got better when I changed the hostname (have no idea why that helped, but it's been stable across reboots and restarts since then...perhaps luch). The other one I'm still working on.

aaron

Greetings,

Are you running Bro as part of Security Onion? I saw a discussion about SO issues with hostnames containing hyphens.

-David

Hi David,

I think the hyphenated hostname was circumstantial evidence as the
hostname/interface scripts were inconsistent even with non-hyphenated
hostnames.

Jon provided a workaround earlier in the thread that appears to be
more consistent so far. I've packaged the updated scripts and
uploaded to our "test" repo. Here's the email I sent to our testers
last night:
https://groups.google.com/d/topic/security-onion-testing/KR_Q-e-SjPQ/discussion

Thanks,
Doug