OOM-killer & Bro

Quick question for those of you running Bro clusters. I often run into situations where OOM-killer invokes and kills some Bro process. Do any of you do anything to tune OOM-killer on Linux or otherwise tune memory management, such as disabling OOM-killer, turning off swap etc?

Background : I've had various success tracking down the events that cause me to suddenly run out of memory and ultimately crash. Sometimes it seems to be the result of log rotation getting stuck on a really big file (8Gig http or dns log), or a sudden 10G traffic spike overwhelming the cluster. I've pursued various avenues to mitigate the issue such as shorter log rotation intervals, pruning known high throughput compute traffic, scheduling daily restarts etc. Ultimately I'm also looking to increase RAM, but I'm concerned even with more RAM, I'm just a traffic spike away from OOM-killer, especially since we are unlikely to be able to buy cluster hardware fast enough to keep up with traffic volumes.

Regards,

Are you chasing a memory leak? broctl top will generally report >500MB of reserved memory (90% of the time even >256M) per worker in a 40 worker cluster capable of handling spikes of 10Gb​. Each worker has ~3GB RAM to it.

I recall the log rotation process is a separate cron-style job that shouldn’t really be bring down the cluster workers.

-Alex

I'm not sure. I'm running Bro 2.2 (release) with the default scripts and most of the known memory leak issues don't seem to apply to me. Other than some limited testing I haven't been using any custom scripts, and none that depend on the input framework.

I've been thinking I may need more than 64G of RAM per node (16 core / 3-5G traffic, & 12 workers each). I seem to run with 100% of the RAM allocated, but 20-30% of my RAM cached before something happens to cause a sudden drop in cached memory (as seen on Orca graphs) resulting in OOM-killer dropping one or more Bro processes.

I've been reading a bit about OOM-killer and some high performance situations seem to call for disabling it, so I'm investigating whether it makes sense to tweak vm-overcommit settings to disallow allocating more than the total physical RAM + SWAP, but I don't know if this is advisable for Bro or not, hence the query.

~ Gary

You should be fine with those specs.. 12 workers should be using closer
to 12G of ram, not anywhere near 64G.

Can you post the output of

    free -m # on one of the worker nodes
    broctl top # on the manager

and to get an idea of your msg log rate:

    cat bro/logs/current/* | wc -l ; sleep 1m ; cat bro/logs/current/* | wc -l

Can you also share the memory graph from this system over time,
particularly after a fresh restart of bro?

The output is below. I've been running the host longer than I have Orca graphs for. When looking at the graphs, you can identify the restarts based on the sudden spike in free memory (in blue). There is a series of restarts toward the end of last week and beginning of this week where I was experimenting with a script and making changes. That script was only tested between last Thursday and this Monday. Traffic and log rates are taken between / 11AM-1PM. In some cases I tried to collect multiple samples.

The history of OOM events predates the graphs, so I thought it would be useful to know them as well. I have been reducing workload & workers over time as part of troubleshooting. I also sometimes have an issue where bro log rotations fails, and I need to rotate logs manually and restart. This usually happens when available/cached memory drops below about 8G.

Machines each have 2 of e5-2670s (8-cores, 2.6Ghz) & 64G RAM. So 16 cores / 32 HT per machine.

OOM-Killer (host 1, manager + 2 proxies + x workers):
Nov 21 - 24 workers
Nov 22 - 24 workers
Nov 25 - 24 workers
Dec 4 - 20 workers
Dec 5 - 20 workers
Dec 6 - 20 workers
Dec 27 - 12 workers
Jan 26 - 12 workers
Jan 30 - 12 workers - might be related to script testing
Feb 1 - 12 workers - might be related to script testing

OOM-Killer (host2, 2 proxies + x workers):
Nov 20 - 24 workers
Nov 21 - 24 workers
Nov 22 - 24 workers
Nov 25 - 24 workers
Dec 4 - 20 workers
Dec 5 - 20 workers
Dec 6 - 20 workers
Jan 26 - 12 workers

broctl top 11:15AM:

host1hourlymemuse4FEB2014.png

host1monthlymemuse4FEB2014.png

host1weeklymemuse4FEB2014.png

host1dailymemuse4FEB2014.png

That is quite a lot of logs... Can you do just a `wc -l *` a minute
apart and diff that? I'm particularly wondering what the rate of
notices/sec you are getting. I recently ran into and fixed an issue
with notice supression using a lot of memory:

https://bro-tracker.atlassian.net/browse/BIT-1115
https://github.com/bro/bro/commit/ec3f684c610f084fdea8ed5cf85f9c4390eb58e6

I wonder if that could be the issue you are running into..

Here it is just after a log rotation:
         14 app_stats.log
         32 capture_loss.log
       3075 communication.log
   10515588 conn.log
    1463723 dns.log
      13760 dpd.log
    1562035 files.log
       1527 ftp.log
    1771968 http.log
         74 irc.log
        127 known_certs.log
      21540 known_hosts.log
       2696 known_services.log
        325 notice.log
        242 reporter.log
      37892 smtp.log
         13 socks.log
      78387 software.log
       3247 ssh.log
     552563 ssl.log
          4 stderr.log
          3 stdout.log
     672817 syslog.log
        556 traceroute.log
       5790 tunnel.log
     472964 weird.log
   17180962 total

1 min later:
         14 app_stats.log
         32 capture_loss.log
       3470 communication.log
   11859982 conn.log
    1619893 dns.log
      15468 dpd.log
    1760513 files.log
       1679 ftp.log
    1993477 http.log
         86 irc.log
        139 known_certs.log
      23839 known_hosts.log
       2881 known_services.log
        352 notice.log
        259 reporter.log
      42941 smtp.log
         13 socks.log
      88544 software.log
       3581 ssh.log
     622256 ssl.log
          4 stderr.log
          3 stdout.log
     750444 syslog.log
        561 traceroute.log
       6567 tunnel.log
     530259 weird.log
   19327257 total

And the diff:

0 app_stats.log
0 capture_loss.log
395 communication.log
1344394 conn.log
156170 dns.log
1708 dpd.log
198478 files.log
152 ftp.log
221509 http.log
12 irc.log
12 known_certs.log
2299 known_hosts.log
185 known_services.log
27 notice.log
17 reporter.log
5049 smtp.log
0 socks.log
10157 software.log
334 ssh.log
69693 ssl.log
0 stderr.log
0 stdout.log
77627 syslog.log
5 traceroute.log
777 tunnel.log
57295 weird.log
2146295 total

Regards,

Gary Faulkner
UW Madison
Office of Campus Information Security
608-262-8591

You only had 27 notices, so it wasn't that problem..

I think @load'ing misc/profiling would be a good next troubleshooting
step. I believe the resulting prof.log can indicate which tables in
memory are growing too large.