Quick question for those of you running Bro clusters. I often run into situations where OOM-killer invokes and kills some Bro process. Do any of you do anything to tune OOM-killer on Linux or otherwise tune memory management, such as disabling OOM-killer, turning off swap etc?
Background : I've had various success tracking down the events that cause me to suddenly run out of memory and ultimately crash. Sometimes it seems to be the result of log rotation getting stuck on a really big file (8Gig http or dns log), or a sudden 10G traffic spike overwhelming the cluster. I've pursued various avenues to mitigate the issue such as shorter log rotation intervals, pruning known high throughput compute traffic, scheduling daily restarts etc. Ultimately I'm also looking to increase RAM, but I'm concerned even with more RAM, I'm just a traffic spike away from OOM-killer, especially since we are unlikely to be able to buy cluster hardware fast enough to keep up with traffic volumes.
Regards,
Are you chasing a memory leak? broctl top
will generally report >500MB of reserved memory (90% of the time even >256M) per worker in a 40 worker cluster capable of handling spikes of 10Gb. Each worker has ~3GB RAM to it.
I recall the log rotation process is a separate cron-style job that shouldn’t really be bring down the cluster workers.
-Alex
I'm not sure. I'm running Bro 2.2 (release) with the default scripts and most of the known memory leak issues don't seem to apply to me. Other than some limited testing I haven't been using any custom scripts, and none that depend on the input framework.
I've been thinking I may need more than 64G of RAM per node (16 core / 3-5G traffic, & 12 workers each). I seem to run with 100% of the RAM allocated, but 20-30% of my RAM cached before something happens to cause a sudden drop in cached memory (as seen on Orca graphs) resulting in OOM-killer dropping one or more Bro processes.
I've been reading a bit about OOM-killer and some high performance situations seem to call for disabling it, so I'm investigating whether it makes sense to tweak vm-overcommit settings to disallow allocating more than the total physical RAM + SWAP, but I don't know if this is advisable for Bro or not, hence the query.
~ Gary
You should be fine with those specs.. 12 workers should be using closer
to 12G of ram, not anywhere near 64G.
Can you post the output of
free -m # on one of the worker nodes
broctl top # on the manager
and to get an idea of your msg log rate:
cat bro/logs/current/* | wc -l ; sleep 1m ; cat bro/logs/current/* | wc -l
Can you also share the memory graph from this system over time,
particularly after a fresh restart of bro?
The output is below. I've been running the host longer than I have Orca graphs for. When looking at the graphs, you can identify the restarts based on the sudden spike in free memory (in blue). There is a series of restarts toward the end of last week and beginning of this week where I was experimenting with a script and making changes. That script was only tested between last Thursday and this Monday. Traffic and log rates are taken between / 11AM-1PM. In some cases I tried to collect multiple samples.
The history of OOM events predates the graphs, so I thought it would be useful to know them as well. I have been reducing workload & workers over time as part of troubleshooting. I also sometimes have an issue where bro log rotations fails, and I need to rotate logs manually and restart. This usually happens when available/cached memory drops below about 8G.
Machines each have 2 of e5-2670s (8-cores, 2.6Ghz) & 64G RAM. So 16 cores / 32 HT per machine.
OOM-Killer (host 1, manager + 2 proxies + x workers):
Nov 21 - 24 workers
Nov 22 - 24 workers
Nov 25 - 24 workers
Dec 4 - 20 workers
Dec 5 - 20 workers
Dec 6 - 20 workers
Dec 27 - 12 workers
Jan 26 - 12 workers
Jan 30 - 12 workers - might be related to script testing
Feb 1 - 12 workers - might be related to script testing
OOM-Killer (host2, 2 proxies + x workers):
Nov 20 - 24 workers
Nov 21 - 24 workers
Nov 22 - 24 workers
Nov 25 - 24 workers
Dec 4 - 20 workers
Dec 5 - 20 workers
Dec 6 - 20 workers
Jan 26 - 12 workers
broctl top 11:15AM:
That is quite a lot of logs... Can you do just a `wc -l *` a minute
apart and diff that? I'm particularly wondering what the rate of
notices/sec you are getting. I recently ran into and fixed an issue
with notice supression using a lot of memory:
https://bro-tracker.atlassian.net/browse/BIT-1115
https://github.com/bro/bro/commit/ec3f684c610f084fdea8ed5cf85f9c4390eb58e6
I wonder if that could be the issue you are running into..
Here it is just after a log rotation:
14 app_stats.log
32 capture_loss.log
3075 communication.log
10515588 conn.log
1463723 dns.log
13760 dpd.log
1562035 files.log
1527 ftp.log
1771968 http.log
74 irc.log
127 known_certs.log
21540 known_hosts.log
2696 known_services.log
325 notice.log
242 reporter.log
37892 smtp.log
13 socks.log
78387 software.log
3247 ssh.log
552563 ssl.log
4 stderr.log
3 stdout.log
672817 syslog.log
556 traceroute.log
5790 tunnel.log
472964 weird.log
17180962 total
1 min later:
14 app_stats.log
32 capture_loss.log
3470 communication.log
11859982 conn.log
1619893 dns.log
15468 dpd.log
1760513 files.log
1679 ftp.log
1993477 http.log
86 irc.log
139 known_certs.log
23839 known_hosts.log
2881 known_services.log
352 notice.log
259 reporter.log
42941 smtp.log
13 socks.log
88544 software.log
3581 ssh.log
622256 ssl.log
4 stderr.log
3 stdout.log
750444 syslog.log
561 traceroute.log
6567 tunnel.log
530259 weird.log
19327257 total
And the diff:
0 app_stats.log
0 capture_loss.log
395 communication.log
1344394 conn.log
156170 dns.log
1708 dpd.log
198478 files.log
152 ftp.log
221509 http.log
12 irc.log
12 known_certs.log
2299 known_hosts.log
185 known_services.log
27 notice.log
17 reporter.log
5049 smtp.log
0 socks.log
10157 software.log
334 ssh.log
69693 ssl.log
0 stderr.log
0 stdout.log
77627 syslog.log
5 traceroute.log
777 tunnel.log
57295 weird.log
2146295 total
Regards,
Gary Faulkner
UW Madison
Office of Campus Information Security
608-262-8591
You only had 27 notices, so it wasn't that problem..
I think @load'ing misc/profiling would be a good next troubleshooting
step. I believe the resulting prof.log can indicate which tables in
memory are growing too large.