I've been running on Bro 2.2 for just under a month mostly without incident. After my most recent restart it ran for about two days since before I received crash reports for several workers and proxies across multiple hosts. Upon investigation it looks like I might have run out of memory. I found logs such as the following in /var/log/messages on all of my nodes (manager and worker nodes):
bro invoked oom-killer
...
Out of memory: Kill process 7152 (bro) score 151 or sacrifice child
Has anyone seen this before? Is this just a sign I need more RAM or am I possibly running into a memory leak? I have run for up to a week without incident in the past before restarting of my own accord after making various changes to reporting, policy etc. The only thing I changed prior to the last restart was to disable an email notice I had previously set.
I've seen this behavior too (in addition so some really weird and
massive packet drops). I've been trying to gather some perf.log data
and track it down to the time(s) in question. One of the strange parts
is it's on a box that's only seeing 200mbps w/8 bro workers on 16
cores and 96GB of ram.
I think you’re definitely running into a memory leak. I’ve had 2.2 processes try to grab up to 100GB of RAM. 8 workers, 96GB of RAM, but the box splits time with another 8 snort workers. My late 2.1 release (september 21 IIRC) was quite a bit more stable.
I think there is some particular traffic that you guys are running into that's causing it. A few other people have encountered that too but we haven't been able to nail down what it is yet.
It normally has a healthy diet of 7-8Gbps and 1Mpps of traffic to consume, and other than some initial growing pains (not enough proxies) it has mostly just worked, although sometimes when it has stopped receiving traffic, such as during network maintenance / code upgrades to the load-balancer it has complained and started killing processes.
That was my assumption too. I upgraded on 8 November, leaked early AM 16th, and then again on the 29th. Traffic would have been at an ebb on the 16th, and rising on the 29th, so I don’t think it’s sheer volume - as you say, there must be something *in* the traffic. Or more likely, a sequence of things, otherwise I expect 2.2 would be vomiting all over my RAM far more often.
Please let me know if there’s anything I can do to help; I got lucky with these, the first crash was the day before I started vacation (well, technically, my first day of) and the second crash was the day immediately after I returned. Unfortunately, when it does happen, it takes out my IDS entirely as I need to cold-boot the server, so if live diagnostics are required, it’ll have to be timed when people are around.
If we're able to get our hands on some of the traffic (pcaps spanning
the time window of memory usage/massive drops) that causes these
issues, what would be some good tests to run against it?
Nothing suspicious or weird shows up in the perf.log for each worker
(or manager).
I have little experience with it, but could running with valgrind be possible/advisable/useful? Perhaps best to be run with -01 on a box that is overpowered for the expected load…
It was not all of them, but it was more than one, both times. I’ll take a closer count next time.
Maybe “amount of RAM I’m currently using” could be added per-worker to some stats output plugin?
It seems like oom-killer got invoked on one host and then then 2 minutes later on another host in my cluster. The load-balancer should be sending each host unique traffic at any given moment. Only 3 workers out of around 40 got killed, but then 3 of my proxies crashed. Cooincidentally as I was writing this I just ran out of RAM again after 24 hours of operations.
Regards,
Gary Faulkner
UW Madison
Office of Campus Information Security
608-262-8591
I've had some proxy crashes in the past and it was suggested that I increase my number of proxies -- which I did until my environment appeared stable for about a week. After being stable for about a week I started to run out of memory, and in subsequent restarts have been running out of memory after about 24 hours of operation, typically during non-peak times (50% of normal traffic). Naturally I'm wondering if I'm just doing it wrong and if my set-up is appropriately sized and configured to handle the load I'm asking it to deal with.
I think I've seen folks on the list that were running Bro on similar hardware that might be able to tell me if my configuration is anything close to what works for them. I'm also curious how other folks determine how many proxies they need, how many workers per host etc.
I'm mostly running Bro 2.2 stock with default scripts, and only minor edits to local.bro to test out email notices. I'm only using these systems for Bro, although they were originally from another project so they weren't necessarily ordered with Bro specs in mind.
Realistically, for that load I would recommend looking into a cluster. My personal sizing criteria is 4-5Gb/s max per box. That's for a box that's roughly the same as yours and that uses up about 40-50GB of RAM per box (although for as cheap as it is, I recommend to always guess high on the RAM). For this size of box, you can probably improve your performance by reducing the number of workers (one per real core is a good benchmark). I am conservative, so I like to keep a couple if cores free for system tasks to ensure reliable performance (setting the number of workers to something like 12 or 14), but that's up to you.
As a caveat, the aforementioned recommendations assume that you're using a network card that's designed to do this work (like an Intel ixgbe card with pf_ring or a Myricom). If your card can't bypass the kernel's interfaces, then you're going to need a lot more hardware to get the same performance because you're spending CPU time shoving the packets through the kernel instead of just accessing them directly on the NIC
Thank you for the feedback. To clarify I am currently using two physical hosts clustered together (using broctl), so each box ends up with 64G of RAM and 16cores/32 threads and Intel IXGBE 10G Cards + pf_ring/DNA/libzero for distributing packets on the host. Each physical host then sees between 2-4Gbps and has 20 workers + 2 proxies. I recall reading a blog entry by Martin Holste where he mentioned allocating only half as many workers as you had logical cores/threads, but also seem to recall others (Vlad G.) pushing nearly as many workers as logical cores, but could have read to much into it. Are you referring to physical cores or logical cores/threads? If it is the former I think what you are saying is inline with what Martin suggested; although I had hoped I could push the worker count a bit higher based on what I thought I had read elsewhere.
Ah, my apologies for my misreading of your specs. Two hosts seeing 2-4Gb/s each should be just fine.
As far as my recommendation, I definitely meant real cores. I find, based on my own experimentation, that taking your number of workers higher than the number of physical cores in the box tends to hurt the performance instead of helping. In the case where a box was also doing manager/proxy duty, I'd subtract a physical core for each of those as well (I've broken those out into a separate box to avoid that issue on mine). I don't recall seeing anything from Vlad suggesting going higher, but I can't rule out the possibility that I missed it.
Also to answer a question that I overlooked earlier, I'm running 3 proxies on that workload. I got to that number by just increasing by one until it stabilized.
So, probably closer to 10-12 workers per host in my case since the proxies and manager are there? It was suggested to me early on to try to acquire another box to separate from manager and proxies, but I don't have one quite yet so I've been trying to make it work as is. It didn't seem like the worker child processes needed much CPU time, so I thought I could push the worker count higher and it also seemed like I got less loss per broctl netstat, but others have suggested that maybe the broctl netstat command wasn't the most reliable way to judge that. I ended up at 4 proxies mostly because two didn't seem stable and I like symmetry so I jumped straight to 4.
Twelve per box would be my plan. And to be honest, once you lower the number of workers, you could probably reduce the number of proxies as well. I'm running 3 for a cluster with 72 active workers, so you could probably do 24 with just one (and you could even put it on the non-manager box to maintain some symmetry). As always though, feel free to experiment with that to find what works best for you.