Hi all,
I have an issue about cluster manager crash when lots of log event send to it.
I set up a bro cluster on my server, the cluster have 32 workers and 1 proxy and handle about 5Gb/s. After run about one and a half hour, the cluster no longer produces logs, but workers still extracts files. So it seems that the manager was crashed.
Is there any possibility that the manager doesn’t work anymore when workers send lots of log event? If so, whats the limit of the log event? Or maybe the issue wont happen if I run a real cluster on several servers?
By the way, if I want to handle 10Gb/s, how much memory should I leave for each worker ? If I do memory usage restrictions, will it affect the performance of the cluster?
I have an issue about cluster manager crash when lots of log event send
to it.
I set up a bro cluster on my server, the cluster have 32 workers and 1
proxy and handle about 5Gb/s. After run about one and a half hour, the
cluster no longer produces logs, but workers still extracts files. So it
seems that the manager was crashed.
Is there any possibility that the manager doesn't work anymore when
workers send lots of log event? If so, what`s the limit of the log event?
Or maybe the issue won`t happen if I run a real cluster on several servers?
yes, it is possible to kill a manager by sending too many data too it,
though that is usually caused by event traffic and not by logs. There is
no definitive limit, that depends a bit on your hardware and traffic.
Generally, if your manager really crashes, it should be restarted by the
broctl cron process. If you have a lot of logging, starting with bro 2.5
(currently in Beta), you also can separate logging from the manager and
move it into a logger node. To enable this on 2.5, put the following into
your node.cfg (this is also part of the example configuration):
[logger]
type=logger
host=localhost
By the way, if I want to handle 10Gb/s, how much memory should I leave
for each worker ? If I do memory usage restrictions, will it affect
the performance of the cluster?
The amount of memory depends on your traffic mix and is a bit difficult to
predict (I will let others chime in what their experiences are). If you
put in memory usage restrictions, it will kill the processes if they need
more memory than they are allowed to allocate.
To add to what Johanna said, what's the desire behind restricting memory use? Are you running other processes on the system and you'd like to avoid Bro processes consuming all of the memory?
This is what I call an architecture limitation of Bro; it’s well known but not really formally acknowledged, you can read the archives and perceive. if you use faster CPUs it will mask the problem by using less workers (in theory). I’m not sure where the ideal worker threshold is and it’s relation to events per second.
Some people avoid this issue by segmenting the cluster per server; you lose some functionality but at least your cluster runs without incident. (mostly). Example: a four server bro cluster becomes four bro clusters, each running it’s own manager and writing to local disk. If you’re just using Bro as a network recorder this is a fairly even trade off.
I’ve never had a day where Bro didn’t crash due to memory exhaustion and have to perform full restarts once per hour to prevent manager crashes. The only way to fix it is to become a Bro developer. :>
Unfortunately in some environments this is currently true. We are actively working on addressing these issues on multiple fronts though. We're hoping that these troubles can be eradicated for more and more people over time.
The logger node as you commented in a follow up is one of those approaches, but we are still working on replacing the built in communication mechanism with Broker which will hopefully have some positive effects on stability (go Matthias!) and Justin is investigating SumStats and some scripts that have particularly negative effects to see what changes need to be made to make them scale horizontally better. We have also extended the misc/stats.bro information in 2.5 and we will likely continuing extending it in 2.6 to provide more information about what Bro is doing at runtime to help understand it's behavior better.
Stability problems are definitely something we're concerned about as much as you are.