We are still having a problem with our Bro cluster and logging. During peak times the manager will slowly consume all available memory while the logs sent to disk are delayed by an hour or more.
Does anyone know the official bug ID for this within bro-tracker.atlassian.net ?
I’ve tracked this problem for a while now and tried all variations of the proposed fixes: the flare patch, the no-flare patch, segmented cluster with one manager per box, and an architecture change from Linux+PF_RING to FreeBSD+Myricom. Currently we are using a standard build of bro-2.5-beta in a cluster configuration with one dedicated manager and three dedicated sensors, each using both ports of a Myricom card with 22 workers attached to each port. ( 1 manager, 1 logger, 12 proxies, 6 worker nodes (22 procs each, 132 total).
Restarting the cluster on a regular basis is much easier without PF_RING but that’s only partially curing the symptom. In that regard the last proposed solution is the most expensive, using faster CPUs which will reduce the worker count. But will that really solve the problem ? I’m more interested in defining what the problem actually is.
FWIW there’s some text below to illustrate, the dates are somewhat old but it’s still a representative example.
- Manager node is near out of memory… 2800 Mb left
- Workers have moderate CPU usage, 60%
- Logs on manager node are 25 minutes behind…
- 21:05 vs 20:40
- Initiated cluster restart at 21:06, completed at 21:11.
Workers have moderate CPU usage.
Logs are 16 minutes behind
Earlier the logs were roughly two hours behind.
[bro@mgr /opt/bro]$ date -r 1471373408 (most recent conn.log timestamp)
Tue Aug 16 18:50:08 UTC 2016
[bro@mgr /opt/bro]$ date
Tue Aug 16 20:43:45 UTC 2016
Bro manager process is using 70G of memory and the system is swapping:
last pid: 96557; load averages: 46.37, 53.09, 54.88 up 0+18:06:24 21:25:17
55 processes: 8 running, 47 sleeping
CPU: 7.7% user, 2.1% nice, 68.1% system, 0.2% interrupt, 21.9% idle
Mem: 103G Active, 2412M Inact, 19G Wired, 549M Cache, 331M Free
ARC: 15G Total, 89M MFU, 15G MRU, 29M Anon, 68M Header, 211M Other
Swap: 12G Total, 12G Used, 85M Free, 99% Inuse, 9248K In
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
7305 bro 34 20 0 40121M 39498M uwait 10 31.7H 280.27% bro
7337 bro 1 96 5 70653M 61577M CPU36 36 868:45 59.96% bro
Currently in this state the logs over two hours behind the current time.
bro@mgr:~ % date -r 1471374952 (most recent conn.log timestamp)
Tue Aug 16 19:15:52 UTC 2016
bro@mgr:~ % date
Tue Aug 16 21:27:04 UTC 2016
Memory usage over the past week: