I built our cluster with the following specs:
Two bro hosts in the cluster + a separate server which runs the bro manager and stores the logs + the arista switch.
Each worker server has 20 cores, 128GB memory, with a 10Gig Intel network card with Pf_Ring. We have now pinned 18 bro processes to the cores, which leaves 2 for OS tasks.
Regarding campus bandwidth, we are pushing 4Gb during normal peak times on I1. We are pushing up to 2Gb on I2 at random times. So it is very possible to be pushing between 4 and 6Gb of traffic at any one time.
I installed the cluster in a datacenter nearby to test it out before moving it downtown next to our edge routers. The edge routers span the traffic to our bro cluster arista switch. While the cluster was in the the datacenter it saw live traffic, but it was much less than the overall campus bandwidth, but I thought it would be a good place to test it out near my office and make sure it was working correctly. The packet loss in the data center was very low, the cluster was running great, logs looked complete, so we moved it downtown.
We immediately had a problem with capture loss of 60%-90%. The CPUs, though, weren’t all pegged at 100% like they were with our old underpowered bro box. Only 6-7 processors were pegged at 100% and the rest were down around 30-40%. We added a bpf filter to only process packets to/from our campus subnets. We haven’t tried a filter for large flows yet. We double checked pf_ring settings, double checked the arista switch settings, and tuned the network card, but the main thing we did was turn off hyper threading. That immediately dropped the capture loss down to 0.0% I guess we didn’t catch that earlier because the data center didn’t have as much traffic going through the cluster. In the datacenter test we had pinned bro to 36 of the 40 hyper threaded cpus on each worker, so when we got rid of that, it worked great. We had been running eight bro proxies on the manager when it was hyper threaded, and I think I can drop that down to four proxies now, I just haven’t tried that yet since it is working at the moment.
After we got the packet loss to 0.0%, we actually ran a test where we set up a separate instance of BRO on just one of our worker servers and it was able to handle the entire load (at least during a short test). All the processors jumped from 20-40% usage to 40-60% when we ran it on a single box, but the capture loss was still basically zero. We are going to run with two boxes, though, because we expect our bandwidth needs to grow and the cluster will be able to keep up for a while.
Hope that helps.
Information Security Manager