Bro performance & sizing question

We have a Bro cluster currently attempting to process up to 13Gbps
(1.4Mpps) partitioned over two 10Gbps Gigamon network taps.

Capture loss currently averages 44% - but before buying more hardware,
we'd like to sanity-check our plans with folks who have already
successfully sized their own installations.

Currently there are two Bro hosts in the cluster, each with 20 CPU
cores (3.1Ghz), 128GB memory, and Myricom cards with the Sniffer V3
driver. Each host runs a proxy, and 17 workers pinned to CPUs. The
manager is running on one of the worker hosts, and logs are being
written to SSD drives. We're using restrict_filters to ignore (large)
flows generated by four hosts.

The current plan is to buy 2 more worker hosts (same specs), as well
as a NAS for storing logs after each hourly rotation.

If we're capturing 56% of 13Gbps, that's 7454Mbps. Given the 34 cores
used by bro, that works out to 219Mbps/core and about 3.6Gbps/host.

Does that seem like expected performance, or might there be something
broken somewhere? Does it seem reasonable to buy two more worker hosts
(at least to handle current needs)?

Any thoughts or recommendations would be much appreciated.

Cheers,
Melissa

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

We have a Bro cluster currently attempting to process up to 13Gbps
(1.4Mpps) partitioned over two 10Gbps Gigamon network taps.

Capture loss currently averages 44% - but before buying more hardware,
we'd like to sanity-check our plans with folks who have already
successfully sized their own installations.

Currently there are two Bro hosts in the cluster, each with 20 CPU
cores (3.1Ghz), 128GB memory, and Myricom cards with the Sniffer V3
driver. Each host runs a proxy, and 17 workers pinned to CPUs. The
manager is running on one of the worker hosts, and logs are being
written to SSD drives. We're using restrict_filters to ignore (large)
flows generated by four hosts.

Are those 20 real cores or 10 cores with hyperthreading? We have some tests planned to further test this, but I think most people disable hypherthreading or don't pin workers to the 'extra' cores. If you are running 17 workers on 10 real cores, that could lead to problems.

The current plan is to buy 2 more worker hosts (same specs), as well
as a NAS for storing logs after each hourly rotation.

If we're capturing 56% of 13Gbps, that's 7454Mbps. Given the 34 cores
used by bro, that works out to 219Mbps/core and about 3.6Gbps/host.

That's not that an extreme amount of traffic, but 44% loss does sound a bit high.

What does broctl netstats report? One thing to watch out for is that the myricom driver reports capture loss across the entire ring, so the dropped amount needs to be divided by the number of worker processes.

Step one should be to see if netstats reports a similar level of loss. If netstats is reporting something closer to 1-5% loss, you could have a problem elsewhere. If netstats agrees with capstats, then the workers are definitely not keeping up.

Does that seem like expected performance, or might there be something
broken somewhere? Does it seem reasonable to buy two more worker hosts
(at least to handle current needs)?

Hard to say.. More boxes always helps, but it can't hurt to see if things can be optimized a bit with your current hardware.

If the gigamon you have is the kind that does aggregation/load balancing, you may be able to do something like send 50% as much traffic to each box to see how they would behave if you had 2 other boxes helping out.

Hi Melissa,
I built our cluster with the following specs:

Two bro hosts in the cluster + a separate server which runs the bro manager and stores the logs + the arista switch.

Each worker server has 20 cores, 128GB memory, with a 10Gig Intel network card with Pf_Ring. We have now pinned 18 bro processes to the cores, which leaves 2 for OS tasks.

Regarding campus bandwidth, we are pushing 4Gb during normal peak times on I1. We are pushing up to 2Gb on I2 at random times. So it is very possible to be pushing between 4 and 6Gb of traffic at any one time.

I installed the cluster in a datacenter nearby to test it out before moving it downtown next to our edge routers. The edge routers span the traffic to our bro cluster arista switch. While the cluster was in the the datacenter it saw live traffic, but it was much less than the overall campus bandwidth, but I thought it would be a good place to test it out near my office and make sure it was working correctly. The packet loss in the data center was very low, the cluster was running great, logs looked complete, so we moved it downtown.

We immediately had a problem with capture loss of 60%-90%. The CPUs, though, weren’t all pegged at 100% like they were with our old underpowered bro box. Only 6-7 processors were pegged at 100% and the rest were down around 30-40%. We added a bpf filter to only process packets to/from our campus subnets. We haven’t tried a filter for large flows yet. We double checked pf_ring settings, double checked the arista switch settings, and tuned the network card, but the main thing we did was turn off hyper threading. That immediately dropped the capture loss down to 0.0% I guess we didn’t catch that earlier because the data center didn’t have as much traffic going through the cluster. In the datacenter test we had pinned bro to 36 of the 40 hyper threaded cpus on each worker, so when we got rid of that, it worked great. We had been running eight bro proxies on the manager when it was hyper threaded, and I think I can drop that down to four proxies now, I just haven’t tried that yet since it is working at the moment.

After we got the packet loss to 0.0%, we actually ran a test where we set up a separate instance of BRO on just one of our worker servers and it was able to handle the entire load (at least during a short test). All the processors jumped from 20-40% usage to 40-60% when we ran it on a single box, but the capture loss was still basically zero. We are going to run with two boxes, though, because we expect our bandwidth needs to grow and the cluster will be able to keep up for a while.

Hope that helps.

Brian Allen

Information Security Manager
Washington University