building a new bro server

Hi All-

I currently have a server running BRO, and we are seeing a lot of packet loss. I am getting quotes for a new server to replace it, and I wanted to run some of the options by this group to see what would be better.

Current server specs:

-2 Processors, 8 cores each at 2.4GHz, so 16 total. We run 14 bro processes, one per core. And they run at 100% utilization all the time.

-128G memory

-Intel IXGBE 10Gig network card with pfring

We are seeing 3-4 Gig traffic pretty much constantly, and we spike to 5 Gig. The bro packet-loss file shows 30+% packet loss most of the time, but during the early morning hours, when traffic drops considerably it will fall to 0.01%.

For one test, we used a bpf filter to block all traffic going to bro except for a one /24 subnet of campus traffic for about 15 minutes and the packet-loss dropped to 0.01%.

So we think our processors are too few and too slow to handle this amount of bandwidth.

Our question as we get a quote to buy a new box is, which is more important for BRO, having the roughly same number of cores but get faster ones, or get more cores at the same or slower speed?

I’m looking at the following two Dell server options, although I can adjust this to add other better possibilities:

Option1:
-Intel Xeon E5-2699, two processors, 18 cores each at 2.3GHz for 36 total
-256Gig RAM
-Intel IXGBE 10Gig network card with pfring

Option2:
-Intel Xeon E5-2687 two processors, 10 cores each at 3.1GHz for 20 total
-256Gig RAM
-Intel IXGBE 10Gig network card with pfring

I’m assuming the first option would be much better but I’ve never researched this to know for sure, or how much better it would actually be. I think the difference in price is around $2,400.

I’d like to get one box to handle our bandwidth as it grows over the next couple years, take the current underpowered box and use it is a BRO test box/elastic search server, and build the infrastructure to move to a BRO cluster in a couple years. Right now a single box would be better for space issues.

I would be really interested to talk to other companies/universities who are running bro in the 3-7 Gig bandwidth range right now so I can see what hardware works for you.

Thanks for your help,

Brian Allen, CISSP
Information Security Manager
Washington University
brianallen@wustl.edu
314-935-5380

A couple thoughts that might help the list better understand your topology/situation.

  1. Are the manager and/or proxies on the same host?

  2. What are you using to determine packet loss? (ex. Bro capture loss script, broctl netstat, pf_ring counters, etc)

  3. Are you running PF_RING using any of the enhanced drivers (DNA/ZC) and/or zero copy scripts(Libzero/ZC)?

  4. Are you pinning your worker processes to individual cores (via node.cfg) or are you letting the OS handle things?
    I saw a marked improvement in average loss as measured by the bro capture loss script simply by pinning CPU cores on a server very similar to yours with similar traffic per host. Bursty traffic, and mega-flows, will still cause higher loss levels for individual workers at times though. Also, if you are running the manager and proxies on the same host they could be competing for the same cores that one or more workers are running on. Running htop might give you an idea of workers are being bounced between cores (if not pinned) as well as whether other processes are clobbering one or more of the cores your workers are on. Either could be an issue with workers running at 100% CPU usage.

Regards,
Gary

Good questions and suggestions.

  1. The manager and the workers are all on the same server.
  2. We have looked at all of those metrics, but the bro capture loss file is what we use most. That is the one saying we have 30+% packet loss.
  3. We got a license and tried PF_Ring with DNA/Zero copy but it didn’t make a noticeable difference.
  4. We do use the node.cfg file to pin the 14 worker processes to the individual cores. That leaves 2 free cores for OS/System tasks.
    We saw a huge improvement when we went from 16Gig RAM to 128Gig RAM. (That one was pretty obvious so we did that first). We also saw improvement when we pinned the processes to the cores.

Thanks,
-Brian

I run a cluster with 12 workers on 16 physical cores. 64GB RAM each (mandatory!!). I pin workers to physical cores, otherwise OS will end up assigning two different workers to the same physical cores but different virtual (HT) and they end up competing for the same resources.

Are you looking at “netstats” or capture loss from a script (the one, that does heuristics with ACKs)? If you later and you can see a lot of drops and the number is more or less even accross the workers, than the loss might be happening before Bro. You said that using DNA didn’t make a difference, so that might be worth looking into. How are you mirroring traffic - using taps or switch?

Archiving to thread on behalf of Jusin Azoff:

I think I had also suggested that you move to tcmalloc. Have you tried that yet? It’s not going to fix your issue with 30% packet loss, but I expect it would cut it down a bit further.

  .Seth

A few other memory-related things, for what they're worth:

* Make sure vm.swappiness is turned way down
* numactl / numastat could be useful to play with: memory locality can make a difference
* Related to memory locality: try tweaking vm.zone_reclaim_mode

If you're into lower-level tuning / analysis, I also like:

https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization

HTH,
Gilbert

For perspective I currently have a bro cluster comprised of 3 physical hosts. The first host runs the manager, proxies, and has storage to handle lots of bro logs and keep them for several months, the other two are dedicated to workers with relatively little storage. We have a hardware load-balancer to distribute traffic as evenly as possible between the worker nodes, and some effort has been made to limit having to process really large uninteresting flows before they reach the cluster. I looked at one of our typically busier blocks of time today (10:00-14:00) and during that time the cluster was seeing an average of 10Gbps of traffic with peaks as high as 15Gbps. Looking at our traffic graphs and capstats showed each host typically was seeing around 50% of that load, or around 5Gbps on average. During this time we saw an average capture loss of around 0.47%, with a max loss of 22.53%. During that same time-frame I had 18 snapshots where individual workers reported loss over 5%, and 2 over 10% out of 748. So, I'd say each host is probably seeing about the same amount of traffic as you have described, but loaded scripts etc may vary from your configuration. We have 22 workers per host for a total of 44 workers, and I believe the capture loss script is sampling traffic over 15 minute intervals by default, so there are roughly 17 time slices for each worker. Here are some details of how those nodes are configured in terms of hardware and bro.

2 worker hosts each with:
2xE5-2697v2 (12 Cores / 24 HT) 2.7Ghz/3.5Ghz Turbo
256GB RAM (probably overkill, but I used to have the manager and proxies running on one of the hosts and it skewed my memory use quite a bit)
Intel X520-DA2 NIC
Bro 2.3-7 (git master at the time I last updated)
22 workers
PF_RING 5.6.2 using DNA IXGBE drivers, and pfdnacluster_master script
CPU's pinned (used OS to verify which core presented to the OS mapped to each physical core to avoid mapping 2 workers to the same physical cores, and didn't use the 1st core on each CPU)
HT is not disabled on these hosts and I'm still using the OS malloc.

Worker configs like this:
[worker-1]
type=worker
host=10.10.10.10
interface=dnacluster:21
lb_procs=22
lb_method=pf_ring
pin_cpus=2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23

I suspect the faster CPUs will handle bursty flows better such as when a large volume of traffic load balances to a single worker, while more cores will probably help when you can better distribute the workload more evenly. This led me to try to pick something that balanced the 2 options (more cores vs higher clock speed. Naturally YMMV, and your traffic may not look like mine.

Hope this helps.

Regards,
Gary

Bear in mind that there is a 32 application limit for the number of bro workers/slaves that can attach to a single cluster ID with the pf_ring dna/zc drivers. Or you can get really crafty and bounce traffic from one ring to another interface/ring and have up to 64 workers on a single box, provided you have the cores to work with :slight_smile:

Looking at the current Intel chips, I’d say the 8-core high-clock (+3.3Ghz) speed procs are a good option in a quad-socket system build and not break the bank. Would give you 32-cores to pin workers upon at a nice high clockspeed, which bro seems to greatly appreciate. The E5-2687W v2 or E5-2667 v2 or E5-4627 v2, some of which can turbo up to 4Ghz for traffic spikes (if you manage the power modes correctly! https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/08/05/how-to-maximise-cpu-performance-for-the-oracle-database-on-linux )

-Alex