Bro with 10Gb NIC's or higher

Does anyone have experience with higher speed NIC’s and Bro? Will it sustain 10Gb speeds or more provide the hardware is spec’d appropriately?

regards,

Coen

Succinctly, yes, although that provision is a big one.

I'm running Bro on two 10 gig interfaces, an Intel X520 and an Endace DAG 9.2X2. Both perform reasonably well. Although my hardware is somewhat underspecced (Dell R710s of differing vintages), I still get tons of useful data.

If your next question would be "how should I spec my hardware", that's quite difficult to answer because it depends on a lot. Get the hottest CPUs you can afford, with as many cores. If you're actually sustaining 10+Gb you'll probably want at least 20-30 cores. I'm sustaining 4.5Gb or so on 8 3.7Ghz cores, but Bro reports 10% or so loss. Note that some hardware configurations will limit the number of streams you can feed to Bro, eg my DAG can only produce 16 streams so even if I had it in a 24 core box, I'd only be making use of 2/3 of my CPU.

Mike

How does one know if bro is dropping (10%) of messages ?

capture_loss log, not enabled by default.

Turn on the capture-loss script by adding the following to your local.bro:

@load misc/capture-loss

I'd agree with all of this. We're monitoring a few 10Gbps network segments with DAG 9.2X2s, too. I'll add in that, when processing that much traffic on a single device, you'll definitely not want to skimp on memory.

I'm not sure which configurations you're using that might be limiting you to 16 streams -- we're run with at least 24 streams, and (at least with the 9.2X2s) you should be able to work with up to 32 receive streams.

v/r

John Donaldson

You're right, it's 32 on mine.

I posted some specs for my system a couple of years ago now, I think.

6-8GB per worker should give some headroom (my workers usually use about 5 apiece I think).

Mike

While, we at LBNL continue to work towards a formal documentation, I think I'd reply then causing further delays:

Here is the 100G cluster setup we've done:

- 5 nodes running 10 workers + 1 proxy each on them
- 100G split by arista to 5x10G
- 10G on each node is further split my myricom to 10x1G/worker with shunting enabled !!

Note: Scott Campbell did some very early work on the concept of shunting
      (http://dl.acm.org/citation.cfm?id=2195223.2195788)

We are using react-framework to talk to arista written by Justin Azoff.

With Shunting enabled cluster isn't even truly seeing 10G anymore.

oh btw, Capture_loss is a good policy to run for sure. With above setup we get ~ 0.xx % packet drops.

(Depending on kind of traffic you are monitoring you may need a slightly different shunting logic)

Here is hardware specs / node:

- Motherboard-SM, X9DRi-F
- Intel E5-2643V2 3.5GHz Ivy Bridge (2x6-=12 Cores)
- 128GB DDRIII 1600MHz ECC/REG - (8x16GB Modules Installed)
- 10G-PCIE2-8C2-2S+; Myricom 10G "Gen2" (5 GT/s) PCI Express NIC with two SFP+
- Myricom 10G-SR Modules

On tapping side we have
- Arista 7504 (gets fed 100G TX/RX + backup and other 10Gb links)
- Arista 7150 (Symetric hashing via DANZ - splitting tcp sessions 1/link - 5 links to nodes

on Bro side:
5 nodes accepting 5 links from 7150
Each node running 10 workers + 1 proxy
Myricom spliting/load balancing to each worker on the node.

Hope this helps,

let us know if you have any further questions.

Thanks,
Aashish

In all of the 10G deployments I have done I always do multiple boxes behind a flow based load balancer. That way I can use commodity boxes without special NICs and keep them at a reasonable price point. The bang for the buck goes down when you talk 4 x 12 core HT processors etc. vs a dual 10 core HT. You also get the ability to have some fault tolerance where if you have hardware issues you are not blind. I have a few deployments that are going from 10G to 100G and the only thing we have to change is the inbound interfaces on the LB gear. The other positive is as usage goes up I can add additional capacity incrementally instead of having to re-solution.

Thanks

Mike

Hi,
What is the name of the log and where is it located at ?

Do you really see and can handle 1Gbit/sec of traffic per core? I'm curious.

I would say, with a 2.6Ghz CPU my educated guess would be somewhere
about 250Mbit/sec / core with Bro. Of course configuration is
everything here, I'm just looking into "given you do it right, that's
what's possible".

Do you really see and can handle 1Gbit/sec of traffic per core? I’m curious.

Haven’t measured if a core can handle 1Gbit/sec but I highly highly doubt.

What saves us is the shunting capability - basically bro identifies and cuts off the rest of the big flows by placing a src,src port - dst, dst-port ACL on arista while continuing to allow control packets (and dynamically removes ACL once connection_ends)

So each core doesn’t really see anything more then 20-40 Mbps (approximation)

(Notes for self, it would be good to get these numbers in a plot)

Thanks,
Aashish

Capture_loss.log and it should be with all your other logs once you turn
it on. Remember to install, check, and restart brocontrol to get it
turned on.

After you add the following to local.bro of course…

@load misc/capture-loss

  .Seth

Ok - I found it it : / - along with “weird”

How I can I specify another directory ?
What do the fields mean ?

root@x64-01:/# cat cap*
1420832673.023244,900.000068,bro,0,0,0.0
1420833573.023279,900.000035,bro,0,6,0.0
1420833727.951157,154.927878,bro,0,0,0.0
1420833885.693988,154.676438,bro,0,0,0.0

My file has the headers in it:

$ head -8 capture_loss.log

#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path capture_loss
#open 2015-01-09-16-33-16
#fields ts ts_delta peer gaps acks percent_lost
#types time interval string count count double

Hi,

i concur with Aashish; the biggest help is the shunting of large flows (and possibly encrypted flows).

we have a Cisco Nexus 3172 (6x40gbps + 48x10gbps copper) load balancing to 6 x Dell 620s (E5-2695 v2 @ 2.40GHz x 24); each with Intel X540-AT2’s (2x10gbp copper) running 20 workers each (with pfring/dna)… sustaining about 5gbps… and we still see packet loss >5% on some workers due to the elephant flows in our environment.

Yee.

How I can I specify another directory ?

What do you mean?

What do the fields mean ?


It’s documented:
  https://www.bro.org/sphinx/scripts/policy/misc/capture-loss.bro.html#type-CaptureLoss::Info

root@x64-01:/# cat cap*
1420832673.023244,900.000068,bro,0,0,0.0
1420833573.023279,900.000035,bro,0,6,0.0
1420833727.951157,154.927878,bro,0,0,0.0
1420833885.693988,154.676438,bro,0,0,0.0

That last number is the estimated percent of packet loss. Unnnnnfortunately, I think I know enough to guess that your traffic is heavily leaning toward DNS and capture-loss relies on having a lot of TCP available so in your case the numbers might be misleading.

  .Seth

Yes, this has been high on our radar for quite a while. I suspect we’re getting closer and closer to getting this into Bro in a generic manner. Unfortunately it just takes a lot of time and experiences to get it right.

  .Seth

Hello,

We at UT Austin are fairly new to Bro and new to the list (been following, but never posted), but I thought I'd share my experience.

We have had good luck monitoring our traffic which sustains ~17-20 Gbps during peak hours with 2 devices made by a
company called Netronome. The traffic is distributed between the 2 clustered devices using an integrated load balancer which
evenly spreads the traffic across all the processors which have been pinned to corresponding bro workers.

We see very little traffic loss - random ~2-3% drops per Bro instance with the occasional larger ~10% drop.

Our configuration:

- 2 clustered devices 40 cores each with 32 workers and 4 proxies
- Primary device with 2 10 gig cards

Hope this is helpful.

-Kelly
UT Austin