I noticed the previous gentleman running 160 workers (I assume 16 boxes with 10 workers each??) in a cluster, and had a general question about this.
If I am pumping out well above 5Gb/s, doesn’t that mean running in a cluser that I am pushing 5 right back out the other side? If so, this doesn’t seem to scale well beyond 5ish Gb/s.
At what point, and how many pps, should we move away from a single manager host talking to cluster hosts? Even if there is no processing by bro on the manager, you still have bandwidth issues, unless you are loading up your bro manager with multiple 10 gig nics, and are loadbalancing upstream, in which case, why aren’t you just load balancing to stand alone boxes each with their own manager, logger, and set of workers?
It seems to me that running multiple physical bro hosts tied to a single manager is the poor mans solution to running proper load balancing hardware upstream. Am I mistaken?
You sound a little confused, multi-node scaling is a feature of Bro and really the only way to monitor high volume locations. See the LBNL paper on Bro at 100G for an example. When using a front-end load-balancer you are distributing the traffic directly to the worker nodes which in turn produce metadata to be sent to the manager node.
The decision to use more than one box is relative to the processing requirements, the basic formula is something like one 3.0 Ghz core per 250Mbps of traffic.
If you use multiple managers you break global visibility in the scripting context, proxies share state among the entire cluster which operates as a sort of giant shared memory space. Multiple managers is essentially independent Bro clusters. I think a basic example would be a scanning script or SQL injection script… if the threshold is 25 and 10.1.1.1 attacks your entire network each cluster only sees 1/n of that activity and may not fire an event because of the limited context.
As for the bandwidth concerns you mention I’m not sure what you mean exactly. The metadata produced by the workers and sent to the manager (logs) are a fraction of the monitored raw traffic.
Is that formula based on Myricom NIC or using PF_Ring? What’s the best way to calculate the expected increase when switching to a custom nic?
The correct formula, valid for every nic is between 0-1Gbit per worker. Depends on the traffic type, CPU, bios settings, kernel settings and sometimes version, OS type version and settings, NIC and running scripts.
The old 250 comes from a dark past and was for unknown traffic on unknown hardware.
Sometimes I process 500 Mbit/sec per worker on Myricom sometimes I have quite a packet drop. Oh well
Also afpacket >> pfring. And there’s afpacket, netmap, Myricom, napatech and dozen others
Any performance hit seen using software based load balancing versus specialized NICs? (Have actual test results?)
Some of the specialized NICs are actually doing software based load balancing as well, they just don't make it terribly clear in their marketing material or documentation. (just to muddy the waters more than they already are!)
I second what Seth said. Correctly configured afpacket is as fast as Myricom. There - I said that.
If you have great budget and no time get napatech. If smaller budget and little time-Myricom. Just works!
Intel has the best future potential, right now requires very careful tuning. And by Intel I mean X710 these days.
Solarflare FTW! Being able to multi thread single threaded capture programs is sweet. Very little config to do on card load balancing to worker threads. Also, all you have to do is compile against libpcap 1.5.3 and slipstream the driver in and poof it works.