High packet drop rates

I've recently run into a problem, with Bro 2.4.1, where I have extremely high packet loss.

I'm running on a server with dual quad-core Xeon processors, with no hyperthreading, and 64GB of RAM, monitoring a few small links, averaging, in aggregate, about 100Mbps of traffic. There isn't too much else running on this system, but we're seeing drop rates that average in the 70-99% range (via the capture-loss.bro policy), even though Bro's CPU utilization sits at around 20-30%.

Most of the traffic comes in from a bond interface, running on top of some Intel NICs, but we're seeing similarly high drop rates when directly capturing from another, non-bonded interface. Using other tools, we're not seeing any dropped packets (even with a heavily-loaded Snort instance). We've tried PF_RING and load-balancing across several workers, pinned to several CPUs, but all that we end up with, then, are multiple processes with 2-30% CPU utilization and 70-99% drop rates. PF_RING isn't showing any drops on its side, and hans't had issues with insufficient memory. We're pretty sure that we're not just seeing TCP-related chaff that's throwing off our numbers, because records of known connections are showing up malformed.

Any insights?

John Donaldson

A likely reason for this is the various NIC offloading features being enabled causing bro to not properly capture entire packets. Is your reporter.log complaining about this at all?

Can you try running

    for i in rx tx sg tso ufo gso gro lro; do ethtool -K eth0 $i off; done

to disable all of those optional features (replace eth0 with appropriate interfaces)

Also, if you are using jumbo frames you may need to add to local.bro:

    redef snaplen = 9000; #potentially as high as 9216

and ensure that the mtu on the nic is set appropriately as well.

Once you know you have capture loss it can be a good idea to look at the lower level records to see which connections are missing bytes:

    cat conn.log | bro-cut id.orig_h id.resp_h id.resp_p orig_bytes resp_bytes missed_bytes

It's possible that all of your missed_bytes are coming from a small subset of hosts. We ran into that issue a while ago due to our MTU being just too small to properly capture traffic between 2 hosts. Since it was a large backup transfer our capture loss would shoot up to 75%+ even though it was only one flow being missed.

Since your traffic rate is relatively low one potential troubleshooting option is to dump the full traffic to a file using something like 'tcpdump -s 0 -w dump.cap -i eth0' and then run bro against that pcap file and see what that reports.