Bro 2.0 packets dropped

Hello,

I am getting a lot of messages reporting dropped packages. This is on CentOS
6.2 running 3 workers on a quad core machine with 8 gb of ram, sas disks.

The link I monitor has low traffic, 0.2 k packets p/s avg, max 2.0 k peaks,
the server load is very low. I would expect no packets to be dropped.

Besides tuning the receive buffer and queue length is there anything else I
can do about this?

worker-1: 1328274953.996680 recvd=129059158 dropped=114860 link=129174018
worker-2: 1328274954.197859 recvd=129059218 dropped=115120 link=129174338
worker-3: 1328274954.397642 recvd=129052866 dropped=122170 link=129175036

Thank you, Machiel.

Are you monitoring 3 separate links on three interfaces? I'm a little suspicious that you may be monitoring the same traffic three separate times. You will need to load balance the traffic across those three workers if it's a single interface (I'm working on automating this now).

Could you add a line to load the misc/capture-loss script to your local.bro?
@load misc/capture-loss

After you do that, make sure you do "check", "install", "restart" in broctl. The capture-loss script will give you another measure of packet loss that is not based on information being received from the NIC. Oh, that brings up another question. What NICs are you using?

  .Seth

> Besides tuning the receive buffer and queue length is there anything else
> I can do about this?
>
> worker-1: 1328274953.996680 recvd=129059158 dropped=114860 link=129174018
> worker-2: 1328274954.197859 recvd=129059218 dropped=115120 link=129174338
> worker-3: 1328274954.397642 recvd=129052866 dropped=122170 link=129175036

Are you monitoring 3 separate links on three interfaces? I'm a little
suspicious that you may be monitoring the same traffic three separate
times. You will need to load balance the traffic across those three
workers if it's a single interface (I'm working on automating this now).

It is one interface, there might be a problem load balancing. I've switched to
a standalone setup for now.

"listening on eth1, capture length 8192 bytes"

"bro: 1328281729.277621 recvd=3553337 dropped=4503 link=3557842"

The packetloss is still there though.

Could you add a line to load the misc/capture-loss script to your
local.bro? @load misc/capture-loss

After you do that, make sure you do "check", "install", "restart" in
broctl. The capture-loss script will give you another measure of packet
loss that is not based on information being received from the NIC.

From the alarm summary:

"2012-02-03-15:39:46 CaptureLoss::Too_Much_Loss
The capture loss script detected an estimated loss rate above 27.282%"

Oh, that brings up another question. What NICs are you using?

Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz
driver: bnx2
version: 2.1.11
firmware-version: bc 4.6.0 ipms 1.6.0

  .Seth

Thanks again, Machiel.

Are you using pf_ring?

It looks like its 10.

It is one interface, there might be a problem load balancing. I've switched to
a standalone setup for now.

If you aren't taking any steps to load balance the traffic then it definitely isn't working. We don't have automated load balanced configuration available in BroControl yet. :slight_smile:

Today, I did just write a script that automates a BPF based load balancing technique on clusters which will be getting merged in along with the rest of the automated load balancing code soon.

"bro: 1328281729.277621 recvd=3553337 dropped=4503 link=3557842"
"2012-02-03-15:39:46 CaptureLoss::Too_Much_Loss

The capture loss script detected an estimated loss rate above 27.282%"

Are sniffing from a tap or a SPAN port? I'm a little suspicious because the first line indicates that the NIC was showing 0.1% packet loss, but the second line indicates much more loss. The misc/capture-loss.bro script can detect loss due to reasons beyond the monitoring host (like an overloaded SPAN port) so I'm just trying to figure out where there is a such a huge disparity between the two measurements.

Oh, one other thought. Are you disabling all of the offload features of your NIC? Here's an article about it:
  1. Security Onion: When is full packet capture NOT full packet capture?

Is the MTU on your NIC larger than 8192 (Bro 2.0's default snaplen). If there are packets larger than that they won't be seen by default.

Oh, that brings up another question. What NICs are you using?

Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz
driver: bnx2
version: 2.1.11
firmware-version: bc 4.6.0 ipms 1.6.0

I usually recommend not using Broadcom nics for monitoring. At times with various broadcom nics I've run into weird problems so I tend to avoid them.

  .Seth

I'm not sure what you're asking. There is no hard coded number of workers that can be defined for a cluster. We have deployments of anywhere from 2 workers up to over a hundred.

  .Seth

Here my problem. I have a single server and a defined 10 works on it to divided up the load. Here the output of "broctl status"

root@homey manager]# broctl status
Name Type Host Status Pid Peers Started
manager manager homey.tacc.utexas.edu running 6202 11 05 Feb 15:22:15
proxy-1 proxy homey.tacc.utexas.edu running 6237 11 05 Feb 15:22:17
worker-1 worker mojo1.tacc.utexas.edu running 18356 2 05 Feb 15:26:41
worker-10 worker mojo1.tacc.utexas.edu running 18350 2 05 Feb 15:26:41
worker-2 worker mojo1.tacc.utexas.edu running 18348 2 05 Feb 15:26:41
worker-3 worker mojo1.tacc.utexas.edu running 18349 2 05 Feb 15:26:41
worker-4 worker mojo1.tacc.utexas.edu running 18357 2 05 Feb 15:26:41
worker-5 worker mojo1.tacc.utexas.edu running 18352 2 05 Feb 15:26:41
worker-6 worker mojo1.tacc.utexas.edu running 18353 2 05 Feb 15:26:41
worker-7 worker mojo1.tacc.utexas.edu running 18354 2 05 Feb 15:26:41
worker-8 worker mojo1.tacc.utexas.edu running 18355 2 05 Feb 15:26:41
worker-9 worker mojo1.tacc.utexas.edu running 18351 2 05 Feb 15:26:41

I now add woker-11 to to the configuration and "bro status" returns:

BroControl] > status
Name Type Host Status Pid Peers Started
manager manager homey.tacc.utexas.edu running 29316 12 06 Feb 13:16:59
proxy-1 proxy homey.tacc.utexas.edu running 29351 12 06 Feb 13:17:01
worker-1 worker mojo1.tacc.utexas.edu running 25026 2 06 Feb 13:17:06
worker-10 worker mojo1.tacc.utexas.edu running 25028 2 06 Feb 13:17:06
worker-11 worker mojo1.tacc.utexas.edu running 25033 2 06 Feb 13:17:06
worker-2 worker mojo1.tacc.utexas.edu running 25032 2 06 Feb 13:17:06
worker-3 worker mojo1.tacc.utexas.edu running 25025 2 06 Feb 13:17:06
worker-4 worker mojo1.tacc.utexas.edu running 25031 2 06 Feb 13:17:06
worker-5 worker mojo1.tacc.utexas.edu running 25029 2 06 Feb 13:17:06
worker-6 worker mojo1.tacc.utexas.edu running 25027 2 06 Feb 13:17:06
worker-7 worker mojo1.tacc.utexas.edu running 25034 2 06 Feb 13:17:06
worker-8 worker mojo1.tacc.utexas.edu running 25030 2 06 Feb 13:17:06
worker-9 worker mojo1.tacc.utexas.edu running 25036 ??? 06 Feb 13:17:06

Notice the ???. It an indication that something is not working correct;y the bro communication library.

If you run "ps.bro" in broctl, what do you get? I'm suspicious that the old #9 process didn't die and it's still holding the communication port open which would result in that error.

  .Seth

Sorry for replying so late, I was unexpectedly off-line for some time... I
tried all the suggestions yesterday and found the issue to be the Broadcom
NIC. The tap is on a Intel NIC now and dropped packages are at an acceptable
level.

bro: 1328772549.657147 recvd=22987443 dropped=9811 link=22997255

There have been no alerts regarding packet loss either. Thanks for you support
on this issue, back to clustering for me :wink:

Great to hear! Broadcom nics have the weirdest problems.

Are you doing network based clustering (multiple physical hosts)?

  .Seth

Not exactly related to Machiel problem but if people are stuck with
Broadcom adapters. You can try updating the firmware on the cards. The
version above 4.x is extremely old.
http://www.broadcom.com/support/ethernet_nic/netxtremeii.php for
getting software for your netextreme-ii adapters.

You can also increase the ring buffer from the default 255 to 1024 or
more so that your receive ring isn't overflowing. "/sbin/ethtool -G
eth0 rx 1024"
The same would apply for intel adapters too, if you notice your
interface card dropping packets.

      sridhar

I'm looking to run one physical host for now and add a second with several
workers later. However load balancing has not worked yet, the module and
driver are loaded and pfcount shows traffic but Bro reports everything triple,
no load balancing.

Is there any configuration apart from configuring the manager, proxy and
workers in node.cfg done in Bro to get this working?

As I understood there should not be, as long as you link to the pf_ring
enabled libpcap it should work when pf_ring is active.

Machiel.

Could you send me the content of your node.cfg and broctl.cfg files? This is fortunate timing, I've been preparing a blog post about using PF_Ring load balancing with Bro and it would be good to find out if there are any problems with it.

  .Seth

I'm using the following in node.cfg:

[manager]
type=manager
host=10.0.0.42

[proxy-1]
type=proxy
host=10.0.0.42

[worker-1]
type=worker
host=10.0.0.42
interface=p1p1

[worker-2]
type=worker
host=10.0.0.42
interface=p1p1

[worker-3]
type=worker
host=10.0.0.42
interface=p1p1

And these settings in broctl.cfg:

MailTo = root@localhost
SitePolicyStandalone = local.bro
SpoolDir = /var/opt/bro/spool
LogDir = /var/opt/bro/logs
LogRotationInterval = 3600
MinDiskSpace = 5
Debug = 1

Regards, Machiel.

Is there any configuration apart from configuring the manager, proxy and
workers in node.cfg done in Bro to get this working?

Could you send me the content of your node.cfg and broctl.cfg files? This
is fortunate timing, I've been preparing a blog post about using PF_Ring
load balancing with Bro and it would be good to find out if there are any
problems with it.

.Seth

I'm using the following in node.cfg:

That seems fine.

And these settings in broctl.cfg:

MailTo = root@localhost
SitePolicyStandalone = local.bro
SpoolDir = /var/opt/bro/spool
LogDir = /var/opt/bro/logs
LogRotationInterval = 3600
MinDiskSpace = 5
Debug = 1

It looks like you are missing the setting that turns on the pf_ring clustering support. If you built against the pf_ring libpcap wrapper it should have been put in there automatically (unless you installed over top of a previous installation?).

Add this to your broctl.cfg and do "check", "install", "restart" in broctl.
PFRingClusterId = 21

  .Seth

I've added the option, there is no difference. I did notice in the debug logs
before that this option has been set by default. At startup i see the
following for all workers, proxy and manager:

"PCAP_PF_RING_USE_CLUSTER_PER_FLOW=1 PCAP_PF_RING_CLUSTER_ID=21"

The bro binary does seem to use the correct lib:

$ ldd /opt/bro/bin/bro | grep pcap
libpcap.so.1 => /usr/local/lib/libpcap.so.1 (0x00007fae5cad2000)

I'll go ahead and do this again on monday, perhaps I did make a mistake during
the build process.

Thanks, Machiel.

What do you see in /proc/net/pf_ring/ ? If you cat a file matching
the PID of one of the Bro processes, it should say what the cluster_id
is. If they are all 21, then it is working.

There is a relative new behavior from the scanners. In order are to work around the automatic scan blocking they have increased the scan rate to so that they can scan 30K-60K address in a second. This make bro go compute bound, I think it do to creating a recorded for each connection pair, and it cannot keep up.

Using PF_RING helps but not all attach hash well and one worker can be be overwhelmed.

Has anyone else seeing this new behavior.

Bill Jones

It looks like only one worker uses pfring, the clusterid is 21.

$ ls -l /proc/net/pf_ring/
r--r--r-- 1 root root 0 Feb 13 10:15 15489-p1p1.1
dr-xr-xr-x 5 root root 0 Feb 13 10:15 dev
-r--r--r-- 1 root root 0 Feb 13 10:15 info
-r--r--r-- 1 root root 0 Feb 13 10:15 plugins_info

[BroControl] > status
...
worker-1 worker 192.168.42.215 running 15489 2 13 Feb 10:12:02
...

When I set transparent_mode to 2 it shows:

[BroControl] > netstats
  worker-1: 1329124697.533460 recvd=144655 dropped=0 link=144655
  worker-2: 1329124697.733532 recvd=0 dropped=0 link=0
  worker-3: 1329124697.934520 recvd=0 dropped=0 link=0

The other two workers do not connect. The only thing I could find so far which could cause this is quick_mode, I've disabled this option.

Any idea what else could cause this?

Machiel.