Bro Packet Loss / 10gb ixgbe / pf_ring

I’m trying to debug some packet drops that I’m experiencing and am turning to the list for help. The recorded packet loss is ~50 – 70% at times. The packet loss is recorded in broctl’s netstats as well as in the notice.log file.

Running netstats at startup – I’m dropping more than I’m receiving from the very start.

[BroControl] > netstats

worker-1-1: 1452200459.635155 recvd=734100 dropped=1689718 link=2424079

worker-1-10: 1452200451.830143 recvd=718461 dropped=1414234 link=718461

worker-1-11: 1452200460.036766 recvd=481010 dropped=2019289 link=2500560

worker-1-12: 1452200460.239585 recvd=720895 dropped=1805574 link=2526730

worker-1-13: 1452200460.440611 recvd=753365 dropped=1800827 link=2554453

worker-1-14: 1452200460.647368 recvd=784145 dropped=1800831 link=2585237

worker-1-15: 1452200460.844842 recvd=750921 dropped=1868186 link=2619368

worker-1-16: 1452200461.049237 recvd=742718 dropped=1908528 link=2651507

System information:

  • 64 AMD Opteron System
  • 128gb of RAM
  • Intel 10gb IXGBE interface (dual 10gb interfaces, eth3 is the sniffer)
  • Licensed copy of PF_Ring ZC

I’m running Bro 2.4.1, PF_Ring 6.2.0 on Centos / 2.6.32-411 kernel

I have the proxy, manager & 16 workers running on the same system. 16 CPUs are pinned (0-15)

Startup scripts to load the various kernel modules (from PF_RING 6.2.0 src)

insmod /lib/modules/2.6.32-431.11.2.el6.x86_64/kernel/net/pf_ring/pf_ring.ko enable_tx_capture=0 min_num_slots=32768 quick_mode=1

insmod /lib/modules/2.6.32-431.11.2.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko numa_cpu_affinity=0,0 MQ=0,1 RSS=0,0

I checked /proc/sys/pci/devices to confirm that the interface is running on numa_node 0. ‘lscpu’ shows that cpus 0-7 are one node 0, socket 0, and cpus 8-15 are on node 1, socket 0. I figured having the 16 RSS queues on the same socket is probably better than having them bounce around.

I’ve disabled a bunch of the ixgbe offloading stuff:

ethtool -K eth3 rx off

ethtool -K eth3 tx off

ethtool -K eth3 sg off

ethtool -K eth3 tso off

ethtool -K eth3 gso off

ethtool -K eth3 gro off

ethtool -K eth3 lro off

ethtool -K eth3 rxvlan off

ethtool -K eth3 txvlan off

ethtool -K eth3 ntuple off

ethtool -K eth3 rxhash off

ethtool -K eth3 rx 32768

I’ve also tuned the stack, per recommendations from SANS:

net.ipv4.tcp_timestamps = 0

net.ipv4.tcp_sack = 0

net.ipv4.tcp_rmem = 10000000 10000000 10000000

net.ipv4.tcp_wmem = 10000000 10000000 10000000

net.ipv4.tcp_mem = 10000000 10000000 10000000

net.core.rmem_max = 134217728

net.core.wmem_max = 134217728

net.core.netdev_max_backlog = 250000

The node.cfg looks like this:

[manager]

type=manager

host=10.99.99.15

#

[proxy-1]

type=proxy

host=10.99.99.15

#

[worker-1]

type=worker

host=10.99.99.15

interface=eth3

lb_method=pf_ring

lb_procs=16

pin_cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

I have a license for ZC, and if I change the interface from eth3 to zc:eth3, it will spawn up 16 workers, but only one of them is receiving any traffic. I’m assuming that it is looking at zc:eth3@0 only. Netstats proves that out. If I run pfcount –I zc@eth3, it will show me that I’m receiving ~1gbp/s of traffic on the interface and not dropping anything.

Am I missing something obvious? I saw many threads about disabling hyper threading, but that seems specific to intel processors – I’m running AMD operterons with their own hyper transport stuff which doesn’t create virtual cpus.

Thanks,
-Paul

Change your min_num_slots to be 65535. I would add an additional proxy as well as an additional 8 workers.

Some thoughts inline...

I’m trying to debug some packet drops that I’m experiencing and am turning to the list for help. The recorded packet loss is ~50 – 70% at times. The packet loss is recorded in broctl’s netstats as well as in the notice.log file.

Running netstats at startup – I’m dropping more than I’m receiving from the very start.

Have you tried enabling the bro capture_loss script in your local.bro as a way to double check your loss numbers? It will give you per worker loss on 15 minute intervals in a separate log file.

In local.bro:
@load policy/misc/capture-loss

insmod /lib/modules/2.6.32-431.11.2.el6.x86_64/kernel/net/pf_ring/pf_ring.ko enable_tx_capture=0 min_num_slots=32768 quick_mode=1

insmod /lib/modules/2.6.32-431.11.2.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko numa_cpu_affinity=0,0 MQ=0,1 RSS=0,0

I checked /proc/sys/pci/devices to confirm that the interface is running on numa_node 0. ‘lscpu’ shows that cpus 0-7 are one node 0, socket 0, and cpus 8-15 are on node 1, socket 0. I figured having the 16 RSS queues on the same socket is probably better than having them bounce around.

The node.cfg looks like this:

[manager]

type=manager

host=10.99.99.15

#

[proxy-1]

type=proxy

host=10.99.99.15

#

[worker-1]

type=worker

host=10.99.99.15

interface=eth3

lb_method=pf_ring

lb_procs=16

pin_cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

I have a license for ZC, and if I change the interface from eth3 to zc:eth3, it will spawn up 16 workers, but only one of them is receiving any traffic. I’m assuming that it is looking at zc:eth3@0 only. Netstats proves that out. If I run pfcount –I zc@eth3, it will show me that I’m receiving ~1gbp/s of traffic on the interface and not dropping anything.

As far as ZC usage, when using in ZC mode did you specify which adapters to enable at the end of your ixgbe insmod statement like this --> adapters_to_enable=<insert comma separated list of licensed mac addresses you want to use>? Also did you try setting RSS to match the number of workers instead of leaving it up to the NIC? Example RSS=16 instead of 0 (comma separated per NIC if more than 1 NIC). Did you try pfcount –I zc@eth3@0 (thru 15) etc to test each RSS queue? Did you put the necessary license files in /etc/pf_ring? Also, just to be certain, are you using the IXGBE drivers that come with PF_RING and have you compiled Bro against the PF_RING libpcap?

Am I missing something obvious? I saw many threads about disabling hyper threading, but that seems specific to intel processors – I’m running AMD operterons with their own hyper transport stuff which doesn’t create virtual cpus.

I'm not sure I understand AMD architecture well enough to know how cores map to nodes, so I can't comment on your pinning configuration in terms of workers per core, but assuming each worker is pinned to a physical core and you truly have 16 physical cores on that socket, have you left any cores unpinned somewhere else (maybe a processor in another socket), for the system, bro manager, proxy etc to use? If not you could have other processes stomping on your workers. If any workers are sharing physical cores that could be problematic as well. Do you have htop or something similar installed where you can easily watch whether processes seem to be competing for the same physical core?

Have you tried running capstats (broctl capstats if using broctl) to see what sort of traffic bro thinks it is seeing across all workers when you are seeing loss? Depending on the clock speed and efficiency of each core you may be able to process anywhere from 100-300+Mbps per core, but if that 1Gbps of traffic was only representative of a single RSS queue on your 10G NIC you could be oversubscribed. If you have free cores on another socket it might be worth taking whatever small performance hit there is over the bus to have more workers running on those other cores. Also, I tend to leave the 1st couple logical cores open for the system as Linux at least seems to prefer them for system use. I do tend to find pinning workers to specific cores helps overall in the loss department vs letting workers bounce between cores, so I think you are on the right track.

~Gary

Thanks Mike -
I’m using 16 workers because the ixgbe 10gb nic support hardware receive side scaling. 16 is the max number of queues that it supports. While trying to monitor traffic this afternoon, I was seeing ~700 - 800mb/s based on pfcount stats.

If I disabled the hardware RSS I’d have to switch over to using pf_ring standard or DNA/ZC. I have a license for ZC, but I’ve been unable to figure out how to get bro to monitor all of the zc:eth3 queues. The current Bro load-balancing documentation only covers pf_ring+DNA, but not the newer/supported zero-copy functionality. I can’t find the right “interface=” configuration for node.cfg.

“interface=zc:eth3” only monitors one of the queues.
interface=“zc:eth3@0,zc:eth3@1,etc…” causes the workers to crash
interface=“zc:eth3@0 -i zc:eth3@1 -i …” didn’t work either.

The pf_ringZC documentation implies the use of zbalance_ipc to start up a set of queues and a cluster ID, with a call to zc:## where ## is the clusterID. I also ran into issues with that.

For tonight, I’ll disable the hardware RSS and switch over to running straight pf_ring with 24 workers. I’ll pin the first 8 so that they are on the same numa node as the NIC. Not sure what to do with the other 16 workers - does anyone have any insight if it is better to pin them to the same socket? I’m on AMD, which isn’t as well documented as the intel world.

Thanks,
-Paul

Thanks Gary.
  Sorry to top post, I'm stuck on OWA at the moment. Thanks for your suggestions - here are some quick replies:

- capture_loss.bro - running it, every 15min it reports ~70% packet loss (or greater) across all of the workers
- 'adapters_to_enable' ixgbe.ko argument doesn't exist in the latest driver bundled w/pf_ring 6.2.0
- I've enabled the multi-queue stuff (MQ=1) on the 2nd interface (MQ=0,2) as well as enabled the 16 hw RSS queues = (RSS=1,16)
- I have a license in /etc/pf_ring
- bro is linked against the pf_ring enabled libpcap
- I've confirmed that the .ko's I'm loading are the latest from pf_ring 6.2.0

Right now, pfcount says that eth3 is receiving 462Mbit/sec - I left it running for 5 minutes or so and there are zero dropped packets. As soon as I start up bro, I'm already dropping 50%+ packets per worker.

The only other thing I can think of could be packet duplication from some new taps that we deployed and potentially protocols that bro isn't parsing?

-Paul

Ah, I’m still on 6.0.3 with DNA/Libzero, so didn’t realize that adapters_to_enable was changed/removed. I’m about to start testing 6.2 with ZC soon, but probably using zbalance_ipc instead of relying on RSS. If you are running into RSS limits, but have more cores on another socket using zbalance_ipc should allow you to do things like aggregate NICs and do 2,4,5-tuple hashing to as many worker queues/ring buffers as you can handle, as well as duplicate traffic to a secondary app such as capstats, since it takes over acting as the on host load-balancer and doesn’t rely on RSS. In that case you set RSS=1 for each interface going into zbalance_ipc. Also looking at the 2.4.1 broctl related code in /lib/broctl/plugins/lb_pf_ring.py seems to imply that possibly the ZC interface naming is only supported when using zbalance_ipc, but perhaps I’m wrong, relevant snippet is below:

if nn.interface.startswith(“zc”):

For the case where a user is running zbalance_ipc

nn.interface = “%s@%d” % (nn.interface, app_instance)

For DNA there are specific entries for DNA using RSS and DNA using pfdnacluster_master in the code. So possibly try ZC with zbalance_ipc with an interface name in node.cfg of zc:whatever clusterid you assigned.

There is a thread discussing using zbalance_ipc, including syntax in the bro archives that might be more helpful than myself starting with this post. It might be worth reading through that whole thread as it involves troubleshooting.

Bro probably isn’t going to like duplicate packets such as if you are tapping both the inside and outside interfaces of a firewall. Have you checked weird.log to see if it is complaining about that? Are the taps you refer to plugged directly into your Bro sensor, or coming off some sort of tap aggregation load-balancer, or are you really using span port (the latter can sometimes see performance hits due to sampling or router/switch cpu load)? If using an optical tap is there any chance the fiber plant isn’t installed such that you see both send and receive? Do you do any sort of packet slicing that might throw off loss numbers? Looking at weird.log might give you an indication if you are seeing one sided conversations as well or have other upstream network issues.

Another thought is that if you have jumbo frames enabled on your network you may want to check MTU sizes. I currently have mine set to 9216 to match the max packet size on our upstream router. If you are collecting flows somewhere it might also be worth looking to see if you have any sources of large flows that might be impacting overall sensor performance.

~Gary


If you make the line “interface=zc:eth3”, the pf_ring plugin for broctl should automatically change the interface that each Bro process is sniffing to the correct name as you’ve indicated (zc:eth3@[0-15]). Configure it that way and the check with ps what interface is being sniffed (you will see it as part of the command line that broctl is executing).

I added support for ZC to that plugin for the 2.4 release and I got it working and validated. There are some issues with this path though because if a Bro process crashes or is shut down you will need to restart zbalance_ipc as well in order for that output ring to be reconnected.

  .Seth

Thanks Seth - I have my node.cfg to point to zc:eth3

interface=zc:eth3

Upon running broctl cleanup/deploy, I’m seeing that bro is called with only "-i zc:eth3”. I tried calling it with “zc:2” (cluster ID) and zbalance_ipc handed out 8k packets before the bro workers crashed.

-Paul

Oh! I forgot, that’s the right way to use it. I forgot about the cluster ID thing. Forget what I said before. :slight_smile:

Could you send me a log from Bro crashing (hopefully you got a stack trace in your crash dump from broctl)? That could be some unrelated problem.

  .Seth

That last bit about only processing 8k packets sounds almost like PF_RING might be stuck in demo mode for ZC (runs for 10 secs or so) and then stops. This could mean that PF_RING isn't seeing your license file. Any chance it can't read the files in /etc/pf_ring/ or that there is a PATH problem somewhere?

Did you check that the broctl config option "pfringclusterid"
has a non-zero value?

broctl config | grep pfring

You can also check that broctl is using the correct interface
name by looking at the "interface=" field in the output of
the following command:

broctl nodes

PF_RING ZC works, you have to set the ClusterID as the single interface in node.cfg (zc:2). What you are missing is the PF_RING ancillary application zbalance_ipc to PCAP and fanout to virtual interfaces. You need to read the pf_ring docs for that as you’ll also need hugepages support.

Also you do not need or want to use RSS with ZC so disable that by configuring a single queue.

For example: Start zbalance_ipc PCAP from ZC interface eth4 with a clusterID of 2 and 16 virtual interfaces, bind the app to the NIC attached NUMA core #3, use IP hashing, set a queue size that won’t exhaust local NUMA memory.

example: zbalance_ipc -i zc:eth4 -c 2 -n 16 -g 3 -m 1 -q 4096 then start bro

As someone else mentioned the current pf_ring compatibility requires an entire cluster restart when a worker crashes.