snaplen and drops

On a reasonable fast Linux box seeing (currently) <10M/bps, I'm
getting lots packet drops with current master, even though CPU is very
low. I did the usual sysctl tuning, but that didn't help. Then I
reduced the snaplen (which now defaults to 65K) down to 8K, and the
drops disappeared.

That seems is quite an extreme effect of the new default value. Should
we reconsider and (1) use a smaller default, and/or (2) make the
snaplen accesible from the scripting layer (right now, there's only
-s; which doens't work well with BroControl).

Is there other tuning to get around the problem (with standard kernel,
not PF_RING etc.)?

Robin

That's very weird. It's abnormal to get packets > 1514, right? Are
you monitoring a link with a lot of jumbo packets? Something is
wrong, as that small bandwidth shouldn't matter no matter what the
packet sizes are anyway.

On linux there set the two parameters that need to be tuned for bro:

net.core.rmem_max
net.core.rmem_default

This controls the buffer used by the raw socket interface. I would start at 20 Megabytes if the interface is a 1 GigE Ethernet 200 Megabyte if it a 10 GigE Ethernet. Keep increasing it as till it longer has an effect on the drop rate. I use "tcpdump -I <interface> -w <file>" to check the drop rate. Let tcpdump run about 10 to 20 seconds and hit ctrl-c. Tcpdump will report the packets dropped by the system and the total packets recived.

Increase net.core.netdev_max_backlog to 100000

There are some 1 GigE Ethernet nicks that just can't be tuned.

That's very weird. It's abnormal to get packets > 1514, right? Are
you monitoring a link with a lot of jumbo packets?

Not at all, but iirc, the kernel reserves spaces for packets of size
snaplen, which means that with larger snaplen, less will fit into its
buffers.

Something is wrong, as that small bandwidth shouldn't matter no
matter what the packet sizes are anyway.

Yeah, I'm thinking so too. Something's odd going on, need to look into
it more closely.

As long as other's aren't seeing similar problems with the default 65K
snaplen, I'm fine.

Robin

Yeah, I had tuned these already, didn't help. But I'll take another
look.

Thanks,

Robin

We were seeing high loss with relatively low traffic but that was with myricom cards. I suppose it could have been due to the snap length though, I never tried changing it.

  .Seth

The linux kernel will enable Large Receive Offload on nic's that support it. This allows the nic to present multiple contiguous tcp packets as one large packet to the kernel. I hqve seen tcpdump report packet size of 24K on interfaces with MTU sizes of 1500 bytes when LRO is on.

The only was to turn of this feature in linux righ now is to turn route forwarding on and reboot..

Hi,

That's very weird. It's abnormal to get packets > 1514, right? Are
you monitoring a link with a lot of jumbo packets?

On Linux you can observe packets > 1514 bytes, even if you monitor a link that does not carry a single jumbo frame.

You can have large packets if your NIC supports RSC (Receive Side Coalescing). RSC is implemented in some network cards (e.g. Intel 10GE with a 82599 chipset), and reassembles subsequent TCP segments into larger packets in order to reduce the number of packets that need to be handled by the kernel.

Even if your network card does not implement RSC, you might also see large packets due to LRO/GRO (Large Receive Offload / Generic Receive Offload) done in software (more information: http://lwn.net/Articles/358910/).

However, this needs to be supported by your NIC driver and enabled via ethtool.

ethtool -k <dev>

will show you if you have LRO or GRO enabled.

On a reasonable fast Linux box seeing (currently) <10M/bps, I'm
getting lots packet drops with current master, even though CPU is very
low. I did the usual sysctl tuning, but that didn't help. Then I
reduced the snaplen (which now defaults to 65K) down to 8K, and the
drops disappeared.

Which kernel and libpcap version do you use?

Todays' Linux kernels support memory mapping for packet exchange between userland and kernel. Since version 1.0.0, libpcap uses this feature by default:

libpcap requests a two megabyte sized shared buffer (default size) from the kernel. The snaplen is passed to the kernel which will align captured packets according to the snaplen. The snaplen is therefore the only parameter that decides how many packets fit into the buffer between kernel and application.

If you configure a snaplen of 64 KB, you will have space for only 32 packets in your buffer (2 MB / 64 KB).

You could try to make libpcap allocate a bigger buffer with pcap_set_buffer_size(). However, this must be called before pcap_activate(), which means that you cannot use pcap_open_live() but have to call pcap_create, pcap_set_snaplen, pcap_set_timeout, pcap_activate yourself..

Best regards,
  Lothar

Hi,

What version of linux are you running. Redhat 5 doesn't seem to have a lro option just the gro option which does not seem to affect the LRO behavior.

Bill Jones

Hi,

Thanks!

Which kernel and libpcap version do you use?

Fedora 15, kernel 2.6.40, libpcap 1.1.1, Intel NIC.

libpcap requests a two megabyte sized shared buffer (default size)
from the kernel.

That's it! I just patched libcap to request a much larger buffer, and
right now it's looking like the drops are indeed gone even at the 65K
snaplen.

You could try to make libpcap allocate a bigger buffer with
pcap_set_buffer_size(). However, this must be called before
pcap_activate(), which means that you cannot use pcap_open_live() but
have to call pcap_create, pcap_set_snaplen, pcap_set_timeout,
pcap_activate yourself...

Argh. Are they serious? There's essentially no way to control the
buffer size? My patch now looks for an environment variable ...

Thanks for pointing this out,

Robin

Hi,

You could try to make libpcap allocate a bigger buffer with
pcap_set_buffer_size(). However, this must be called before
pcap_activate(), which means that you cannot use pcap_open_live() but
have to call pcap_create, pcap_set_snaplen, pcap_set_timeout,
pcap_activate yourself...

Argh. Are they serious? There's essentially no way to control the
buffer size? My patch now looks for an environment variable ...

You can change the buffer size in Bro if you use the new API for opening an interface. However, this will not work with libpcap versions < 1.0.0

But that's not a real problem (at leat for Linux) because these versions do not support memory mapping, anyways.

If you want to use the new API and do not want to drop support for libpcap < 1.0.0, you have to check the pcap version in cmake and set some define for old versions (e.g. -DOLD_PCAP). Then you can have something like the following in PktSrc.cc:

#ifdef OLD_PCAP
  pd = pcap_open_live(...);
  if (!pd)
    do_some_complaining();
    return;
#else
  int status;
        pd = pcap_create(device, errorbuf);
        if (!pd)
            do_some_complaining();
        status = pcap_set_snaplen(pd, snaplen);
        if (status < 0)
                goto fail;
        status = pcap_set_promisc(pd, promisc);
        if (status < 0)
                goto fail;
        status = pcap_set_timeout(pd, to_ms);
        if (status < 0)
                goto fail;

  /* increase the buffer size */
  status = pcap_set_buffer_size(pd, new_bigger_buffer_size)
        if (status < 0)
                goto fail;

        status = pcap_activate(p);
        if (status < 0)
                goto fail;
#endif

  do_some_more_useful_stuff_if_necessary();

#ifndef OLD_PCAP
fail:
  do_some_complaining();
  pcap_close(pd);
#endif

Thanks for the code example, I hadn't really looked at the new API
yet. I'm not that concerned about dropping support for libpcap < 1.
The part I don't like is how the new parameter "buffer size" impacts
behaviour of existing programs without given the user a hook to change
the default. That doesn't seem right to me.

Anyways, for Bro is probably makes most sense to address this as a
part of a larger piece we already have on our to-do list: overhauling
Bro's code for packet aquisition. It's in pretty bad shape right now:
(1) the main packet loop still works around problems with non-blocking
mode in older libpcap/OS versions; I would hope that's not necessary
anymore. (2), we don't have a nice interface for using other packet
sources than libpcap; we need an abstraction there. And finally (3),
if we got an interface in to exploit further NIC-level features, like
load-balancing, that would be pretty cool.

Not sure when we somebody will start working on all this though.

Robin

Glad you were able to sort this out. I use PF_RING exclusively for
packet capture, so I've not run into this before.

In the future, AF_PACKET support would be a great addition to Bro and
would bring it closer to Snort and Suricata as far as acquisition.
It's got performance reasonably close to PF_RING without having to
download anything extra. However, you need to be running a 3.0 Linux
kernel to do software load-balancing, which is one of the reasons I
use PF_RING.

Hi,

Anyways, for Bro is probably makes most sense to address this as a
part of a larger piece we already have on our to-do list: overhauling
Bro's code for packet aquisition. It's in pretty bad shape right now:
(1) the main packet loop still works around problems with non-blocking
mode in older libpcap/OS versions; I would hope that's not necessary
anymore. (2), we don't have a nice interface for using other packet
sources than libpcap; we need an abstraction there.

Snort has an abstraction layer called libdaq:

http://vrt-blog.snort.org/2010/08/snort-29-essentials-daq.html

I haven't had a look at it myself, so I cannot make a statement on whether its a good abstraction layer. But maybe it can be used in Bro, too.

And finally (3),
if we got an interface in to exploit further NIC-level features, like
load-balancing, that would be pretty cool.

Yes, these new features might be very cool. However, if you rely on these hardware features, you might run into hardware-specifc problems.

Some NICs seem to use a per-flow scheme for distributing traffic onto multiple queues. This can lead to problems if you use such NICs for distributing traffic to multiple Bro instances: Client and server traffic of a single TCP connection might be forwarded to different worker nodes.

We therefore use software load-balancing for setups with multiple Bro worker nodes on a single machine.

Best regards,
  Lothar

Hi,

Glad you were able to sort this out. I use PF_RING exclusively for
packet capture, so I've not run into this before.

In the future, AF_PACKET support would be a great addition to Bro and
would bring it closer to Snort and Suricata as far as acquisition.
It's got performance reasonably close to PF_RING without having to
download anything extra.

I'm a bit puzzled. If I understand things correctly, libpcap-1.0.0 uses AF_PACKET by default (after checking that MMAP support is available in the running kernel).

As far as I understand, AF_PACKET is the kernel socket infrastructure that allows to have a mmaped buffer between the kernel and userspace and a socket that can be polled/waited when no packets are stored in the buffer. Using a "new" libpcap with a modern kernel should already provide AF_PACKET support.

Am I missing something?

However, you need to be running a 3.0 Linux
kernel to do software load-balancing, which is one of the reasons I
use PF_RING.

Cool, I wasn't aware of load balancing features in the standard kernel. Did you do some experiments to compare the standard kernel load-balancing to the one provided by PF_RING?

Best regards,
  Lothar

I'm a bit puzzled. If I understand things correctly, libpcap-1.0.0 uses AF_PACKET by default (after checking that MMAP support is available in the running kernel).

I don't think that's how it works, but I'm not a kernel-hacking guru.
The reason I'm pretty sure it doesn't work that way is that both Snort
and Suricata IDS include separate data acquisition code for libpcap
and af_packet, which is nonsensical if you can get af_packet via
libpcap natively. I did a bit of Googling and cannot find anything
definitive one way or the other.

Cool, I wasn't aware of load balancing features in the standard kernel. Did you do some experiments to compare the standard kernel load-balancing to the one provided by PF_RING?

None of the major distros are using the 3.0 kernel yet, and I don't
have time to mess around with the dev kernels, so I'm without any
experimental data for you.

Snort has an abstraction layer called libdaq:
I haven't had a look at it myself, so I cannot make a statement on whether its a good abstraction layer. But maybe it can be used in Bro, too.

Nope, it's GPL. I asked on their mailing list if they could relicense it as BSD right after they announced it a while back and I never heard back from anyone.

Some NICs seem to use a per-flow scheme for distributing traffic onto multiple queues. This can lead to problems if you use such NICs for distributing traffic to multiple Bro instances: Client and server traffic of a single TCP connection might be forwarded to different worker nodes.

The fairly common RSS feature in NICs does this sort of round robin packet distribution across queues, but Intel's newer Flow Director feature on their high ends NICs does flow based load balancing across the queues so you actually get client and server traffic in the same queue. On Linux, the only way I know to take advantage of that is with TNAPI (from Luca Deri and the other NTOP guys).

Lately I've been very impressed with Myricom's sniffer drivers which do the hardware based load balancing and direct memory injection. Their drivers work on FreeBSD and Linux which is an added benefit too. We need to make some small modifications to broctl to better support clustering with them, but they've been very problem free so far. Charging extra for special drivers seems a bit underhanded to me though (Myricom's sniffer drivers cost extra).

We therefore use software load-balancing for setups with multiple Bro worker nodes on a single machine.

How are you doing this? PF_RING is also doing software based load balancing in the kernel, but it's actually slightly wrong because it includes the vlan-id as one of the tuples they balance on which can cause problems for network where each direction of traffic is in a different vlan. I filed a ticket with them to make the load balancing configurable though so hopefully that will be fixed in the next release of PF_RING.

  .Seth