snaplen and drops

Hi,

I'm a bit puzzled. If I understand things correctly, libpcap-1.0.0 uses AF_PACKET by default (after checking that MMAP support is available in the running kernel).

I don't think that's how it works, but I'm not a kernel-hacking guru.
The reason I'm pretty sure it doesn't work that way is that both Snort
and Suricata IDS include separate data acquisition code for libpcap
and af_packet, which is nonsensical if you can get af_packet via
libpcap natively. I did a bit of Googling and cannot find anything
definitive one way or the other.

I had a quick look at the libpcap (1.2.0) and the libdaq (0.6.2) code. It seems to me that both of them perform basically the same steps for packet acquisition.

Both create a socket PF_PACKET socket, both request a shared memory area on the capturing rx_ring. And both perform similar operations during packet acquisition:

while (running) {
  if (packet_in_buffer) {
    consume_packet();
  } else {
    poll(socket);
  }
}

So if you use libpcap >= 1.0.0, you should have AF_PACKET support by default. Snort/Suricata probably implemented separate AF_PACKET support for systems that ship libpcap < 1.0.0.

Best regards,
  Lothar

Hi,

Some NICs seem to use a per-flow scheme for distributing traffic onto multiple queues. This can lead to problems if you use such NICs for distributing traffic to multiple Bro instances: Client and server traffic of a single TCP connection might be forwarded to different worker nodes.

The fairly common RSS feature in NICs does this sort of round robin packet distribution across queues, but Intel's newer Flow Director feature on their high ends NICs does flow based load balancing across the queues so you actually get client and server traffic in the same queue. On Linux, the only way I know to take advantage of that is with TNAPI (from Luca Deri and the other NTOP guys).

we use an Intel card (82598 Based PCI-E 10) that supports flow-based distribution across queues (RSS). And flow-based on this chipsset really means flow-based. Each flow seems to be hashed according to some hash function that works on:

H(srcIP, srcPort, dstIP, dstPort, proto)

instead of something like

H(srcIP + srcPort + dstIP + dstPort + proto)

which would be bi-directional.

I just checked with one of our setups that employs 8 RX queues with PF_RING + TNAPI. This setup distributes client and server side traffic from several connections onto different queues.

Since I don't have a 82599-based card, I don't know if they changed this behavior. However, from a network card engineers' points of view, it seems to be perfectly fine to not map bi-flows to different queues but split the directions of the traffic on a per-flow base:

If you operate a NIC in a server for normal network communication and not traffic monitoring, which is probably the primary use case for NICs, you will have client traffic on the RX queues and server traffic on the TX queues. So both directions can be hashed independently.

Anyways, the original point I was trying to make is: If you employ the hardware features of current NICs directly, you might run into hardware-specific issues. Such as different chipsets (or even worse: different chipset revisions), that oppose different behavior.

Lately I've been very impressed with Myricom's sniffer drivers which do the hardware based load balancing and direct memory injection.

Is this similar to the DNA driver that has been done by Luca Deri?

We therefore use software load-balancing for setups with multiple Bro worker nodes on a single machine.

How are you doing this? PF_RING is also doing software based load balancing in the kernel, but it's actually slightly wrong because it includes the vlan-id as one of the tuples they balance on which can cause problems for network where each direction of traffic is in a different vlan. I filed a ticket with them to make the load balancing configurable though so hopefully that will be fixed in the next release of PF_RING.

Yes, we use this PF_RING feature for load-balancing. I haven't had any problems with the VLAN problem that you describe, as we do not have any PF_RING setups that employ multiple VLANs.

But yes, this behavior is definitely a bug, and should be fixed.

Best regards,
  Lothar

H(srcIP, srcPort, dstIP, dstPort, proto)

instead of something like

H(srcIP + srcPort + dstIP + dstPort + proto)

Ugh.. Well that's annoying. So, I guess the best option currently with a commodity NIC is to still do the load balancing on the CPU? (with PF_RING or AF_PACKET since that seems to support load balancing now too we don't support it yet though)

Lately I've been very impressed with Myricom's sniffer drivers which do the hardware based load balancing and direct memory injection.

Is this similar to the DNA driver that has been done by Luca Deri?

In a way since it does include the direct memory injection like the DNA drivers but it also does the multiqueue bidirectional flow based load balancing. I've just been very impressed with the myricom drivers because they're easy to install and start using. We need to make some small changes to broctl to properly support doing the load balancing with the myricom sniffer drivers, but nothing major.

  .Seth

Did you look int the os-daq-modules/daq_afpacket.c file? DAQ implements the AF_PACKET support there.

  .Seth

So if you use libpcap >= 1.0.0, you should have AF_PACKET support by default. Snort/Suricata probably implemented separate AF_PACKET support for systems that ship libpcap < 1.0.0.

I've used pcap > 1.0 and had much worse performance than AF_PACKET, so
I'd be willing to bet that IRQ CPU utilization is higher with pcap and
AF_PACKET does a polling mechanism to decrease its IRQ overhead. I
can't speak to the mmap techniques and whether or not they differ, but
IRQ alone would be enough to make a noticeable difference.

Hi,

I had a quick look at the libpcap (1.2.0) and the libdaq (0.6.2) code. It seems to me that both of them perform basically the same steps for packet acquisition.

Both create a socket PF_PACKET socket, both request a shared memory area on the capturing rx_ring. And both perform similar operations during packet acquisition:

Did you look int the os-daq-modules/daq_afpacket.c file? DAQ implements the AF_PACKET support there.

Yes, I was comparing

daq-0.6.2/os-daq-modules/daq_afpacket.c

to

libpcap-1.2.0/pcap-linux.c

In my opinion, the important parts are:

Setup phase:

socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL))

which creates the AF_PACKET socket

setsockopt(handle->fd, SOL_PACKET, PACKET_RX_RING, some_stuff_stuff);
mmap(...);

which creates and maps a shared buffer between kernel and user space. One important difference could be the default buffer size between kernel and userspace. Libpcap chooses 2 MB. I have no idea what libdaq defaults to.

Receiving packets is also done in a similar way:

As long as there are packets:
  comsume packets
else
  call poll() on the socket and sleep until new packet arrives

So, if I'm not overlooking something that is important for capturing performance, both implementations should result in similar capture rates.

Lothar

Hi,

So if you use libpcap >= 1.0.0, you should have AF_PACKET support by default. Snort/Suricata probably implemented separate AF_PACKET support for systems that ship libpcap < 1.0.0.

I've used pcap > 1.0 and had much worse performance than AF_PACKET,

This is interesting. Could your result be related to the small default buffer size in libpcap (the 2MB which have been problematic with Bro if a snaplen of 64 Kb is used?)

Can you remember how your setup looked like?

so
I'd be willing to bet that IRQ CPU utilization is higher with pcap and
AF_PACKET does a polling mechanism to decrease its IRQ overhead. I
can't speak to the mmap techniques and whether or not they differ, but
IRQ alone would be enough to make a noticeable difference.

Hmm, I'm not sure that I understand. What do you mean with IRQ? Hardware Interrupts originated from the NIC?

I can see the following things that can influence the capturing performance:

1.) hardware interrupts
2.) software interrupts == availability of kernel threads to pull data into userspace
3.) packet copy operations
4.) packet exchange between kernel and userspace (e.g. mmap)
5.) synchronization between kernel and userspace (e.g. poll() on a socket)

1.) + 2.) are handled by the kernel, and to the best of my knowledge neither libpcap nor libdaq should have any influence on them.

3)-5) are done using the same mechanisms in both libraries.

When I'm back at our lab in the next week, I'll try to find some time to do some experiments. If I can reproduce your observations, maybe I can find an explanation for the differences.

Lothar

This is interesting. Could your result be related to the small default buffer size in libpcap (the 2MB which have been problematic with Bro if a snaplen of 64 Kb is used?)

Can you remember how your setup looked like?

The setup was a stock Ubuntu 10.04 LTS on both Intel and Broadcom
nics, and the behavior was observed with any libpcap-based
application, including tcpdump.

Hmm, I'm not sure that I understand. What do you mean with IRQ? Hardware Interrupts originated from the NIC?

I can see the following things that can influence the capturing performance:

1.) hardware interrupts
2.) software interrupts == availability of kernel threads to pull data into userspace
3.) packet copy operations
4.) packet exchange between kernel and userspace (e.g. mmap)
5.) synchronization between kernel and userspace (e.g. poll() on a socket)

1.) + 2.) are handled by the kernel, and to the best of my knowledge neither libpcap nor libdaq should have any influence on them.

This is where PF_RING and AF_PACKET come in. They alter the way in
which polling takes place at the kernel level to save hardware
interrupts.

When I'm back at our lab in the next week, I'll try to find some time to do some experiments. If I can reproduce your observations, maybe I can find an explanation for the differences.

That would be great! I'm very sure that AF_PACKET performs better
than stock libpcap on Ubuntu 10.04 LTS, but I can only make these
guesses as to why.