Greetings:
We were eager to explore Zerocopy BPF and after making sure bro was fully functional, we changed to 0-copy via:
sysctl net.bpf.zerocopy_enable=1
We should have known we were in for trouble when tcpdump then immediately began coredumping on exit. We installed the latest and greatest tcpdump and libpcap (v 1.1.1) via FreeBSD ports, and had the same user-experience. The following is offered in the hope that others may avoid the special type of fun that we enjoyed - keep in mind this fun is only to be had when 0 copy is enabled:
- As previously mentioned, tcpdump coredumps, gdb indicates that it tried to call free() upon exit, presumably trying to free a kernel-owned buffer. Didn’t debug it any more, but it was a portent of things to come. Later found a patch for this issue at http://sourceforge.net/tracker/?func=detail&aid=3290385&group_id=53067&atid=469579
- Bro failed to run with 0-copy - quite a bit of dithering indicated that it was freezing at pcap_next(), which reads the next packet from the interface.
- Wrote a test program using pcap_next() - it fails under 0-copy after several hundred packets. Well, since tcpdump does work (except for the coredump), lets see what its doing:
- tcpdump is working using pcap_next_ex() instead of pcap_next(), so I wrote a replacement pcap_next() in terms of pcap_next_ex(), and it correctly grabs packets.
- grafted replacement pcap_next() into bro, and the user experience was the same
- Lots of debugging using various cutlery on bro, eventually libpcap came into focus as a potential culprit
- sliced and diced the 0-copy code inside of libpcap - found a few places where improvements could be made (but that’s a different story), which gave quite a bit of insight into its innerds - here’s a presentation on 0-copy: http://www.seccuris.com/documents/whitepapers/20070517-devsummit-zerocopybpf.pdf
- Ran bro with a simplified policy of just conn, tcp & vlan - (our packets at this point in our network are vlan tagged) - it worked!
- Ran again with our policy, it freezes!
- After a somewhat binary search of policy, discovered that remote.bro causes zero-copy to freeze. So, after all that, it turns out that bro works with libpcap-1.1.1 on 0-copy, but it took a lot to figure that out.
So turning off the remote communication fixes the issue in the short term, but doesn’t solve it for us, since broctl uses the same mechanism
Haven’t finished debugging yet, but it appears that broccoli may be causing the issue on 0-copy - when it becomes clearer, I will send more.
This is written in the hopes that folks won’t be tearing their hair out, like us, as they go forward in this direction. If anyone has any suggestions, etc. (particularly in going forward with solving this problem), I would appreciate it.
Hope this helps,
Jim Mellander
NERSC CyberSecurity
BTW - it appears on zero copy that net.bpf.maxbufsize & net.bpf.bufsize are limited to 2 megs in size - they can be bigger but apparently it won’t be used, per netstat -B, which is your friend when debugging these issues.
BTW #2: zerocopy seems to be worth doing, especially at high bandwidth’s that we’re moving up to, so its important to us to solve this.
BTW #3: the problem doesn’t just manifest on a hi-speed link - I pointed Bro towards our management port (100M), and it failed in the same way, so its not a capacity issue.
BTW #4: There’s no special config other than setting the sysctl to turn on 0-copy - libpcap detects that it is running 0-copy and follows a different code path, but the API is the same - except that the issue we’ve been having (and the coredump of tcpdump) indicates that 0-copy is not quite fully baked.