random performance degradation on FreeBSD 13

Hello,

recently we upgraded our FreeBSD servers from 12.2-RELEASE to 13.0-RELEASE. Zeek changed from 3.0.11 to 4.0.4.
We are now faced with a strange performance degradation: when started Zeek runs like it should maxing out one CPU.
Processing of a 500 MB PCAP takes less than one minute. Then after a random amount of time it stalls and it can
take more than 3 hours for a file of the same size. CPU load drops to appr. 5 %.

We have our own setup to process multiple pcaps which works like this:

  1. a manager process is spawned
  2. manager spawns tcpslice
  3. tcpslice creates a fifo and opens a socket
  4. manager spawns zeek and passes the fifo to zeek via -r
  5. manager controls tcpslice via socket to send pcaps over the fifo

FreeBSD is running on VMware with 8 cores and 32 GB RAM. PCAPs are read via NFS. While the error happens, reading
the pcaps with other programs is fast, so no bottleneck here.

Before the upgrade, this setup has run stable for months without ever restarting Zeek.

We tried to gather debug info via ktrace(1)/kdump(1) or truss(1), but nothing stands out in all that noise. What system calls would you look for?

Has anyone seen this kind of behavior? What metrics could I get from Zeek to find the bottleneck?
Running Zeek with -t blows up the 500 MB PCAP to a 4.2 GB log. How would you filter this?

Can this be related to changes in FreeBSD? Any limits like sysctls you can think of?

Thanks for any ideas!

Franky

Hello,

recently we upgraded our FreeBSD servers from 12.2-RELEASE to 13.0-RELEASE. Zeek changed from 3.0.11 to 4.0.4.
We are now faced with a strange performance degradation: when started Zeek runs like it should maxing out one CPU.
Processing of a 500 MB PCAP takes less than one minute. Then after a random amount of time it stalls and it can
take more than 3 hours for a file of the same size. CPU load drops to appr. 5 %.

We have our own setup to process multiple pcaps which works like this:
1) a manager process is spawned
2) manager spawns tcpslice
3) tcpslice creates a fifo and opens a socket
4) manager spawns zeek and passes the fifo to zeek via -r
5) manager controls tcpslice via socket to send pcaps over the fifo

You might be the only one using zeek -r with a fifo, I guess it should
mostly work, but the way you are sending it packets might be causing
this issue.

FreeBSD is running on VMware with 8 cores and 32 GB RAM. PCAPs are read via NFS. While the error happens, reading
the pcaps with other programs is fast, so no bottleneck here.

Before the upgrade, this setup has run stable for months without ever restarting Zeek.

We tried to gather debug info via ktrace(1)/kdump(1) or truss(1), but nothing stands out in all that noise. What system calls would you look for?

It's likely not an issue with system calls but with the IO loop
getting confused about something. That's why the CPU load is so low,
it's likely just waiting for packets.

Has anyone seen this kind of behavior? What metrics could I get from Zeek to find the bottleneck?
Running Zeek with -t blows up the 500 MB PCAP to a 4.2 GB log. How would you filter this?

Can this be related to changes in FreeBSD? Any limits like sysctls you can think of?

Probably not changes in freebsd, but changes in the zeek IO loop.

On linux I would use `perf` to record what zeek was doing when it
stops processing the pcap data. I have no idea what the equivalent
would be on FreeBSD.

You could potentially use `gdb` to interrupt the process a few times
and see what the stack trace indicates that zeek is doing. Or just
interrupt it and step through the io loop to see what it's doing. My
guess is that something is going wrong with ExtractNextPacket in the
pcap iosource.

Hi Justin,

thanks for your quick reply.

You might be the only one using zeek -r with a fifo, I guess it should
mostly work, but the way you are sending it packets might be causing
this issue.

It should be transparent for zeek, if it is reading a file or from the pipe.
I compared processing a merged pcap or reading the parts from the fifo. Both
produced the same logs.

It's likely not an issue with system calls but with the IO loop
getting confused about something. That's why the CPU load is so low,
it's likely just waiting for packets.

I will have a look for changes in the IO loop between the two versions
of zeek.

On linux I would use `perf` to record what zeek was doing when it
stops processing the pcap data. I have no idea what the equivalent
would be on FreeBSD.

I think the right tool would be DTrace on FreeBSD. Looks like I can
no longer shy away from learning to use it! :slight_smile:

You could potentially use `gdb` to interrupt the process a few times
and see what the stack trace indicates that zeek is doing. Or just
interrupt it and step through the io loop to see what it's doing. My
guess is that something is going wrong with ExtractNextPacket in the
pcap iosource.

It's so frustrating that I have not yet found a way to reliably reproduce
the error. It comes and goes.

Franky