Over the course of the past year we’ve made a number of changes to Zeek that have positively affected its performance, sometimes considerably. This blog post highlights a few places and changes that knowingly improved performance between Zeek 5.2 and Zeek 6.2.
Selected Pull Requests
Improvements for non-selectable Packet Sources (6.0)
Zeek’s handling of packet sources without selectable file descriptors used significant amounts of CPU time doing non-productive busy polling. This applies to certain high performance cards like Napatech or older Myricom NICs. At least one bug in the IO loop caused polling to happen 10x more often than expected. The PktSrc::GetNextTimeout()
method requested a packet to decide the next timeout for the IO loop. Further, poll frequency and packet source timeouts were hard-coded, requiring re-compilation to tweak and experiment.
The bug was squashed. The GetNextTimeout()
implementation now uses a last-value heuristic: if Zeek just saw a packet, expect another without a timeout, otherwise return short timeout to the IO loop to relax busy polling. Zeek’s main loop poll behavior and the timeout have also become configurable, enabling NIC-specific tuning and experimentation without recompilation.
References
- PktSrc: Avoid calling ExtractNextPacketInternal() in GetNextTimeout()
- PktSrc: Introduce Pcap::non_fd_timeout
- iosource/Manager: Fix poll_interval updating using not-yet valid IsLive()
- iomanager: Collect all sources with zero timeouts as ready
Increased PCAP File Read Buffer (6.2)
On most Linux systems, the default buffer size for a FILE
opened via fopen()
is 4096 bytes. When reading large PCAP files, the overhead of individual read()
system calls to fill such a small buffer repeatable becomes noticeable. Zeek now uses a default buffer size of 128 KB, significantly reducing the number of read()
system calls required to processa large PCAP file. You can configure the buffer size via the option PCAP::bufsize_offline_bytes
. Run zeek -r file.pcap PCAP::bufsize_offline_bytes=4096
to determine the impact this makes on your own system. This change only applies PCAP file reading.
References
Packet Analysis: Removal of shared_ptr copying (6.2)
The packet analysis framework’s Lookup()
method used to return a std::shared_ptr
instance of a packet analyzer for every network layer found in a packet (eth, vlan, IP, UDP/TCP, tunnel analyzers). Creating and destroying std::shared_ptr
instances proves expensive due to thread-safety guarantees and the use of atomic instructions. Doing so multiple times per packet only exacerbates the problem. Worse, Zeek did not need to do this work in the first place, since the packet analyzer instance pointed to exists over the lifetime of the Zeek process. We switched to returning a const reference to the std::shared_ptr
instead. (We could possibly have used raw pointers directly, too.)
References
- About avoiding std::shared_ptr on the packet path discussion
- Fix Dispatcher::Lookup shared_ptr copy + micro optimizations
Removal of Segment Profiling (6.2)
This one is embarrassing. Until version 6.2, Zeek instantiated a SegmentProfiler
object (a historic feature to record user and system time) in certain very hot execution paths. Even with segment profiling disabled, creating the SegmentProfiler
instance on the stack would zero-initialize 144 bytes of the contained struct rusage
instance, resulting in significant and non-productive overhead for packet and event processing.
References:
- Perf fixes around FieldOffset(), EventMgr’s flare and SegmentProfiler
- segment_profiling: Remove SegmentProfiler and load_sample event
Removal of event manager’s flare (6.2)
Zeek’s main loop wakes up every time a packet becomes available, an IO source needs serving, or a timeout of an IO source expires. Zeek’s event manager operates as an IO source, too. Previously, when Zeek queued the first event during a loop iteration, a write()
system call to a pipe marked the event manager ready. Subsequently, a read()
from the pipe cleared readiness.
The way the event manager is integrated, however, removes the need to use a pipe. IO sources support indicating readiness via a timeout to the main loop. The event manager now communicates a timeout of 0.0
to the loop whenever events are available, indicating immediate readiness. No need for system calls here.
References
Remove snprintf() for hex printing (6.2)
To format digest hashes or mac address values, Zeek used snprintf("%02x")
. While not overly hot, these places were visible in perf top
output on production systems. For example, the result of hashing X509 certificate files or formatting a connection’s endpoint mac address involves hex formatting. Calling snprintf()
to convert a single byte to its hex representation is not cheap. These places now use zeek::util::bytetohex()
, which internally uses a 16 byte hex lookup table directly and requires no parsing of a format string.
References
String built-in function improvements (6.2)
Functions for string handling like ends_with()
or starts_with()
often copied the input string or used CheckString()
repeatedly. We improved these by leveraging std::string_view where possible.
References
Spicy Performance Improvements (6.2)
Version 1.10 of Spicy (bundled with the upcoming Zeek 6.2 release) comes with numerous improvements to improve throughput of generated parsers. We have revisited the code typically generated for parsers and the runtime libraries they use with the goal of improving throughput of parsers at runtime. Coarsely summarized, this work centered around
- reduction of allocations during parsing
- reduction of data copies during parsing
- use of dedicated, hand-check implementations for automatically generated code to avoid overhead from safety checks in the runtime libraries
With these changes we see throughput improvements of some parsers in the range of 20-30%.
References:
Tools and Methods
perf and Firefox Profiler
On Linux, we find many of these issues using the perf tooling and visualizing results in Firefox Profiler.
This works particularly when code is compiled with frame pointers enabled throughout, including external dependencies like libpcap or jemalloc. We record a profile of a running Zeek process or one-off command using perf record -g
. A subsequent invocation of perf script
then converts into a format compatible with Firefox Profiler:
$ perf record -g zeek -C -b -r 200k-syn-packets.pcap
$ perf script -F+pid > /tmp/200k-syn-packets.profile
Loading 200k-syn-packets.profile into an instance of Firefox Profiler allows to dig into the profile:
perf record --call-graph dwarf
With normal frame pointers, functions defined in C++ class headers are not visible in the flamegraph as they are inlined. For these cases, we find it useful to compile Zeek with dwarf debugging information and record a profile using perf record --call-graph dwarf
. This also helps visibility of functions from std::map
, std::string
, std::shared_ptr
, etc.
To demonstrate the difference, we focus on the Analyzer::Analyzer()
constructor below. Without DWARF, the result is as follows . This looks suspicious, but not pointing out anything specifically:
When using --call-graph dwarf
, there’s a lot more detail: Roughly 75% of the samples in Analyzer’s constructor fall into the ComponentManager::Lookup()
method and its subsequent std::map
lookups. There’s now a clearer picture: We normalize an analyzer’s name to all upper-case and look for the analyzer’s tag in a std::map
. This is not productive - an analyzer’s tag is constant and we should be able to cache the lookup result. The conversion from const char*
to std::string
for the Lookup()
call is additionally visible when hovering over one of the more left bars.
perf annotate
In addition to flamegraphs, the output of perf annotate
can be useful. However, it is rather low-level information showing individual instructions within methods. It helped at the time to find clues about the std::shared_ptr
overhead by drilling into Dispatcher::Lookup()
method and seeing ~72% of its samples in a lock addl
instruction.
$ perf annotate zeek::packet_analysis::Dispatcher::Lookup
0.17 : 913d8f: mov %rcx,%rax
2.50 : 913d92: ret
0.00 : 913d93: nopl 0x0(%rax,%rax,1)
: 72 __gnu_cxx::__atomic_add(int volatile*, int):
: 71 { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
71.56 : 913d98: lock addl $0x1,0x8(%rdx)
0.00 : 913d9d: jmp 913d8f <zeek::packet_analysis::Dispatcher::Lookup(unsigned int) const+0x5f>
hyperfine
Lastly, to compare PCAP processing runtime of two different Zeek versions, hyperfine is a great tool for benchmarking. Primarily it avoids custom shell scripting for timing command execution and computing statistics. To compare Zeek 6.0.3 with a recent master build, one example invocation is:
$ hyperfine -L zeek /opt/zeek-6.0.3/bin/zeek,/opt/zeek-dev-prod/bin/zeek 'taskset -c 1 {zeek} -C -b -r 200k-syn-packets.pcap'
Benchmark 1: taskset -c 1 /opt/zeek-6.0.3/bin/zeek -C -b -r 200k-syn-packets.pcap
Time (mean ± σ): 2.908 s ± 0.015 s [User: 2.303 s, System: 0.603 s]
Range (min … max): 2.886 s … 2.942 s 10 runs
Benchmark 2: taskset -c 1 /opt/zeek-dev-prod/bin/zeek -C -b -r 200k-syn-packets.pcap
Time (mean ± σ): 2.575 s ± 0.015 s [User: 1.994 s, System: 0.579 s]
Range (min … max): 2.557 s … 2.607 s 10 runs
Summary
'taskset -c 1 /opt/zeek-dev-prod/bin/zeek -C -b -r 200k-syn-packets.pcap' ran
1.13 ± 0.01 times faster than 'taskset -c 1 /opt/zeek-6.0.3/bin/zeek -C -b -r 200k-syn-packets.pcap'
This indicates that processing a PCAP file of just SYN packets in bare mode is 1.13x faster with the latest version compared to Zeek 6.0.
Summary
Compared to prior versions, Zeek 6.2 and later should provide you with a nice boost in performance, possibly allowing you to reduce required infrastructure.
If such changes excite you and you are technical enough to profile and hack on Zeek, we would love to see or hear where else in Zeek we can squeeze out more cycles. Pull requests would be amazing, of course.
Further, if you are interested in doing these kind of things as your day job, we are currently looking to grow the Open source team at Corelight.
Please reach out if you have questions or input!