Compiling with --coverage flag

Greetings:

I’ve been tinkering with the --coverage flag to capture runtime statistics which can then be used to compile zeek with branch prediction hints. My preliminary tests indicate a substantial performance increase, enough to justify engaging the zeek community.

I noticed that the configure script includes --enable-coverage, which doesn’t quite do what I want, as it compiles with debug support. and I’m most interested in optimization for production use.

In brief, I’ve been testing:

./configure --enable-coverage

for the initial compile, then run against pcaps and live traffic, and use that profiling data to recompile:

CFLAGS=’-fprofile-use -fprofile-correction -flto’ CXXFLAGS=’-fprofile-use -fprofile-correction -flto’ ./configure

with a substantial performance boost against a regular compile (can additionally do --build-type=Release for compiling with -O3 flag).

Has anyone else tinkered with this? - I would be happy to elaborate, and discuss with others.

Jim

Hi Jim,

interesting, could you send some numbers on the kind of improvements
you saw, and on what traffic?

Robin

Although I have quite a bit more work to do on this, I wrote a preliminary document, reproduced below:

Using Profile Guided Optimization with Zeek (draft)

Background

At Brocon 2017, Packetsled gave a lightning talk that included, among other items, information on strategically inserting likely() & unlikely() macros into (then) Bro source code, which boosted performance a reported 3%. This code was later released on github as ‘Community Bro’, along with other modifications. For various reasons, detailed below, we did not use this strategy to test performance enhancements, electing to go with automatic code profiling, which yielded a performance increase greater than 14%.

Discussion

There exist (at least) two strategies for enhancing code performance by optimization of compiled code.

First, likely()/unlikely() macros can be manually inserted, typically in if statements, to signal the likelihood of the condition being met. This hinting provides the opportunity for compiler technology to efficiently organize the assembly code to avoid branch mispredictions and cache misses for the most common cases. This capability is supported in the gcc & clang compilers. For other compilers, the macros can be set up to be no-ops.

The likely()/unlikely() macros are used extensively in the Linux kernel to (hopefully) improve efficiency. A 2008 blog post (https://bitsup.blogspot.com/2008/04/measuring-performance-of-linux-kernel.html) disputes the value of using these macros in the kernel, although both kernel & compiler technology have advanced. In any event, they have remained in the kernel. More information on these macros available at: https://kernelnewbies.org/FAQ/LikelyUnlikely

The gcc documentation indicates the following: In general, you should prefer to use actual profile feedback for this (`-fprofile-arcs’), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.

As the gcc documentation snippet above indicates, it is possible to automatically collect information on branches taken on a test run with representative data, and use that information to compile production code with branch prediction. The test run is compiled to collect statistics on each branch and function call. Of course, the overhead is significant for such a test run, however, the data gathered is extremely valuable. Fortunately, performing these steps is relatively painless, as detailed below.

Instrumenting zeek

Step 1: Perform a baseline run in standalone mode using a default compile (./configure --build-type=Release), against a test pcap (I used a ~150gig pcap cobbled together from public domain sources), capturing run times. (–build-type=Release compiles with -O3 optimization.)

Step 2: An instrumented version of zeek is compiled, using:

CFLAGS=’–coverage’ CXXFLAGS=’–coverage’ ./configure --build-type=Release

make

make install

  • (The --coverage flag to CFLAGS/CXXFLAGS includes -fprofile-arcs, as described above, as well as other data capture options). The instrumented zeek can be run against the same pcap, or if desired, against live network traffic in standalone mode, or both. Running against live traffic will exercise the networking code. Multiple runs can be made against various sources, so both pcaps and live network traffic can be used, as the profiling code will update the profiling files if they already exist in the source tree.

  • Standalone mode is more convenient, as otherwise cluster nodes on the same physical box will overwrite each other’s profile data. This can possibly be overcome by passing environment variables to each cluster node, which can specify different locations for each node’s profiling data. I didn’t go down this road, as:

  • Communication overhead probably dwarfs any potential savings in branch prediction.

  • Communication is likely only a small fraction of total zeek CPU anyway.

  • Custom code would need to be written to merge profiling data from multiple sources.

  • When the instrumented zeek is stopped, it will output profile information in the build directory of the source tree as *.gcno & *.gcda files. It is probably a good idea to make a backup copy of the source tree, in case of problems with the following.

Step 3: Zeek can then be recompiled to take advantage of this profiling information:

cd bro-2.6.x (top level of zeek source directory)

tar cvf gc.tar find . -name '*.gc*' (tarball of the *.gcno & *.gcda files)

make distclean (clear all vestiges of prior build)

CFLAGS=’-fprofile-use -fprofile-correction -flto’ CXXFLAGS=’-fprofile-use -fprofile-correction -flto’ ./configure --build-type=Release

tar xvf gc.tar (restore profiling information into build tree)

make

make install

Run the newly compiled zeek against the initial test pcap & capture run times.

  • -fprofile-correction is needed, since the profiling code is not thread-safe, while zeek is a multi-threaded process. This causes corruption of profiling data for a few functions, which is adjusted via a gcc heuristic.

  • If recompiling zeek in a different source tree location than the original location, you can expect many warnings about the source tree location changing, however, the optimization will still be made.

  • If an object module wasn’t executed during the profiling run, a warning message will be output during compilation: gcda not found, execution counts estimated. If this is of concern to you, try to find a test case that will exercise that module. On the other hand, it may be that the particular module is rarely used in practice at your site, so optimization will provide minimal benefit anyway.

Link time optimization

An optimization included above is Link Time Optimization, invoked by adding -flto to CFLAGS & CXXFLAGS when running configure, as above. This optimization uses information from all compiled object modules in the final link stage of the binary, which provides additional runtime performance improvements, due to improved locality of reference, and possible inlining of functions (the documentation is a bit opaque, but the above is what I gleaned from it).

Compiling for Native

By using -march=native, the compiler will generate machine code that matches the processor where it is currently running when optimizing code. This will generate the best possible code for that chipset but will likely break the compiled object on older chipsets (assuming backwards compatibility). -mtune=native will “tune” the optimized code to run best for the current chipset but will still allow backwards compatibility with older chipsets.

Tests indicate that the original profiling compile (with --coverage) needs to use the same -march flag as the compile using the coverage, or the following warning will occur: does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch], which can be suppressed with -Wno-error=coverage-mismatch, but which may indicate that the profiling data may not be suitable.

Performance increase using this flag is unimpressive (nil to negative), probably due to the structure of zeek code: most recent additional CPU features appear geared to numerical and/or video processing.

Compiler Versions

Although the tests have all been made with gcc 4.8.5, more recent versions may provide further optimizations.

Additionally, both clang & icc provide similar optimization capabilities. Further tests are indicated to determine best compiler and options.

Timing tests

Against ~150GB test file, user time in seconds, average of 5 runs

Bro 2.6.2 standalone

Default local.bro policy

Xeon CPU E5-2687W v2 @ 3.40GHz - 32 core

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)

./configure

time 2508.27 (default)

CFLAGS=’-march=native’ CXXFLAGS=’-march=native’ ./configure

time 2530.98 (worse!?)

./configure --build-type=Release

time 2464.38 (Release compiles with -O3)

CFLAGS=’-march=native’ CXXFLAGS=’-march=native’ ./configure --build-type=Release

time 2492.1 (worse than Release)

CFLAGS=’–fprofile-use -fprofile-correction -flto’ CXXFLAGS=’-fprofile-use -fprofile-correction -flto’ ./configure --build-type=Release

time 2221.58 (profile run against pcaps in the distribution only)

CFLAGS=’-fprofile-use -fprofile-correction -flto’ CXXFLAGS=’-fprofile-use -fprofile-correction -flto’ ./configure --build-type=Release

time 2103.88 (profile run against 2 days of ESNet border traffic)

(over 14% speedup over Release, over 16% speedup over default compile)