I particularly like the idea of an allocation pool that per-packet information can be stored, and reused by the next packet.
Turns out bro does this most of the time.. unless you use the next_packet event. Normal connections use the sessions cache which holds connection objects, but new_packet has its own code path that creates the ip header from scratch for each packet. I tried to pre-allocate PortVal objects, but I think I was screwing something up with 'Ref' and bro would just segfault on the 2nd connection.
There also are probably some optimizations of frequent operations now that we're in a 64-bit world that could prove useful - the one's complement checksum calculation in net_util.cc is one that comes to mind, especially since it works effectively a byte at a time (and works with even byte counts only). Seeing as this is done per-packet on all tcp payload, optimizing this seems reasonable. Here's a discussion of do the checksum calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website also has an x64 allocator that is claimed to be faster than tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml (note: I haven't tried anything from this source, but find it interesting).
I couldn't get this code to return the right checksums inside bro (some casting issue?), but if it is faster it should increase performance by a small percentage. Comparing 'bro -b' runs on a pcap with 'bro -b -C' runs (which should show what kind of performance increase we would get if that function took 0s to run) shows a decent chunk of time taken computing checksums.
I'm guessing there are a number of such "small" optimizations that could provide significant performance gains.
I've been trying to figure out the best way to profile bro. So far attempting to use linux perf, or google perftools hasn't been able to shed much light on anything. I think the approach I was using to benchmark certain operations in the bro language is the better approach.
Instead of running bro and trying to profile it to figure out what is causing the most load, simply compare the execution of two bro runs with slightly different scripts/settings. I think this will end up being the better approach because it answers real questions like "If I load this script or change this setting what is the performance impact on the bro process". When I did this last I used this method to compare the performance from one bro commit to the next, but I never tried comparing bro with one set of scripts loaded to bro with a different set of scripts loaded.
For example, the simplest and most dramatic test I came up with so far:
$ time bro -r 2009-M57-day11-18.trace -b
$ cat np.bro
event new_packet(c: connection, p: pkt_hdr)
$ time bro -r 2009-M57-day11-18.trace -b np.bro
We've been saying for a while that adding that event is expensive, but I don't know if it's even been quantified.
The main thing I still need to figure out is how to do this type of test in a cluster environment while replaying a long pcap.
Somewhat related, came across this presentation yesterday:
CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”
Among other things, he mentions using a memory pool for objects instead of creating/deleting them.