Performance Enhancements

Hi all:

One item of particular interest to me from Brocon was this tidbit from Packetsled’s lightning talk:

“Optimizing core loops (like net_run() ) with preprocessor branch prediction macros likely() and unlikely() for ~3% speedup. We optimize for maximum load.”

After conversing with Leo Linsky of Packetsled, I wanted to initiate a conversation about easy performance improvements that may be within fairly easy reach:

  1. Obviously, branch prediction, as mentioned above. 3% speedup for (almost) free is nothing to sneeze at.

  2. Profiling bro to identify other hot spots that could benefit from optimization.

  3. Best practices for compiling Bro (compiler options, etc.)

  4. Data structure revisit (hash functions, perhaps?)

etc.

Perhaps the Bro core team is working on some, all, or a lot more in this area. It might be nice to get the Bro community involved too. Is anyone else interested?

Jim Mellander

ESNet

Jon Siwek was optimizing the main event loop last February, but I believe it could only go so far without the new Broker API being integrated. Also, I believe there is a need to move off of the select() function. Anyway, there is definitely a lot of optimization that could be made there.

Howdy:

Not sure about the content of the BroCon talk … but a few years back, I did a bit of work on this. There was a plugin here:

https://github.com/cubic1271/bro-plugin-instrumentation

that allowed me to profile the execution of various bro scripts and figure out what was eating the most time. It also added a pretty braindead mechanism to expose script variables through a REST interface, which I wrapped in an HTML5 UI to get some real-time statistics … though I have no idea where that code went.

I also threw together this:

https://github.com/cubic1271/pybrig

which was intended to benchmark Bro on a specific platform, the idea being to get results that were relatively consistent. It could make some pretty pictures, which was kind of neat … but I’d probably do things a lot differently if I had it to do over again :slight_smile:

I’ll note that one of the challenges with profiling is that there are the bro scripts, and then there is the bro engine. The scripting layer has a completely different set of optimizations that might make sense than the engine does: turning off / turning on / tweaking different scripts can have a huge impact on Bro’s relative performance depending on the frequency with which those script fragments are executed. Thus, one way to look at speeding things up might be to take a look at the scripts that are run most often and seeing about ways to accelerate core pieces of them … possibly by moving pieces of those scripts to builtins (as C methods).

If I had to guess at one engine-related thing that would’ve sped things up when I was profiling this stuff back in the day, it’d probably be rebuilding the memory allocation strategy / management. From what I remember, Bro does do some malloc / free in the data path, which hurts quite a bit when one is trying to make things go fast. It also means that the selection of a memory allocator and NUMA / per-node memory management is going to be important. That’s probably not going to qualify as something small, though …

On a related note, a fun experiment is always to try running bro with a different allocator and seeing what happens …

Another thing that (I found) got me a few percentage points for more-or-less free was profile-guided optimization: I ran bro first with profiling enabled against a representative data set, then rebuilt it against the profile I collected. Of course, your mileage may vary …

Anyway, hope something in there was useful.

Cheers,

Gilbert Clark

I'll note that one of the challenges with profiling is that there are the bro scripts, and then there is the bro engine. The scripting layer has a completely different set of optimizations that might make sense than the engine does: turning off / turning on / tweaking different scripts can have a huge impact on Bro's relative performance depending on the frequency with which those script fragments are executed. Thus, one way to look at speeding things up might be to take a look at the scripts that are run most often and seeing about ways to accelerate core pieces of them ... possibly by moving pieces of those scripts to builtins (as C methods).

Re: scripts, I have some code I put together to do arbitrary benchmarks of templated bro scripts. I need to clean it up and publish it, but I found some interesting things. Function calls are relatively slow.. so things like

    ip in Site::local_nets

Is faster than calling

    Site::is_local_addr(ip);

inlining short functions could speed things up a bit.

I also found that things like

    port == 22/tcp || port == 3389/tcp

Is faster than checking if port in {22/tcp,3389/tcp}.. up to about 10 ports.. Having the hash class fallback to a linear search when the hash only contains few items could speed things up there. Things like 'likely_server_ports' have 1 or 2 ports in most cases.

If I had to guess at one engine-related thing that would've sped things up when I was profiling this stuff back in the day, it'd probably be rebuilding the memory allocation strategy / management. From what I remember, Bro does do some malloc / free in the data path, which hurts quite a bit when one is trying to make things go fast. It also means that the selection of a memory allocator and NUMA / per-node memory management is going to be important. That's probably not going to qualify as something *small*, though ...

Ah! This reminds me of something I was thinking about a few weeks ago. I'm not sure to what extent bro uses memory allocation pools/interning for common immutable data structures. Like for port objects or small strings. There's no reason bro should be mallocing/freeing memory to create port objects when they are only 65536 times 2 (or 3?) port objects... but bro does things like

        tcp_hdr->Assign(0, new PortVal(ntohs(tp->th_sport), TRANSPORT_TCP));
        tcp_hdr->Assign(1, new PortVal(ntohs(tp->th_dport), TRANSPORT_TCP));

For every packet. As well as allocating a ton of TYPE_COUNT vals for things like packet sizes and header lengths.. which will almost always be between 0 and 64k.

For things that can't be interned, like ipv6 address, having an allocation pool could speed things up... Instead of freeing things like IPAddr objects they could just be returned to a pool, and then when a new IPAddr object is needed, an already initialized object could be grabbed from the pool and 'refreshed' with the new value.

https://golang.org/pkg/sync/#Pool

Talks about that sort of thing.

On a related note, a fun experiment is always to try running bro with a different allocator and seeing what happens ...

I recently noticed our boxes were using jemalloc instead of tcmalloc.. Switching that caused malloc to drop a few places down in 'perf top' output.

I particularly like the idea of an allocation pool that per-packet information can be stored, and reused by the next packet.

There also are probably some optimizations of frequent operations now that we’re in a 64-bit world that could prove useful - the one’s complement checksum calculation in net_util.cc is one that comes to mind, especially since it works effectively a byte at a time (and works with even byte counts only). Seeing as this is done per-packet on all tcp payload, optimizing this seems reasonable. Here’s a discussion of do the checksum calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website also has an x64 allocator that is claimed to be faster than tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml (note: I haven’t tried anything from this source, but find it interesting).

I’m guessing there are a number of such “small” optimizations that could provide significant performance gains.

Take care,

Jim

I've been messing around with 'perf top', the one's complement function often shows up fairly high up.. that, PriorityQueue::BubbleDown, and BaseList::remove

Something (on our configuration?) is doing a lot of PQ_TimerMgr::~PQ_TimerMgr... I don't think I've come across that class before in bro.. I think a script may be triggering something that is hurting performance. I can't think of what it would be though.

Running perf top on a random worker right now with -F 19999 shows:

Samples: 485K of event 'cycles', Event count (approx.): 26046568975
Overhead Shared Object Symbol
  34.64% bro [.] BaseList::remove
   3.32% libtcmalloc.so.4.2.6 [.] operator delete
   3.25% bro [.] PriorityQueue::BubbleDown
   2.31% bro [.] BaseList::remove_nth
   2.05% libtcmalloc.so.4.2.6 [.] operator new
   1.90% bro [.] Attributes::FindAttr
   1.41% bro [.] Dictionary::NextEntry
   1.27% libc-2.17.so [.] __memcpy_ssse3_back
   0.97% bro [.] StmtList::Exec
   0.87% bro [.] Dictionary::Lookup
   0.85% bro [.] NameExpr::Eval
   0.84% bro [.] BroFunc::Call
   0.80% libtcmalloc.so.4.2.6 [.] tc_free
   0.77% libtcmalloc.so.4.2.6 [.] operator delete[]
   0.70% bro [.] ones_complement_checksum
   0.60% libtcmalloc.so.4.2.6 [.] tcmalloc::ThreadCache::ReleaseToCentralCache
   0.60% bro [.] RecordVal::RecordVal
   0.53% bro [.] UnaryExpr::Eval
   0.51% bro [.] ExprStmt::Exec
   0.51% bro [.] iosource::Manager::FindSoonest
   0.50% libtcmalloc.so.4.2.6 [.] operator new[]

Which sums up to 59.2%

BaseList::remove/BaseList::remove_nth seems particularly easy to optimize. Can't that loop be replaced by a memmove?
I think something may be broken if it's being called that much though.

Interesting info. The > order of magnitude difference in time between BaseList::remove & BaseList::removenth suggests the possibility that the for loop in BaseList::remove is falling off the end in many cases (i.e. attempting to remove an item that doesn’t exist). Maybe thats whats broken.

I particularly like the idea of an allocation pool that per-packet information can be stored, and reused by the next packet.

Turns out bro does this most of the time.. unless you use the next_packet event. Normal connections use the sessions cache which holds connection objects, but new_packet has its own code path that creates the ip header from scratch for each packet. I tried to pre-allocate PortVal objects, but I think I was screwing something up with 'Ref' and bro would just segfault on the 2nd connection.

There also are probably some optimizations of frequent operations now that we're in a 64-bit world that could prove useful - the one's complement checksum calculation in net_util.cc is one that comes to mind, especially since it works effectively a byte at a time (and works with even byte counts only). Seeing as this is done per-packet on all tcp payload, optimizing this seems reasonable. Here's a discussion of do the checksum calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website also has an x64 allocator that is claimed to be faster than tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml (note: I haven't tried anything from this source, but find it interesting).

I couldn't get this code to return the right checksums inside bro (some casting issue?), but if it is faster it should increase performance by a small percentage. Comparing 'bro -b' runs on a pcap with 'bro -b -C' runs (which should show what kind of performance increase we would get if that function took 0s to run) shows a decent chunk of time taken computing checksums.

I'm guessing there are a number of such "small" optimizations that could provide significant performance gains.

I've been trying to figure out the best way to profile bro. So far attempting to use linux perf, or google perftools hasn't been able to shed much light on anything. I think the approach I was using to benchmark certain operations in the bro language is the better approach.

Instead of running bro and trying to profile it to figure out what is causing the most load, simply compare the execution of two bro runs with slightly different scripts/settings. I think this will end up being the better approach because it answers real questions like "If I load this script or change this setting what is the performance impact on the bro process". When I did this last I used this method to compare the performance from one bro commit to the next, but I never tried comparing bro with one set of scripts loaded to bro with a different set of scripts loaded.

For example, the simplest and most dramatic test I came up with so far:

$ time bro -r 2009-M57-day11-18.trace -b
real 0m2.434s
user 0m2.236s
sys 0m0.200s

$ cat np.bro
event new_packet(c: connection, p: pkt_hdr)
{

}

$ time bro -r 2009-M57-day11-18.trace -b np.bro
real 0m10.588s
user 0m10.392s
sys 0m0.204s

We've been saying for a while that adding that event is expensive, but I don't know if it's even been quantified.

The main thing I still need to figure out is how to do this type of test in a cluster environment while replaying a long pcap.

Somewhat related, came across this presentation yesterday:

https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be

CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”

Among other things, he mentions using a memory pool for objects instead of creating/deleting them.

Well, I found pathological behavior with BaseList::remove

Instrumenting it with a printf of num_entries & i after the for loop, running against a test pcap then summarizing with awk gives:

Count, num_entries, i

1 3583 3536
1 3584 3537
1 3623 3542
1 3624 3543
1 3628 3620
1 3629 3621
1 3636 3562
1 3636 3625
1 3637 3563
1 3637 3626
1 3644 3576
1 3644 3641
1 3645 3577
1 3645 3642
1 3647 3641
1 3648 3642
1 3650 3629
1 3651 3630
1 3658 3647
1 3659 3648
1 3663 3655
1 3664 3656
1 3673 3629
1 3674 3630
1 3697 3686
1 3698 3687
1 3981 3595
1 3982 3596
1 4372 3978
1 4373 3979
1 4374 3656
1 4374 4371
1 4375 3657
1 4375 4372
1 4554 4371
1 4555 4372
1 4571 4367
1 4571 4551
1 4572 4368
1 4572 4552
1 4968 4566
1 4969 4567
1 5058 4566
1 5059 4567
1 5160 4963
1 5161 4964
1 5258 5157
1 5259 5158
1 5342 4566
1 5343 4567
1 5353 5253
1 5354 5254
1 5356 3638
1 5356 5337
1 5356 5350
1 5356 5351
1 5356 5353
1 5357 3639
1 5357 5338
1 5357 5351
1 5357 5352
1 5357 5354
1 5367 5351
1 5368 5352
1 5369 4556
1 5370 4557
1 5374 3675
1 5374 5366
1 5375 3676
1 5375 5367
1 5379 3664
1 5379 5045
1 5380 3665
1 5380 5046
1 5384 3601
1 5385 3602
1 5386 5354
1 5387 5355
1 5392 5370
1 5393 5371
1 5404 5363
1 5404 5381
1 5405 5364
1 5405 5382
1 5408 5341
1 5408 5368
1 5408 5399
1 5409 5342
1 5409 5369
1 5409 5400
1 5413 5401
1 5413 5403
1 5414 5402
1 5414 5404
1 5416 5408
1 5417 5409
1 5429 5395
1 5430 5396
1 5439 5381
1 5439 5406
1 5440 5382
1 5440 5407
1 5460 5436
1 5461 5437
1 5463 5407
1 5464 5408
1 5465 5397
1 5465 5460
1 5466 5398
1 5466 5461
1 5474 5359
1 5474 5451
1 5474 5456
1 5474 5471
1 5475 5360
1 5475 5452
1 5475 5457
1 5475 5472
1 5479 5456
1 5479 5476
1 5480 5457
1 5480 5477
1 5481 5416
1 5482 5417
1 5493 5426
1 5493 5474
1 5493 5488
1 5494 5427
1 5494 5475
1 5494 5489
1 5497 5357
1 5497 5367
1 5497 5461
1 5497 5462
1 5497 5480
1 5497 5488
1 5498 5358
1 5498 5368
1 5498 5462
1 5498 5463
1 5498 5481
1 5498 5489
1 5499 3682
1 5499 5460
1 5499 5476
1 5499 5478
1 5499 5480
1 5500 3683
1 5500 5461
1 5500 5477
1 5500 5479
1 5500 5481
2 3612 3609
2 3613 3610
2 3689 3686
2 3690 3687
2 3697 3694
2 3698 3695
2 5374 5371
2 5375 5372
2 5384 5381
2 5385 5382
2 5463 5460
2 5464 5461
2 5493 5465
2 5494 5466
2 5497 5484
2 5498 5485
2 5499 5482
2 5499 5488
2 5499 5490
2 5499 5492
2 5499 5494
2 5500 5483
2 5500 5489
2 5500 5491
2 5500 5493
2 5500 5495
3 4571 4568
3 4572 4569
3 5493 5490
3 5494 5491
3 5497 5490
3 5497 5492
3 5498 5491
3 5498 5493
3 5499 5486
3 5500 5487
4 3647 3644
4 3648 3645
5 5379 5376
5 5380 5377
7 5499 5496
7 5500 5497
10 5497 5494
10 5498 5495
26 3 2
5861 3 1
13714 4 4
14130 4 1
34914 4 0
74299 3 3
1518194 2 1
2648755 2 2
8166358 3 0
13019625 2 0
62953139 0 0
71512938 1 1
104294506 1 0

there are 286 instances where the list has over 3000 entries, and the desired entry is near the end… That linear search has got to be killing performance, even though its uncommon :frowning:

The case of num_entries being 0 can be optimized a bit, but is relatively minor.

Now, I’ll see if I can identify the offending List

Jim

If you look in one of the subdirectories or another, in ages past there was a little shell script to incrementally execute bro against a specific trace, loading one script at a time to see the effect each of them had on the overall runtime. I can't remember what it was called offhand, but it was useful for quick and dirty testing.

And yes, function calls in bro script land are pretty horrific in terms of overhead. Line per line, bro script in general isn't terribly efficient just because it doesn't do any of the things a modern interpreter might (e.g. just-in-time compilation). That's not a criticism, it's only a note - most folks rely on horizontal scaling to deal with faster ingress, which I think makes a whole lot of sense.

From what my little microbenchmarks have been showing me, bro scripts are mostly fast, but there are some operations that may be slower than they should be.. like for loops or set/table lookups on small or empty tables.

Just in the event it's useful, I've attached some profiling I did on the script function overhead with the crap I wrote: these are some graphs on which script functions call which other script functions, how many times that happens, and how much time is spent in each function.

avggraph is the average time spent per-call, and resgraph is the aggregate time spent in each function across the entire run. The numbers' formatting needed some fixing, but never made it that far ...

Something like this from an hours worth of bro runtime would be neat. One potential issue is that if a function is called a lot, it could mean it's either something that needs to be optimized so it is faster, or it could mean it's something that needs to be refactored so it's not called so much.

I know Robin et. al. were working on different approaches for next-generation scripting kind of stuff, but haven't kept up well enough to really know where those are.

http://www.icir.org/hilti/ I believe.

One thing I played around with was using plugin hooks to integrate other scripting languages into the bro fast path (luajit was my weapon of choice) and seeing if conversion from bro script to one of those other languages might improve the run time. Other languages would still be less efficient than C, and anything garbage collected would need to be *really* carefully used, but ... it struck me as an idea that might be worth a look :slight_smile:

I'm not sure how hard it would be to write a transpiler for bro scripts and convert them completely to something like lua. Other than maybe ip address and subnets as data types, I think they overlap fairly well.

And yeah, generally speaking, most of the toolkits I've played with for software-based packet processing absolutely do use memory pools for the fast path. They also use burst fetch tricks (to amortize the cost of fetching packets over X packets, rather than fetching one packet at a time), and I've also seen quite a bit of prefetch / SIMD to try to keep things moving quickly once the packets make it to the CPU.

Things start to get pretty crazy as packet rates increase, though: once you hit about 10 - 15 Gbps, even a *cache miss* on a modern system is enough to force a drop ...

Data rate is just one part of it.. the number of packets per second and new sessions per second has a huge impact as well. Handling 10Gbps of a few concurrent connections is easy, but 10Gbps of DNS will not work so well.

DNS is one analyzer/script that I think could benefit from being ported to C++. The script doesn't do much and there aren't many scripts out there that want the lower level dns events.. Having the analyzer merge the request/reply directly and bypass the scripts entirely could boost performance by quite a bit.

A for loop over an empty table/set causes the "0 0" entries. Something related to the "cookies" it uses for iteration.

Not sure what causes the "1 1" cases.

What I also find interesting are the "1 0" entries. I wonder how many of those cases are followed up by the list itself being destroyed.

Something allocating and then destroying 104,294,506 lists that only ever have a single item in them. That's a lot of work for nothing.

So I still haven't gotten this to work, but I did some more tests that I think show it is worthwhile to look into replacing this function.

I generated a large pcap of a 3 minute iperf run:

    $ du -hs iperf.pcap
    9.6G iperf.pcap
    $ tcpdump -n -r iperf.pcap |wc -l
    reading from file iperf.pcap, link-type EN10MB (Ethernet)
    7497698

Then ran either `bro -Cbr` or `bro -br` on it 5 times and track runtime as well as cpu instructions reported by `perf`:

    $ python2 bench.py 5 bro -Cbr iperf.pcap
    15.19 49947664388
    15.66 49947827678
    15.74 49947853306
    15.66 49949603644
    15.42 49951191958
    elapsed
    Min 15.18678689
    Max 15.7425909042
    Avg 15.5343231678
    
    instructions
    Min 49947664388
    Max 49951191958
    Avg 49948828194
    
    $ python2 bench.py 5 bro -br iperf.pcap
    20.82 95502327077
    21.31 95489729078
    20.52 95483242217
    21.45 95499193001
    21.32 95498830971
    elapsed
    Min 20.5184400082
    Max 21.4452238083
    Avg 21.083449173
    
    instructions
    Min 95483242217
    Max 95502327077
    Avg 95494664468

So this shows that for every ~7,500,000 packets bro processes, almost 5 seconds is spent computing checksums.

According to https://locklessinc.com/articles/tcp_checksum/, they run their benchmark 2^24 times (16,777,216) which is about 2.2 times as many packets.

Their runtime starts out at about 11s, which puts it in line with the current implementation in bro. The other implementations they show are
between 7 and 10x faster depending on packet size. A 90% drop in time spent computing checksums would be a noticeable improvement.

Unfortunately I couldn't get their implementation to work inside of bro and get the right result, and even if I could, it's not clear what the license for the code is.

Yeh, the lockless implementation has a bug:

if (size)

s/b

if (size & 1)

I ended up writing an checksum routine that sums 32 bits at a time into a 64 bit register, which avoids the need to check for overflow - it seems to be faster than the full 64 bit implementation - will test with Bro and report results.