Performance Issues after the fe7e1ee commit?

We are running into performance issues (30x slower) since the Broker patch (fe7e1ee) –

We have 40G connections tapped from our storage filers feeding multiple Bro instances which analyze specifically only NFS and SMB traffic; all other analyzers are disabled. With the broker patch we are seeing processing times for a ~1GB pcap jump from around 2 seconds to over 1 minute. Profiling Bro, it looks like the culprit is the new Actor functions –

Before patch

Overhead Shared Object Symbol

14.57% [kernel] [k] copy_user_enhanced_fast_string

3.20% bro [.] EventHandler::operator bool

2.99% bro [.] _siphash

2.89% bro [.] Dictionary::Lookup

After patch

Overhead Shared Object Symbol

5.71% [kernel] [k] native_write_msr_safe

3.84% libcaf_core.so.0.15.7 [.] caf::scheduler::workercaf::policy::work_stealing::run

3.71% libcaf_core.so.0.15.7 [.] caf::detail::double_ended_queuecaf::resumable::take_head

3.29% [kernel] [k] _raw_spin_lock

Is the Bro development team still optimizing the Broker/Actor framework? It might be helpful to have a way to disable Broker for those of us who haven’t migrated to it yet.

~1GB file time (old)

$ time /hostname/bro-devel/bin/bro -r 20180606-1049-prodfilers-truncated_00000_20180606104904.pcap master.bro

real 0m2.294s

user 0m1.862s

sys 0m0.385s

~1GB file time (new)

$ time /hostname/bro-devel/bin/bro -r 20180606-1049-prodfilers-truncated_00000_20180606104904.pcap master.bro

real 1m11.458s

user 0m58.933s

sys 1m34.074s

Thanks!

–Tim

Well, I can at least reproduce that here on one of the standard test pcap files and no scripts loaded:

$ /usr/local/bin/bro --version
/usr/local/bin/bro version 2.5.2
$ time /usr/local/bin/bro -b -r ~/src/bro/testing/external/bro-testing/2009-M57-day11-18.trace

real 0m2.317s
user 0m1.874s
sys 0m0.327s

$ . ~/src/bro/build/bro-path-dev.sh
$ bro --version
bro version 2.5-648
$ time bro -b -r ~/src/bro/testing/external/bro-testing/2009-M57-day11-18.trace

real 1m5.523s
user 0m30.565s
sys 0m32.661s

and the same thing in perf:

Samples: 270K of event 'cycles:u', Event count (approx.): 131984875188
Overhead Command Shared Object Symbol
   8.95% bro libcaf_core.so.0.15.7 [.] caf::detail::double_ended_queue<caf::resumable>::take_tail
   5.90% bro libcaf_core.so.0.15.7 [.] caf::detail::double_ended_queue<caf::resumable>::take_head
   4.92% bro libcaf_core.so.0.15.7 [.] caf::scheduler::worker<caf::policy::work_stealing>::run
   2.67% bro libc-2.17.so [.] __GI___libc_nanosleep
   1.98% bro libtcmalloc.so.4.4.5 [.] operator new[]
   1.79% bro libc-2.17.so [.] __sleep
   1.68% bro libtcmalloc.so.4.4.5 [.] tc_deletearray_nothrow
   1.47% bro libc-2.17.so [.] __libc_disable_asynccancel
   1.28% bro libstdc++.so.6.0.19 [.] std::this_thread::__sleep_for
   1.10% bro libc-2.17.so [.] usleep
   1.05% bro bro [.] ones_complement_checksum

I haven't noticed a huge difference on a real multi process cluster, at least not a 30x change, so this is odd that pcap processing is so much slower.
Especially since broker should be completely disabled when pcap files are being read and caf shouldn't even be doing anything.

Is the Bro development team still optimizing the Broker/Actor framework?

Yes, in the sense that optimizations will be done according to
feedback. No, in the sense that there was no one actively looking
into it at the moment you asked.

It might be helpful to have a way to disable Broker for those of us who haven’t migrated to it yet.

That's unlikely at this point as scripts have been completely ported
to Broker now and usages of the old communication system completely
removed from them. No simple switch to it back, so your feedback is
especially important/helpful.

# ~1GB file time (old)

$ time /hostname/bro-devel/bin/bro -r 20180606-1049-prodfilers-truncated_00000_20180606104904.pcap master.bro

real 0m2.294s

user 0m1.862s

sys 0m0.385s

# ~1GB file time (new)

$ time /hostname/bro-devel/bin/bro -r 20180606-1049-prodfilers-truncated_00000_20180606104904.pcap master.bro

real 1m11.458s

user 0m58.933s

sys 1m34.074s

Try pulling in the change I just did at [1], which was a big part of
the problem in my own test:

# 2.5.3
$ time bro -r 2009-M57-day11-18.trace

real 0m16.187s
user 0m16.312s
sys 0m1.865s

# master before [1]
$ time bro -r ../testing/external/bro-testing/Traces/2009-M57-day11-18.trace

real 1m31.434s
user 1m44.925s
sys 1m4.947s

# master after [1]
$ time bro -r 2009-M57-day11-18.trace

real 0m24.595s
user 0m25.574s
sys 0m5.816s

- Jon

[1] Improve Broker performance · bro/bro@9822fc2 · GitHub

It's not that simple.

You could think of broker being enabled in this case of simply reading
a pcap because it was querying the library for whether there were any
peers (and that was actually a at least part of the perf. problem).

Beyond that, you can still think of broker being enabled when you're
reading a pcap and you are also using data stores (which the Known*
scripts in Bro now do by default). Communication with a local master
data store is still asynchronous communication (with local
threads/actors) that ends up going through CAF messaging. There's
also essentially a per-packet synchronization being done between Bro
and Broker/CAF now to ensure that this type of asynchronous workload
ends up producing the same results between any given Bro run. So
there's now also that bit of extra overhead in pcap processing.

- Jon

Hmm, I'm still seeing much larger runtimes on that M57 trace using our
test configuration, even with yesterday's change:

; Master, pre-Broker
# zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
73.72user 7.90system 1:06.53elapsed 122%CPU (0avgtext+0avgdata 199092maxresident)

; Current master
# zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
2191.59user 1606.69system 12:39.92elapsed 499%CPU (0avgtext+0avgdata 228400maxresident)

Bro must still be blocking somewhere when reading from trace files.

Robin

Thanks, if you pull master changes [1] again there's likely some improvement.

You can also try comparing the timing of:

# zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
Known::use_host_store=F Known::use_service_store=F
Known::use_cert_store=F

With that, I get the same timings as from before pre-Broker. At least
a good chunk of the difference when using data stores is that, for
every query, Bro will immediately block until getting a response back
from the data store thread/actor. Not doing this type of synchronous
data store access when reading pcaps leads to the possibility of
results differing between runs (and I recall such differences being
likely from experience in running the unit test suite).

- Jon

[1] https://github.com/bro/broker/commit/9b56fea4999d4e11a5cd2caaafd934759015fab5

Thanks, if you pull master changes [1] again there's likely some improvement.

Yeah, a little bit, not much though.

# zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
Known::use_host_store=F Known::use_service_store=F
Known::use_cert_store=F

That indeed gets it way down, though still not back to the same level
on my box:

170.49user 58.14system 1:55.94elapsed 197%CPU

(pre-master: 73.72user 7.90system 1:06.53elapsed 122%CPU)

Are there more places where Bro's waiting for Broker in pcap mode?

With that, I get the same timings as from before pre-Broker. At least
a good chunk of the difference when using data stores is that, for
every query, Bro will immediately block until getting a response back
from the data store thread/actor.

Yeah, I remember that discussion. It's the trade-off between
performance and consistency. Where's the code that's doing this
blocking? Would it be possible to not block right away, but sync up
with Broker when events are flushed the next time? (Like we had
discussed before as a general mechanism for making async operations
consistent)

Robin

> # zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
> Known::use_host_store=F Known::use_service_store=F
> Known::use_cert_store=F

That indeed gets it way down, though still not back to the same level
on my box:

170.49user 58.14system 1:55.94elapsed 197%CPU

(pre-master: 73.72user 7.90system 1:06.53elapsed 122%CPU)

Just double-checking: same configure/build flags were used between the two?

Here's the more precise results I got from running on a macOS and Linux system:

# Linux master (b51e6f3) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup
real 0m32.572s
user 0m40.926s
sys 0m7.873s

# Linux master (b51e6f3) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup
Known::use_host_store=F Known::use_cert_store=F
Known::use_service_store=F
real 0m25.603s
user 0m23.311s
sys 0m7.952s

# Linux pre-broker (7a6f502) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup
real 0m24.785s
user 0m21.230s
sys 0m8.341s

# macOS master (b51e6f3) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup.bro
real 0m38.775s
user 0m42.365s
sys 0m8.950s

# macOS master (b51e6f3) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup.bro
Known::use_host_store=F Known::use_cert_store=F
Known::use_service_store=F
real 0m32.975s
user 0m33.774s
sys 0m4.409s

# macOS pre-broker (7a6f502) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup.bro
real 0m30.785s
user 0m31.840s
sys 0m3.788s

Are there more places where Bro's waiting for Broker in pcap mode?

Not that I can think of.

Yeah, I remember that discussion. It's the trade-off between
performance and consistency. Where's the code that's doing this
blocking?

I just know that Manager::AdvanceTime() and also
Manager::TrackStoreQuery() -> FlushPendingQueries() can block/spin
waiting on actor/threading in pcap mode.

Would it be possible to not block right away, but sync up
with Broker when events are flushed the next time? (Like we had
discussed before as a general mechanism for making async operations
consistent)

I think the way to make an async operation most consistent is to model
it as a synchronous operation? That's what I'm doing now at least,
and if I try moving FlushPendingQueries() to later (before the
AdvanceTime() call) in an attempt to possibly have more queries in the
queue/buffer, I get the external test suites failing. At least on my
own test systems, I don't have much performance to recover by going
this route anyway, so maybe we need to dig more into why your results
are different. Only thing I'm thinking is that your test system maybe
does a poorer job of scheduling the right thread to run in order for
the FlushPendingQueries() spin-loop to actually finish.

- Jon

Actually, realized you still had bad timing with data stores off, so
it would more likely be a problem with the AdvanceTime() code path.
There's some mutex locking going on there, but w/o data stores
involved there shouldn't be anyone competing for them.

- Jon

Ok, I'll dig around a bit more and also double-check that I didn't
change any settings in the meantime.

Robin

Hi Jon,

Thanks for looking into this! I tried this again after pulling from master and at least for our case it is looking good!

### old
# /hostname/bro/bin/bro --version
/hostname/bro/bin/bro version 2.5-519
# time /hostname/bro/bin/bro -r 20180606-1242-prodfilers.pcap master.bro

real 0m32.772s
user 0m28.821s
sys 0m3.873s

### new
# /hostname/bro-devel/bin/bro --version
/hostname/bro-devel/bin/bro version 2.5-658
# time /hostname/bro-devel/bin/bro -r 20180606-1242-prodfilers.pcap master.bro

real 0m34.684s
user 0m31.502s
sys 0m4.266s

Thanks again!
--Tim

Confirmed that I'm still seeing that difference even when using the
options to turn the stores. Tried it on two different Fedora 28
systems, with similar results.

I haven't been able to nail down what's going on though. The
AdvanceTime() method does seem to do a lot of locking in pcap mode,
independent of whether there's actually any data store operations.
However, I tried a quick hack to get that down and that didn't change
anything.

I then ran it through oprofile. Output is attached. There's indeed
some locking showing up high in there, but I can't tell if that's just
expected idling in CAF. I am bit surprised to see a number of
std::chrono() methods showing rather prominently that I would expect
to be cheap.

Robin

report (107 KB)

I had previously tried on a Fedora 28 VM w/ 4 CPUs and didn't
reproduce it, but trying again on a system with 32 CPUs, I can
reproduce your results. This has lead to the fix/workaround at [1],
now in master, which should bring things back to baseline when not
using data stores and more sane timings when using them.

General changes/improvements in CAF itself may be warranted here
because I can recreate the same performance overhead if I artificially
remove all broker functionality from Bro, but then simply create and
never use a caf::actor_system via the default configuration.

- Jon

[1] https://github.com/bro/bro/commit/c9fe9a943c4d78b18ffbae13c562b93349a5f951

This has lead to the fix/workaround at [1], now in master, which

Cool, that indeed solved it! It also helps significantly when data
stores *are* being used; that still takes about 2x the time, but
that's much less than before. Now I'm wondering if we can get that
back down to normal, too ...

One question about Broker's endpoint::advance_time(): that's locking
on each call when in pcap mode, i.e., once per packet. Could we limit
that to cases where something actually needs to be done? Quick idea:
use an atomic<> for current_time plus another atomic<> counter
tracking if there are any pending message. And go into the locked case
only if the latter is non-zero?

General changes/improvements in CAF itself may be warranted here

Yeah, sounds like it.

Robin

One question about Broker's endpoint::advance_time(): that's locking
on each call when in pcap mode, i.e., once per packet. Could we limit
that to cases where something actually needs to be done? Quick idea:
use an atomic<> for current_time plus another atomic<> counter
tracking if there are any pending message. And go into the locked case
only if the latter is non-zero?

Yes, I've done that in master, but didn't measure a notable difference.

> General changes/improvements in CAF itself may be warranted here

Made an issue for that here:

- Jon