Performance Issues after the fe7e1ee commit?

McMullan_Tim · June 6, 2018, 4:54pm

We are running into performance issues (30x slower) since the Broker patch (fe7e1ee) –

We have 40G connections tapped from our storage filers feeding multiple Bro instances which analyze specifically only NFS and SMB traffic; all other analyzers are disabled. With the broker patch we are seeing processing times for a ~1GB pcap jump from around 2 seconds to over 1 minute. Profiling Bro, it looks like the culprit is the new Actor functions –

Before patch

Overhead Shared Object Symbol

14.57% [kernel] [k] copy_user_enhanced_fast_string

3.20% bro [.] EventHandler::operator bool

2.99% bro [.] _siphash

2.89% bro [.] Dictionary::Lookup

After patch

Overhead Shared Object Symbol

5.71% [kernel] [k] native_write_msr_safe

3.84% libcaf_core.so.0.15.7 [.] caf::scheduler::workercaf::policy::work_stealing::run

3.71% libcaf_core.so.0.15.7 [.] caf::detail::double_ended_queuecaf::resumable::take_head

3.29% [kernel] [k] _raw_spin_lock

Is the Bro development team still optimizing the Broker/Actor framework? It might be helpful to have a way to disable Broker for those of us who haven’t migrated to it yet.

~1GB file time (old)

$ time /hostname/bro-devel/bin/bro -r 20180606-1049-prodfilers-truncated_00000_20180606104904.pcap master.bro

real 0m2.294s

user 0m1.862s

sys 0m0.385s

~1GB file time (new)

$ time /hostname/bro-devel/bin/bro -r 20180606-1049-prodfilers-truncated_00000_20180606104904.pcap master.bro

real 1m11.458s

user 0m58.933s

sys 1m34.074s

Thanks!

–Tim

Azoff_Justin_S · June 6, 2018, 7:03pm

Well, I can at least reproduce that here on one of the standard test pcap files and no scripts loaded:

$ /usr/local/bin/bro --version
/usr/local/bin/bro version 2.5.2
$ time /usr/local/bin/bro -b -r ~/src/bro/testing/external/bro-testing/2009-M57-day11-18.trace

real 0m2.317s
user 0m1.874s
sys 0m0.327s

$ . ~/src/bro/build/bro-path-dev.sh
$ bro --version
bro version 2.5-648
$ time bro -b -r ~/src/bro/testing/external/bro-testing/2009-M57-day11-18.trace

real 1m5.523s
user 0m30.565s
sys 0m32.661s

and the same thing in perf:

Samples: 270K of event 'cycles:u', Event count (approx.): 131984875188
Overhead Command Shared Object Symbol
   8.95% bro libcaf_core.so.0.15.7 [.] caf::detail::double_ended_queue<caf::resumable>::take_tail
   5.90% bro libcaf_core.so.0.15.7 [.] caf::detail::double_ended_queue<caf::resumable>::take_head
   4.92% bro libcaf_core.so.0.15.7 [.] caf::scheduler::worker<caf::policy::work_stealing>::run
   2.67% bro libc-2.17.so [.] __GI___libc_nanosleep
   1.98% bro libtcmalloc.so.4.4.5 [.] operator new[]
   1.79% bro libc-2.17.so [.] __sleep
   1.68% bro libtcmalloc.so.4.4.5 [.] tc_deletearray_nothrow
   1.47% bro libc-2.17.so [.] __libc_disable_asynccancel
   1.28% bro libstdc++.so.6.0.19 [.] std::this_thread::__sleep_for
   1.10% bro libc-2.17.so [.] usleep
   1.05% bro bro [.] ones_complement_checksum

I haven't noticed a huge difference on a real multi process cluster, at least not a 30x change, so this is odd that pcap processing is so much slower.
Especially since broker should be completely disabled when pcap files are being read and caf shouldn't even be doing anything.

Jon_Siwek · June 6, 2018, 9:50pm

Is the Bro development team still optimizing the Broker/Actor framework?

Yes, in the sense that optimizations will be done according to
feedback. No, in the sense that there was no one actively looking
into it at the moment you asked.

It might be helpful to have a way to disable Broker for those of us who haven’t migrated to it yet.

That's unlikely at this point as scripts have been completely ported
to Broker now and usages of the old communication system completely
removed from them. No simple switch to it back, so your feedback is
especially important/helpful.

# ~1GB file time (old)

$ time /hostname/bro-devel/bin/bro -r 20180606-1049-prodfilers-truncated_00000_20180606104904.pcap master.bro

real 0m2.294s

user 0m1.862s

sys 0m0.385s

# ~1GB file time (new)

$ time /hostname/bro-devel/bin/bro -r 20180606-1049-prodfilers-truncated_00000_20180606104904.pcap master.bro

real 1m11.458s

user 0m58.933s

sys 1m34.074s

Try pulling in the change I just did at [1], which was a big part of
the problem in my own test:

# 2.5.3
$ time bro -r 2009-M57-day11-18.trace

real 0m16.187s
user 0m16.312s
sys 0m1.865s

# master before [1]
$ time bro -r ../testing/external/bro-testing/Traces/2009-M57-day11-18.trace

real 1m31.434s
user 1m44.925s
sys 1m4.947s

# master after [1]
$ time bro -r 2009-M57-day11-18.trace

real 0m24.595s
user 0m25.574s
sys 0m5.816s

- Jon

[1] Improve Broker performance · bro/bro@9822fc2 · GitHub

Jon_Siwek · June 6, 2018, 10:05pm

It's not that simple.

You could think of broker being enabled in this case of simply reading
a pcap because it was querying the library for whether there were any
peers (and that was actually a at least part of the perf. problem).

Beyond that, you can still think of broker being enabled when you're
reading a pcap and you are also using data stores (which the Known*
scripts in Bro now do by default). Communication with a local master
data store is still asynchronous communication (with local
threads/actors) that ends up going through CAF messaging. There's
also essentially a per-packet synchronization being done between Bro
and Broker/CAF now to ensure that this type of asynchronous workload
ends up producing the same results between any given Bro run. So
there's now also that bit of extra overhead in pcap processing.

- Jon

robin · June 7, 2018, 5:21pm

Hmm, I'm still seeing much larger runtimes on that M57 trace using our
test configuration, even with yesterday's change:

; Master, pre-Broker
# zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
73.72user 7.90system 1:06.53elapsed 122%CPU (0avgtext+0avgdata 199092maxresident)

; Current master
# zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
2191.59user 1606.69system 12:39.92elapsed 499%CPU (0avgtext+0avgdata 228400maxresident)

Bro must still be blocking somewhere when reading from trace files.

Robin

Jon_Siwek · June 7, 2018, 10:05pm

Thanks, if you pull master changes [1] again there's likely some improvement.

You can also try comparing the timing of:

# zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
Known::use_host_store=F Known::use_service_store=F
Known::use_cert_store=F

With that, I get the same timings as from before pre-Broker. At least
a good chunk of the difference when using data stores is that, for
every query, Bro will immediately block until getting a response back
from the data store thread/actor. Not doing this type of synchronous
data store access when reading pcaps leads to the possibility of
results differing between runs (and I recall such differences being
likely from experience in running the unit test suite).

- Jon

[1] https://github.com/bro/broker/commit/9b56fea4999d4e11a5cd2caaafd934759015fab5

robin · June 8, 2018, 3:23pm

Thanks, if you pull master changes [1] again there's likely some improvement.

Yeah, a little bit, not much though.

# zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
Known::use_host_store=F Known::use_service_store=F
Known::use_cert_store=F

That indeed gets it way down, though still not back to the same level
on my box:

170.49user 58.14system 1:55.94elapsed 197%CPU

(pre-master: 73.72user 7.90system 1:06.53elapsed 122%CPU)

Are there more places where Bro's waiting for Broker in pcap mode?

With that, I get the same timings as from before pre-Broker. At least
a good chunk of the difference when using data stores is that, for
every query, Bro will immediately block until getting a response back
from the data store thread/actor.

Yeah, I remember that discussion. It's the trade-off between
performance and consistency. Where's the code that's doing this
blocking? Would it be possible to not block right away, but sync up
with Broker when events are flushed the next time? (Like we had
discussed before as a general mechanism for making async operations
consistent)

Robin

Jon_Siwek · June 8, 2018, 5:16pm

> # zcat 2009-M57-day11-18.trace.gz | time bro -r - tests/m57-long.bro
> Known::use_host_store=F Known::use_service_store=F
> Known::use_cert_store=F

That indeed gets it way down, though still not back to the same level
on my box:

170.49user 58.14system 1:55.94elapsed 197%CPU

(pre-master: 73.72user 7.90system 1:06.53elapsed 122%CPU)

Just double-checking: same configure/build flags were used between the two?

Here's the more precise results I got from running on a macOS and Linux system:

# Linux master (b51e6f3) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup
real 0m32.572s
user 0m40.926s
sys 0m7.873s

# Linux master (b51e6f3) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup
Known::use_host_store=F Known::use_cert_store=F
Known::use_service_store=F
real 0m25.603s
user 0m23.311s
sys 0m7.952s

# Linux pre-broker (7a6f502) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup
real 0m24.785s
user 0m21.230s
sys 0m8.341s

# macOS master (b51e6f3) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup.bro
real 0m38.775s
user 0m42.365s
sys 0m8.950s

# macOS master (b51e6f3) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup.bro
Known::use_host_store=F Known::use_cert_store=F
Known::use_service_store=F
real 0m32.975s
user 0m33.774s
sys 0m4.409s

# macOS pre-broker (7a6f502) --build-type=debug
$ export BROPATH=$BROPATH:../testing/external/scripts; time bro -r
2009-M57-day11-18.trace test-all-policy testing-setup.bro
real 0m30.785s
user 0m31.840s
sys 0m3.788s

Are there more places where Bro's waiting for Broker in pcap mode?

Not that I can think of.

Yeah, I remember that discussion. It's the trade-off between
performance and consistency. Where's the code that's doing this
blocking?

I just know that Manager::AdvanceTime() and also
Manager::TrackStoreQuery() -> FlushPendingQueries() can block/spin
waiting on actor/threading in pcap mode.

Would it be possible to not block right away, but sync up
with Broker when events are flushed the next time? (Like we had
discussed before as a general mechanism for making async operations
consistent)

I think the way to make an async operation most consistent is to model
it as a synchronous operation? That's what I'm doing now at least,
and if I try moving FlushPendingQueries() to later (before the
AdvanceTime() call) in an attempt to possibly have more queries in the
queue/buffer, I get the external test suites failing. At least on my
own test systems, I don't have much performance to recover by going
this route anyway, so maybe we need to dig more into why your results
are different. Only thing I'm thinking is that your test system maybe
does a poorer job of scheduling the right thread to run in order for
the FlushPendingQueries() spin-loop to actually finish.

- Jon

Jon_Siwek · June 8, 2018, 5:40pm

Actually, realized you still had bad timing with data stores off, so
it would more likely be a problem with the AdvanceTime() code path.
There's some mutex locking going on there, but w/o data stores
involved there shouldn't be anyone competing for them.

- Jon

robin · June 8, 2018, 5:47pm

Ok, I'll dig around a bit more and also double-check that I didn't
change any settings in the meantime.

Robin

McMullan_Tim · June 8, 2018, 7:18pm

Hi Jon,

Thanks for looking into this! I tried this again after pulling from master and at least for our case it is looking good!

### old
# /hostname/bro/bin/bro --version
/hostname/bro/bin/bro version 2.5-519
# time /hostname/bro/bin/bro -r 20180606-1242-prodfilers.pcap master.bro

real 0m32.772s
user 0m28.821s
sys 0m3.873s

### new
# /hostname/bro-devel/bin/bro --version
/hostname/bro-devel/bin/bro version 2.5-658
# time /hostname/bro-devel/bin/bro -r 20180606-1242-prodfilers.pcap master.bro

real 0m34.684s
user 0m31.502s
sys 0m4.266s

Thanks again!
--Tim

robin · June 11, 2018, 11:30pm

Confirmed that I'm still seeing that difference even when using the
options to turn the stores. Tried it on two different Fedora 28
systems, with similar results.

I haven't been able to nail down what's going on though. The
AdvanceTime() method does seem to do a lot of locking in pcap mode,
independent of whether there's actually any data store operations.
However, I tried a quick hack to get that down and that didn't change
anything.

I then ran it through oprofile. Output is attached. There's indeed
some locking showing up high in there, but I can't tell if that's just
expected idling in CAF. I am bit surprised to see a number of
std::chrono() methods showing rather prominently that I would expect
to be cheap.

Robin

report (107 KB)

Jon_Siwek · June 12, 2018, 7:16pm

I had previously tried on a Fedora 28 VM w/ 4 CPUs and didn't
reproduce it, but trying again on a system with 32 CPUs, I can
reproduce your results. This has lead to the fix/workaround at [1],
now in master, which should bring things back to baseline when not
using data stores and more sane timings when using them.

General changes/improvements in CAF itself may be warranted here
because I can recreate the same performance overhead if I artificially
remove all broker functionality from Bro, but then simply create and
never use a caf::actor_system via the default configuration.

- Jon

[1] https://github.com/bro/bro/commit/c9fe9a943c4d78b18ffbae13c562b93349a5f951

robin · June 12, 2018, 9:26pm

This has lead to the fix/workaround at [1], now in master, which

Cool, that indeed solved it! It also helps significantly when data
stores *are* being used; that still takes about 2x the time, but
that's much less than before. Now I'm wondering if we can get that
back down to normal, too ...

One question about Broker's endpoint::advance_time(): that's locking
on each call when in pcap mode, i.e., once per packet. Could we limit
that to cases where something actually needs to be done? Quick idea:
use an atomic<> for current_time plus another atomic<> counter
tracking if there are any pending message. And go into the locked case
only if the latter is non-zero?

General changes/improvements in CAF itself may be warranted here

Yeah, sounds like it.

Robin

Jon_Siwek · June 13, 2018, 7:50pm

One question about Broker's endpoint::advance_time(): that's locking
on each call when in pcap mode, i.e., once per packet. Could we limit
that to cases where something actually needs to be done? Quick idea:
use an atomic<> for current_time plus another atomic<> counter
tracking if there are any pending message. And go into the locked case
only if the latter is non-zero?

Yes, I've done that in master, but didn't measure a notable difference.

> General changes/improvements in CAF itself may be warranted here

Made an issue for that here:

github.com/actor-framework/actor-framework

performance overhead of unused actor_system

opened 07:45PM - 13 Jun 18 UTC

closed 07:41PM - 18 Jun 18 UTC

jsiwek

question

On a 32-core Linux system (`std::thread::hardware_concurrency()` is 32) and defa…ult build of caf/master (cdfe2c223) using the following test benchmark: ```c++ #include "caf/actor_system.hpp" #include "caf/actor_system_config.hpp" #include <cstdint> using namespace caf; int main() { actor_system_config cfg; actor_system system{cfg}; uint64_t total = 0u; uint64_t limit = 10000000000u; for ( uint64_t i = 0u; i < limit; ++i ) total += i; return 0; } ``` I get this timing (built w/ `-O0`): ``` $ time ./a.out real 1m6.162s user 1m12.252s sys 0m27.054s ``` Then recompile without creating an `actor_system`: ```c++ #include "caf/actor_system.hpp" #include "caf/actor_system_config.hpp" #include <cstdint> using namespace caf; int main() { //actor_system_config cfg; //actor_system system{cfg}; uint64_t total = 0u; uint64_t limit = 10000000000u; for ( uint64_t i = 0u; i < limit; ++i ) total += i; return 0; } ``` And I get this timing: ``` $ time ./a.out real 0m45.918s user 0m45.751s sys 0m0.002s ``` I can also get closer to the better timing by artificially limiting the number of threads via `cfg.scheduler_max_threads = 4;`. Running the benchmarks on a lower-core system with default configuration would also not see as much of a problem. Does this generally indicate some scaling issues on high-core-count systems? Or is there any room to improve CAF so that the default configuration of an actor system on a high-core-count machine doesn't have as much overhead?

- Jon

Topic		Replies	Views
Broker raw throughput Development development	13	100	May 6, 2022
early performance comparisons of CAF-based run loop Development development	16	90	May 6, 2022
Broker port status Development development	10	109	May 6, 2022
Bro performance issues Zeek	53	124	May 6, 2022
A more parallel Bro Zeek	5	218	May 6, 2022

Performance Issues after the fe7e1ee commit?

Before patch

After patch

~1GB file time (old)

~1GB file time (new)

Related topics