Final Broker branch testing

The latest version of the new Broker-ized cluster/communication system for Bro in 'topic/actor-system' branch is wrapping up and, in my opinion, ready to be merged into Bro's 'master' branch.

However, since it's such a big change, I'd like a last round of feedback before merging. If you want to test, the build process should now be as simple as:

$ git clone --recursive --branch=topic/actor-system git://git.bro.org/bro
$ cd bro && ./configure && make

Configuring BroControl is not any different from before.

If you had custom scripts, they may require porting. There's a guide and examples for that at [1] and [2] (hyperlinks in those docs will render more nicely when it's up on bro.org).

Though, for this round of testing, I'd be most interested just in any general stability issues or major feature breakages from a vanilla Bro installation. Mild performance issues, minor bugs, or other issues w/ porting custom scripts are things I think we can iron out even after merging into 'master'.

- Jon

[1] https://github.com/bro/bro/blob/topic/actor-system/doc/frameworks/broker.rst
[2] https://github.com/bro/bro/tree/topic/actor-system/doc/frameworks/broker

Trying this I noticed a few things (ordered by urgency from my point of view).

With this change, we Bro cannot be compiled out of the Box on RedHat/Centos 7 anymore. Since that is the latest release of RedHat and probably used in production by quite a few people a potentially significant amount of people might not be able to (easily) compile Bro with this merge.

It aborts in configure, with:

-- Performing Test cxx11_header_works - Success
CMake Error at aux/broker/CMakeLists.txt:4 (cmake_minimum_required):
   CMake 3.0.2 or higher is required. You are running version 2.8.12.2

--snip

Compiling on Debian 8 gives some CAF warnings that are a tad ugly:

In file included from /root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/serializer.hpp:32:0,
                  from /root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/detail/tuple_vals.hpp:25,
                  from /root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/make_message.hpp:28,
                  from /root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/mailbox_element.hpp:27,
                  from /root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/abstract_actor.hpp:37,
                  from /root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/actor.hpp:32,
                  from /root/bro/aux/broker/broker/data.hh:11,
                  from /root/bro/aux/broker/broker/broker.hh:8,
                  from /root/bro/src/broker/Data.h:4,
                  from /root/bro/src/broker/Data.cc:1:
/root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/data_processor.hpp: In function ‘typename std::enable_if<std::is_same<caf::error, decltype (declval<caf::deserializer&>().caf::data_processor<caf::deserializer>::apply(declval<T&>()))>::value>::type caf::operator&(caf::deserializer&, T&) [with T = std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long int, std::ratio<1l, 1000000000l> > >; typename std::enable_if<std::is_same<caf::error, decltype (declval<caf::deserializer&>().caf::data_processor<caf::deserializer>::apply(declval<T&>()))>::value>::type = void]’:
/root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/data_processor.hpp:478:7: warning: ‘dur’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        t = std::chrono::time_point<std::chrono::system_clock,

{dur};

        ^
/root/bro/aux/broker/3rdparty/caf/libcaf_core/caf/data_processor.hpp:476:16: note: ‘dur’ was declared here
        Duration dur;
                 ^
--snip

/root/bro/aux/broker/3rdparty/caf/libcaf_core/src/scheduled_actor.cpp:892:55: warning: unused parameter ‘sender’ [-Wunused-parameter]
                                            actor_addr& sender,

--snip

I noticed one small thing while building with make -j4; in this case you get several different % numbers simultaneously (one for car and one for broker).

Example:

[ 25%] Built target plugin-Bro-BackDoor
[ 25%] Building CXX object src/analyzer/protocol/bittorrent/CMakeFiles/plugin-Bro-BitTorrent.dir/bittorrent_pac.cc.o
[ 85%] Building CXX object libcaf_io/CMakeFiles/libcaf_io_shared.dir/src/interfaces.cpp.o
[ 25%] Building CXX object src/analyzer/protocol/bittorrent/CMakeFiles/plugin-Bro-BitTorrent.dir/events.bif.cc.o

While this is obviously cosmetic, it still looks weird to me :).

Apart from that it compiled and ran all tests on all systems I tried it on.

There were a few test failures on the first run (that succeeded on a rerun) though.

These were (from different systems):
MacOs:
[ 76%] scripts.base.frameworks.logging.field-extension-cluster ... failed
[ 21%] broker.disconnect ... failed
[ 56%] broker.ssl_auth_failure ... failed
[ 89%] scripts.base.frameworks.control.shutdown ... failed
[ 99%] scripts.base.frameworks.openflow.log-cluster ... failed

There were also a couple that did not succeed after several reruns for me. This was on a digital ocean 4cpu optimized debian8 instance for me; the reruns were not parallel:

root@debian-c-4-8gib-sfo2-01:~/bro/testing/btest# ../../aux/btest/btest -r -d
[ 0%] scripts.base.frameworks.control.configuration_update ... failed
   % 'btest-bg-wait 10' failed unexpectedly (exit code 1)
   % cat .stderr
   The following processes did not terminate:

   BROPATH=.:/root/bro/scripts:/root/bro/scripts/policy:/root/bro/scripts/site:/root/bro/build/scripts:.. bro /root/bro/testing/btest/.tmp/scripts.base.frameworks.control.configuration_update/configuration_update.bro frameworks/control/controller Control::host=127.0.0.1 Control::host_port=65531/tcp Control::cmd=shutdown

I noticed that Bro no longer builds on any version of RHEL/CentOS:

CMake Error at aux/broker/CMakeLists.txt:4 (cmake_minimum_required):
   CMake 3.0.2 or higher is required. You are running version 2.8.12.2

The latest version of the new Broker-ized cluster/communication system
for Bro in 'topic/actor-system' branch is wrapping up and, in my
opinion, ready to be merged into Bro's 'master' branch.

[..]

Though, for this round of testing, I'd be most interested just in any
general stability issues or major feature breakages from a vanilla Bro
installation. Mild performance issues, minor bugs, or other issues w/
porting custom scripts are things I think we can iron out even after
merging into 'master'.

- Jon

I threw this on our test cluster, and whatever that issue was with rotation breaking causing the logger
to buffer and the OOM is fixed now.. logs have rotated twice now without issue.

cpu usage is still higher, but I think it is just busy waiting like you suggested.. perf top on a proxy shows:

   5.32% [kernel] [k] system_call_after_swapgs
   3.48% libcaf_core.so.0.15.7 [.] caf::scheduler::worker<caf::policy::work_stealing>::run
   3.12% libc-2.17.so [.] __GI___libc_nanosleep
   3.10% [kernel] [k] sysret_check
   3.05% libcaf_core.so.0.15.7 [.] caf::detail::double_ended_queue<caf::resumable>::take_head
   2.61% [kernel] [k] __schedule
   2.20% libc-2.17.so [.] __sleep
   2.19% [kernel] [k] timerqueue_add
   2.06% [kernel] [k] __audit_syscall_exit
   1.89% [kernel] [k] native_write_msr_safe
   1.85% [kernel] [k] cpuacct_charge
   1.84% [kernel] [k] __audit_syscall_entry
   1.74% [kernel] [k] hrtimer_start_range_ns
   1.50% libstdc++.so.6.0.19 [.] std::this_thread::__sleep_for
   1.40% libc-2.17.so [.] __libc_disable_asynccancel
   1.37% [kernel] [k] _raw_spin_unlock_irqrestore
   1.37% [kernel] [k] do_nanosleep
   1.25% libc-2.17.so [.] usleep
   1.22% [kernel] [k] rb_insert_color
   1.20% [kernel] [k] update_curr
   1.18% [kernel] [k] idle_cpu
   1.14% [kernel] [k] copy_user_generic_string
   1.09% [kernel] [k] finish_task_switch
   1.07% [kernel] [k] __x86_indirect_thunk_rax
   1.06% [kernel] [k] ktime_get
   0.93% [kernel] [k] native_sched_clock
   0.92% [kernel] [k] sys_nanosleep

which seems almost entirely related to timers and sleeping.

Other than that things are working great. Cluster::publish_hrw is distributing data cross proxies perfectly:

# for x in 1 2 3; do broctl print Scan::attacks proxy-$x|grep attempts= -c;done
3304
3405
3397

# cat /bro/logs/current/notice.log |bro-cut note peer_descr|grep Scan::|cut -f 2|sort|uniq -c
    454 proxy-1
    463 proxy-2
    417 proxy-3

Once this is stable for a bit i'll start trying things like killing a proxy and verifying that things failover.

Trying this I noticed a few things (ordered by urgency from my point of
view).

With this change, we Bro cannot be compiled out of the Box on
RedHat/Centos 7 anymore. Since that is the latest release of RedHat and
probably used in production by quite a few people a potentially
significant amount of people might not be able to (easily) compile Bro
with this merge.

It aborts in configure, with:

-- Performing Test cxx11_header_works - Success
CMake Error at aux/broker/CMakeLists.txt:4 (cmake_minimum_required):
   CMake 3.0.2 or higher is required. You are running version 2.8.12.2

Using EPEL (which should be quite easy), a cmake3 package is available.

With this change, we Bro cannot be compiled out of the Box on RedHat/Centos 7 anymore. Since that is the latest release of RedHat and probably used in production by quite a few people a potentially significant amount of people might not be able to (easily) compile Bro with this merge.

It aborts in configure, with:

-- Performing Test cxx11_header_works - Success
CMake Error at aux/broker/CMakeLists.txt:4 (cmake_minimum_required):
CMake 3.0.2 or higher is required. You are running version 2.8.12.2

Is "use cmake3 from EPEL" an acceptable answer?

The main reason for it (IIRC) is for embedding CAF as a CMake ExternalProject, which I was struggling to hack around with lack of features in CMake 2.8.

Compiling on Debian 8 gives some CAF warnings that are a tad ugly:

It's CAF's master branch at the moment, so I don't feel much pressure to report/patch these unless they're still there when we're close to moving to beta/release version.

I noticed one small thing while building with make -j4; in this case you get several different % numbers simultaneously (one for car and one for broker).

Not sure I can help that. The thing that comes to mind would be enforcing that CAF does not build in parallel and thus wasting your time. Or else try to patch CMake to do a better job when using external projects.

There were a few test failures on the first run (that succeeded on a rerun) though.

Thanks, I also still see occasional failures that pass when re-running -- it's lower priority on my list to stabilize these. And I think doesn't prevent merging to master since they are most often problems with the unit test itself.

There were also a couple that did not succeed after several reruns for me. This was on a digital ocean 4cpu optimized debian8 instance for me; the reruns were not parallel:

Ok, will take a closer look at those.

I should also mention that I've run unit tests myself on the following:

* MacOS 10.13.4
* FreeBSD 11
* CentOS 7
* Debian 8

I find the tests are usually stable with a few needing some reruns. I haven't had any single test persistently fail on any of those systems.

I was also testing --build-type=debug everywhere and only a few places without, so that could be a difference.

- Jon

It might be. I am honestly not sure - I suspect that this still will mean that some places might not be able to easily use Bro anymore--adding external package sources does not seem to be a viable option everywhere.

As a side-note, it also looks like that means that we cannot provide binary packages for RedHat/CentOS anymore.

Johanna

Is it a feasible compromise to allow cmake 2.8 if we don't need to
build CAF? So either people have cmake 3.0 or they need to build CAF
themselves and say --with-caf=...?

Robin

It might be. I am honestly not sure - I suspect that this still will mean that some places might not be able to easily use Bro anymore--adding external package sources does not seem to be a viable option everywhere.

They could still build CMake themselves? (CMake itself is easy to build)

The options to go forward are:

(1) Users whose OS has insufficient CMake will need to compile/obtain a newer one. This would mostly be Ubuntu 14.04 (LTS until April 2019) and RHEL/CentOS 6+7 (LTS for these is in the 2020-2024 range).

(2) We go back to CMake 2.8.12 and have people compile CAF themselves. (Or maybe we could conditionally require only 2.8.12 users to compile CAF and others get the embedded CAF).

(3) I need to try to hack our CMake system more to try to get back down to 2.8.12 while still being able to embed CAF.

I can give (3) a bit more time to see if I didn't miss something (the line I was drawing before was having to manually do platform-dependent RPATH manipulation), though would be nice to hear more feedback on the approaches.

As a side-note, it also looks like that means that we cannot provide binary packages for RedHat/CentOS anymore.

We should be able to use whatever we want to create the binary package, shouldn't we? Or do you mean it wouldn't be accepted as part of the official repos even if it's just a build-time dependency?

- Jon

I tested this and it works great! I killed proxy-3, and cluster.log immediately logged it as 'node down'

The publish_hrw sent the new data to proxy 1 and 2 and when proxy 3 was restarted it rejoined and started receiving data again.

The next step is 2+ managers and 2+ loggers and we can finally have a bro cluster with no SPOF :slight_smile:

If there's something quick that ends up making (3) work, that'd be
ideal of course, but I don't think it's worth spending much time on,
given that there are reasonable ways to get a more recent cmake.

I wouldn't want to go back to not shipping CAF at all, but if we can
tell cmake that 2.8.12 is fine if users build CAF themselves, that
would be the 2nd best option I think. (1) ist worst case, which still
isn't too bad IMO, unless it does actually prevent us from building
binary packages for RH, that would be a problem.

Robin

I’m able to get this built and running on FreeBSD 11.1 .

- Keith

Yup, 2 would be ok I guess. One should still be able to just compile the CAF in the Bro subdirectory in that case, right?

1 I would rather avoid if possible.

Johanna

I think (hope!) I was mistaken and everything already works with 2.8.12 (structure of CMake docs previously led me to think it wouldn't) and just needed the version check moved back down, sorry for the noise.

Otherwise, I've stabilized some unit tests and made a merge request [1] for the broker branch.

- Jon

[1] https://bro-tracker.atlassian.net/browse/BIT-1653

Yay, that is really good news, thanks :slight_smile:

Johanna

I'm really late on this, but congratulations Jon! That's super exciting and it's been a long slog to getting to this point but now I can't wait for you to be able to turn your attention to other stuff soon. :slight_smile:

Thanks!
   .Seth