Bro 2.6-beta plans

There's no significant code changes/features planned to get added to
the master branch from now until the 2.6-beta gets released (maybe in
about a week). Until that happens, please help test the latest master
branch and provide any feedback about how it's working if you can.

- Jon

To be clear, Cluster::relay_rr() is gone forever? I’ll need to rewrite some policies, but also update the blog to be correct.

Dop

Yes.

- Jon

I just got 2 clusters upgraded from

fa7fa5aa to
452eb0cb

And now everything is broken..

cpu and memory are through the roof across the board, as well as network traffic, but it's not logging much.

I may have created a message loop replacing the relay_rr stuff, but it's kind of hard to tell.

I'll do some more testing but so far this is the first issue I've ran into in months.

I guess one observation is that it is really hard to tell what bro/broker are doing. Before you could minimally
tcpdump the communication and see what events were being sent back and forth, but now that is encrypted.

I tested an almost stock local.bro (a few additional things disabled) and saw the same thing.

fa7fa5aa is fine, but with 452eb0cb everything is working really hard to do something.

The most noticeable thing is the network traffic on the manager changing from being almost idle at

300KiB TX and 3MiB RX and jumping to
20MiB TX and 12MiB RX

I just got 2 clusters upgraded from

fa7fa5aa to
452eb0cb

And now everything is broken..

cpu and memory are through the roof across the board, as well as network traffic, but it's not logging much.

I may have created a message loop replacing the relay_rr stuff, but it's kind of hard to tell.

The recent forwarding changes would be my main suspicion and, at least
in the default scripts, there's no communication patterns that
actually make use of the automatic forwarding, so can you check if
adding "redef Broker::forward_messages = F;" to site/local.bro makes a
difference?

If it does fix things, then yeah, either I missed a forwarding loop in
the default scripts or potentially you introduced one when replacing
relay_rr (feel free to point me at stuff to look over).

(Generally may want to just leave message forwarding turned off due to
these types of dangers if that's what it turns out to be...).

I guess one observation is that it is really hard to tell what bro/broker are doing. Before you could minimally
tcpdump the communication and see what events were being sent back and forth, but now that is encrypted.

You can redef Broker::disable_ssl=T. I don't recall how readable the
non-encrypted communications are, but I think I did it at least once
or twice and still was able to spot event names.

- Jon

Thanks for that, I'll start looking into it, but still would be
helpful if you could try disabling message forwarding (or disable ssl
+ look at some captured traffic to see if you can understand what
might be happening). Thanks.

- Jon

Yeah, that fixed it!

I re-enabled that and then disabled ssl and I am looking at the comm stuff going to the logger, which should just be logs

This seems to work for basic quick analysis:

[root@bro40-dev ~]# tcpdump -n -i em1 port 47761 -A|sed "s/\.\.\.\.\./\n/g"|egrep -io broker.* |head -n 10000|sort|uniq -c|sort -nr
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em1, link-type EN10MB (Ethernet), capture size 262144 bytes
tcpdump: Unable to write output: Broken pipe
   8842 broker::topic+broker::internal_command+@u32.bro/known/certs/<$>/data/clone
   1124 broker::topic+broker::internal_command+@u32.bro/known/hosts/<$>/data/clone
      8 broker::internal_command+@u32.bro/known/certs/<$>/data/clone
      5 broker::topic+broker::internal_command+@

Thanks, there was an unintended forwarding loop in data store
communication. It's fixed in master now, but I've also just reverted
to generally disabling the forwarding mechanisms by default (any
specific/advanced use-cases can always choose to selectively enable
it, but the default cluster config doesn't need it for anything
currently).

For anyone doing testing, please pull the latest change from master
and give feedback on that version.

- Jon

Latest master works much better.

One thing I'm still seeing when I switch from an old version to latest master is that huge spike
in Content switches/interrupts and cpu spent in the kernel.

On one test worker box (that is receiving a full 10G and a bit overloaded)

before: context switches 200k/sec, interrupts 150k/sec
after: context switches 1mil/sec, interrupts 700k/sec

Before, cpu maxed out but spending 60% in user and 30% in system
After, cpu maxed out but spending 12% in user and 80% in system

On the manager side, cpu went up a little, but not by much.

I tried redef Broker::max_threads=2; but that didn't make much of a difference.

I just updated the default tuning parameters for CAF's scheduling
policy and exposed them all in Bro [1] and get within 10% of the
number of context switches that 2.5.5 had in a very simple test case.

Let me know how that goes for everyone.

[1] https://github.com/bro/bro/blob/4bd6da71864b4ab68b372954c6268a023d69e52b/scripts/base/frameworks/broker/main.bro#L64-L94

That helped a little with the context switches, but not much with the cpu load.

Do many of those options do anything? I tried looking in the CAF source to figure out how they are used, and it looks like they are all defined in
libcaf_core/caf/actor_system_config.hpp as

  // -- work-stealing parameters -----------------------------------------------

  size_t work_stealing_aggressive_poll_attempts CAF_DEPRECATED;
  size_t work_stealing_aggressive_steal_interval CAF_DEPRECATED;
  size_t work_stealing_moderate_poll_attempts CAF_DEPRECATED;
  size_t work_stealing_moderate_steal_interval CAF_DEPRECATED;
  size_t work_stealing_moderate_sleep_duration_us CAF_DEPRECATED;
  size_t work_stealing_relaxed_steal_interval CAF_DEPRECATED;
  size_t work_stealing_relaxed_sleep_duration_us CAF_DEPRECATED;

Scratch that.. the deprecation refers to something else, they are still used here:

work_stealing::worker_data::worker_data(scheduler::abstract_coordinator* p)
    : rengine(std::random_device{}()),
      // no need to worry about wrap-around; if `p->num_workers() < 2`,
      // `uniform` will not be used anyway
      uniform(0, p->num_workers() - 2),
      strategies{{
        {CONFIG("aggressive-poll-attempts", aggressive_poll_attempts), 1,
         CONFIG("aggressive-steal-interval", aggressive_steal_interval),
         timespan{0}},
        {CONFIG("moderate-poll-attempts", moderate_poll_attempts), 1,
         CONFIG("moderate-steal-interval", moderate_steal_interval),
         CONFIG("moderate-sleep-duration", moderate_sleep_duration)},
        {1, 0, CONFIG("relaxed-steal-interval", relaxed_steal_interval),
         CONFIG("relaxed-sleep-duration", relaxed_sleep_duration)}}} {
  // nop
}

So that strategies is an array of structs of [attempts, step size, steel interval, sleep interval]

so as I understand it, the current default is

5 polls, 1 step, interval 4, 0 sleep
5 polls, 1 step, interval 2, 16msec sleep
1 poll, 0 step, interval 1, 64msec sleep

Which means if a thread has no work to do it will loop for 5 times with no sleep, then 5 more times with 16msec sleep, then once with 64ms sleep.

This doesn't seem that bad.. worse case is 11 attempts every 149ms or ~70/second.

I went as far as trying

redef Broker::aggressive_polls = 1;
redef Broker::moderate_polls = 1;
redef Broker::moderate_sleep = 100msec;
redef Broker::relaxed_sleep = 200msec;

but that didn't make a noticeable change, so the scheduler may not be the problem. Will have to do some more testing.

I did some more testing and profiling and figured out what is going on..

The new version is much more efficient, so it's spending a lot less time in user space
and a lot more time in the kernel fetching packets.

On the managers, overall CPU using is a bit lower, but I think some of that is from removing all uses
of &synchronized, especially for large input tables.

If anything there's just a slight overhead in the bro-myricom plugin in how it uses snf_ring_recv to
receive a single packet instead of snf_ring_recv_many to grab multiple packets like how af_packet works.

I'd look into fixing it but we are moving to intel 40g cards anyway.

Hi again!

Just finished the migration to master across the board, and it’s looking REALLY good.

No crashes, memory is stable, cpu is pretty good. The only thing I noticed is the manager userspace CPU utilization on one cluster
jumped up a bit after the switch.

This graph shows the system and userspace utilization across an 8 physical node cluster. All the lower lines are the system usage
and the higher lines are userspace. The Y axis is cpu seconds per second, so 4 is full 4 cores worth of usage.

After the switchover:

  • the worker userspace cpu utilization was unchanged
  • the worker system cpu utilization increased a bit, but I believe that is due to more time spent in the myricom driver waiting for packets
  • the manager system cpu utilization dropped a bunch
  • the manager userspace cpu increased 1-3x

The manager box in this cluster only runs the manager and logger processes, no proxies. It also has something like 20 idle cores,
so this isn’t a problem at all, but could affect people who run a cluster-in-a-box.

I do seem to be seeing a bunch of reporter errors like

Reporter::ERROR string with embedded NUL: “\x00\x00\x00\x00OPTIONS”

but I’m not sure if that is a new thing.

Just finished the migration to master across the board, and it's looking REALLY good.

Great, thanks for helping test and provide performance data.

The manager box in this cluster only runs the manager and logger processes, no proxies. It also has something like 20 idle cores,
so this isn't a problem at all, but could affect people who run a cluster-in-a-box.

An idea in this type of situation could be to tune Broker::max_threads
per node type. E.g. leave at 1 for workers and bump to ~4 for
manager/logger since there's idle cores on their host and they're
inherently in a less-scalable/centralized location. That may not
lower overall cpu usage, but may help prevent some bottlenecks in the
processing of remote messages. Particularly the work of processing
data store communication should distribute among threads, potentially
each data store could be processing messages independently on separate
threads. (The default scripts have 3 stores, one for each known-*
script).

I do seem to be seeing a bunch of reporter errors like

Reporter::ERROR string with embedded NUL: "\\x00\\x00\\x00\\x00OPTIONS"

Ok, not necessarily that bad, but would be nice to find where that's
coming from to handle it more properly

- Jon

For those that haven’t yet dug into all that is Broker, this brings up a question on the general architecture. Are there still parent/child processes handling comms/work? I’d assume the fork is gone, so the new code is dropping a process but gaining a thread, in essence a wash.

Is there a mechanism today for per node type tuneables?

…alan

Are there still parent/child processes handling comms/work?

No. Single process, configurable number of threads (default 1).

Is there a mechanism today for per node type tuneables?

One should be able to use an @if directive [1] to tune differently
per-node if they want.

- Jon

[1] https://www.bro.org/sphinx-git/script-reference/directives.html

Could anyone update Bro on docker to include 2.5.5, 2.6-beta and master? https://hub.docker.com/r/broplatform/bro/tags/

We use this with our internal trybro instance which is fantastic for quickly collaborating and testing scripts. :slight_smile:

Working on it now :slight_smile:

There isn't an official 2.6 beta, but I will build and upload 2.5.5 and current master.