[Bro-Blue] scan detector related sumstats load

First - wouldn't this be a topic for bro-dev instead of bro-blue? I don't
really see a reason to keep this private.

Indeed.. Moving this to bro-dev as there aren't any more internal logs or addresses.

To catch everyone up, I noticed some performance issues with how scan.bro and sumstats interacts.
scan.bro creates a large number of unique sumstats keys, and the mechanism that the manager uses to fetch them from the workers is not very efficient. I implemented some ideas to improve things and then realized that I basically re-implemented how it used to work in 2013.

In any case... Seth might remember this better, but as far as I remember,
we had some huge, quite difficult to debug Problems at bigger sites (I
think especially at Indiana), when running Bro with the old code that used
batching. I think I remember something about this causing _huge_ memory
spikes in those circumstances and that the best way around that was this
switch.

I could see a batch size of 50 being a problem, but if batching was the problem, simply setting the batch size to 1 would be better than what we have now.

If worker1 has key1 and worker2 has key2, right now what happens is:

manager send out get_a_key to get a key from each worker
worker1 will send_a_key for key1
worker2 will send_a_key for key2

At this point stats_keys[uid] contains [key1, key2]

Now, this is where everything goes wrong

manager sends out cluster_get_result for key1
worker1 sends out cluster_send_result reply for key1
worker2 sends out cluster_send_result empty reply

manager sends out cluster_get_result for key2
worker1 sends out cluster_send_result empty reply
worker2 sends out cluster_send_result reply for key2

With 56 workers, you end up with

manager sends out cluster_get_result for key1
worker1 sends out cluster_send_result reply for key1
worker2 sends out cluster_send_result empty reply
worker3 sends out cluster_send_result empty reply
worker4 sends out cluster_send_result empty reply
...
worker56 sends out cluster_send_result empty reply

manager sends out cluster_get_result for key2
worker1 sends out cluster_send_result empty reply
worker2 sends out cluster_send_result reply for key2
worker3 sends out cluster_send_result empty reply
worker4 sends out cluster_send_result empty reply
...
worker56 sends out cluster_send_result empty reply

For 56 workers to send the SAME key up to the manager you will get 1 get_a_key, 56 send_a_key, 1 cluster_get_results, and 56 cluster_send_result events. This is the best case scenario for the current system. This works ok if your keys are things like country codes or mime types that do not grow unbounded. There is a little overhead, but not much.

However, for 56 workers to send 56 different keys up to the manager you get 1 get_a_key, 56 send_a_key, 56 cluster_get_results, and 3136 cluster_send_result events. This is the worst case scenario and is what scan.bro triggers.

You also have to be a bit careful when going back to old code (or changing
the sumstats code) - the code has a bit of a... sad interaction with the
Bro message cache (or whatever it is called) - if you forget to call
copy() at all the right places that exchange messages about data in
tables, you are not actually going to exchange data but just references,
which can lead to stale data on the manager (and also reduce message load
as a side effect - while leading to wrong results). I am not sure if that
is the case here, I am just saying you have to be quite careful changing
things :slight_smile:

I saw those copy()'s.. I didn't understand them but I left them alone :slight_smile:

And - one thing in your older email - removing the Cluster::worker_count
== done_with[uid] is also a bit problematic because it makes it difficult
to check the correctness of the results. Which can become an issue with
sumstats - sometimes single nodes reply surprisingly slowly.

Yeah.. I realized that was important. What my code currently does is:

When the manager wants the results, it sends out a single get_some_key_data event. This is similar to the old send_data event.

get_some_key_data sends one key to the manager using the existing cluster_send_result and does a

    schedule 0.001 sec { SumStats::get_some_key_data(uid, ss_name, cleanup) };

to re-call itself if there is more data to be sent. When there is no more data for the current ss_name, it sends a 'send_no_more_data' event up to the manager.

I moved the Cluster::worker_count check to count the send_no_more_data events. So rather than being done once per key, the unit of work is the entire table. I believe this is almost identical to the 2013 code.

Compared to the 3000+ events before, for 56 workers to send 56 different keys up to the manager this mechanism uses 1 get_some_key_data, 56 cluster_send_result events, and 56 send_no_more_data events. The send_no_more_data count will always be 56, so the overhead is a small constant.

It is possibly that the more efficient method of transferring data up to the manager was what was causing the memory spikes. I think the current code may appear to behave better, but it is also spending 97.5% of its time sending around extra events and never making any progress.

It may be that the reason it was changed to transfer one key at a time was so that bro would never have to build a copy of the entire sumstat table in memory on the manager. Fixing that issue and the event amplification at the same time would be a little harder. I know two ways to solve that problem:

* Do an n-way merge of streams of sorted keys from each worker.. implementing that in bro script would not be fun.

* Shard the sumstats table itself. If each sumstats table was first bucketed by a hash of the key on each worker into 37 buckets, you could transfer and process each bucket serially which would cut the memory usage on the manager to 1/37. This is probably really easy to implement. Not sure how the hash would be computed in bro script, but the rest is trivial.

Oh, and I believe I also found a small inefficiency with how cluster_key_intermediate_response works, the recent_global_view_keys is on the worker, so each worker can independantly kick off a cluster_key_intermediate_response for the same key. This small patch keeps track of the recent_global_view_keys on the manager too and should cut down on repeated events:

+global recent_global_view_keys: table[string, Key] of count &create_expire=1min &default=0;
# Managers handle intermediate updates here.
event SumStats::cluster_key_intermediate_response(ss_name: string, key: Key)
       {
       #print fmt("MANAGER: receiving intermediate key data from %s", get_event_peer()$descr);
       #print fmt("MANAGER: requesting key data for %s", key);
+ if ( [ss_name, key] in recent_global_view_keys )
+ return;

       if ( ss_name in outstanding_global_views &&
            >outstanding_global_views[ss_name]| > max_outstanding_global_views )
@@ -451,6 +458,7 @@
               return;
               }

+ ++recent_global_view_keys[ss_name, key];
       ++outstanding_global_views[ss_name];

       local uid = unique_id("");