Sumstats bugs & fixes

In a previous email, I asked the question: Does anyone use sumstats outside of a cluster context?

In my case, the answer is yes, as I perform development on a laptop, and run bro standalone to test new bro policies.

I found several different bugs in share/bro/base/frameworks/sumstats/non-cluster.bro, specifically SumStats::process_epoch_results

  1. The policy returns sumstats results 50 at a time, and then reschedules itself for the next 50 after 0.01 seconds… Unfortunately, the reschedule is:
    schedule 0.01 secs { process_epoch_result(ss, now, data1) };
    instead of:
    schedule 0.01 secs { **SumStats::**process_epoch_result(ss, now, data1) };
    so it silently fails after the first 50 results. Would be nice to have a warning if a script schedules an event that doesn’t exist.

2.The serious issue with the policy, though, is that the ‘for’ loop over the result table is the main loop, with up to 50 items processed and deleted within the loop, the expectation being that the iteration will not thus be disturbed.

The attached program (hash_test.bro) demonstrates that this is not the case (should output 1000, 0, but the 2nd value comes back non-zero), in line with the documented caveat: “Currently, modifying a container’s membership while iterating over it may result in undefined behavior, so do not add or remove elements inside the loop.” I didn’t examine bro source code to appreciate the reason, but surmise that table resizing and rehashing would account for the issue.

The consequences of this issue are that, under certain circumstances:

  • Not all data will be returned by SumStats at the epoch
  • SumStats::finish_epoch may not be run.

To address the issue can be done via a rearrangement of the code, along the lines of the following pseudocode (boundary conditions, etc. ignored)

original (modifies table inside ‘for’ loop):

i=50;
for (foo in bar)
{
process bar[foo];
delete bar[foo];
–i;
if (i == 0) break;
}

to (modifies table outside ‘for’ loop):

i=50;
while (i >0)
{
for (foo in bar)
{
break;
}
process bar[foo];
delete bar[foo]
–i;
}

… there are a few other subtleties in the code (keeping a closure on the result table so that sumstats can clear the original table & proceed to the next epoch, and not running SumStats::finish_epoch if the result table was empty to begin with).

A bit of rearrangement fixes the bugs while preserving the original behavior, with the help of a wrapper event that checks for an empty result table, and if not makes an explicit copy for further processing by the actual event doing the work.

An additional ‘for’ loop around the result table could be used to keep it all in one event, but looks too much like black magic (and still, albeit probably in a safe way, depending on undefined behavior) - I prefer clear, understandable code (ha!), rather than “dark corner” features. Six months later, when I look at the code, I won’t be able to remember the clever trick I was using :slight_smile:

Attached please find hash_test.bro & (patched) non-cluster.bro

Jim Mellander
ESNet

hash_test.bro (183 Bytes)

non-cluster.bro (2.15 KB)

Unfortunately, the reschedule
is:
schedule 0.01 secs { process_epoch_result(ss, now, data1) };
instead of:
schedule 0.01 secs { SumStats::process_epoch_result(ss, now, data1) };
so it silently fails after the first 50 results.

Thanks, you're right about that.

Would be nice to have a
warning if a script schedules an event that doesn't exist.

Right again, it would be nice since it has resulted in bugs like this,
though I recall it might not be an easy change to the parser to clear
up the differences in namespacing rules for event identifiers.

Attached please find hash_test.bro & (patched) non-cluster.bro

Thanks for those. I remember you pointing out the potential problem
in the earlier mail and meant to respond to indicate we should fix it
and I must have just forgot, so sorry for that. I had a bit of a
different idea on how to address the iterator invalidation that might
be more understandable: keep a separate list of keys to delete later,
outside the loop. I have my version of your proposed fixes at [1].
Can you take a look and let me know if that works for you?

- Jon

[1] https://github.com/bro/bro/commit/3495b2fa9d84e8105a79e24e4e9a2f9181318f1a

Seems like either way would work - I’ve used the idea of keeping a list of keys to delete after loop exit (not in bro, actually, but in awk, which has a similar paradigm), but since we’re peeling off keys, and discarding them after usage, avoiding the additional loop seemed more natural/efficient - although I suppose my way of looping over the keys, just to gather the first item, and breaking out may not seem too natural or be particularly efficient.

I’ll try your changes to see if they accomplish the same things. Based on Justin’s post, I’m now more concerned about running my code on a cluster, but will tackle that next.