I’m thinking of writing a SumStats plugin, probably with the initial implementation in bro scriptland, with a re-implementation as BIFs if initial tests successful.
From examining several plugins, it appears that I need to:
- Add NAME of my plugin as an enum to Calculation
- Add optional tunables to Reducer
- Add my data structure to ResultVal
- In register_observe_plugins, register the function to take an observation.
- In init_result_val_hook, add code to initialize data structure.
- In compose_resultvals_hook, add code to merge multiple data structures
- Create function to extract
from data structure either at epoch_result, or epoch_finished
Any thing else I should be aware of?
Thanks in advance,
It seems that there’s some inconsistency in SumStats plugin usage and implementation. There appear to be 2 classes of plugins with differing calling mechanisms and action:
Item to be measured is in the Key, and the measurement is in Observation
These include Average, Last X Observations, Max, Min, Sample, Standard Deviation, Sum, Unique, Variance
These are exact measurements.
Some of these have dependencies: StdDev depends on Variance, which depends on Average
Item to be measured is in Observation, and the measurement is implicitly 1, and the Key is generally null
These include HyperLogLog (number of Unique), TopK (top count)
These are probabilistic data structures.
The Key is not passed to the plugin, but is used to allocate a table that includes, among other things, the processed observations. Both classes call the epoch_result function once per key at the end of the epoch. Since class 2 plugins often use a null key, there is only one call to epoch_result, and a special function is used to extract the results from the probabilistic data structure (https://www.bro.org/current/exercises/sumstats/sumstats-5.bro). The epoch_finished function is called when all keys have been returned to finish up. This is unneeded with this sort of class 2 plugin, since all the work can be done in the sole call to epoch_result. Multiple keys could be used with class 2 plugins, which allows for groupings (https://www.bro.org/current/exercises/sumstats/sumstats-4.bro).
I have a use case where I want to pass both a key and measurement to a plugin maintaining a probabilistic data store . I don’t want to allocate a table for each key, since many/most will not be reflected in the final results. Since the Observation is a record containing both a string & a number, a hack would be to coerce the key to a string, and pass both in the Observation to a class 2 plugin, with a null key - which is what I am doing currently.
It would be nice to have a conversation on how to unify these two classes of plugins. A few thoughts on this:
Pass Key to the plugins - maybe Key could be added to the Observation structure.
Provide a mechanism to not allocate the table structure with every new Key (this and the previous can possibly be done with some hackiness with the normalize_key function in the reducer record)
Some sort of epoch_result factory function that by default just performs the class 1 plugin behavior. For class 2 plugins, the function would feed the results one by one into epoch_result.
Incidentally, I think theres a bug in the observe() function:
These two lines are run in the loop thru the reducers:
if ( r?$normalize_key )
key = r$normalize_key(copy(key));
which has the effect of modifying the key for subsequent loops, rather than just for the one reducer it applies to. The fix is easy and and obvious…
 Implementation of algorithms 4&5 (with enhancements) of https://arxiv.org/pdf/1705.07001.pdf
Yeah, looked wrong to me also. Fixed via  in master branch now.
Sorry I don't have much knowledge of the existing sumstats code to
drive the other discussion/suggestions forward.