Hi all:
I’m thinking of writing a SumStats plugin, probably with the initial implementation in bro scriptland, with a re-implementation as BIFs if initial tests successful.
From examining several plugins, it appears that I need to:
- Add NAME of my plugin as an enum to Calculation
- Add optional tunables to Reducer
- Add my data structure to ResultVal
- In register_observe_plugins, register the function to take an observation.
- In init_result_val_hook, add code to initialize data structure.
- In compose_resultvals_hook, add code to merge multiple data structures
- Create function to extract
from data structure either at epoch_result, or epoch_finished
Any thing else I should be aware of?
Thanks in advance,
Jim
It seems that there’s some inconsistency in SumStats plugin usage and implementation. There appear to be 2 classes of plugins with differing calling mechanisms and action:
-
Item to be measured is in the Key, and the measurement is in Observation
-
These include Average, Last X Observations, Max, Min, Sample, Standard Deviation, Sum, Unique, Variance
-
These are exact measurements.
-
Some of these have dependencies: StdDev depends on Variance, which depends on Average
-
Item to be measured is in Observation, and the measurement is implicitly 1, and the Key is generally null
-
These include HyperLogLog (number of Unique), TopK (top count)
-
These are probabilistic data structures.
The Key is not passed to the plugin, but is used to allocate a table that includes, among other things, the processed observations. Both classes call the epoch_result function once per key at the end of the epoch. Since class 2 plugins often use a null key, there is only one call to epoch_result, and a special function is used to extract the results from the probabilistic data structure (https://www.bro.org/current/exercises/sumstats/sumstats-5.bro). The epoch_finished function is called when all keys have been returned to finish up. This is unneeded with this sort of class 2 plugin, since all the work can be done in the sole call to epoch_result. Multiple keys could be used with class 2 plugins, which allows for groupings (https://www.bro.org/current/exercises/sumstats/sumstats-4.bro).
I have a use case where I want to pass both a key and measurement to a plugin maintaining a probabilistic data store [1]. I don’t want to allocate a table for each key, since many/most will not be reflected in the final results. Since the Observation is a record containing both a string & a number, a hack would be to coerce the key to a string, and pass both in the Observation to a class 2 plugin, with a null key - which is what I am doing currently.
It would be nice to have a conversation on how to unify these two classes of plugins. A few thoughts on this:
-
Pass Key to the plugins - maybe Key could be added to the Observation structure.
-
Provide a mechanism to not allocate the table structure with every new Key (this and the previous can possibly be done with some hackiness with the normalize_key function in the reducer record)
-
Some sort of epoch_result factory function that by default just performs the class 1 plugin behavior. For class 2 plugins, the function would feed the results one by one into epoch_result.
Incidentally, I think theres a bug in the observe() function:
These two lines are run in the loop thru the reducers:
if ( r?$normalize_key )
key = r$normalize_key(copy(key));
which has the effect of modifying the key for subsequent loops, rather than just for the one reducer it applies to. The fix is easy and and obvious…
Jim
[1] Implementation of algorithms 4&5 (with enhancements) of https://arxiv.org/pdf/1705.07001.pdf
Yeah, looked wrong to me also. Fixed via [1] in master branch now.
Sorry I don't have much knowledge of the existing sumstats code to
drive the other discussion/suggestions forward.
- Jon
https://github.com/bro/bro/commit/5821c16490e731a68c0efc9c1aaba2d7aec28f48