Bloom filter merging

I noticed a slight complication while implementing the Bloom filter
merge functionality: in order for Bloom filters to be mergeable, they
must have been filled with the same hash functions. Consequently,
Bloom filters need to share their internal hash functions for
meaningful results. This means that we need to preallocate them
because their seed depends on calls to bro_random(), that is, if we
instantiate a hash function later in the game, we may not get the same
one across all cluster nodes. It's also unclear how many hash
functions we need, I've seen corner cases of over 40 instances
depending on poor parametrization, though the common case is around
4-8. The hash function H3 takes on space o(2304B), so maintaining a
global pool of, say, 100 hash functions to be on the safe side would
cost us o(2MB). For example, if a Bloom filter needs 5 hash functions,
it will always grab the first 5 from the pool to ensure a
deterministic setup. The pool needs to be instantiated directly after
the seed has been initialized.

Instead of allocating a fixed number of hash functions, we could also
set aside a fixed number of seeds and use them to instantiate hash
functions as we need them. Does that sound like a good strategy to
pursue?

    Matthias

I had a chat with Vern about this earlier today and the discussion
yielded the following idea:

Rather than reserving a fixed number of seeds at startup, we use a
keyed hashing scheme that computes a seed to construct the i'th hash
function as follows:

    seed_i = h(initial_seed || name || i)

One can fix "initial_seed" at startup via an environment variable. The
new element "name" represents a parameter that the user provide
(optionally?) as part of the BiF bloomfilter_init. The counter "i"
allows for generating an arbitrary number of hash functions. The
function "h" shall be a consistent hash function, e.g., H3 seeded with
the initial seed.

Feedback welcome,

    Matthias

What about:

seed_i = h( ((name.length() == 0) ? initial_seed : name) || i)

In my mind, it'd be cool if identical names would produce identical results (in this case) without the need to set an environment variable. Maybe there's a reason that's a bad idea, though?

--Gilbert

I'm not entirely sure how much of a concern this is, but hashing
solely based on the name would make the bloom filters vulnerable to
attack since it's more predictable what the hash values are. Ideally,
the names would be something simple and easily identifies what the
filter is counting (from how I picture the names being used).

That would put pressure on the user to choose names that are random or
salted to keep the filters secure. I do like the idea of not setting
an environment variable, but I'm not sure how much of a benefit it is.

Thanks,
Soumya Basu

seed_i = h( ((name.length() == 0) ? initial_seed : name) || i)

This is how I actually implemented it internally currently. However, I
do not think it will make a difference because CompHash is also seeded
by initial_seed [1], and hashing the Bloom filter implementation uses
CompHash to hash Val instances. To generate k different hash values I
just stuff the CompHash output into k H3 instances that are seeded
according to the formula above. Let x be an instance of a Val. Then
the i'th hash value is:

    h_i(x) = H3_i(CompHash(x))

In my mind, it'd be cool if identical names would produce identical results (in this case) without the need to set an environment variable. Maybe there's a reason that's a bad idea, though?

Soumya pointed out the problem: there's a trade-off between making
Bloom filters shareable on the one hand, and robust against attackers
on the other hand. If we want to distribute Bloom filters at some
point, we'll face the problem of running into different seeds in
different deployments. Conceptually, we have two tuning knobs: the
Bloom filter name, which is unique to the Bloom filter, and the
environment variable BRO_SEED_FILE that determines initial_seed.
Ideally our uses will not have to mess with their seed to use a Bloom
filter someone wants to share. At this point, however, it is
impossible to get rid of the initial_seed dependency due to the
problem noted above. If there's a straight-forward way to hash an
arbitrary Val in a different way, please let me know.

    Matthias

[1] According to the documentation in Hash.cc, hashing a Val either
uses H3 for short data and HMAC MD5 for arbitrary long data. Both the
seed of the H3 instance as well as the HMAC key are a function of
initial_seed.