Cluster state synchronization

I’m having some troubles wrapping my head around synchronization of set values in a cluster.

We use a relatively simple bro script that correlates sets of whitelisted/blacklisted DNS names with new connections. To accomplish this, we have sets that are just the IP addresses returned by DNS lookups, which we then use to check against new connections.

i.e. Host “foo.internal” looks up “blacklist.example.com”, and receives response “10.0.0.1”. Bro then adds IP address “10.0.0.1” to the set named “blacklisted_ips”. “foo.internal” then proceeds to contact “10.0.0.1” on TCP/443. Bro looks up “10.0.0.1” in “blacklisted_ips” and, as there is a match, raises a notice.

After migrating from a standalone to a single-node cluster configuration (manager, proxy, worker), it now appears as though the sets containing IP addresses are updated after the TCP connection is initialized. As a result, our notice log is now growing with entries that should never have been raised in the first place, and is missing entries that should have been raised.

Does this theory make sense? Is there a way to speed up set additions/removals, or otherwise force synchronization whenever a modification is made, before processing any further traffic? Alternatively, does the Bro scripting language have any concept of a ‘sleep’?

I’m not sure about forcing synchronization. In reply to your question about sleep, you may want to look at scriptland’s suspend_processing() and continue_processing().

-AK

I think I may have mis-spoken when I used the term ‘synchronization’: our cluster is a single-node cluster, with one each of manager, proxy, and worker. So the sets are not tagged with ‘&synchronized’.

What we’re seeing is:

  1. DNS lookup is performed. Query matches whitelist. Bro is told to place the query response into the ‘whitelisted_ips’ set.
  2. Connection is established to the IP address. Connection is verified against the ‘whitelisted_ips’ set. No match is found, so a notice is raised.
  3. I confirm that the destination IP exists in the ‘whitelisted_ips’ set.

I’ve just done a quick test, and I have confirmed that one DNS name that was present in a whitelist took upwards of 10 seconds to show up in the whitelisted_ips set. While this was happening, multiple DNS lookups were performed and recorded by Bro in the DNS log.

Why does it take so long for a set to be updated, if that set is not set for synchronization?

I think I understand the question now. What does the &synchronize attribute do for values in a single node cluster? Someone else on the list will have to jump in on this one…
I’m curious if changing the priority of the event where you add to the whitelist set would make any difference.

-AK


Is this a script that you wrote locally or are you using the Broala script?

  https://github.com/broala/bro-snippets/blob/master/intel-dns.bro
  (this script works like it sounds like your does, but it uses data you have fed into the intel framework)

If you're curious about your script though, post is somewhere and someone can take a look. :slight_smile:

  .Seth

Is this a script that you wrote locally or are you using the Broala
script?

        https://github.com/broala/bro-snippets/blob/master/intel-dns.bro
        (this script works like it sounds like your does, but it uses data
you have fed into the intel framework)

It's a script that I inherited, originally written locally (I believe). It
is quite similar to the Broala script, but we're not using the intel
framework.

If you're curious about your script though, post is somewhere and someone

can take a look. :slight_smile:

A shortened version of the script I'm using for testing is at
https://gist.github.com/mutemule/a36f49b16db51eccd159. If I move the 'add'
commands into their own functions, and then prioritize the 'add_' over the
'is_' functions, would that be a reasonable way to ensure my sets are
updated before being used for lookups? I'm already planning to migrate
some of our stuff over to Intel, but I'm not quite there yet.

Oh, nice. I like the idea behind that script. I think I understand the rationale behind it too.

I made some updates to your script (also attached to the email)...
  Try Zeek

I don't see any reason why this script wouldn't work (on single workers, it won't work well on a cluster). You'll need to add your own list of authorized fqdns (probably in local.bro after you load this script), like this...

@load connection_validation
redef ConnectionValidation::authorized_fqdns += {
  "a.example.com",
  "b.example.com",
};

If you try it, let me know how it works for you!

connection_validation.bro (2.24 KB)

Thanks! That's the first step in my cleanup done. :wink:

But I don't see much of a difference between these scripts, as it relates
to my problem with the timeliness of set updates. Is there anything in
particular you've done that might result in sets being updated before being
referenced for lookups?

Not really. You had some code in there that was a little difficult to follow which was why I started doing the reorganization and cleanup. I'm wondering if you might have had a bug in there before. Could you try running my updated script and seeing if that works (I wish I had a concrete answer, but I don't yet).

  .Seth

Done, and the end result is the same: sets are being updated slowly. For
one CNAME in question, it took multiple queries and a few minutes for it to
show up in the CNAMEs set.


Weird. Could you try removing the &persistent attribute from those variables with it set? That attribute hasn't been used much ever, especially in the past couple of years. It's probably going to be removed before too many more releases too (replaced by something else).

It would help if you had some traffic that showed this effect too. It possible that there is something weird going on with your traffic that is giving this appearance.

  .Seth

> Done, and the end result is the same: sets are being updated slowly.
For one CNAME in question, it took multiple queries and a few minutes for
it to show up in the CNAMEs set.

Weird. Could you try removing the &persistent attribute from those
variables with it set? That attribute hasn't been used much ever,
especially in the past couple of years. It's probably going to be removed
before too many more releases too (replaced by something else).

I can try, but we kind of depend on persistence: without it, Bro goes a bit
nuts on startup until it sees all the DNS queries again. I've been toying
with the idea of resolving each member of the list and adding to the
appropriate address set during bro_init(), but that would still leave some
gaps.

It would help if you had some traffic that showed this effect too. It
possible that there is something weird going on with your traffic that is
giving this appearance.

The CNAME I mentioned above was a bad example: looks like the host was
caching some aspects of it (namely, that it was a CNAME), so Bro never had
a chance. I'll have another look at the network: we just had another false
positive arrive, but the DNS log shows the query happening after the
connection.

Thanks! Looks like the issue may be in the network itself.

As far as persistence goes, I had a similar need and ignored the
&persistent attribute; instead I manually append the data to a file
and then read the file into a table on bro_init. This seemed to work
well on some large scale (10 GB pcap) tests, but I haven't finished
the script to run in prod yet.

Ok, that makes me feel better about the persistence at least. I'd be curious about what's causing things on your network to show up in a weird order though.

  .Seth