bro cluster in containers

Has anyone implemented a bro cluster in containers? The reason I ask is that we are looking to build a cluster on top of Mesos / DC/OS so that we can have high availability as we are processing tons of traffic, and it is just easier to deploy things on top of it if we do it in containers. I understand how to do most of it, but the configuration so that the cluster master knows about all the other instances is kind of my sticking point. Is there a way to utilize a tool like zookeeper so that it can dynamically manage the instances in case one of them crashes and then gets spun up on a different host?

Has anyone implemented a bro cluster in containers?

I've been meaning to try to build this out using k8s, just haven't had time.

The reason I ask is that we are looking to build a cluster on top of Mesos / DC/OS so that we can have high availability as we are processing tons of traffic, and it is just easier to deploy things on top of it if we do it in containers.

To really be useful you also need to automate the configuration of the tapagg layer.

I understand how to do most of it, but the configuration so that the cluster master knows about all the other instances is kind of my sticking point.

Right now it would break because of how this is written:

event Cluster::hello(name: string, id: string) &priority=10
    {
    if ( name !in nodes )
        {
        Reporter::error(fmt("Got Cluster::hello msg from unexpected node: %s", name));
        return;
        }

    local n = nodes[name];

    if ( n?$id )
        {
        if ( n$id != id )
            Reporter::error(fmt("Got Cluster::hello msg from duplicate node:%s",
                                name));
        }
    else
        event Cluster::node_up(name, id);

    n$id = id;
    Cluster::log(fmt("got hello from %s (%s)", name, id));

    if ( n$node_type == WORKER )
        {
        add active_worker_ids[id];
        worker_count = |active_worker_ids|;
        }
    }

but I'm sure you could have a variation of that function that doesn't care if the node is unexpected.

Is there a way to utilize a tool like zookeeper so that it can dynamically manage the instances in case one of them crashes and then gets spun up on a different host?

k8s and mesos should just do that for you, but what environment are you running in where that would be useful?

The deployment I was thinking of would involve a k8s operator to manage the Arista so as a cluster is created or scaled up and down it would
automatically manage the tool port etherchannel groups for me. Without that it wouldn't be useful at all.

I've been meaning to try to build this out using k8s, just haven't had time.

We do plan to migrate to k8s at some point. In fact, I'd prefer doing it that way. The reason we are using Mesos is that it was chosen by two of the original developers before I joined the project. They were not so much focused on the containers part of it as they were on hosting MapR.

To really be useful you also need to automate the configuration of the tapagg layer.

Can you elaborate on what that means? :slight_smile:

Right now it would break because of how this is written:
...
but I'm sure you could have a variation of that function that doesn't care if the node is unexpected.

Can you override event handlers in Bro for a core function like that? Presumably I could also just have a thing to add the node in before it gets there?

k8s and Mesos should just do that for you, but what environment are you running in where that would be useful?

Well, I was just thinking more from the standpoint of discovering the nodes. You are right, zookeeper wouldn't really be needed because I can get all the info about the nodes from Mesos itself. I actually came up with an idea of running my own little service that monitors what nodes are running and makes sure the master config is up to date. If anyone has done that before, I would be interested in talking to you.

As for environment, we are feeding in large volumes of network traffic from some gigamons for further analysis by our data scientists. We feel that bro will give us the flexibility we need and will also help in categorizing the data when it comes in the door, not to mention simply giving us some additional checking through feeds like critical stack.

Tap aggregator is a network switch with provides a layer of indirection between your bro cluster and your network taps. It allows you to “route around" or duplicate data (maybe you also want the same data sent to Snort and full PCAP) and slice it up into smaller, more manageable, network flows. In a bro cluster, it is what load balances between bro worker nodes.

Automation of the tapagg layer means that you would use the API your Tap aggregator has to turn on and turn off the switch ports that connect to containers. The orchestration component, would spin up/tear down containers/pods and simultaneously turn on/off the corresponding switchport on the tap aggregator. The way I’m thinking of it, you wouldn’t use Kubernetes/Mesos/ect. to route network traffic around, rather your orchestration platform would control the dataplane of your tap aggregator. My experience is that handling high throughput traffic in software is a recipe for traffic loss and frustration. Do as much as you can in hardware.

Either way, I am very interested in this conversation — we run RedHat’s Openshift which is a Kubernetes distribution. Running bare metal works really well for us, but if I could get on OpenShift, I could get rid of most of our remaining physical machines. My question is whether or not broctl can be made to gracefully handle the rapid elastic characteristics of container orchestration — for example can I add/remove bro worker nodes/containers without doing a “broctl deploy” or “broctl restart”? I’m sure I can just restart the service, but that seems like a disruptive and non elegant solution.

Best Regards,
-Stefan

It wouldn't need to, broctl would not be used at all in this environment.

Tap aggregator is a network switch with provides a layer of indirection between your bro cluster and your network taps. It allows you to “route around" or duplicate
data (maybe you also want the same data sent to Snort and full PCAP) and slice it up into smaller, more manageable, network flows. In a bro cluster, it is what load
balances between bro worker nodes.

Automation of the tapagg layer means that you would use the API your Tap aggregator has to turn on and turn off the switch ports that connect to containers. The
orchestration component, would spin up/tear down containers/pods and simultaneously turn on/off the corresponding switchport on the tap aggregator. The way
I’m thinking of it, you wouldn’t use Kubernetes/Mesos/etc. to route network traffic around, rather your orchestration platform would control the dataplane of your
tap aggregator. My experience is that handling high throughput traffic in software is a recipe for traffic loss and frustration. Do as much as you can in hardware.

Either way, I am very interested in this conversation — we run Red Hat's OpenShift which is a Kubernetes distribution. Running bare metal works really well for us,
but if I could get on OpenShift, I could get rid of most of our remaining physical machines. My question is whether or not broctl can be made to gracefully handle the
rapid elastic characteristics of container orchestration — for example can I add/remove bro worker nodes/containers without doing a “broctl deploy” or “broctl
restart”? I’m sure I can just restart the service, but that seems like a disruptive and non-elegant solution.

I happen to be somewhat of a subject matter expert on OpenShift, and I helped a great deal with our deployment in my company. My team doesn't use it, but that's mostly because they came up with this architecture before I joined the team. I'm definitely pushing them towards Kubernetes, but we already have promised deliverables and time is tight, so I can't really re-architect it right now.

I'll ask the guy who designed the current architecture about the tap aggregator. If I understand you correctly, it is really about configuring that to deal with the containers coming up and down, not so much bro itself. Does the bro cluster manager need to be updated when instances come and go then? I've seen that it acts as a log aggregator (although you can configure a different node type to do that). I am of course going to do my homework and start building it out, but I'm still kind of new, so any guardrails are appreciated.

My question is whether or not broctl can be made to gracefully handle the rapid elastic characteristics of container orchestration — for example can I add/remove bro worker nodes/containers without doing a “broctl deploy” or “broctl restart”? I’m sure I can just restart the service, but that seems like a disruptive and non elegant solution.

It wouldn’t need to, broctl would not be used at all in this environment.


Justin Azoff

Okay — that makes sense. I’d have to think the details through, but I bet I could figure it out.

Ø I think Justin just answered this question — the orchestration would handle the broctl/bro management things. So no, there is no need to update the cluster manager. There is no need for broctl. If I’m understanding Justin correctly, I believe you would need to figure out all the command line options to pass to the bro binaries. I don’t think that would be too hard if you have an example bro cluster to look at.

I don't know anything about mesos, but k8s provides internal cluster dns based service discovery. You would just point all the workers at "logger" and it will do the right thing.

There's probably a whole bunch of corner cases and minor issues that would come up if you actually tried to build this out, but nothing that should prevent it from working.

I've been meaning to try setting it up, just haven't had time.

I don't know anything about mesos, but k8s provides internal cluster dns based service discovery. You would just point all the workers at "logger" and it will do the
right thing.
There's probably a whole bunch of corner cases and minor issues that would come up if you actually tried to build this out, but nothing that should prevent it from
working.
I've been meaning to try setting it up, just haven't had time.

Ah, ok. If that's the case, I guarantee I can make it work then. Thank you for your help!