Cluster Controller Framework Thoughts

Rather lengthy post follows. In case people didn’t see it, much of this references this document: https://docs.google.com/document/d/1r0wXnihx4yESOpLJ87Wh2g1V-aHOFUkbgiFe8RHJpZo/edit

The past couple of weeks, I upgraded all of our standalone Zeek clusters (about a dozen) to Zeek 3.2 and I also wanted to use this as an opportunity to try to move over to the new Supervisor framework.

I think there are cases where the Supervisor framework is a great fit, and for being at an early stage, it really does work well. However, I ended up not using the Supervisor framework at all, and just running everything directly from systemd (with no zeekctl). I realize that “just use systemd” isn’t a solution for everyone, but I’d like to share some thoughts in order to suss out whether there’s something inherently different about our environment/setup, or whether the Supervisor and Cluster Controller frameworks need some tweaking or rethinking.

Configuring Supervisor was more difficult. In both cases, I need to configure my cluster layout. I do this via an Ansible template. Then, I need to tell the Zeek processes their working directory, CPU pinning assignment, environment variables, interface to sniff on, etc. These are all things that systemd can do natively, and I found that it was much easier to do it via systemd templates.

Monitoring the cluster was also more difficult with Supervisor. When I check the status of the cluster, I just see a bunch of identically-named processes. I can’t tell what’s using excessive memory, if I want to check the logs for a process, I need to go manually look at stderr and stdout in its working directory.

With systemd, I see something like this:

Active: active since Fri 2020-08-28 08:13:19 PDT; 4 days ago
Tasks: 247
Memory: 3.1G
CGroup: /zeek.slice
├─zeek-worker.slice
│ ├─zeek_worker@1.service
│ │ └─55473 /usr/local/zeek/bin/zeek site/local -i ens1
│ ├─zeek_worker@2.service
│ │ └─55253 /usr/local/zeek/bin/zeek site/local -i ens1

├─zeek_cluster@control.service
│ └─55074 /usr/local/zeek/bin/zeek site/local
├─zeek_cluster@logger.service
│ └─55125 /usr/local/zeek/bin/zeek site/local
├─zeek_cluster@manager.service
│ └─55114 /usr/local/zeek/bin/zeek site/local
└─zeek_cluster@proxy.service
└─55116 /usr/local/zeek/bin/zeek site/local

I immediately know what process is what, per-node logs go to my standard logging infrastructure, and I can restart a single node if it’s misbehaving. I even took this a step further, and configured systemd to kill and restart any worker if it ever swaps, but to allow the logger, manager, etc. to swap.

Ultimately, given the choice between systemd + supervisor versus just systemd, for our use case, just systemd gave us some distinct benefits and reduced complexity. My intent is not to be critical of the existing work, like I said, I think there are cases where it’s a great fit. However, I did also want to share my experiences and thoughts after running it for a while.

That experience gave me a new perspective on the Cluster Controller framework. This is essentially another layer of orchestration over the cluster, except there are specific things that it does not do, namely it’s not responsible for keeping the following Zeek items in sync: binaries, packages, and scripts. So, in order to use this, I need some layer which will distribute these and ensure that they match across a cluster. Once again, it feels like I’m left with a choice between one orchestration tool + the cluster controller framework, versus just using a single orchestration tool to keep these files in sync and handle cluster stop/start/restarts.

Take, for instance, a cluster restart. This could either be planned (e.g. I upgraded a binary or script and want to have that change take effect), or unplanned (something crashed and the cluster is now degraded). In the first case, I’d need to use my orchestration tool to push out the file changes, and then it becomes trivial to have that handle the restart as well. In fact, having my orchestration tool do that also enables me to do things like fallback to the previous version should the upgrade go wrong. In the second case, rather than build another layer on top of it, I’d like to see more resiliency in the Cluster framework.

There are other capabilities such as coordinated starting and stopping of the cluster, but the only reason I do those actions is because I need to (e.g. the system running the logger is going to reboot, and I know that if I don’t stop the whole thing, logs will queue in memory and workers will crash).

So, I’m left wondering if the Cluster Controller framework is really solving the right problems. Rather than worry about coordinated restarts, I’d like to see the need for such restarts go away with better resiliency. If I need to reboot a system in a cluster, and it’s running the manager and logger, I’d like to see another system in the cluster get promoted to being the manager and logger, and all the nodes to start talking to that instead. I could see a design where all systems run managers and loggers, with only one active at a time. If the active manager shares state with the standby managers, a failover can happen seamlessly.

It’s true that the world has moved on from the model of broctl and zeekctl, but I feel like that’s because they’ve moved towards better resiliency in distributed systems, and it feels a bit like the Cluster Controller framework is trying to take the old zeekctl features and get them to fit into a new model.

–Vlad

Hi Vlad,

thanks for the feedback, that's quite helpful. I'll dig a bit into
some of your points below.

As a general point, there's nothing wrong with chosing a different
deployment model than whatever becomes the new default. Quite the
opposite: Part of the thinking here has been that there's no single
approach that'll work for everybody, hence we want to offer multiple
layers that people can hook into depending on their needs and
expertise. The Controller would be the highest level abstraction that
gives you an experience not too far from current ZeekControl (more on
that below). On the other end of the spectrum, skipping everything
altogether and going with a manual systemd config is the lowest-level
way of doing it. In between those two we have: using the Supervisor
API through a custom Zeek management script (i.e., no Cluster
Agent/Controller), and writing a custom controller to interface with
the Agent API while doing your own state management.

Ultimately, given the choice between systemd + supervisor versus just
systemd, for our use case, just systemd gave us some distinct benefits and
reduced complexity.

Ack, I can see that for you guys, especially with the current state of
things. I'll just add here that (1) not everybody can/want to use
systemd, so we'll need to have ways to build clusters in other
settings; and (2) some of the technical advantages you mention should
be addressable with the Supervisor/Controller, too; that's just not
there yet (e.g., nicer process visualization).

it feels like I'm left with a choice between one orchestration tool +
the cluster controller framework, versus just using a single
orchestration tool to keep these files in sync and handle cluster
stop/start/restarts.

I think part of the question here is how much effort one is willing to
invest into installing and maintaining Zeek. If you (1) are very
familiar with Zeek, and (2) have a current orchestration tool in place
that's easy to extend with all the necessary pieces (incl. managing of
restarts, logging, health monitoring), I agree that one tool sounds
better than two. However, if we look at it from the perspective of a
new user who wants to get Zeek running on their network quickly,
figuring out all those pieces is probably quite a hurdle. That
trade-off seems similar to ZeekControl today: people already have the
option to go through systemd, but ZeekControl remains the standard way
to run Zeek, even with all its quirks.

Re/ putting files in place everywhere: Per the design doc, I
definitely see distribution of packages and site-specific scripts in
scope for future versions of the Controller. That would then leave
people with the task to just install the same Zeek version everywhere,
which seems a reasonable expectation to me.

If I need to reboot a system in a cluster, and it's running the manager and
logger, I'd like to see another system in the cluster get promoted to being
the manager and logger, and all the nodes to start talking to that instead.

I would like to see that, too. :slight_smile: However, this seems to be quite a
different thing than the systemd approach you are describing. How
would such a dynamic scheme operate without some kind of control layer
in between doing the coordination? In some future version, the Cluster
Controller would be the management component that can initiate changes
like dynamic fall-over. We can argue about whether that control layer
should be a central component (as the Controller proposes) vs some
distributed consensus scheme; and also whether we should really
implement this ourselves or rather go with some 3rd party tool for
coordination. But either way, I think something needs to be there.

it feels a bit like the Cluster Controller framework is trying to take
the old zeekctl features and get them to fit into a new model.

The proposal for the Supervisor/Controller model has been out for a
while, and the main point of feedback so far has been from folks who
wanted to ensure that we don't loose functionality that ZeekControl
offers today. So yes, that has been a starting point for fleshing out
a bunch of this: Can we retain what people like about ZeekControl, but
move it over into a new architecture that removes what they don't like
(e.g., copying binaries around)--and all that while facilitating a
more dynamic future world that increases Zeek's flexibility and
resilience. I'm not saying that the current Controller design achieves
all that already, but it has indeed been designed as an incremental
path forward rather than a lets-redo-it-from-scratch approach. Happy
to discuss if that's the right trade-off.

Robin

Thanks, Robin. Your comments clarified some things, and were overall very helpful.

The main thing I somehow missed originally is that the plan is to enable multiple deployment models, while at the same time making it as easy as possible to get up and running. I was concerned that we were straying too far afield from the supported model, which we try to avoid, especially since anything that we might share with the community is no longer applicable/useful.

There were a couple of places where the Ansible/systemd approach didn’t work well out of the box, due to assumptions built into the supervisor framework. For instance, it’s assumed that if a process is running as a cluster node, it’s supervised, which has some undesirable implications. I can open up some issues for those couple of instances, to track finding a better way to do those specific things.

Otherwise, we’re working on spinning up a cluster that we’d like to use to test/develop some HA capabilities, and will report back.

–Vlad