Rather lengthy post follows. In case people didn’t see it, much of this references this document: https://docs.google.com/document/d/1r0wXnihx4yESOpLJ87Wh2g1V-aHOFUkbgiFe8RHJpZo/edit
The past couple of weeks, I upgraded all of our standalone Zeek clusters (about a dozen) to Zeek 3.2 and I also wanted to use this as an opportunity to try to move over to the new Supervisor framework.
I think there are cases where the Supervisor framework is a great fit, and for being at an early stage, it really does work well. However, I ended up not using the Supervisor framework at all, and just running everything directly from systemd (with no zeekctl). I realize that “just use systemd” isn’t a solution for everyone, but I’d like to share some thoughts in order to suss out whether there’s something inherently different about our environment/setup, or whether the Supervisor and Cluster Controller frameworks need some tweaking or rethinking.
Configuring Supervisor was more difficult. In both cases, I need to configure my cluster layout. I do this via an Ansible template. Then, I need to tell the Zeek processes their working directory, CPU pinning assignment, environment variables, interface to sniff on, etc. These are all things that systemd can do natively, and I found that it was much easier to do it via systemd templates.
Monitoring the cluster was also more difficult with Supervisor. When I check the status of the cluster, I just see a bunch of identically-named processes. I can’t tell what’s using excessive memory, if I want to check the logs for a process, I need to go manually look at stderr and stdout in its working directory.
With systemd, I see something like this:
Active: active since Fri 2020-08-28 08:13:19 PDT; 4 days ago
│ │ └─55473 /usr/local/zeek/bin/zeek site/local -i ens1
│ │ └─55253 /usr/local/zeek/bin/zeek site/local -i ens1
│ └─55074 /usr/local/zeek/bin/zeek site/local
│ └─55125 /usr/local/zeek/bin/zeek site/local
│ └─55114 /usr/local/zeek/bin/zeek site/local
└─55116 /usr/local/zeek/bin/zeek site/local
I immediately know what process is what, per-node logs go to my standard logging infrastructure, and I can restart a single node if it’s misbehaving. I even took this a step further, and configured systemd to kill and restart any worker if it ever swaps, but to allow the logger, manager, etc. to swap.
Ultimately, given the choice between systemd + supervisor versus just systemd, for our use case, just systemd gave us some distinct benefits and reduced complexity. My intent is not to be critical of the existing work, like I said, I think there are cases where it’s a great fit. However, I did also want to share my experiences and thoughts after running it for a while.
That experience gave me a new perspective on the Cluster Controller framework. This is essentially another layer of orchestration over the cluster, except there are specific things that it does not do, namely it’s not responsible for keeping the following Zeek items in sync: binaries, packages, and scripts. So, in order to use this, I need some layer which will distribute these and ensure that they match across a cluster. Once again, it feels like I’m left with a choice between one orchestration tool + the cluster controller framework, versus just using a single orchestration tool to keep these files in sync and handle cluster stop/start/restarts.
Take, for instance, a cluster restart. This could either be planned (e.g. I upgraded a binary or script and want to have that change take effect), or unplanned (something crashed and the cluster is now degraded). In the first case, I’d need to use my orchestration tool to push out the file changes, and then it becomes trivial to have that handle the restart as well. In fact, having my orchestration tool do that also enables me to do things like fallback to the previous version should the upgrade go wrong. In the second case, rather than build another layer on top of it, I’d like to see more resiliency in the Cluster framework.
There are other capabilities such as coordinated starting and stopping of the cluster, but the only reason I do those actions is because I need to (e.g. the system running the logger is going to reboot, and I know that if I don’t stop the whole thing, logs will queue in memory and workers will crash).
So, I’m left wondering if the Cluster Controller framework is really solving the right problems. Rather than worry about coordinated restarts, I’d like to see the need for such restarts go away with better resiliency. If I need to reboot a system in a cluster, and it’s running the manager and logger, I’d like to see another system in the cluster get promoted to being the manager and logger, and all the nodes to start talking to that instead. I could see a design where all systems run managers and loggers, with only one active at a time. If the active manager shares state with the standby managers, a failover can happen seamlessly.
It’s true that the world has moved on from the model of broctl and zeekctl, but I feel like that’s because they’ve moved towards better resiliency in distributed systems, and it feels a bit like the Cluster Controller framework is trying to take the old zeekctl features and get them to fit into a new model.