Zeek Supervisor Command-Line Client

Don't recall any basic "project infrastructure" discussions happening
yet for the upcoming replacement/alternative for ZeekControl that we
want to introduce in Zeek 3.2 (roadmap/design links found at [1]), so
here's starting questions.

# What to Name It ?

Suggestion: `zeekcl`, Zeek (Command-Line) CLlient.

Open to ideas, but will use `zeekcl` below.

# What Programming Language ?

`zeekcl` has different/narrower scope than ZeekControl. It's more
clearly a "client" with sole job of handling requests/responses via
Broker without many (any?) system-level operations/integrations.
Meaning there may be less of an approachability/convenience gap
between C++ versus Python with `zeekcl` than there was with
ZeekControl.

Also nice if `zeekcl` doesn't require more dependencies beyond what
`zeek` needs since they're expected to be used together.

Is use of Python still desirable for other reasons? Otherwise, I lean
towards `zeekcl` being C++.

For reference/sanity-check in terms of what people expect `zeekcl` to
be: in my testing of the SupervisorControl framework [2], I had a
sloppy Zeek script implementing the full "client side" (essentially
the majority of what `zeekcl` will do) in ~100 LOC. Most operations
are that simple: send request and display response.

That does mean the third option to consider besides either Python or
C++ is Zeek's scripting language (e.g. `ctl.zeek`), but I don't
suggest that since (1) using a full `zeek` process is way more than we
need and (2) the command-line interface is awkward (`zeek ctl
Supervisor::cmd="status"` versus `zeekcl status`)

# Where's the Source Code Live ?

Past experiences with ZeekControl being in a separate repo than Zeek
are negative in terms of CI/testing: changes in Zeek have broken
ZeekControl, but go uncaught for a while since it is tested
independently.

Since common use/maintenance will involve both `zeek` and `zeekcl`,
and also don't expect the later to accrue large amounts of code
deserving of a separate project, I plan to have `zeekcl` code/tests
live inside the main Zeek repo.

- Jon

[1] https://github.com/zeek/zeek/issues/582
[2] https://github.com/zeek/zeek/blob/689a242836092fba7818ba24724b74a7a7902e48/scripts/base/frameworks/supervisor/control.zeek

I’m still fuzzy on the Supervisor framework, as we’re still in the process of upgrading systems to the point of supporting the new C++ requirements.

As a concrete example, what does a cluster upgrade look like? Today, that means install the new version on the manager, and then do zeekctl deploy, which copies the files to the nodes and restarts the cluster. All of that is done without Broker.

What does that look like with zeekcl + Broker? Let’s say I install the new version on the manager. If I then tell zeekcl to destroy the running instance, will that work, or will the newer zeekcl be incompatible with the Broker version of the running Zeek?

Reading the script linked in [2], I notice that zeekcl would not support copying files from one node to another? Other features that would be missing that we routinely use are zeekctl print and zeekctl exec. I’m assuming zeekcl would be running in some uber-bare mode if it’s written in Zeek?

–Vlad

Suggestion: `zeekcl`, Zeek (Command-Line) CLlient.

"zeekcl" is very close to "zeekctl", which could lead to confusion.
"zcl" maybe?

Is use of Python still desirable for other reasons? Otherwise, I lean
towards `zeekcl` being C++.

No particular preference from my side, I can see either. Effort is
probably about the same in this model, and C++ does have the advantage
of less dependency issues.

Zeek's scripting language (e.g. `ctl.zeek`), but I don't suggest that

Ack, agree.

I plan to have `zeekcl` code/tests live inside the main Zeek repo.

Makes sense to me as well.

Robin

As a concrete example, what does a cluster upgrade look like?

The idea is to handle this more like other system services: you'll be
in charge of getting the new Zeek version onto all your systems
yourself, using whatever method you use for other software as well.
For example, if you're installing through a package manager, you'd
just run "update" on all systems. If you're installing from source,
you'll either need to compile on each system, or copy the installation
over manually.

The underlying assumption is that people will already have a mechanism
in place for administration of their systems, and we shouldn't be
trying to reinvent the wheel, as ZeekControl oddly does. From a
sysadmin perspective, ZeekControl is really doing a lot more right now
that it should be doing; other tools don't work that way. We don't
want it look like an APT anymore (https://github.com/zeek/zeek/issues/259). :slight_smile:

Today, that means install the new version on the manager, and then do
`zeekctl deploy`, which copies the files to the nodes and restarts the
cluster. All of that is done without Broker.

There are two parts here: (1) deploying the Zeek installation itself,
and (2) deploying any configuration changes (incl. new Zeek scripts).

For (1), the above applies: we'll rely on standard sysadmin processes
for updating. That means you'd use "zeekcl" to shutdown the cluster
processes, then run "yum update" (or whatever), then use "zeekcl"
again to start things up again. (The Zeek supervisor will be running
already at that point, managaged through systemd or whatever you're
using).

(2) is still a bit up in the air. With 3.2, there won't be any support
for distributing configurations automatically, but we could add that
so that config files/scripts/packages do get copied around over
Broker. Feedback would be appreciated here: What's better, having
zeekcl manage that, or leave it to standard sysadmin process as well?

Reading the script linked in [2], I notice that zeekcl would not support
copying files from one node to another?

Correct right now, (2) may or may not change that.

zeekctl print

"print" will be supported (roadmap says not in 3.2 yet, but it should
be easy to do, maybe we can get it in still).

zeekctl exec.

"exec" will likely not be supported. We *could* support it, no
technical reason for not doing that over Broker. It just s seems like
another things that's better handled with different tools.

Robin

Thanks Robin, that helps.

I have a slightly different take: isn't it more common to expect
"start" and "stop" operations here to be done by the service-manager
rather than Zeek client? I'm assuming "update/deploy Zeek
installation" could involve a change in the `zeek` binary and that
implements the supervisor process itself, so you'd want, at the level
of system services, to stop the entire Zeek process tree, including
the root supervisor.

That doesn't exclude the possibility of the client having operations
like "start" (spawn `zeek -j <config>`), "stop" (kill the root `zeek`
supervisor process), or even others that dynamically add/remove
cluster nodes from the tree, but that's probably not the
common/expected usage to prioritize since it's again back to model of
the process tree being managed manually by the user, independent from
a system's service-manager.

- Jon

A clarification that may help you: the "orphaning" behavior isn't
related to Broker connections, it's related to the parent-child
relationship between processes. So there's a process tree here with
`zeek` in supervisor-mode at the root and child processes that are
individual cluster nodes (worker, manager, logger, proxy).

The normal termination behavior for the supervisor process is to
gracefully kill and wait for all children to exit. In the very
exceptional case of the supervisor exiting/crashing without having
cleaned up all children, those children will self-terminate upon
noticing they are no longer parented to the supervisor.

- Jon

Suggestion: `zeekcl`, Zeek (Command-Line) CLlient.

"zeekcl" is very close to "zeekctl", which could lead to confusion.
"zcl" maybe?

Is use of Python still desirable for other reasons? Otherwise, I lean
towards `zeekcl` being C++.

No particular preference from my side, I can see either. Effort is
probably about the same in this model, and C++ does have the advantage
of less dependency issues.

I agree - I actually kind of like the idea that zeekcl does not have python as a dependency.

I plan to have `zeekcl` code/tests live inside the main Zeek repo.

Makes sense to me as well.

Agreed here too.

Johanna

I believe we're pretty close to saying the same thing. I'm making a
distinction between the supervisor Zeek process (which the service
manager starts & stops), and the cluster's node processes (manager,
workers, etc). The supervisor manages the latter and will by default
shut them down when it gets the "stop" from its service-manager. But I
think we also want their state controllable from the client as well,
so that one can have an orderly shutdown of a multi-system cluster
without loss of data (e.g., one probably wants to shutdown workers
first to collect remaining log data). This what I meant above by
"shutdown the cluster processes": "zeek-client stop" would tell the
supervisors to shutdown their node processes (or rather: "zeek-client
stop workers", or maybe "zeek-client" would now the order in which to
stop nodes or systems). And I imagine one would do that before
starting to a cluster-wide upgrade to the next Zeek version.

That said, your note on Slack sounds right: let's figure out the
single-system operation first and get that usable. I'm pretty
confident that we will then be able to build the multi-system model on
top of that without too much trouble, and it'll we easier to collect
requirements for administration/management of multi-system setups once
we got some experience with single-system setups.

Robin

Ack, got it and agree that the distinction is likely helpful: the
supervisor node implements the low-level "dirty work" of stopping
processes and can ensure shutdown of its entire process tree if it
really has to, but the client can carry out shutdown logic with a
higher-level of insight into directing a shutdown process (possibly
across many hosts) in orderly fashion.

Also, based on "naming" feedback: plan to use `zeekc`.

- Jon

Sorry for chiming in late on this...

Ack, got it and agree that the distinction is likely helpful: the
supervisor node implements the low-level "dirty work" of stopping
processes and can ensure shutdown of its entire process tree if it
really has to, but the client can carry out shutdown logic with a
higher-level of insight into directing a shutdown process (possibly
across many hosts) in orderly fashion.

I think that the script we ship with zeek that effectively implements the supervisor behavior should understand the business logic of shutting down a cluster in the correct order. One way to think about it is that the supervisor script will presumably understand the business logic for starting a cluster in the right order so consequently it would seem that it should understand how to shut down the cluster as well.

We talked about it recently and now that I've had some more time to think about it I'm really starting to think that the business logic for correctly starting and stopping a cluster should be fully implemented in the supervisor script. The zeekc tool could then just be a dumb tool that says to start and stop and doesn't end up causing us to spread our logic around to other tooling.

   .Seth

How would that then work across multiple systems?

Robin

Maybe the important observation is that the logic can be performed
anywhere that has access to the Zeek-Supervisor process.

* The Supervisor process itself would be able to perform the logic via
direct BIF access.

* External processes, like zeekc, have access to a Zeek-event
interface to indirectly access those same BIFs, so they can also
execute equivalent logic (either via multiple events, or a single
"convenience" event that implements a sequence of BIF calls on remote)

When we bring multi-hosting into the mix, it's still a similar
situation, just with beefed up logic for orchestrating
node-type-specific steps across many peers: anyone with access to the
Zeek-event interface could implement this logic. You could pick zeekc
to orchestrate, or you could pick a single Zeek-Supervisor process to
orchestrate between other Supervisors, or you could pick a regular
Zeek process, or you could write a Python script just using Broker
Python bindings, etc.

So where we put the logic at this point may not be important. If we
can find a single-best-place for the logic to live, that's great, but
if there's utility for others to have their own
independent-yet-equivalent logic, I don't see a problem with that.

- Jon

Maybe the important observation is that the logic can be performed
anywhere that has access to the Zeek-Supervisor process.

Agree.

So where we put the logic at this point may not be important. If we
can find a single-best-place for the logic to live, that's great

I believe that's what Seth is arguing for: have a Zeek-side script be
the single point of that logic, rather than implement it multiple
times and/or outside of Zeek.

I can see doing that in Zeek but I think there's a trade-off here: if
we want to do the singe-place approach with a multi-system setup, we'd
need an authoritative place to run this logic and hence depend on
*that* Zeek supervisor being up and running for performing the
operation. That may be a reasonably assumption (say if we dedicated
the supervisor running the manager to also be the cluster
coordinator), but it's different from a world where the client can
execute higher-level operations on its own.

Robin