Zeek Supervisor: designing client and log archival behavior

Looking for feedback on the design/plan for these two Zeek Supervisor
components:

* https://github.com/zeek/zeek/wiki/Zeek-Supervisor-Client
* https://github.com/zeek/zeek/wiki/Zeek-Supervisor-Log-Handling

- Jon

* https://github.com/zeek/zeek/wiki/Zeek-Supervisor-Log-Handling

This overall sounds good to me. Some notes & questions:

Log Rotation

To help bridge/replace Step (4) and (5), suggest adding a new option:
Log::default_rotation_dir. The Log::rotation_format_func() will use
this as part of its default return value.

Seems we should then set this to "." by default, and have the cluster
framework override it.

The log_mgr will attempt to create necessary dirs just-in-time,
failing to do so emits an error, but otherwise continues with rotation
using working directory instead.

I'd extend this to any error case: if moving from current location to
Log::default_rotation_dir fails (e.g., because the latter is a on
different file system), continue with new name inside the current
working directory (and report the error).

Once moved, I suppose we would continue to optionally run a
post-processor, right? For a supervised cluster, we wouldn't use that
and suggest that people go with "zeek-archive" instead; but with
ZeekControl we'd keep the current behavior of gzipping behavior so
that we don't break any setups.

We can implement that distinction through the post-processer function:
the new default function would just do the rename according to the new
scheme, and a separate legacy function for ZeekControl spawns the
"archive-log" script.

zeek-archiver

I like making this a standard tool, but seems like something we could
postpone doing right now and prioritize getting the Zeek-side
infrastructure in place.

We can potentially have the Zeek Supervisor process configurable to
auto-start and keep a zeek-archiver child alive.

I'd say that's a job for systemd (or whatever service manager). I know
Seth disagress. :slight_smile:

Leftover Log Rotation

The rotation for such a leftover log file uses the metadata in the
shadowfile to help try to go through the exact rotation that it should
have occurred, including running the postprocessor function.

Not sure it's worth retaining the information about the post-processor
function, and it could to potentially lead to trouble if the function
changed somehow in between (or disppeared). We could instead just run
the leftovers through whatever the restarted config says to do with
files.

Do we even need any other meta data at all in the new scheme? I'm
wondering if we could simplify this all to: "If at open() time, X.log
exists, first rotate it away through the currently configured
postprocessor function". If we did that, we should probably have an
global boolean that allows to choose between that and just overwriting
existing files. The latter would be the default to retain current
command-line behavior, and the cluster framework would enable leftover
recovery.

Hmm, actually, there's a piece of meta that we'll need: the opening
timestamp, so that one can incorporate that into the name of the
rotated file (assuming we want to retain that capability). Unless we
parsed that out of the X.log itself ...

Robin

* https://github.com/zeek/zeek/wiki/Zeek-Supervisor-Client

Some thoughts on the commands:

$ zeekc status [all | <node_name>]

Do we need to include any other metrics in the returned status?

That information is mostly static, would be nice to get some dynamic
information in there as well, like uptime, CPU/memory/traffic stats,
No need to have that right away, but worth keeping in mind.

# Do we need more categories to filter by (e.g. node type) ?

I'd skip for now.

# If there's downed nodes at this point, what do we expect users to do?
# Check the standard services logs for stderr/stdout info? Check reporter.log ?

Yeah, would be cool if zeekc had access to the stderr/stdout from the
nodes through their supervisors. The supervisors could buffer that for
a while and return on request. More generally, the supervisor could
get a "diagnostics buffer" that, over time, we could use for more
stuff like store backtraces etc.

"reporter.log" is out I'd say, that will go through the normal log
rotation & archival, and be accessible that way.

# A `zeekc diag` command could help gather information, like ask Zeek supervisor
# to find core dumps and extract stack trace. Would it do more than that, like
# show last N lines of downed nodes' stderr, or last N lines of reporter.log?

$ zeekc check

I'm wondering which supervisor that would be be talking to in a
multi-system setup? All?

$ zeekc terminate
...

# Normally wouldn't terminate the supervisor if a service-manager is handling
# the Zeek supervisor process itself and will just restart it, but`terminate`
# would be helpful for anyone running a supervised Zeek cluster
"manually".

Another use case: If for some reason one wants to restart the
supervisor itself, "terminate" would kill it and the service
manager would then restart it.

Robin

> Log::default_rotation_dir

Seems we should then set this to "." by default, and have the cluster
framework override it.

Yes, exactly.

Once moved, I suppose we would continue to optionally run a
post-processor, right? For a supervised cluster, we wouldn't use that
and suggest that people go with "zeek-archive" instead; but with
ZeekControl we'd keep the current behavior of gzipping behavior so
that we don't break any setups.

Yes, with the proposed changes, custom postprocessors still work the
same as before and everything is backwards compatible / equivalent in
non-supervised-mode.

Supervised-mode is just picking some different default settings from
non-supervised-mode:

* don't use a postprocessing script (archive-log)
* rotate into a `Log::default_rotation_dir` of "log-queue" instead of "."

Not sure it's worth retaining the information about the post-processor
function, and it could to potentially lead to trouble if the function
changed somehow in between (or disppeared). We could instead just run
the leftovers through whatever the restarted config says to do with
files.

* Disappeared: easy to notice the function no longer exists and
fallback to default post-processor

* Changed: running through a function of same-name, but it happened to
get changed between restart is probably still going to be closer to
what user expects than running it through the default post-processor
which is completely different ?

Do we even need any other meta data at all in the new scheme? I'm
wondering if we could simplify this all to: "If at open() time, X.log
exists, first rotate it away through the currently configured
postprocessor function".

What if an open() rarely or never happens again for a given log?

I'm thinking the rotation of leftover logs needs to happen once at
startup rather than lazily.

Hmm, actually, there's a piece of meta that we'll need: the opening
timestamp, so that one can incorporate that into the name of the
rotated file (assuming we want to retain that capability). Unless we
parsed that out of the X.log itself ...

Don't think we'd have the opening timestamp to parse from the log when
LogAscii::use_json=T.

So still think it's necessary to obtain open-time meta from a
`.shadow.X.log`, either it's explicitly in there or use the files
modified time (essentially creation time).

The close-time of X.log is just taken as last-modified time of X.log.

- Jon

What if an open() rarely or never happens again for a given log?

Ah, right, forgot about that case. So yeah, agree, the shadow files
are useful for this and to retain whatever information we need.

* Changed: running through a function of same-name, but it happened to
get changed between restart is probably still going to be closer to
what user expects than running it through the default post-processor
which is completely different ?

I was thinking not the default post-processor, but whatever is
configured for the log file we are just opening (if we did it at
open() time). But yeah, won't work when the cleanup happens already
before the new open.

Robin