Request for Feedback - Zeek Process Supervision Model

I just published some design thoughts related to a major new Zeek
feature that's planned/upcoming: a process supervision model that may
act as an alternative (successor) to BroControl. Find that here:

https://blog.zeek.org/2019/03/beyond-brocontrol-new-process.html

Feel free to use this mailing list / thread to provide feedback, thanks.

- Jon

I’m excited to see this. I think it’s a great design choice. This sentence is my favorite, “We need to make it easy to test, from the command-line, using just PCAP files, a complete cluster deployment (scaled down) as it would work in production.”

I’m looking forward to it!

-AK

This would be awesome to have, especially in a cluster environment. Testing new scripts before we push them to production is a bit challenging sometimes, so being able to reliably and repeatably test them in a clustered environment would be awesome.

Personally, I think it would be poor design to rebuild host OS monitoring inside the Zeek supervisor. I think that should be left up to the many other projects specifically designed to monitor disk usage, etc. That being said, exposing some metrics about Zeek the application layer sounds like it would be a win. That being said, that might be outside the scope of a supervisor as well.

Overall, I’m in agreement with what i’m reading in these responses as well the design docs. I think this is much needed and I’m glad it is getting the focus it deserves.

  • Sam

High resource usage of broctrl prevents me from running it at home…so hopefully that can be improved upon.

James

Thanks a lot for doing this. Those who don’t want to replace broctl shall do a triple back salto. No one? I see.

I have only one request so far, still reading the proposal. Can we make sure that we support a configuration where in a stable state (after initialization has been done) there is only one worker process per core, without all those run-bro scripts and the like?

1 process per core = timer ticking disabled = trips to kernel and back minimized, no partial cache flushing, no partial TLB flushing and higher performance.

That's planned...
  Improve main event loop · Issue #264 · zeek/zeek · GitHub

   .Seth

Excellent..looking forward to it!

James

Just tell yourself that all of the processes that are being spawned and supervised are just threads and then you may think about this project differently. The fact that we will be spawning and monitoring child processes is merely an implementation detail. If we chose to offset the responsibility for starting and managing all of the process to something like systemd then it would specifically tie us to systemd (and we definitely don't want to maintain compatibility with multiple supervisors).

The benefit to this approach is that from the OS perspective it's easy to run under any system supervisor and in Docker since it effectively has the same model of "run in the foreground and monitor that the process is still alive". There is an additional benefit too because we've been discussing doing an "early fork" of the supervisor process so that they all derive from the same binary (same initial memory image) which you can think of like a stem cell so the supervisor can tell it to fork again and specialize into a particular cluster process. This has the benefit of being sure that all of the processes are the same. Otherwise, if systemd restarted one of the workers and the binary on disk had changed in the intervening time it would end up being a different process (different version of Zeek?). I know it's a somewhat contrived example but it's always surprising to see the problems that will be encountered in the real world so the more potential problems we can avoid up front in the design is probably better.

Another benefit to this approach is that a full cluster can be started from the command line really easily and will run in the foreground. It's been really fascinating using the prototype as it is.

   .Seth

All of that optimization should be possible. It wasn't included in the proposal because it should be possible to build on top of whatever this ends up looking like in the end and I don't think any of us quite know what that actual config would look like. If you have some suggestions about how the configuration should look or work or even thoughts about the mechanism it should use feel free to speak up. :slight_smile:

   .Seth

One thing I haven’t seen specifically called out yet (perhaps I missed it) was making sure we keep the functionality for broctl commands that aren’t really about managing processes. Like ‘check’, ‘print’, ‘diag’, etc. I could be them being part of a separate tool still, but I find them extremely valuable for debugging.

-Dop

We haven't specified or deeply discussed what some of the extra tooling will look like yet, but one goal we have is to simplify everything and cut out features that aren't utterly critical or can't be done better by other system tools (and obviously watching for community feedback and discussion on what stays and goes as we keep moving forward!)

   .Seth