Zeek monitoring

Hey!

How do you monitor your Zeek instance?

What are the things you would be looking for?

I'd like to compile a list of 'standard' checks (and have that added
to the official documentation later) that are considered a good
practice.

This is what we are currently doing, or planning to do:

Hardware checks - nothing exciting here
- interface status on the packet broker side
- is the traffic on each interface within expected range? alert if too
low (someone unplugged a tap?)
- is the signal level what it should be? alert is signal level is too
low (important for fibre)

OS level checks - is each sensor up, memory usage, CPU usage, disk space, etc

App level checks
- how many processes with the name 'bro' are running?
- have we seen 'I am alive' events in the last N minutes?

What I am also thinking about
- monitoring the log lag (Justin has a script for that)
- have we seen at least N events of type M from sensor K in a time
window O minutes? Repeat for each log file, for each sensor
- packet loss (expect me to publish some code soon, because you are
doing it wrong :wink:
- errors (hardware counters for each card, for example)
- crashing processes - I have seen process Y crashed N times in the
last 15 minutes
- weird.log size or just how fast is it growing
- reporter.log size or just the growth velocity (useful to catch
errors in scripts and undefined variable access)

I'm also thinking about monitoring how frequently the broctl cron has
to respawn processes. Is there any useful way to tell that?

Feel free to comment, critique or add to the list.