Has anyone else ever had trouble with broctl getting confused about the status of a process? I just ran into it a little bit ago where broctl thought that all of my workers were dead when I tried to do a restart command. They failed when they were trying to start again because with the myricom sniffer drivers you can only sniff the interface once.
We need to do some debugging on this, but it happens sporadically enough that it might be tough. I sort of wonder if there are issues with broctl.dat being written, I've run into problems in that file before where things wouldn't be written right. Would it make sense to maybe even move away from broctl.dat (which tracks cluster state) and toward something like an SQLite database?
I had the same issue a time or two. Running ‘broctl ps.bro’ right after ‘broctl status’ has become part of my new ritual before stopping/starting or just restarting any of my clusters.
That's actually perfect! Sometime could you make a copy of your spool/broctl.dat before you restart?
I'd like two copies of that file. One before a restart and one after the restart (assuming the restart fails). It could point out if there is corruption entering the broctl.dat file at some point. This problem drives me crazy and I'd love to fix it.
> I had the same issue a time or two. Running 'broctl ps.bro' right after 'broctl status' has become part of my new ritual before stopping/starting or just restarting any of my clusters.
Yes, but there could be other situations you can probably look for when
the bro process is running:
- Your network interfaces may not see the data
- Another possibility is that data is coming on the interfaces and bro
is running but not processing the incoming data (hopefully won't happen)
so you might want to also check if the logs are growing regularly.
Additional complexities:
Your logs may be growing but above situation happens only on 1
worker-node ( in which case you'd see manager logs growing but won't
know that one node is consistently missing logs)
.... and so on for various other failures (I must say these are rare but
on a production system you have to account for them and over period of
time this has happened)
We have a cron script which watches for these conditions.