I upgraded my zeek cluster to LTS 5.0 last week,since then the logs have stopped being generated after 6-7 hours of restart. Whenever I restart the cluster it logs for a few hours and then stops logging. There are no errors in the stderr log and diag output, everything seems normal, unable to figure it out.
Huh, this is weird, I never have heard of this specific error case before.
Is the output of
zeekctl status and
zeekctl peerstatus also normal?
Could you also check the output of
broker.log - they might have information about this.
Finally, can you try stopping/starting the logger node (leaving the other nodes running) and see if this gets logs flowing again? (That obviously is no great fix, but might help determining if the problem is solely on the side of the logger).
Can you check output of
reporter.log? Sometimes errors are reported there too.
Thanks for the response.
There are no errors in any logs, except one in reporter.log. zeekctl status, peer status all seems normal.
Reporter::ERROR no such index (Notice::tmp_notice_storage[Notice::uid]) /opt/zeek/share/zeek/policy/frameworks/notice/extend-email/hostnames.zeek, line 39
# Extend email alerting to include hostnames
I’ve some local additions in local.zeek that too were working fine before the upgrade. I’ve restarted the cluster with the default configuration and it’s logging so far.
Has this reproduced again?
Yes, it’s happening again. The cluster stopped logging/generating logs after a couple days. I see only these 3 logs, stats.log stderr.log stdout.log with no information.
On manually stopping the cluster, it kills each worker process, normal stop is not happening.
Due to lack of errors in logs, I’ve not been able to figure out the root cause of the issue.
Just so it’s in public (we’ve had some discussion on Slack) I’m seeing the same thing. Oddly, I have two clusters seeing the same traffic, same configuration, only one cluster is seeing this. Difference between clusters is the one seeing this is production has a physical manager + two physical worker systems (16 workers apiece); the testing one has a VM manager + one physical worker system (16 workers).
In the production cluster I have commented out the extend-email/hostnames.zeek from local.zeek and done a deploy. Previously I would see the no logs generated starting anywhere from 3 hours to 3 days after a deploy, so we’ll see.
That made no difference, it fell over again within about 2 hours of starting.
Found an ‘out of memory’ error on all the nodes. The cluster was initially configured with jemalloc.
Out of memory: Killed process 2566999 (zeek) total-vm:14135852kB, anon-rss:11044432kB, file-rss:129972kB, shmem-rss:0kB, UID:501 pgtables:25956kB oom_score_adj:0
oom_reaper: reaped process 2566999 (zeek), now anon-rss:0kB, file-rss:131072kB, shmem-rss:0kB
I do occasionally have a worker OOM, I also use jemalloc. But - only on the production cluster, not the testing one.
I should say - the OOMs aren’t usually associated with the logs stopping. But if a worker crashes from an OOM, it doesn’t seem to restart correctly.
Tim and I spent some time with my production cluster again, and verified that everything should be working, it just isn’t. Worker hosts are sending no traffic to the manager host. I’ve made my prod cluster be like my dev, with just one worker host now, and we’ll see if it repeats.
@akash If you’re not using them, could you try disabling the
known-* scripts? Add the following to local.zeek:
redef Known::use_host_store = F; redef Known::use_service_store = F; redef Known::use_cert_store = F ;
We’ve had some problems with those scripts recently, specifically in the round of testing for 5.0. We thought we’d ironed out all of the issues but early testing on @earlesswondercat’s system seems to indicate that they might be a culprit.
Thank you for the response.
I reverted back to version 4 last week. We do use these scripts, however, I can test if disabling these scripts help in v5.