I upgraded my zeek cluster to LTS 5.0 last week,since then the logs have stopped being generated after 6-7 hours of restart. Whenever I restart the cluster it logs for a few hours and then stops logging. There are no errors in the stderr log and diag output, everything seems normal, unable to figure it out.
Huh, this is weird, I never have heard of this specific error case before.
Is the output of
zeekctl status and
zeekctl peerstatus also normal?
Could you also check the output of
broker.log - they might have information about this.
Finally, can you try stopping/starting the logger node (leaving the other nodes running) and see if this gets logs flowing again? (That obviously is no great fix, but might help determining if the problem is solely on the side of the logger).
Can you check output of
reporter.log? Sometimes errors are reported there too.
Thanks for the response.
There are no errors in any logs, except one in reporter.log. zeekctl status, peer status all seems normal.
Reporter::ERROR no such index (Notice::tmp_notice_storage[Notice::uid]) /opt/zeek/share/zeek/policy/frameworks/notice/extend-email/hostnames.zeek, line 39
# Extend email alerting to include hostnames
I’ve some local additions in local.zeek that too were working fine before the upgrade. I’ve restarted the cluster with the default configuration and it’s logging so far.
Has this reproduced again?
Yes, it’s happening again. The cluster stopped logging/generating logs after a couple days. I see only these 3 logs, stats.log stderr.log stdout.log with no information.
On manually stopping the cluster, it kills each worker process, normal stop is not happening.
Due to lack of errors in logs, I’ve not been able to figure out the root cause of the issue.
Just so it’s in public (we’ve had some discussion on Slack) I’m seeing the same thing. Oddly, I have two clusters seeing the same traffic, same configuration, only one cluster is seeing this. Difference between clusters is the one seeing this is production has a physical manager + two physical worker systems (16 workers apiece); the testing one has a VM manager + one physical worker system (16 workers).
In the production cluster I have commented out the extend-email/hostnames.zeek from local.zeek and done a deploy. Previously I would see the no logs generated starting anywhere from 3 hours to 3 days after a deploy, so we’ll see.
That made no difference, it fell over again within about 2 hours of starting.
Found an ‘out of memory’ error on all the nodes. The cluster was initially configured with jemalloc.
Out of memory: Killed process 2566999 (zeek) total-vm:14135852kB, anon-rss:11044432kB, file-rss:129972kB, shmem-rss:0kB, UID:501 pgtables:25956kB oom_score_adj:0
oom_reaper: reaped process 2566999 (zeek), now anon-rss:0kB, file-rss:131072kB, shmem-rss:0kB
I do occasionally have a worker OOM, I also use jemalloc. But - only on the production cluster, not the testing one.
I should say - the OOMs aren’t usually associated with the logs stopping. But if a worker crashes from an OOM, it doesn’t seem to restart correctly.
Tim and I spent some time with my production cluster again, and verified that everything should be working, it just isn’t. Worker hosts are sending no traffic to the manager host. I’ve made my prod cluster be like my dev, with just one worker host now, and we’ll see if it repeats.
@akash If you’re not using them, could you try disabling the
known-* scripts? Add the following to local.zeek:
redef Known::use_host_store = F; redef Known::use_service_store = F; redef Known::use_cert_store = F ;
We’ve had some problems with those scripts recently, specifically in the round of testing for 5.0. We thought we’d ironed out all of the issues but early testing on @earlesswondercat’s system seems to indicate that they might be a culprit.
Thank you for the response.
I reverted back to version 4 last week. We do use these scripts, however, I can test if disabling these scripts help in v5.
@akash - could you share a few details about your setup and infrastructure? Are you running workers, loggers, proxies on separate hosts or are all nodes co-located on the same physical host?
We have a potential lead, but so far could only reproduce when workers and loggers are located on separate hosts.
This issue sounds like the issue I have run into. Posted in another thread with details there.
Short description: Ubuntu 20.04 nodes. One node runs manager, logger, proxy. Other node is configured with four workers. Running zeek 5.0.0.
I only get the following log files generated on the manager node:
zeek@elkstack:/opt/zeek/logs/current$ ls -l
-rw-rw-r-- 1 zeek zeek 1047 Aug 24 18:06 reporter.log
-rw-rw-r-- 1 zeek zeek 1360 Aug 24 18:10 stats.log
-rw-rw-r-- 1 zeek zeek 0 Aug 23 19:55 stderr.log
-rw-rw-r-- 1 zeek zeek 188 Aug 23 19:55 stdout.log
After some period of time the workers end up in crashed state. Estimate several hours before this happens.
stats.log looks like data was collected.
netstats, peerstats, and capstats look right. packets look like they are being seen.
[ZeekControl] > netstats
worker-zeek-ch-1: 1661364847.281574 recvd=84983 dropped=36 link=85267
worker-zeek-ch-2: 1661364847.295818 recvd=85059 dropped=79 link=85259
worker-zeek-ch-3: 1661364847.308851 recvd=85114 dropped=29 link=85425
worker-zeek-ch-4: 1661364847.323851 recvd=85574 dropped=60 link=85835
[ZeekControl] > netstats
worker-zeek-ch-1: 1661364853.923669 recvd=236101 dropped=36 link=236400
worker-zeek-ch-2: 1661364853.937458 recvd=236191 dropped=79 link=236377
worker-zeek-ch-3: 1661364853.949551 recvd=236145 dropped=29 link=236362
worker-zeek-ch-4: 1661364853.963070 recvd=236599 dropped=60 link=236763
But no log files created and after some time the workers crash.
This was a new install of the zeek software on both nodes.
I tried modifying local.zeek as mentioned in this thread.
deployed after the change there is one difference noted so far.
In the logs directory “current” a new file was created, loaded_scripts.log was not there before.
zeek@elkstack:/opt/zeek/logs/current$ ls -l
-rw-rw-r-- 1 zeek zeek 697 Aug 24 18:22 broker.log
-rw-rw-r-- 1 zeek zeek 742 Aug 24 18:22 cluster.log
-rw-rw-r-- 1 zeek zeek 31014 Aug 24 18:22 loaded_scripts.log
-rw-rw-r-- 1 zeek zeek 0 Aug 24 18:22 stderr.log
-rw-rw-r-- 1 zeek zeek 188 Aug 24 18:22 stdout.log
will monitor to see it starts logging data as expected.
Reporter.log file was created.
Contents may point to an issue:
zeek@elkstack:/opt/zeek/logs/current$ cat reporter.log
#fields ts level message location
#types time enum string string
1661365684.150128 Reporter::WARNING SumStat key request for the sMKulHLKb58 SumStat uid took longer than 1 minute and was automatically cancelled. /opt/zeek/share/zeek/base/frameworks/sumstats/./cluster.zeek, line 226
1661365684.150128 Reporter::WARNING SumStat key request for the eWcinLxo4t1 SumStat uid took longer than 1 minute and was automatically cancelled. /opt/zeek/share/zeek/base/frameworks/sumstats/./cluster.zeek, line 226
We’ve 5 worker nodes each on a separate physical host, while logger, manager and proxy on one physical host.
The cluster is working fine after I reverted to version 4.
I do use jemalloc and occasionally have a worker OOM. However, only on the production cluster; the testing cluster was not affected.