Hi all,
I have zeek v5.1.1 running in a Debian container. The configuration is (almost) identical to an (older version) standalone setup. The standalone setup works just fine. However, within the container, logging from the workers stops suddenly and I cannot figure out why.
In my current directory there are only these files: stats.log stderr.log stdout.log telemetry.log
and no more logging from the workers. The workers are still running, zeekctl diag and zeekctl status show nothing extrodinary; all workers are running.
The only thing I found in the stderr logs was:
/usr/local/zeek/share/zeekctl/scripts/run-zeek: line 110: 430000 Segmentation fault nohup ${pin_command} $pin_cpu “myzeek" "@”
listening on pcap0
[broker/ERROR] 2023-01-16T16:35:14.213 unable to find a master for zeek/known/certs
[broker/ERROR] 2023-01-16T16:35:14.213 unable to find a master for zeek/known/hosts
[broker/ERROR] 2023-01-16T16:35:14.214 unable to find a master for zeek/known/services
[broker/ERROR] 2023-01-16T16:35:14.214 proxy 17 received an unexpected message: message(caf::sec::broken_promise)
(28 of these last messages, that is exactly the number of workers running on one NUMA/CPU where pcap0 is attached to).
Anyone any clue/hints how to solve or debug this?
Regards, John
Does logging initially work and then stop after some time, or are there no traffic related logs produced whatsoever? Which node names do you find in telemetry.log?
More questions: Do worker processes consume CPU after logging stopped? Do you see packets for the mirroring traffic when running tcpdump -n -i pcap0 -c 1024
within the container?
Do you know what’s segfaulting here?
We’ve had the following ticket for tracking logging issues for Zeek 5.1 / 5.0 #2389, but that should’ve been fixed with the Zeek 5.1.1 (PR in broker).
If you’re building from source: Could you double check that the auxil/broker subtree is at the v2.4.1 tag? If not, you might have missed a git submodule update --init --recursive
somewhere to pick up the fix. Given you’re using/building with Docker, that seems unlikely though.
Thanks,
Arne
Hi Arne,
to answer your questions:
- initially the logging comes in for several hours and then stops, that is the various logs per protocol or the conn.log are no longer updated. Only the stats.log and telemetry.log are written to.
- In the telemetry.log only “logger-1” is present after all other logging has stopped.
- yes, when there is no more logging, the workers still consume CPU time so they are doing something…
- Data is still coming in on the NICs. As soon as I restart zeek, per protocol logging starts immediately.
- Unfortunately I don’t know which process crashed; still trying to figure that out.
- As you already mentioned/expected, I’m running auxil/broker version 2.4.1.
However, I found another interesting thing: there is a giant memory leak somewhere in one of our plugins,
You can see that Sunday afternoon, the OOM-killer kicked in. Please note that logging stopped way before the kernel kills processes, in this case it stopped Friday evening at 17:23 hrs because at that time I have the last worker entries in the telemetry.log and no more per protocol log entries.
On Monday morning I restarted with the same configuration and again memory usage grows quickly.
Yesterday/Tuesday I restarted Zeek without any plugins enabled (except for afpacket and the community-id). Now it is much more stable and memory usage increases just a little bit over time. Besides this, logging is still ok.
So, we will keep this running for another day or so and will then restart with half of the plugins enabled to see if the problem reoccurs. If not, trry the other half, etc. in order to find the one or ones that cause the trouble.
We had the following plugins enabled:
zeek/salesforce/ja3
zeek/corelight/cve-2021-44228
zeek/hosom/file-extraction
zeek/zeek/spicy-analyzers
zeek/corelight/zeek-spicy-ipsec
zeek/corelight/zeek-spicy-openvpn
zeek/corelight/zeek-spicy-wireguard
To be continued…
Regards, John