Hi,
We have a Zeek node that sees high volumes on working days. Due to our internal network configuration a lot of connections for our internal DNS servers are generated by certain endpoints (because our DNS does not resolve any external domains and certain applications keep repeating the DNS requests at astronomical rates). The node is a 16 core, 128GB VM and we use ASCII logger.
We have observed that under high loads (~40k writes/s), the logger process starts lagging behind and its memory usage goes up. Once the machine is using >60% of its memory, Zeek starts dropping packets and a general drop in performance is observed. Only solution is to restart the zeek process.
My understanding is that logger is buffering the unwritten lines in memory and so memory usage is going up.
To work around this, I split the output files so that all connections to the DNS server and all DNS requests to high velocity domains are logged to separate files (conn-noise.log and dns-noise.log). These two files consume nearly 80% of the disk usage under the current directory (E.g. in 30 minutes the current directory use is 4.9G out of which these two files use 4.0G). Doing this, I hoped that any lags would be limited to these two files and I will lose less data on a restart. Also by using separate threads for heavily written files, I may be able to get better performance. The idea has worked partially as lags for other files are generally low now although we do need to restart zeek if memory usage goes beyond 55%.
The problem is that I have observed that logger memory usage does not decrease on its own when the loads reduce (e.g. at night). E.g. If Zeek was using 40G memory on Friday evening and dns-noise was showing a lag of 1800 seconds, the memory usage on Monday morning is still 40G although the lag is only around 1 second. Has anyone experienced anything similar? I am running Zeek-4.1.1.
Thanks,
Dheeraj