Possible memory leak in logger process?

Hi,

We have a Zeek node that sees high volumes on working days. Due to our internal network configuration a lot of connections for our internal DNS servers are generated by certain endpoints (because our DNS does not resolve any external domains and certain applications keep repeating the DNS requests at astronomical rates). The node is a 16 core, 128GB VM and we use ASCII logger.

We have observed that under high loads (~40k writes/s), the logger process starts lagging behind and its memory usage goes up. Once the machine is using >60% of its memory, Zeek starts dropping packets and a general drop in performance is observed. Only solution is to restart the zeek process.

My understanding is that logger is buffering the unwritten lines in memory and so memory usage is going up.

To work around this, I split the output files so that all connections to the DNS server and all DNS requests to high velocity domains are logged to separate files (conn-noise.log and dns-noise.log). These two files consume nearly 80% of the disk usage under the current directory (E.g. in 30 minutes the current directory use is 4.9G out of which these two files use 4.0G). Doing this, I hoped that any lags would be limited to these two files and I will lose less data on a restart. Also by using separate threads for heavily written files, I may be able to get better performance. The idea has worked partially as lags for other files are generally low now although we do need to restart zeek if memory usage goes beyond 55%.

The problem is that I have observed that logger memory usage does not decrease on its own when the loads reduce (e.g. at night). E.g. If Zeek was using 40G memory on Friday evening and dns-noise was showing a lag of 1800 seconds, the memory usage on Monday morning is still 40G although the lag is only around 1 second. Has anyone experienced anything similar? I am running Zeek-4.1.1.

Thanks,
Dheeraj

We also have another report of the same in https://github.com/zeek/zeek/issues/1856. Is it possible for you to rebuild with jemalloc support and run the jemalloc profiling plugin on your logger node? That should give more information about what’s causing the bloat. We can use that issue to discuss more in depth what’s going on with it, if that’s easier than email.

Tim

Thanks for the pointer Tim.

I will try to run jemalloc profiling and post back on the Github issue.

  • Dheeraj

Dheeraj,

Whats the OS on the host running the logger node ?

I've seen same issues of bloating logger node with tcmalloc on FreeBSD. Mine
crashes after 180+GB - takes a couple weeks to do so!

Since last week I have been running with jemalloc and things seem better - but
lets see I may risk speaking sooner here.

(On a side note)

I've been trying jemalloc and few hiccups (struggles) related to building zeek with
jemalloc on FreeBSD:

1) fix for building zeek + jemalloc + FreeBSD: https://github.com/zeek/zeek/pull/1878

and,

2) Fix for building jemalloc itself on FreeBSD to --enable-profiling

We've (Craig leres) got out a patch to be able to do so as well.

(2) is mostly needed so that I can build zeek against jemalloc with
--enable-profiling to run Justin's zeekctl jemalloc profiler.

Aashish

Hi Aashish,

OS is CentOS-7 and we haven’t enabled tcmalloc/jemalloc during build. All processes (logger, workers, proxy, manager) run on the same machine. We have four more “sensors” with similar setup but somewhat lesser traffic which do not exhibit this problem.

  • Dheeraj