We must have crossed some threshold yesterday. Suddenly we are suffering an epidemic of workers dying with “out of memory in new” even though we made no changes. Previously, we would have a few die each day. Now we have had 250 alerts of workers dying and being restarted from 00:00 to 10:00. I have no idea where to start debugging the problem. Any suggestions?
What causes a worker to die by running out of memory? The sensors have lots of memory (see below) so I would not expect to have any out of memory deaths. (To monitor the problem, I am in the process of setting up collectd and graphana.)
Some details:
- 5 sensors, each with 16-core, AMD Epyc 7351P, 128 GB RAM, Intel X520-T2
- Zeek 2.6.1
- node.cfg: lb_procs=15, pin_cpus=1-15, af_packet_buffer_size=110241024*1024
- broctl.cfg: setcap enabled
- Not shunting any traffic
Mark