We are setting up a Zeek cluster consisting of a manager/logger and five sensors. Each node uses the same hardware:
- 2.4 GHz AMD Epyc 7351P (16-core, 32-threads)
- 256 GB DDR3 ECC RAM
- Intel X520-T2 10 Gbps to Arista with 0.5m DAC
- Arista 7150S hashing on 5-tuple
- Gigamon sends to Arista via 4x10 Gbps
- Zeek v2.6-167 with AF_Packet
- 16 workers per sensor (total: 5x16=80 workers)
The capture loss was 50-70% until I remembered to turn off offloading. Now it averages about 0.8%. Except that often 0-4 cores in a 1 hour summary spike at 60-70% capture loss. There doesn’t appear to be a pattern on which core suffers the high loss. Searches for how to identify and fix the reason for such large losses have failed to yield any suggestions for debugging the problem. Suggestions?
Once you have a high capture loss value you need to switch from
focusing on that and look at the missed_bytes column in the conn.log.
The capture loss value is like a check engine light. It only tells
you that something is wrong, but the conn.log tells you what is wrong.
Look for entries in the conn.log where missed_bytes is non zero, or
even start with looking for any records where it is > 100000. You may
find that you simply have a few connections that are completely broken
causing the capture loss to be skewed towards that 60% value.
A much better metric that I like to use is 'percent of connections
with loss'. It's a completely different problem if you have 40%
overall capture loss but only .01% of connections with loss, compared
to 40% overall capture loss with loss on 20% of connections.
If you install bro-doctor from bro-pkg that will do a lot of analysis
like this for you.
I'd also run 1 less worker on each of those boxes. With 16 workers
and 16 cores, you're not leaving any spare cores to dedicate to cron
jobs and other background tasks.