capture_loss vs. pkts_dropped vs. missed_bytes

I am still tuning our new Zeek cluster: an Arista switch for load balancing with 4x10 Gbps links from a Gigamon and 10 Gbps links to the sensors, five sensors (16 physical cores with 128 GB RAM each) using af_packet, 15 workers per sensor, and a separate management node running the manager, logger, proxy, and storage (XFS on RAID-0 with 8 7200 RPM spindles, 256 GB RAM). Output is JSON (for feeding into an ElasticStack later).

The average capture loss was <1% early on with spikes to 50-70%. We increased the af_packet_buffer_size from the default (128MB) to 2GB and capture_loss is gone.

$ zcat capture_loss.10:00:00-11:00:00.log.gz | jq .percent_lost | statgen
Count Min Max Avg StdDev
300 0.0000 0.0000 0.0000 0.0000

Next, I looked at the missing bytes from the conn.log which doesn’t look too bad:

$ zcat conn.10:00:00-11:00:00.log.gz | jq .missed_bytes | statgen
Count Min Max Avg StdDev
5488 0.0000 5802.0000 1.7332 92.9547

Out of the 5488 records, only two were non-zero (5802 and 3710) and for both of those the missed_bytes == resp_bytes (service: ssl).

But even with the above, the pkts_dropped in stats.log is extremely high:

$ zcat stats.10:00:00-11:00:00.log.gz | jq .pkts_dropped | grep -v null | statgen
Count Min Max Avg StdDev
900 3564854 18216752 5762446.99 1591145.34

So even though there was no capture_loss and almost no missing_bytes, the pkts_dropped is huge. Is this something to be concerned about? If so, I am not sure how to go about figuring out the problem. What should I do next?

Hey Mark,

First of all, I really like your setup and I don’t see any obvious errors there. Cool.

Jan (also on this list) might know more about the way drops are calculated in stats log. It looks like they are just af_packet statistics.

Can you run Justin’s troubleshooting tool and send us results?

BTW, while monitoring for drops, take a look here, where we describe several other places drops might happen (and all of them should be monitored).