bro cluster packet loss with pf_ring_zc

Dear list,

I’m using Bro 2.4.1 stable and PF_RING_ZC to analysis network traffic. The peak flow of the traffic almost close to 1G/bps(the full load of NIC), and the number of data packet in traffic may reach 200,000 pps. PF_RING_ZC zbalance_ipc shows pf_ring has no packet loss and broctl netstats shows that bro cluster have lost most of the packets, but the link number is equal to the receive packet number.

When handle the packets, almost all of the cpus are in full load status, so I suspect that the processor of the server limits the packet processing speed, so the bro-cluster have to drop packets.

So my question is now the performance of the server under 200,000 pps cases a packet loss in bro is normal or not.

Here is the CPU info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
Stepping: 6
CPU MHz: 1317.937
BogoMIPS: 4419.58
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

Here is the memory info:

total used free shared buff/cache available
Mem: 65759080 13018468 31412324 132364 21328288 52079776
Swap: 29241340 0 29241340

Is anyone able to help me?

Thanks in advance,

Bowen Li

I guess the first thing is to ask how much packet loss? You’ll never fully eliminate it, but a good cluster set up can keep your average loss under 1%. I could ask a whole lot of questions about your cluster set up, but it is probably easier if you can share a redacted version of your node.conf (redact public IPs, DNS names, any sensitive info etc). You could also try running the capture loss script and doing some analysis on which and how many workers are dropping packets over time. Keep in mind if you are doing any kind of partial flow shunting this could skew the results. You could also look at stats.log if you have it enabled. If one or two workers are really dropping packets during an interval of time, but the rest look OK this could be traffic related (some large flows). If it is across the board you may need to look more closely at your cluster set-up for sub-optimal configuration or over-subscription.