Bad load balancing with PF_RING cluster

Hi.

I’m having issues with a sensor. I’m running Zeek 4.2.1 configured with PF_RING 8.0. The pk_ring.ko module is loaded. Workers are pinned to specific cores. I’m adjusting the number of RSS queues to the number of workers.

I’m processing ~ 5.2gbps of traffic. I have ~30% of my workers maxing their CPU core at 100% (and dropping a ton of packets, while the rest are chilling at a cozy ~40-50%. Since it might be relevant, I also have a capture filter that filters out some high volume flows.

Things I’ve tried, but seem to have no effect

  • disabling reassembly offload (as suggested in Bro-2.5.2 and PF_RING 6.7 not load balancing properly). Doesn’t seem to affect anything.
  • setting PFRINGClusterType = 5-tuple in zeekctl.cfg.
  • changing the number of workers (tried 12, 14)
  • setting PFRINGClusterType to round-robin. I expected zeek might stop working but would at least change it’s behaviour but the results are exactly the same.

Zeeks stats show that the affected workers are seeing about twice as many packets (in .pkts_link) but also twice as many connections. This is what leads me to believe it’s a clustering issue, and not some CPU heavy flows hogging a few workers.

root@S40-0002:/opt/zeek# cat logs/current/stats.log |grep worker  |jq -c '[.peer, .pkts_proc, .pkts_dropped, .pkts_link, .bytes_recv, .tcp_conns, .udp_conns]|@tsv' -r |sort -n
worker-1	11926937	0	11926937	11821666034	9705	7568
worker-10	20304874	0	20304874	25494365339	9887	6571
worker-11	15982274	0	15982274	17382237822	12721	7560
worker-12	5569983	22038010	27607993	3998967861	28960	6724
worker-13	9880313	0	9880313	10404686108	12841	10090
worker-14	6431003	13202965	19633968	4980727202	23737	5635
worker-2	12780457	0	12780457	13877151339	12552	8607
worker-3	8981430	0	8981430	8152736373	10535	6884
worker-4	22568900	0	22568900	26497928709	9129	6921
worker-5	5240202	25481354	30721556	3621222996	20892	5015
worker-6	25090072	38963	25129035	29919849386	8952	6595
worker-7	11789623	0	11789623	12828034403	9187	6960
worker-8	5547602	27071557	32619159	3821877698	20696	5237
worker-9	17891859	0	17891859	21401727729	11683	8174

Here is my node.cfg:

[manager]
type=manager
host=127.0.0.1
[proxy-1]
type=proxy
host=127.0.0.1
[logger-1]
type=logger
host=127.0.0.1

[worker]
type=worker
host=127.0.0.1
interface=ens2f0,ens2f1
lb_method=pf_ring
lb_procs=14
pin_cpus=0,2,4,6,8,10,12,14,15,13,11,9,7,5

I’ve also tried 6-tuples, but it seems to only make things worse.

Hi,

With just the stats you can potentially see that there’s an issue, but you determine what is really happening.

Install this package: GitHub - J-Gras/add-node-names: Adds cluster node name to logs.

That will add the worker to the conn log. Once you have that you can dig in a little further and determine which connections are being sent to which worker. Possibly once you do that you’ll be able to identify a pattern.