Performance Issues after upgrade to Zeek-6

Hi,

We are running Zeek in cluster mode on a single machine using PF_RING (v8.2.0). The node.cfg looks like

[manager]
type=manager
host=localhost

[logger]
type=logger
host=localhost

[proxy-1]
type=proxy
host=localhost

[worker-1]
type=worker
host=localhost
interface=ens10
lb_method=pf_ring
lb_procs=5

We have noticed a significant performance degradation (upto 90% packet loss) after upgrading from Zeek-5.2.2 to Zeek-6.0.0 on the same machine/configuration to the point where we had to revert back to Zeek-5.2.2

For Zeek-5.2.2 the pf_ring process info for one single process is

Bound Device(s)        : ens10
Active                 : 1
Breed                  : Standard
Appl. Name             : zeek-ens10
Socket Mode            : RX+TX
Capture Direction      : RX+TX
Sampling Rate          : 1
Filtering Sampling Rate: 0
IP Defragment          : No
BPF Filtering          : Enabled
Sw Filt Hash Rules     : 0
Sw Filt WC Rules       : 0
Sw Filt Hash Match     : 0
Sw Filt Hash Miss      : 0
Sw Filt Hash Filtered  : 0
Hw Filt Rules          : 0
Poll Pkt Watermark     : 1
Num Poll Calls         : 5
Poll Watermark Timeout : 0
Channel Id Mask        : 0xFFFFFFFFFFFFFFFF
VLAN Id                : 65535
Cluster Id             : 21
Slot Version           : 20 [8.2.0]
Min Num Slots          : 7315
Bucket Len             : 9216
Slot Len               : 9272 [bucket+header]
Tot Memory             : 67842048
Tot Packets            : 126093
Tot Pkt Lost           : 0
Tot Insert             : 126093
Tot Read               : 126068
Insert Offset          : 24036888
Remove Offset          : 24029152
Num Free Slots         : 7290
TX: Send Ok            : 0
TX: Send Errors        : 0
Reflect: Fwd Ok        : 0
Reflect: Fwd Errors    : 0

The stats.log is nice and linear too

1689673088.948389	logger	151	0	0	-	-	-	940	384	0	0	0	0	0	0	2048	89	0	0	0	0	0	0	0	0
1689673090.540163	manager	103	0	0	-	-	-	7019	6467	0	0	0	0	0	0	2137	63	0	0	0	0	0	0	0	0
1689673092.124542	proxy-1	91	0	0	-	-	-	176690	176134	0	0	0	0	0	0	1719	59	0	0	0	0	0	0	0	0
1689673093.844903	worker-1-5	290	2724366	2195088093	0	2724366	0.000606	1941787	1941785	2249	2700	150	31166	76081	723	499807	125434	12511	82	0	0	780853	475078	1078275	0
1689673093.890729	worker-1-4	287	2320015	1619416526	0	2320015	0.000097	1883343	1883342	2176	2669	143	31093	76157	707	499680	124347	12381	67	0	0	804117	463304	238523	0
1689673093.855832	worker-1-1	286	2069789	1443405113	0	2069789	0.000120	1880384	1880384	2193	2765	142	30845	76106	696	496321	124011	12250	86	0	0	701742	541965	16001	0
1689673093.925209	worker-1-2	283	1893755	1280572861	0	1893755	0.000076	1882802	1882800	2165	2713	147	30661	75975	704	519899	124042	12266	72	0	0	696494	541729	901994	0
1689673093.897475	worker-1-3	286	1840839	1242641106	0	1840839	0.000110	1964678	1964670	2219	2712	159	30599	76131	734	494667	124522	12365	82	0	0	814335	555651	457608	0
1689673388.948476	logger	151	0	0	-	-	-	363	363	0	0	0	0	0	0	2013	89	0	0	0	0	0	0	0	0
1689673390.540846	manager	103	0	0	-	-	-	3493	3491	0	0	0	0	0	0	2136	62	0	0	0	0	0	0	0	0
1689673392.124647	proxy-1	93	0	0	-	-	-	160647	160646	0	0	0	0	0	0	1715	61	0	0	0	0	0	0	0	0
1689673393.855856	worker-1-1	331	3362873	2815507136	0	3362873	0.000072	1837817	1837813	2098	2538	117	29002	75150	695	525583	227924	11623	85	0	0	517671	27176	32091	0
1689673393.844947	worker-1-5	338	1648761	1108523273	0	1648761	0.000116	1855376	1855380	2076	2559	137	28830	75933	729	532624	229876	11596	106	0	0	477789	277150	1146270	0
1689673393.925209	worker-1-2	329	1490401	965625528	0	1490401	0.000074	1826702	1826700	2048	2429	124	28740	74828	687	551752	227072	11769	89	0	0	521065	3630	988203	0
1689673393.890797	worker-1-4	336	1933832	1340839201	0	1933832	0.000111	1905318	1905316	2097	2626	115	28913	75073	715	527678	228638	12029	96	0	0	523117	110731	259305	0
1689673393.897583	worker-1-3	332	2053098	1460076779	0	2053098	0.000492	1899970	1899973	2095	2580	133	28805	75206	725	527748	228125	11584	67	0	0	640793	0	485294	0
1689673688.949326	logger	151	0	0	-	-	-	363	363	0	0	0	0	0	0	2014	90	0	0	0	0	0	0	0	0
1689673690.541238	manager	103	0	0	-	-	-	2171	2173	0	0	0	0	0	0	2137	62	0	0	0	0	0	0	0	0
1689673692.124714	proxy-1	94	0	0	-	-	-	156467	156469	0	0	0	0	0	0	1711	59	0	0	0	0	0	0	0	0
1689673693.845019	worker-1-5	348	2609874	2016559593	0	2609874	0.000095	2076648	2076642	1823	2436	89	27867	76349	543	519900	226112	11021	71	0	0	555726	337290	1060520	0
1689673693.855920	worker-1-1	340	1944933	1338137324	0	1944933	0.000103	2008271	2008272	1809	2419	82	28164	76435	516	516145	224930	11056	76	0	0	679186	38704	38687	0
1689673693.925291	worker-1-2	342	1465935	909106868	0	1465935	0.000105	2000249	2000250	1848	2490	86	28113	76396	527	541327	224869	11117	67	0	0	568298	336331	910187	0
1689673693.890857	worker-1-4	348	1836140	1257882500	0	1836140	0.000223	2038914	2038916	1858	2487	87	28131	76249	535	517739	225119	11355	118	0	0	626095	1933811	268142	0
1689673693.897670	worker-1-3	343	1671119	1010249583	0	1671119	0.000093	2091418	2091416	1927	2463	83	28185	76132	537	522347	225316	11002	59	0	0	594599	0	492906	0
1689673988.950265	logger	152	0	0	-	-	-	363	363	0	0	0	0	0	0	2014	91	0	0	0	0	0	0	0	0
1689673990.541520	manager	103	0	0	-	-	-	4115	4113	0	0	0	0	0	0	2138	64	0	0	0	0	0	0	0	0
1689673992.124785	proxy-1	95	0	0	-	-	-	153749	153748	0	0	0	0	0	0	1714	60	0	0	0	0	0	0	0	0
1689673993.845092	worker-1-5	348	1790141	1140473966	0	1790141	0.000121	1940427	1940427	1954	2584	152	28604	73976	804	500852	225623	10918	70	0	0	471313	27773	1124798	0
1689673993.856001	worker-1-1	340	3260145	2585018935	0	3260145	0.000090	1915993	1915993	1999	2696	153	28541	74335	783	502756	225331	11106	62	0	0	538834	27032	10122	0
1689673993.925308	worker-1-2	342	1582863	1056736314	0	1582863	0.000058	1897449	1897450	1999	2613	147	28724	74178	796	526744	225627	11114	87	0	0	462121	561302	901605	0
1689673993.890926	worker-1-4	348	2094632	1561020151	0	2094632	0.000146	2045062	2045059	1956	2576	157	28375	74017	779	503563	224704	11363	76	0	0	524205	180654	241585	0
1689673993.897735	worker-1-3	343	1607045	1007812267	0	1607045	0.000063	1973494	1973495	1899	2576	139	28227	73770	782	500908	224153	11053	70	0	0	449315	0	471086	0

After upgrade to Zeek-6.0.0 there is no appreciable CPU or memory increase (infact CPU usage decreases)

This is what PF_RING info for a single worker process looks like when using Zeek-6.0.0 in that case

Bound Device(s)        : ens10
Active                 : 1
Breed                  : Standard
Appl. Name             : zeek-ens10
Socket Mode            : RX+TX
Capture Direction      : RX+TX
Sampling Rate          : 1
Filtering Sampling Rate: 0
IP Defragment          : No
BPF Filtering          : Enabled
Sw Filt Hash Rules     : 0
Sw Filt WC Rules       : 0
Sw Filt Hash Match     : 0
Sw Filt Hash Miss      : 0
Sw Filt Hash Filtered  : 0
Hw Filt Rules          : 0
Poll Pkt Watermark     : 1
Num Poll Calls         : 665
Poll Watermark Timeout : 0
Channel Id Mask        : 0xFFFFFFFFFFFFFFFF
VLAN Id                : 65535
Cluster Id             : 21
Slot Version           : 20 [8.2.0]
Min Num Slots          : 7315
Bucket Len             : 9216
Slot Len               : 9272 [bucket+header]
Tot Memory             : 67842048
Tot Packets            : 776413
Tot Pkt Lost           : 660401
Tot Insert             : 116024
Tot Read               : 12736
Insert Offset          : 9749592
Remove Offset          : 9774208
Num Free Slots         : 0
TX: Send Ok            : 0
TX: Send Errors        : 0
Reflect: Fwd Ok        : 0
Reflect: Fwd Errors    : 0

stats.log becomes non-linear and messy with workers reporting late

1689665990.583017	logger	325	0	0	-	-	-	-	947	384	0	0	0	0	0	02046	87	0	0	0	0	0	0	0	0
1689665992.160743	manager	168	0	0	-	-	-	-	6669	6109	0	0	0	0	0	02138	64	0	0	0	0	0	0	0	0
1689665993.820183	proxy-1	143	0	0	-	-	-	-	80788	80225	0	0	0	0	0	01719	59	0	0	0	0	0	0	0	0
1689665995.749078	worker-1-5	334	1517951	909319007	0	1517951	0.000074	-	1806814	1806811	1799	2438	38	27536	82035	729	490907	127634	10567	71	0	0	381022	618268	1924534	0
1689665995.716962	worker-1-1	311	848554	487461472	1519209	2367763	0.374560	-	1153972	1153968	2152	2371	33	17641	46739	285	312799	81282	6048	76	0	0	849879	32792	1224808	0
1689665995.718581	worker-1-2	269	166082	92919503	4489850	4655932	50.207620	-	231399	231397	1918	1787	12	5354	10622	154	109471	25035	1036	10	0	0	886489	0	254857	0
1689666290.583110	logger	326	0	0	-	-	-	-	363	363	0	0	0	0	0	02015	89	0	0	0	0	0	0	0	0
1689666292.161305	manager	168	0	0	-	-	-	-	3173	3173	0	0	0	0	0	02138	65	0	0	0	0	0	0	0	0
1689666293.820261	proxy-1	144	0	0	-	-	-	-	132879	132878	0	0	0	0	0	01715	61	0	0	0	0	0	0	0	0
1689666295.749121	worker-1-5	379	3194441	2610559636	0	3194441	0.000075	-	1738975	1738976	1869	2550	110	27118	73571	521	499412	229168	10568	63	0	0	317955	0	2085997	0
1689666295.718795	worker-1-2	339	1433823	899745950	934	1434757	0.000188	-	1683376	1683377	2168	2584	104	25326	66461	499	464683	127777	9676	71	0	0	613979	0	1961929	0
1689666295.720408	worker-1-1	358	1663067	1115119437	3807	1666874	0.000148	-	1811448	1811465	1916	2594	102	26664	73508	545	500881	183698	10731	90	0	0	265676	0	2047659	0
1689666590.583305	logger	326	0	0	-	-	-	-	363	363	0	0	0	0	0	02015	91	0	0	0	0	0	0	0	0
1689665996.059727	worker-1-4	241	96332	78257454	18494939	18591271	596.212214	-	61073	61072715	20	0	1760	2877	49	44141	6515	239	3	0	0	482022	0	5140	0
1689666592.161775	manager	168	0	0	-	-	-	-	1782	1783	0	0	0	0	0	02133	62	0	0	0	0	0	0	0	0
1689666593.820329	proxy-1	147	0	0	-	-	-	-	126948	126948	0	0	0	0	0	01713	61	0	0	0	0	0	0	0	0
1689666595.749254	worker-1-5	381	6530084	5740179828	0	6530084	0.000070	-	1831942	1831940	1954	2658	85	26576	73868	756	500686	221413	10091	84	0	0	367133	0	2005657	0
1689666595.718838	worker-1-2	381	1830578	1283171424	252747	2083325	0.019678	-	1817988	1817987	2380	2606	84	26032	72557	756	511779	211629	10379	106	0	0	507179	0	1963502	0
1689666595.720606	worker-1-1	373	1671716	1176140418	18551	1690267	5.095542	-	1763935	1763918	1916	2624	80	26211	73952	770	499710	219954	9896	93	0	0	487499	4796	1773685	0
1689666296.412325	worker-1-4	241	31689	19420776	2124988	2156677	405.070991	-	33879	33997	701	68	31361	1494	24	40363	9504	93	5	0	0	571243	0	10598	0
1689666596.566161	worker-1-4	249	75704	44841174	996368	1072072	177.229402	-	96204	96086	1496	297	73235	4288	61	76415	15491	315	20	0	0	1075730	0	53963	0
1689666890.583509	logger	328	0	0	-	-	-	-	363	363	0	0	0	0	0	02013	91	0	0	0	0	0	0	0	0
1689666892.162084	manager	168	0	0	-	-	-	-	2979	2978	0	0	0	0	0	02138	63	0	0	0	0	0	0	0	0
1689666893.820429	proxy-1	147	0	0	-	-	-	-	143140	143140	0	0	0	0	0	01713	61	0	0	0	0	0	0	0	0
1689666895.723584	worker-1-1	376	1481959	939371895	0	1481959	0.000488	-	1795831	1795832	1825	2646	33	27172	72046	644	511643	219479	10702	58	0	0	296163	66640	1626471	0
1689666895.749393	worker-1-5	384	5828942	5291618670	0	5828942	0.000083	-	1866109	1866109	1872	2708	35	27293	71715	637	511385	219957	10892	76	0	0	367351	66640	1702553	0
1689666895.718901	worker-1-2	389	2066399	1469797942	74890	2141289	4.142399	-	1965421	1965431	1997	2702	29	26938	71323	652	530447	217642	10286	64	0	0	380822	0	1965573	0
1689666896.582199	worker-1-4	296	358184	227000933	2210716	2568900	24.103091	-	396927	396926	3249	706	22	10728	18019	177	227485	48773	1634	94	0	0	2435022	66640	300480	0
1689665995.730098	worker-1-3	224	79777	78413709	22974980	23054757	933.746794	-	24602	24602406	6	0	801	1082	21	22461	2869	74	0	0	0	393285	0	6447	0
1689666296.300109	worker-1-3	226	27861	21002965	1112905	1140766	730.235311	-	24530	24703	571	35	51096	1198	17	29583	5728	66	5	0	0	512228	0	43848	0
1689666596.610617	worker-1-3	242	42103	20710360	673414	715517	488.876490	-	70446	70284	1406	413	18	2666	3365	86	64433	13224	240	18	0	0	949777	0	37330	0
1689666896.728680	worker-1-3	246	32230	15999183	319419	351649	223.234622	-	57683	57676	979	90	02026	2343	42	79123	14049	183	9	0	0	777349	2631	68473	0
1689667190.583757	logger	328	0	0	-	-	-	-	363	363	0	0	0	0	0	02013	91	0	0	0	0	0	0	0	0
1689667192.162172	manager	168	0	0	-	-	-	-	2660	2661	0	0	0	0	0	02138	64	0	0	0	0	0	0	0	0
1689667193.820507	proxy-1	147	0	0	-	-	-	-	152504	152504	0	0	0	0	0	01713	61	0	0	0	0	0	0	0	0
1689667195.724268	worker-1-1	377	1314268	799209166	39488	1353756	0.000783	-	1741699	1741703	1875	2388	137	25022	73626	498	483656	217134	8727	80	0	0	342824	291098	1977426	0
1689667195.749673	worker-1-5	385	3136137	2672363215	0	3136137	0.000242	-	1769909	1769913	1805	2304	145	25522	73118	514	488066	216860	8931	69	0	0	364753	693059	2173149	0
1689667195.718903	worker-1-2	392	4210580	3691968830	465463	4676043	4.113310	-	1722161	1722150	2514	2325	139	24820	70433	492	497302	214421	7889	108	0	0	792600	0	2138074	0
1689667166.480769	worker-1-3	285	184686	103105728	510936	695622	35.534405	-	284808	313724	2491	70	82	7784	13572	189	158475	0	983	61	0	0	1703517	0	0	0
1689667192.996349	worker-1-4	354	2107692	1701573134	1898311	4006003	8.940853	-	1214532	1214550	3474	1045	107	21457	54765	459	514212	120782	5493	140	0	0	1478287	135880	1145634	0
1689667201.829916	worker-1-5	395	32787	22349091	0	32787	0.688453	-	140295	243137	1626	127	146	542	1472	4	10087	0	170	70	0	0	375742	0	0	0
1689667197.675986	worker-1-2	399	26080	21116474	0	26080	4.852965	-	118121	215936	2302	115	139	179	477	1	3630	0	57	85	0	0	773034	0	0	0
1689667201.824249	worker-1-1	388	20455	10610665	0	20455	0.744583	-	140617	243384	1633	127	136	521	1499	1	10184	0	159	76	0	0	364371	193789	0	0
1689667204.945037	proxy-1	147	0	0	-	-	-	-	6744	6746	0	0	0	0	0	0107	0	0	0	0	0	0	0	0	0
1689667206.852718	manager	168	0	0	-	-	-	-	79	83	0	0	0	0	0	0133	0	0	0	0	0	0	0	0	0

There are no entries in reporter.log and no errors in stderr.log

We use custom RPMS built from official source tarballs and the build sequence for both Zeek-5.2.2 and 6.0.0 is the same

./configure --prefix=%{_prefix} --binary-package --enable-static-broker --disable-broker-tests --disable-btest --disable-btest-pcaps --with-python=/usr/bin/python3 --enable-jemalloc --disable-cpp-tests --build-type=Release
make
make install

PF_RING also reports correct number of rings (with same cluster ID) in both cases.

zeekctl.cfg

SendMail = 
MailTo = root@localhost
MailConnectionSummary = 0
MinDiskSpace = 5
MailHostUpDown = 0


LogRotationInterval = 3600
LogExpireInterval = 550

StatsLogEnable = 1
StatsLogExpireInterval = 90


StatusCmdShowAll = 0
SitePolicyScripts = local.zeek
LogDir = /zeeklog/logs
SpoolDir = /zeeklog/spool
CfgDir = /opt/zeek/etc

PFRINGClusterID = 21
ZeekArgs = -f "<LONGISH BPF FILTER>"

Any help in resolving this issue is appreciated.

EDIT: This appears to be an issue with live traffic. Timing tests on a 1.7GB PCAP having around 2M packets show no appreciable difference in Zeek-5 and Zeek-6.

It appears that one or more workers “stall” at startup when using PF_RING. This appears to be an issue at slightly elevated loads (100k packets per sec). These workers consume no CPU or packets from PF_RING so that PF_RING starts dropping packets as the queue gets full for that worker.

I have also noticed that some of the stalled workers will start working after some time so that packets are no longer dropped but this behaviour is random. When the stalled worker starts running it will output to captureloss.log with the correct timestamp (i.e. the timestamp will match with what was earlier output by a healthy worker).

A running worker (which was earlier stalled) may stall for some time but a worker that started smoothly does not stall later.

I looked at lsof and zeekctl diag output for the stalled workers but could not find any useful pointers. The /proc/ for the stalled workers shows very few context switches but otherwsie the output does not appear any different from healthy workers (apart from low memory use)

zeekctl diag output for a stalled worker

[worker-1-4]

No core file found.

Zeek 6.0.0
Linux 4.18.0-425.3.1.el8.x86_64

Zeek plugins:
Conn::Burst - <Insert description> (dynamic, version 0.1.0)
Conn::LongConnections - <Insert description> (dynamic, version 0.5.0)
HTTP::POSTBody - <Insert description> (dynamic, version 0.1.0)
Site::IsDarkNet - <Insert description> (dynamic, version 2.1.0)
SSL::JA3Fingerprint - <Insert description> (dynamic, version 0.5.0)

==== No reporter.log

==== stderr.log
listening on ens10


==== stdout.log
max memory size         (kbytes, -m) unlimited
data seg size           (kbytes, -d) unlimited
virtual memory          (kbytes, -v) unlimited
core file size          (blocks, -c) unlimited

==== .cmdline
-i ens10 -U .status -p zeekctl -p zeekctl-live -p local -p worker-1-4 local.zeek zeekctl base/frameworks/cluster zeekctl/auto -f <LONG BPF FILTER>

==== .env_vars
PATH=/opt/zeek/bin:/opt/zeek/share/zeekctl/scripts:/opt/zeek/bin:/home/admin/.local/bin:/home/admin/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
ZEEKPATH=/zeeklog/spool/installed-scripts-do-not-touch/site::/zeeklog/spool/installed-scripts-do-not-touch/auto:/opt/zeek/share/zeek:/opt/zeek/share/zeek/policy:/opt/zeek/share/zeek/site:/opt/zeek/share/zeek/builtin-plugins
CLUSTER_NODE=worker-1-4

==== .status
RUNNING [run_loop]

==== No prof.log

==== No packet_filter.log

==== No loaded_scripts.log

Hi there,

Thanks for the detailed write-up. I can’t think of any PF_RING-specific changes between 5.2 and 6.0 that would explain what you’re seeing, so I’m thinking the stalls are a side-effect of something else.

Normally problems that come up only in clusterized operation are due to hotspots in the inter-node communication, usually specific nodes getting overwhelmed by Zeek events or log-writes. These tend to become visible as excessive memory use, but I’m not seeing that in your stats.log — at least nothing glaring, and this also wouldn’t fit your startup stall observation.

A few questions/ideas:

  • It’d be nice to rule out PF_RING, regardless. Are you able to temporarily switch to an alternative packet capture toolchain like AF_PACKET?

  • Do you use large intel framework input files? I’m asking because the distribution of these across the cluster can cause delays at startup (though here too I can’t think of 5.2->6.0 changes that would matter).

  • Zeek’s telemetry might be able to offer more clues than stats.log. We have a new troubleshooting section in our documentation that you could take a look at for this.

  • You could attach gdb to a worker in the stalled state (though you’d need to switch your build type to get debug symbols) and see what it’s doing. A flamegraph could also be useful.

  • It’s great that you’re on 5.2.2 because it makes the window of changes relatively manageable. If you’re comfortable with git we could try a git bisect session to pinpoint the change. I did one of these recently in another context, and it took around 10 iterations to pin-point an offending commit between 5.2 and 6.0. Happy to provide more details.

Out of curiosity, what are the Zeek plugins in the diag output? Their names look more like script-only Zeek packages. For PF_RING users we normally recommend the GitHub - ntop/bro-pf_ring: Zeek (formerly Bro) plugin to capture packets through PF_RING plugin.

Many of us are out on vacation right now, so apologies for slow response times until around the beginning of August.

Best,
Christian

Thanks for looking at this Christian.

I am thinking that the way we use PF_RING with Zeek may be a problem.

We just add the path to the PFRING libpcap (/usr/local/lib) to the environment before we run Zeek and then the correct number of PF_RING rings (with configured cluster IDs) are formed and workers receive the packets.

After upgrade to 6, this way is perhaps broken? If I remove the path from environment (and link zeek against native libpcap), the worker (or workers if I don’t change node cfg) drop some packets initially but quickly start using the CPU and further packet drops are reduced

(All workers here receive same copy of traffic so loads are high)

# zeekctl top
Name         Type    Host             Pid     VSize  Rss  Cpu   Cmd
logger       logger  localhost        1852514 775M   126M  25%  zeek
manager      manager localhost        1852563 366M   103M   0%  zeek
proxy-1      proxy   localhost        1852611 372M   104M  12%  zeek
worker-1-1   worker  localhost        1852711 782M   513M  81%  zeek
worker-1-2   worker  localhost        1852707 782M   511M  93%  zeek
worker-1-3   worker  localhost        1852709 782M   511M  93%  zeek
worker-1-4   worker  localhost        1852713 782M   511M  87%  zeek
worker-1-5   worker  localhost        1852714 782M   511M  81%  zeek

# zeekctl netstats
 worker-1-1: 1689918246.843961 recvd=1998983 dropped=247537 link=2249083
 worker-1-2: 1689918246.915474 recvd=2000722 dropped=250288 link=2254091
 worker-1-3: 1689918246.963278 recvd=2021961 dropped=224048 link=2249006
 worker-1-4: 1689918247.012879 recvd=2010319 dropped=245095 link=2258279
 worker-1-5: 1689918247.082430 recvd=2023355 dropped=225161 link=2250941

With PF_RING libpacp in the path, this is what happens

# zeekctl netstats
 worker-1-1: 1689916548.856383 recvd=3857 dropped=0 link=3857
 worker-1-2: 1689916548.263639 recvd=1917 dropped=371100 link=373017
 worker-1-3: 1689916548.818753 recvd=2940 dropped=0 link=2940
 worker-1-4: 1689916548.485396 recvd=2074 dropped=247439 link=249513
 worker-1-5: 1689916564.997089 recvd=170885 dropped=0 link=170885

# zeekctl top
Name         Type    Host             Pid     VSize  Rss  Cpu   Cmd
logger       logger  localhost        1826236 680M   290M   0%  zeek
manager      manager localhost        1826285 360M   140M   0%  zeek
proxy-1      proxy   localhost        1826333 360M   141M   6%  zeek
worker-1-1   worker  localhost        1826431 420M   216M   0%  zeek
worker-1-2   worker  localhost        1826428 394M   212M   0%  zeek
worker-1-3   worker  localhost        1826435 400M   215M   0%  zeek
worker-1-4   worker  localhost        1826434 400M   212M   0%  zeek
worker-1-5   worker  localhost        1826436 466M   273M  25%  zeek

When PF_RING is in use with Zeek-6, some/all workers process packets very slowly (<100 per second) and hence new packets keep getting dropped. After non deterministic time (few minutes to an hour), the workers start processing packets at speed so packets are no longer dropped. The processing speed may sometimes again reduce to <100 in one or more workers but this slowdown seems to be temporary and is random.

When using 5.2.2 with same PF_RING, this is what I see

# zeekctl netstats
 worker-1-1: 1689918793.994643 recvd=871245 dropped=0 link=871245
 worker-1-2: 1689918794.050900 recvd=435704 dropped=0 link=435704
 worker-1-3: 1689918794.107826 recvd=528243 dropped=0 link=528243
 worker-1-4: 1689918794.162082 recvd=726034 dropped=0 link=726034
 worker-1-5: 1689918794.217483 recvd=615310 dropped=0 link=615310

# zeekctl top
Name         Type    Host             Pid     VSize  Rss  Cpu   Cmd
logger       logger  localhost        1864888 790M   114M   0%  zeek
manager      manager localhost        1864937 412M    92M   0%  zeek
proxy-1      proxy   localhost        1864992 352M    92M   6%  zeek
worker-1-1   worker  localhost        1865091 469M   250M  31%  zeek
worker-1-2   worker  localhost        1865090 469M   249M  18%  zeek
worker-1-3   worker  localhost        1865092 469M   250M  31%  zeek
worker-1-4   worker  localhost        1865087 469M   253M  37%  zeek
worker-1-5   worker  localhost        1865095 475M   250M  18%  zeek

EDIT - With AF_PACKET also workers get up to speed quickly. There are small amounts of packets that get dropped at start but later on no new drops are seen and CPU usage also increases

# zeekctl netstats
 worker-1-1: 1689926909.085256 recvd=16099195 dropped=57420 link=29428961
 worker-1-2: 1689926909.148833 recvd=5783459 dropped=50292 link=18120351
 worker-1-3: 1689926909.208387 recvd=6177769 dropped=43832 link=14341727
 worker-1-4: 1689926909.265238 recvd=8388867 dropped=17044 link=27166837
 worker-1-5: 1689926909.324552 recvd=6529677 dropped=51817 link=21679472

# zeekctl top
Name         Type    Host             Pid     VSize  Rss  Cpu   Cmd
logger       logger  localhost        1917146 859M   216M  17%  zeek
manager      manager localhost        1917195 408M   107M   5%  zeek
proxy-1      proxy   localhost        1917247 362M   113M   5%  zeek
worker-1-1   worker  localhost        1917342 712M   440M  23%  zeek
worker-1-2   worker  localhost        1917348 656M   430M  23%  zeek
worker-1-3   worker  localhost        1917350 712M   432M  23%  zeek
worker-1-4   worker  localhost        1917352 712M   440M  29%  zeek
worker-1-5   worker  localhost        1917349 656M   430M  29%  zeek

Really interesting observations. Zeek knows nothing about PF_RING, so whatever causes this would seem to be a side-effect of other changes. Zeek 6 contains changes to how it manages network vs current time, as well subtle time-related changes to its I/O loop. I’m wondering if those might relate.

We’ll dig in and follow up.

Best,
Christian

@redbaron - thank you very much for reporting. I’m able to reproduce something similar in a VM:

  • pf_ring installed via ntop repositories (adds /usr/local/lib to ld.conf setup)
  • lb_procs = 4, lb_method = pf_ring
  • Replaying 10kpps mostly SSL traffic

With Zeek 5.2.2, there’s no packet drops visible through the status files in /proc/net/pf_ring:

root@ubuntu-01:/proc/net/pf_ring# for f in  *enp6s0* ;  do echo "$f $(cat $f | grep Lost) $(cat $f | grep App)" ; done
63289-enp6s0.65 Tot Pkt Lost           : 0 Appl. Name             : zeek-enp6s0
63292-enp6s0.64 Tot Pkt Lost           : 0 Appl. Name             : zeek-enp6s0
63293-enp6s0.63 Tot Pkt Lost           : 0 Appl. Name             : zeek-enp6s0
63294-enp6s0.62 Tot Pkt Lost           : 0 Appl. Name             : zeek-enp6s0

With Zeek 6.0, current observation is that some of the workers immediately or after a short period of tiem start dropping packets:

root@ubuntu-01:/proc/net/pf_ring# for f in  *enp6s0* ;  do echo "$f $(cat $f | grep Lost) $(cat $f | grep App)" ; done
73720-enp6s0.78 Tot Pkt Lost           : 29138 Appl. Name             : zeek-enp6s0
73721-enp6s0.81 Tot Pkt Lost           : 13508 Appl. Name             : zeek-enp6s0
73722-enp6s0.79 Tot Pkt Lost           : 0 Appl. Name             : zeek-enp6s0
73725-enp6s0.80 Tot Pkt Lost           : 0 Appl. Name             : zeek-enp6s0

CPU usage of workers dropping packets plummets.

I’ll look at more detail - this is an obvious regression and I suspect caused by IO loop or timing changes introduced in 6.0.

Thanks again!

@redbaron - could you try the following in your local.zeek with Zeek 6.0 assuming you have a testing environment around and see if this changes things drastically?

redef io_poll_interval_live = 100;

Reverse experiment on my end was to cherry-pick just this PR that changed io_poll_interval_live determination from 100 to 10 into the Zeek 5.2 branch. Once done, packet drops with PF_RING reproduce with Zeek 5.2.2 as well.

My current understanding is that PF_RING provides a selectable FD to Zeek, but it’s seems polling it stops (Num Poll Calls stop increasing) and then not detected/flagged as by Zeek’s IO loop, resulting in Manager::Poll() -> kevent() actually blocking until a timer expires or a broker message comes in. May have gotten lucky with io_poll_interval_live = 100 being set high enough by accident previously to consume enough packets.

@awelzel

Many thanks for following this up.

I upgraded to Zeek-6 on the same sensor and added the following setting to local.zeek

redef io_poll_interval_live = 100;

After zeekctl deploy there were no packet drops and behaviour was consistent with what was seen in Zeek-5.2

# zeekctl netstats
 worker-1-1: 1690865326.106985 recvd=3507653 dropped=0 link=3507653
 worker-1-2: 1690865326.158915 recvd=1396391 dropped=0 link=1396391
 worker-1-3: 1690865326.225469 recvd=1687950 dropped=0 link=1687950
 worker-1-4: 1690865326.274738 recvd=1507115 dropped=0 link=1507115
 worker-1-5: 1690865326.332403 recvd=1890484 dropped=0 link=1890484

# zeekctl top
Name         Type    Host             Pid     VSize  Rss  Cpu   Cmd
logger       logger  localhost        2475545 811M   348M   0%  zeek
manager      manager localhost        2475604 408M   162M   0%  zeek
proxy-1      proxy   localhost        2475672 360M   142M   5%  zeek
worker-1-1   worker  localhost        2475917 592M   400M  17%  zeek
worker-1-2   worker  localhost        2475924 586M   394M  11%  zeek
worker-1-3   worker  localhost        2475921 592M   416M  17%  zeek
worker-1-4   worker  localhost        2475930 592M   390M  17%  zeek
worker-1-5   worker  localhost        2475927 606M   409M  29%  zeek

As a follow-up, we believe this is an issue in PF_RING when used with epoll. We have opened an issue there: epoll() may require unconditional poll_wait() invocation in ring_poll() · Issue #878 · ntop/PF_RING · GitHub

2 Likes