Debugging high-cpu on a single worker

I’m seeing a repeatable issue where the first worker of ten always has a higher CPU usage.

11 3014 bro 20 0 2078M 1250M 502M R 97.4 1.0 18h34:48 /opt/bro/bin/bro -i netmap::bro}0 -U .status -p broctl -p broctl-live -p
14 3005 bro 20 0 1885M 1056M 501M R 83.5 0.8 13h09:44 /opt/bro/bin/bro -i netmap::bro}4 -U .status -p broctl -p broctl-live -p
18 3013 bro 20 0 1746M 983M 501M R 67.3 0.8 13h25:41 /opt/bro/bin/bro -i netmap::bro}8 -U .status -p broctl -p broctl-live -p
17 3009 bro 20 0 2139M 977M 502M S 63.5 0.8 13h11:37 /opt/bro/bin/bro -i netmap::bro}7 -U .status -p broctl -p broctl-live -p
19 3008 bro 20 0 2168M 1333M 501M S 63.0 1.0 13h30:12 /opt/bro/bin/bro -i netmap::bro}9 -U .status -p broctl -p broctl-live -p
13 3006 bro 20 0 2164M 1329M 502M S 62.5 1.0 13h16:05 /opt/bro/bin/bro -i netmap::bro}3 -U .status -p broctl -p broctl-live -p
12 3012 bro 20 0 2194M 1287M 501M R 60.6 1.0 13h15:14 /opt/bro/bin/bro -i netmap::bro}2 -U .status -p broctl -p broctl-live -p
15 3011 bro 20 0 2177M 982M 502M R 60.6 0.8 13h03:26 /opt/bro/bin/bro -i netmap::bro}5 -U .status -p broctl -p broctl-live -p
16 3007 bro 20 0 2317M 1237M 502M R 60.1 1.0 13h12:35 /opt/bro/bin/bro -i netmap::bro}6 -U .status -p broctl -p broctl-live -p
20 3010 bro 20 0 2328M 991M 501M R 58.2 0.8 13h14:40 /opt/bro/bin/bro -i netmap::bro}1 -U .status -p broctl -p broctl-live -p

Even after a fresh restart the “bro}0” worker is about double the CPU usage compared to the other workers:

11 1799 bro 20 0 1422M 668M 501M S 73.5 0.5 0:12.28 /opt/bro/bin/bro -i netmap::bro}0 -U .status -p broctl -p broctl-live -p
15 1813 bro 20 0 1420M 530M 364M S 37.9 0.4 0:07.04 /opt/bro/bin/bro -i netmap::bro}5 -U .status -p broctl -p broctl-live -p
17 1831 bro 20 0 1420M 527M 362M R 37.5 0.4 0:06.92 /opt/bro/bin/bro -i netmap::bro}7 -U .status -p broctl -p broctl-live -p
12 1805 bro 20 0 1424M 525M 356M R 36.5 0.4 0:07.04 /opt/bro/bin/bro -i netmap::bro}2 -U .status -p broctl -p broctl-live -p
18 1830 bro 20 0 1420M 514M 350M R 35.1 0.4 0:06.67 /opt/bro/bin/bro -i netmap::bro}8 -U .status -p broctl -p broctl-live -p
19 1826 bro 20 0 1421M 533M 367M S 34.6 0.4 0:07.15 /opt/bro/bin/bro -i netmap::bro}9 -U .status -p broctl -p broctl-live -p
20 1832 bro 20 0 1421M 529M 363M R 34.6 0.4 0:07.06 /opt/bro/bin/bro -i netmap::bro}1 -U .status -p broctl -p broctl-live -p
14 1827 bro 20 0 1420M 530M 365M S 34.6 0.4 0:06.87 /opt/bro/bin/bro -i netmap::bro}4 -U .status -p broctl -p broctl-live -p
13 1802 bro 20 0 1418M 520M 356M S 34.6 0.4 0:06.79 /opt/bro/bin/bro -i netmap::bro}3 -U .status -p broctl -p broctl-live -p
16 1824 bro 20 0 1417M 502M 339M R 33.1 0.4 0:06.60 /opt/bro/bin/bro -i netmap::bro}6 -U .status -p broctl -p broctl-live -p

A quick spot check of an hours worth of conn logs shows the traffic is balanced as expected:

MID_EXT-1 = 397479
MID_EXT-2 = 409771

MID_EXT-10 = 408699

The real issue is that its the only worker thats reporting capture_loss, which I assume is due to hitting 100% CPU usage:

1487866049.759038 600.000033 MID_EXT-2 4713 4710856 0.100046
1487866049.759333 600.000015 MID_EXT-9 5427 5193821 0.10449
1487866049.760596 600.000068 MID_EXT-5 5768 5548436 0.103957
1487866049.762628 600.000107 MID_EXT-8 5190 5224691 0.099336
1487866049.764048 600.000125 MID_EXT-3 5036 4947221 0.101795
1487866111.312071 600.000666 MID_EXT-6 8007 5959706 0.134352
1487866111.313803 600.000803 MID_EXT-4 6581 4981572 0.132107
1487866111.314788 600.000126 MID_EXT-7 6579 4905261 0.134121
1487866111.327390 600.000047 MID_EXT-10 7693 6369836 0.120772
1487866111.584908 600.000566 MID_EXT-1 823616 4227324 19.483153

Any tips on how to debug/profile further would be greatly appreciated.

Thanks!
-Dave

Hi Dave,

Any tips on how to debug/profile further would be greatly appreciated.

using https://github.com/0xxon/bro-scripts/blob/master/conn-workers.bro
and looking at the type etc. of connections that are handled by the
worker might help.

Jan

I'd guess it is one of two things:

The pin cpu setting that is pinning that worker to the same cpu that is being used for things like interrupts

The fact that it is the first worker on netmap may mean it receives all of a certain kind of traffic that can't otherwise be load balanced to the other workers.

What pin_cpus setting are you using and what does /proc/interrupts contain?

I'd guess it is one of two things:

The pin cpu setting that is pinning that worker to the same cpu that is being used for things like interrupts

The fact that it is the first worker on netmap may mean it receives all of a certain kind of traffic that can't otherwise be load balanced to the other workers.

What pin_cpus setting are you using and what does /proc/interrupts contain?

--
- Justin Azoff

You nailed it, I missed the IRQ triage and that NIC is pinned to the first CPU in its numa node, which matches with the first one Bro is pinned to.

So now the dilemma is that hardware only has 10 cores per CPU. So either I reduce the number of Bro workers to 9, and leave a core dedicated to NIC interrupts, or I pin the NIC to a core on the other CPU and incur a latency hit due to it being on a different numa node.

-Dave