High-CPU on just a single worker in the cluster

I'm in the process of trying to debug an odd high-cpu issue and looking for guidance.

The deployment is a follows:
  - Cluster has with two nodes, each with 10 workers and the workers are pinned to specific cpu cores.
  - x520 with PF_RING
  - Traffic to each node is load balanced equally

The issue is that one worker on one of the nodes is always at 100% CPU while all other workers are around 50%. If I restart Bro a different worker will pin to 100%, but always on the same node.

I ran 'strace' on both a "bad" and "good" worker and one anomaly I spotted was that the "bad" worker never called 'nanosleep', whereas the "good" worker had about 84,000 'nanosleep' calls in the same amount of time.

I'm wondering if its possible for a queue to go bad on the x520, which might explain why its a random worker on the same node after restarting.

Is there a way to determine which x520 queue a specific worker is reading from?

Thanks,
-Dave

Can you load this script that will add a node column to the conn.log that says which node handled that connection:

https://github.com/broala/bro-snippets/blob/master/add-node-to-conn.bro

also, what 'broctl netstats' outputs would be useful to see.

I'm already capturing the node in conn.log but haven't been able to spot anything out off the ordinary compared to the other nodes.

Below is a fresh 'netstats' from this morning (WIN_INT-9 is obviously the culprit).

  MID_INT-1: 1460636199.609909 recvd=327807625 dropped=1 link=327807625
MID_INT-10: 1460636199.813720 recvd=339804389 dropped=4 link=339804389
  MID_INT-2: 1460636200.010047 recvd=313901304 dropped=0 link=313901304
  MID_INT-3: 1460636200.210033 recvd=323507786 dropped=1 link=323507786
  MID_INT-4: 1460636200.413951 recvd=322338069 dropped=0 link=322338069
  MID_INT-5: 1460636200.613996 recvd=314681107 dropped=1 link=314681107
  MID_INT-6: 1460636200.814761 recvd=325488973 dropped=1 link=325488973
  MID_INT-7: 1460636201.017945 recvd=328830658 dropped=3 link=328830658
  MID_INT-8: 1460636201.218113 recvd=338250015 dropped=0 link=338250015
  MID_INT-9: 1460636201.417949 recvd=387979776 dropped=0 link=387979776
  WIN_INT-1: 1460636288.903341 recvd=142474122 dropped=1 link=142474122
WIN_INT-10: 1460636289.103648 recvd=232076131 dropped=1 link=232076131
  WIN_INT-2: 1460636289.303290 recvd=145451659 dropped=2 link=145451659
  WIN_INT-3: 1460636289.507242 recvd=182345947 dropped=0 link=182345947
  WIN_INT-4: 1460636289.707591 recvd=140378820 dropped=1 link=140378820
  WIN_INT-5: 1460636289.911410 recvd=140342198 dropped=0 link=140342198
  WIN_INT-6: 1460636290.111178 recvd=138961706 dropped=0 link=138961706
  WIN_INT-7: 1460636290.315433 recvd=198792251 dropped=0 link=198792251
  WIN_INT-8: 1460636290.515158 recvd=170824302 dropped=3 link=170824302
  WIN_INT-9: 1460636287.108095 recvd=2414368833 dropped=438939600 link=2414368833

Ahh, well that last worker is seeing almost double the number of packets than all of the other workers on that host combined, so that explains the CPU usage.

What does a frequency distribution of the node column from your conn.log from around that time show?

    zcat conn.....gz | bro-cut node | sort | uniq -c

Adding this to local.bro

    @load misc/stats

If you don't have it already will give you a stats.log which will contain some helpful information too.

If I had to guess, there's probably something going on traffic wise.. if I had to guess the WIN box is seeing a ton of non-ip traffic that all gets load balanced to the same worker.

Can you run capstats on the two boxes and compare the output?

    # capstats -i p1p1 -I 2
    1460639895.723665 pkts=6642 kpps=2.8 kbytes=2100 mbps=7.2 nic_pkts=6642 nic_drops=0 u=0 t=0 i=0 o=0 nonip=6642

I just noticed that capstats doesn't properly handle vlan encapsulated packets, so all of our traffic shows up as nonip.. I'll look into fixing that, but if you are not using vlans looking at the breakdown of udp,tcp,ip,other,nonip (u,t,i,o,nonip) would help.

Here is the frequency distribution (I’m logging in JSON so different command but same results):

$ zcat 2016-04-14/conn.08:00:00-09:00:00.log.gz | jq -r ‘.node’ | sort | uniq -c

192470 MID_GLR-1
192325 MID_GLR-2
193491 MID_GLR-3
192444 MID_GLR-4
192288 MID_GLR-5
654908 MID_INT-1
655252 MID_INT-10
652749 MID_INT-2
657361 MID_INT-3
655236 MID_INT-4
656477 MID_INT-5
654518 MID_INT-6
656199 MID_INT-7
656069 MID_INT-8
655798 MID_INT-9
770362 WIN_INT-1
772679 WIN_INT-10
773520 WIN_INT-2
772197 WIN_INT-3
768066 WIN_INT-4
771089 WIN_INT-5
771734 WIN_INT-6
772599 WIN_INT-7
772359 WIN_INT-8
721526 WIN_INT-9

You may be on to something with the non-ip traffic… there is a drastic difference between the two datacenters:

WIN

1460641772.239436 pkts=10414545 kpps=208.2 kbytes=5732528 mbps=938.6 nic_pkts=10414545 nic_drops=0 u=104675 t=3627503 i=307 o=405 nonip=6681655

MID
1460641723.573448 pkts=9553569 kpps=178.9 kbytes=6561123 mbps=1006.6 nic_pkts=9553569 nic_drops=0 u=174140 t=9373195 i=267 o=934 nonip=5033

-Dave

You are seeing the side effects of hot spotting. It's likely that you have a connection which is massively larger than all others that your host is seeing. You could filter your traffic to stop seeing that flow which would cut down on the problems you are seeing.

I just pushed a new repository that might help you address your issue. It will still require a lot of clean up and documentation but if you just load the module as-is, it might automatically help you. :slight_smile: It automatically identifies large and fast flows and places BPF filters to automatically shunt them while still watching for SYN/ACK/FIN/RST flags.

  https://github.com/sethhall/bro-bulkflows

There are a couple of configuration options you'll want to pay attention to, so I'll just give you an example blurb you could put into your local.bro:

@load bro-bulkflows
# Set a minimum flow size before considering shunting (32MB)
redef BulkFlows::size_threshold = 33554432;
# Set a minimum flow data rate before considering shunting (50Mbps)
BulkFlows::speed_threshold = 6250000;

# If you have VLAN tags on your traffic, you will also have to add the following line because packet filtering with BPF is generally broken otherwise...
#redef PacketFilter::restricted_filter="vlan";

  .Seth

I'm not 100% familiar with the load balancing methods, but know if your doing a 5 tuple load balance algorithm, I have seen IPsec stick to a single location, as there are effectively only three tuple available (Src IP, Dst IP and proto), which don't change for most static IPSec connections. Which results in the traffic not changing and sliding from worker to worker like normal connections would.

Could there be a high use IPSec tunnel between that Datacenter and another location?

Great.. just what I was thinking. At this point you should be able to just run something like

    tcpdump -n -c 1000 'not ip'

on the WIN box

and see exactly what this traffic is.. then we can figure out what to do about it...

$ sudo tcpdump -n -i eth6 not ip and not arp -c10000 | grep ethertype | cut -f 2 -d ‘,’ | sort | uniq -c
9980 ethertype Unknown (0x8903)

A quick Google points to Cisco FabricPath Switching ( http://www.cisco.com/en/US/docs/switches/datacenter/sw/5_x/nx-os/fabricpath/configuration/guide/fp_switching.html)

“The FabricPath hierarchical MAC address carries the reserved EtherType 0x8903.”

I suppose now is a good time to reach out to the Network Engineering team and ask about the SPAN placement in that datacenter.

Thanks for helping me quickly navigate this issue!

-Dave

This repository will probably go away at some point. I need to just integrate some small changes into Justin's bro-react[1] repository to make it work with BPF filters. Apparently I had gotten that code from Justin a long time ago and subsequently forgotten about it. Makes more sense to just integrate it into his, but this is functionality that I think could make sense to integrate into Bro eventually as a nice core pairing with the upcoming NetControl framework.

  .Seth

1. https://github.com/JustinAzoff/bro-react

Ah.. so there are probably two issues here: bro and pf_ring

Based on this image from your link

https://www.cisco.com/en/US/docs/switches/datacenter/sw/5_x/nx-os/fabricpath/configuration/guide/fp_switching_files/fp_switching-1.jpg

the FP header is fixed size, so adding support for it to bro should be easy enough: handle that ether type, skip the right number of bytes (see iosource/Packet.cc)

That won't help with the pf_ring issue though, but you're probably best reaching out to the pf_ring people about this issue.

It's entirely possibly you can fix the issue with a different span configuration though.