I'm in the process of trying to debug an odd high-cpu issue and looking for guidance.
The deployment is a follows:
- Cluster has with two nodes, each with 10 workers and the workers are pinned to specific cpu cores.
- x520 with PF_RING
- Traffic to each node is load balanced equally
The issue is that one worker on one of the nodes is always at 100% CPU while all other workers are around 50%. If I restart Bro a different worker will pin to 100%, but always on the same node.
I ran 'strace' on both a "bad" and "good" worker and one anomaly I spotted was that the "bad" worker never called 'nanosleep', whereas the "good" worker had about 84,000 'nanosleep' calls in the same amount of time.
I'm wondering if its possible for a queue to go bad on the x520, which might explain why its a random worker on the same node after restarting.
Is there a way to determine which x520 queue a specific worker is reading from?
Can you load this script that will add a node column to the conn.log that says which node handled that connection:
also, what 'broctl netstats' outputs would be useful to see.
I'm already capturing the node in conn.log but haven't been able to spot anything out off the ordinary compared to the other nodes.
Below is a fresh 'netstats' from this morning (WIN_INT-9 is obviously the culprit).
MID_INT-1: 1460636199.609909 recvd=327807625 dropped=1 link=327807625
MID_INT-10: 1460636199.813720 recvd=339804389 dropped=4 link=339804389
MID_INT-2: 1460636200.010047 recvd=313901304 dropped=0 link=313901304
MID_INT-3: 1460636200.210033 recvd=323507786 dropped=1 link=323507786
MID_INT-4: 1460636200.413951 recvd=322338069 dropped=0 link=322338069
MID_INT-5: 1460636200.613996 recvd=314681107 dropped=1 link=314681107
MID_INT-6: 1460636200.814761 recvd=325488973 dropped=1 link=325488973
MID_INT-7: 1460636201.017945 recvd=328830658 dropped=3 link=328830658
MID_INT-8: 1460636201.218113 recvd=338250015 dropped=0 link=338250015
MID_INT-9: 1460636201.417949 recvd=387979776 dropped=0 link=387979776
WIN_INT-1: 1460636288.903341 recvd=142474122 dropped=1 link=142474122
WIN_INT-10: 1460636289.103648 recvd=232076131 dropped=1 link=232076131
WIN_INT-2: 1460636289.303290 recvd=145451659 dropped=2 link=145451659
WIN_INT-3: 1460636289.507242 recvd=182345947 dropped=0 link=182345947
WIN_INT-4: 1460636289.707591 recvd=140378820 dropped=1 link=140378820
WIN_INT-5: 1460636289.911410 recvd=140342198 dropped=0 link=140342198
WIN_INT-6: 1460636290.111178 recvd=138961706 dropped=0 link=138961706
WIN_INT-7: 1460636290.315433 recvd=198792251 dropped=0 link=198792251
WIN_INT-8: 1460636290.515158 recvd=170824302 dropped=3 link=170824302
WIN_INT-9: 1460636287.108095 recvd=2414368833 dropped=438939600 link=2414368833
Ahh, well that last worker is seeing almost double the number of packets than all of the other workers on that host combined, so that explains the CPU usage.
What does a frequency distribution of the node column from your conn.log from around that time show?
zcat conn.....gz | bro-cut node | sort | uniq -c
Adding this to local.bro
If you don't have it already will give you a stats.log which will contain some helpful information too.
If I had to guess, there's probably something going on traffic wise.. if I had to guess the WIN box is seeing a ton of non-ip traffic that all gets load balanced to the same worker.
Can you run capstats on the two boxes and compare the output?
# capstats -i p1p1 -I 2
1460639895.723665 pkts=6642 kpps=2.8 kbytes=2100 mbps=7.2 nic_pkts=6642 nic_drops=0 u=0 t=0 i=0 o=0 nonip=6642
I just noticed that capstats doesn't properly handle vlan encapsulated packets, so all of our traffic shows up as nonip.. I'll look into fixing that, but if you are not using vlans looking at the breakdown of udp,tcp,ip,other,nonip (u,t,i,o,nonip) would help.
Here is the frequency distribution (I’m logging in JSON so different command but same results):
$ zcat 2016-04-14/conn.08:00:00-09:00:00.log.gz | jq -r ‘.node’ | sort | uniq -c
You may be on to something with the non-ip traffic… there is a drastic difference between the two datacenters:
1460641772.239436 pkts=10414545 kpps=208.2 kbytes=5732528 mbps=938.6 nic_pkts=10414545 nic_drops=0 u=104675 t=3627503 i=307 o=405 nonip=6681655
1460641723.573448 pkts=9553569 kpps=178.9 kbytes=6561123 mbps=1006.6 nic_pkts=9553569 nic_drops=0 u=174140 t=9373195 i=267 o=934 nonip=5033
You are seeing the side effects of hot spotting. It's likely that you have a connection which is massively larger than all others that your host is seeing. You could filter your traffic to stop seeing that flow which would cut down on the problems you are seeing.
I just pushed a new repository that might help you address your issue. It will still require a lot of clean up and documentation but if you just load the module as-is, it might automatically help you. It automatically identifies large and fast flows and places BPF filters to automatically shunt them while still watching for SYN/ACK/FIN/RST flags.
There are a couple of configuration options you'll want to pay attention to, so I'll just give you an example blurb you could put into your local.bro:
# Set a minimum flow size before considering shunting (32MB)
redef BulkFlows::size_threshold = 33554432;
# Set a minimum flow data rate before considering shunting (50Mbps)
BulkFlows::speed_threshold = 6250000;
# If you have VLAN tags on your traffic, you will also have to add the following line because packet filtering with BPF is generally broken otherwise...
I'm not 100% familiar with the load balancing methods, but know if your doing a 5 tuple load balance algorithm, I have seen IPsec stick to a single location, as there are effectively only three tuple available (Src IP, Dst IP and proto), which don't change for most static IPSec connections. Which results in the traffic not changing and sliding from worker to worker like normal connections would.
Could there be a high use IPSec tunnel between that Datacenter and another location?
Great.. just what I was thinking. At this point you should be able to just run something like
tcpdump -n -c 1000 'not ip'
on the WIN box
and see exactly what this traffic is.. then we can figure out what to do about it...
$ sudo tcpdump -n -i eth6 not ip and not arp -c10000 | grep ethertype | cut -f 2 -d ‘,’ | sort | uniq -c
9980 ethertype Unknown (0x8903)
A quick Google points to Cisco FabricPath Switching ( http://www.cisco.com/en/US/docs/switches/datacenter/sw/5_x/nx-os/fabricpath/configuration/guide/fp_switching.html)
“The FabricPath hierarchical MAC address carries the reserved EtherType 0x8903.”
I suppose now is a good time to reach out to the Network Engineering team and ask about the SPAN placement in that datacenter.
Thanks for helping me quickly navigate this issue!
This repository will probably go away at some point. I need to just integrate some small changes into Justin's bro-react repository to make it work with BPF filters. Apparently I had gotten that code from Justin a long time ago and subsequently forgotten about it. Makes more sense to just integrate it into his, but this is functionality that I think could make sense to integrate into Bro eventually as a nice core pairing with the upcoming NetControl framework.
Ah.. so there are probably two issues here: bro and pf_ring
Based on this image from your link
the FP header is fixed size, so adding support for it to bro should be easy enough: handle that ether type, skip the right number of bytes (see iosource/Packet.cc)
That won't help with the pf_ring issue though, but you're probably best reaching out to the pf_ring people about this issue.
It's entirely possibly you can fix the issue with a different span configuration though.