Dropping packets

Good morning,

I’m new to the list, and have been working on inheriting an existing zeek deployment that we have here. I’m trying to track down some (to me) excessive packet dropping.

We had an older version of zeek (bro) installed and mostly functional, though as I recall they were having issues with workers occasionally crashing.

Before I started looking into things, a new version of zeek was deployed (from an binary) and is mostly vanilla. We’ve included the bhr and myricom plugin, but that’s about it.

Zeek master and workers run bare metal on 3 pretty big Intel hosts (192GB memory, 2x Xeon E5-2690 with 14cores/socket, Debian 9). The workers have myricom interfaces. There’s span ports at the edge that feed into Arista switches that feed the Myricom interfaces in the workers.

We have a few issues:

  1. If I try to start up the workers with any more than ~8 threads, packet drop and memory usage goes through the roof in pretty short order. If I try to pin them, the first “worker” cpu’s get pegged pretty high and the others stay more or less idle (though that could be due to the amount of traffic the second worker interface is receiving).

  2. If I try to start up “1” worker (per worker node), using the “myricom::*” interface, the worker node goes unresponsive and needs to be hardware bounced. (Driver issue?)

  3. I can start workers nodes with multiple workers and ~5 threads each (currently “unpinned”), but after a few days, Packet drop is still excessive.

My current node.cfg is below [1]. Output from ‘zeekctl netstats’ is also below [2]. It’s been up since Friday ~2:00pm Eastern. Load average is higher than I would think it should be (given how much cpu these workers actually have, and how idle most of the cpu’s actually are). Htop output included [3].

I understand we should probably be pinning the worker threads, but the output of ‘lstopo-no-graphics --of txt’ is terrible to try and trace with 56 threads available. Also, do I want to use the “P” or the “L” listings? I can include that as a follow up if necessary.

Please help!

[1]

  1. If I try to start up the workers with any more than ~8 threads, packet drop and memory usage goes through the roof in pretty short order. If I try to pin them, the first “worker” cpu’s get pegged pretty high and the others stay more or less idle (though that could be due to the amount of traffic the second worker interface is receiving).

8 workers on each card should work fine. Based on your netstats output your load balancing might not be working that well. The total received by interface is:

worker-1 7484629230
worker-2 13622938689
worker-3 4497737524

worker-4 4479277818

which is a bit skewed.

  1. If I try to start up “1” worker (per worker node), using the “myricom::*” interface, the worker node goes unresponsive and needs to be hardware bounced. (Driver issue?)

Could be… I never used that feature.

  1. I can start workers nodes with multiple workers and ~5 threads each (currently “unpinned”), but after a few days, Packet drop is still excessive.

Not super surprising with only 5 workers per interface… with those cpus I’d run 10-12

My current node.cfg is below [1]. Output from ‘zeekctl netstats’ is also below [2]. It’s been up since Friday ~2:00pm Eastern. Load average is higher than I would think it should be (given how much cpu these workers actually have, and how idle most of the cpu’s actually are). Htop output included [3].

I understand we should probably be pinning the worker threads, but the output of ‘lstopo-no-graphics --of txt’ is terrible to try and trace with 56 threads available. Also, do I want to use the “P” or the “L” listings? I can include that as a follow up if necessary.

Turning off HT will simplify that quite a bit. You want the P cores I believe. In htop you’ll end up with cores 1-28 busy and 29-56 idle since those are the fake ones. Using lstopo-no-graphics --of ascii (make font really small) may make things easier to understand.

I’d change your node.cfg to be more like this, the output will make more sense:
(remember to stop the cluster before changing worker names)

[worker-1-a]

type=worker

host=WORKER 1

lb_method=custom

lb_procs=5

interface=myricom::eth4

[worker-1-b]

type=worker

host=WORKER 1

lb_method=custom

lb_procs=5

interface=myricom::eth5

[worker-2-a]

type=worker

host=WORKER 2

lb_method=custom

lb_procs=5

interface=myricom::eth4

[worker-2-b]

type=worker

host=WORKER 2

lb_method=custom

lb_procs=5

interface=myricom::eth5

The last time I worked out a node.cfg for a myricom based cluster this is what I ended up with:

[foo-a]
type=worker
host=foo
interface=myricom::p1p1
lb_method=custom
lb_procs=8
pin_cpus=3,5,7,9,11,13,15,17
env_vars=SNF_APP_ID=1,SNF_DATARING_SIZE=16384MB,SNF_DESCRING_SIZE=4096MB

[foo-b]
type=worker
host=fooo
interface=myricom::p2p1
lb_method=custom
lb_procs=8
pin_cpus=2,4,6,8,10,12,14,16
env_vars=SNF_APP_ID=2,SNF_DATARING_SIZE=16384MB,SNF_DESCRING_SIZE=4096MB

I’m about 90% sure I did the pinning right :slight_smile:

[2]

================

bro@bro-master-1:~$ zeekctl netstats

worker-1-1: 1581949346.194441 recvd=2178149468 dropped=2260820124 link=15063051356

worker-1-2: 1581949346.194473 recvd=274557259 dropped=2260820124 link=13159459147

worker-1-3: 1581949346.168558 recvd=1888926901 dropped=2260820124 link=14773828789

worker-1-4: 1581949346.081130 recvd=2110377092 dropped=2260820124 link=14995278980

worker-1-5: 1581949346.234478 recvd=1032618510 dropped=2260820124 link=13917520398

worker-2-1: 1581949346.269794 recvd=1551167612 dropped=640636540 link=14436069500

worker-2-2: 1581949346.271224 recvd=2811566586 dropped=640636540 link=15696468474

worker-2-3: 1581949346.292474 recvd=3295536154 dropped=640636540 link=16180438042

worker-2-4: 1581949346.314556 recvd=2505663441 dropped=640636540 link=15390565329

worker-2-5: 1581949343.011855 recvd=3459004896 dropped=640636540 link=20638874080

One thing to keep in mind with the myrcom driver is the drops are shared… so you aren’t dropping 2260820124 per worker, you dropped 2260820124 total.
The number is right, but to work out the total drop % you need to sum recvd but not the dropped number.

Here’s a script i wrote a while ago that does that… can pipe netstats output to this:

#!/usr/bin/env python
from future import print_function
import sys
import re
from collections import defaultdict

totals = defaultdict(int)
host_dropped = {}

total_rx = total_drop = 0

for line in sys.stdin:
parts = re.split(‘[ =:]+’, line.strip())
node, time, _, recvd, _, dropped, _, link = parts
host = node[:-2]
totals[host] += int(link)

total_rx += int(link)
if host not in host_dropped:
total_drop += int(dropped)
host_dropped[host] = int(dropped)

for host, total in sorted(totals.items()):
if not total: continue
d = host_dropped[host]
print(“%s dropped=%d rx=%d %0.2f%%” % (host, d, total, 100.0*d/total))

print()
print(“Totals dropped=%d rx=%d %0.2f%%” % (total_drop, total_rx, 100.0*total_drop/total_rx))

may need to tweak the host = node[:-2] line if you run more than 9 workers

running that on your output I get

worker-1 dropped=2260820124 rx=71909138670 3.14%
worker-2 dropped=640636540 rx=82342415425 0.78%

which isn’t so bad… without accounting for the shared drops it looks more like 15% and 4%. Double the number of workers and fix the load balancing and you should be able to get that to zero.

Thanks for the response Justin.

I’ll start with disabling HT on the workers, and will reconfigure the workers with 8 threads each. I’ll also pin the processes.

The myricom configuration you quoted included environment variables:

env_vars=SNF_APP_ID=1,SNF_DATARING_SIZE=16384MB,SNF_DESCRING_SIZE=4096MB

What are those for/where can I find documentation/Should I include any of those?

You also said the packets dropped are inclusive for the entire worker, but that the packets recvd are per thread:

worker-1-1: 1581949346.194441 recvd=2178149468 dropped=2260820124 link=15063051356

worker-1-2: 1581949346.194473 recvd=274557259 dropped=2260820124 link=13159459147

worker-1-3: 1581949346.168558 recvd=1888926901 dropped=2260820124 link=14773828789

worker-1-4: 1581949346.081130 recvd=2110377092 dropped=2260820124 link=14995278980

worker-1-5: 1581949346.234478 recvd=1032618510 dropped=2260820124 link=13917520398

Those add up to ~7.5B recvd with 2.2B dropped. That’s 30%, but your script said 3. Am I looking at some numbers wrong?

Joe

Thanks for the response Justin.

I’ll start with disabling HT on the workers, and will reconfigure the workers with 8 threads each. I’ll also pin the processes.

The myricom configuration you quoted included environment variables:

env_vars=SNF_APP_ID=1,SNF_DATARING_SIZE=16384MB,SNF_DESCRING_SIZE=4096MB

What are those for/where can I find documentation/Should I include any of those?

It’s in the myricom documentation… searching for those settings should find you a PDF of it.

You also said the packets dropped are inclusive for the entire worker, but that the packets recvd are per thread:

worker-1-1: 1581949346.194441 recvd=2178149468 dropped=2260820124 link=15063051356

worker-1-2: 1581949346.194473 recvd=274557259 dropped=2260820124 link=13159459147

worker-1-3: 1581949346.168558 recvd=1888926901 dropped=2260820124 link=14773828789

worker-1-4: 1581949346.081130 recvd=2110377092 dropped=2260820124 link=14995278980

worker-1-5: 1581949346.234478 recvd=1032618510 dropped=2260820124 link=13917520398

Those add up to ~7.5B recvd with 2.2B dropped. That’s 30%, but your script said 3. Am I looking at some numbers wrong?

Oh, yes! I forgot one another detail that is specific to the myricom nics. The link and drop counters don’t reset when you restart zeek… possibly this could be worked around in the plugin to snapshot them at startup and report the differences instead of the absolute numbers the card reports…

Try running this:

zeekctl stop
zeekctl exec /opt/snf/bin/myri_counters -c -p 0 >/dev/null

zeekctl exec /opt/snf/bin/myri_counters -c -p 1 >/dev/null
zeekctl start

that will clear the counters and give you clean numbers.

Can’t clear the counters, I get an “Operation not permitted”:

Lanai uptime (seconds): 59830

Counters uptime (seconds): 59830

Net send KBytes: 0

Net recv KBytes: 16291892294

Ethernet send: 8

Ethernet Small recv: 12077187

Ethernet Big recv: 0

Ethernet recv down: 0

Ethernet recv overrun: 236

SNF send pkts: 0

SNF recv pkts: 21273184129

SNF drop ring full: 73133

Interrupts: 361990

Net bad PHY/CRC32 drop: 8

Net overflow drop: 10266

Net Recv PAUSEs: 0

Ethernet Multicast filter drop: 153131

Ethernet Unicast filter drop: 616646672

Cannot clear counters: Operation not permitted

For some reason with the workers named as you suggested, everything starts up fine but after ~15 minutes, zeekctl status reports the workers as “stopped”, though all the processes are still running on them and the logger is still receiving data. We reverted back to the worker-[1234] notation that we had before and they start and stay running.

Any time we try and pin_cpu’s the workers crash. I was able to get things to start and stay running with 10 lb_procs each (unpinned), though I’m going to try and figure out how to get these counters to clear.

Note, we’ve disabled hyperthreading and included the environment variables that you suggested. Those are the main changes that we’ve made so far.

Can’t clear the counters, I get an “Operation not permitted”:

Hmm… are you running zeek as root or a regular user? may need to use sudo, or tweak the permissions on the /dev/myri… (i think?) files.

For some reason with the workers named as you suggested, everything starts up fine but after ~15 minutes, zeekctl status reports the workers as “stopped”, though all the processes are still running on them and the logger is still receiving data. We reverted back to the worker-[1234] notation that we had before and they start and stay running.

That doesn’t make a lot of sense… It’s just the name, it doesn’t affect anything else… Maybe it was just a coincidence?

Any time we try and pin_cpu’s the workers crash. I was able to get things to start and stay running with 10 lb_procs each (unpinned), though I’m going to try and figure out how to get these counters to clear.

That also doesn’t make a lot of sense. Pinning the workers just causes them to be started via cpuset locked to a core, everything else is the same.

How exactly are the workers crashing? Are you getting a crash report from zeekctl? or in dmesg?

Hmm… are you running zeek as root or a regular user? may need to use sudo, or tweak the permissions on the /dev/myri… (i think?) files.

  • We’re running zeek as an unprivileged user. I was able to go onto each worker as root and issue the commands to clear the counters.

That doesn’t make a lot of sense… It’s just the name, it doesn’t affect anything else… Maybe it was just a coincidence?

  • When we named the workers “worker-1-A”, etc, the processes that were started up were “worker-1-A-1” through “worker-1-A-N”. Is it possible that zeekctl doesn’t parse something correctly when it checks to make sure the processes are running? Zeekctl stop failed to stop the workers, zeeksctl status showed them as stopped. They were still running and the logger was still working. The only way to kill them was to do a “killall zeek” from the workers.

That also doesn’t make a lot of sense. Pinning the workers just causes them to be started via cpuset locked to a core, everything else is the same.

How exactly are the workers crashing? Are you getting a crash report from zeekctl? or in dmesg?

  • No crash reports that I recall. I can try and force it to happen again tomorrow. I’d like to let things go for 24 hours and see where we end up. We’ve been running for 5 hours and our stats currently looks like this [1]. Note, I modified the script that you provided to work with more than 9 threads. (host = re.sub(r’-[0-9]*$’,’’, node))

  • We’re currently running 10 threads per worker unpinned

  • Medium load average on both boxes is around 6

  • Memory usage on both boxes is around 75G.

  • I’m going to find out if there’s something we can do in the Aristas so they send some more traffic to the secondary interface. I need to find out how they’re configured and why.

[1]

bro@bro-master-1:~$ zeekctl netstats | ~/calcDrop.py

Warning: ZeekControl plugin uses legacy BroControl API. Use

‘import ZeekControl.plugin’ instead of ‘import BroControl.plugin’

worker-1 dropped=9051079 rx=6537521948 0.14%

worker-2 dropped=0 rx=6766088683 0.00%

worker-3 dropped=0 rx=437828006 0.00%

worker-4 dropped=0 rx=466874444 0.00%

Totals dropped=9051079 rx=14208313081 0.06%

Hmm… are you running zeek as root or a regular user? may need to use sudo, or tweak the permissions on the /dev/myri… (i think?) files.

  • We’re running zeek as an unprivileged user. I was able to go onto each worker as root and issue the commands to clear the counters.

I see… I bet we could fix the myricom plugin to workaround this problem… basically just treat the initial drop/link numbers as zero. I had worked around it so long ago just by always clearing the counters when the cluster is restarted… but never looked into a permanent fix.

That doesn’t make a lot of sense… It’s just the name, it doesn’t affect anything else… Maybe it was just a coincidence?

  • When we named the workers “worker-1-A”, etc, the processes that were started up were “worker-1-A-1” through “worker-1-A-N”. Is it possible that zeekctl doesn’t parse something correctly when it checks to make sure the processes are running? Zeekctl stop failed to stop the workers, zeeksctl status showed them as stopped. They were still running and the logger was still working. The only way to kill them was to do a “killall zeek” from the workers.

Yeah… that’s the weird part… it doesn’t parse anything or really care about the names at all, it just keeps track of the pid. You stopped the old workers before renaming and starting the new ones right? not doing so can cause similar sounding problems, but I’m not sure if that’s what you ran into.

That also doesn’t make a lot of sense. Pinning the workers just causes them to be started via cpuset locked to a core, everything else is the same.

How exactly are the workers crashing? Are you getting a crash report from zeekctl? or in dmesg?

  • No crash reports that I recall. I can try and force it to happen again tomorrow. I’d like to let things go for 24 hours and see where we end up. We’ve been running for 5 hours and our stats currently looks like this [1]. Note, I modified the script that you provided to work with more than 9 threads. (host = re.sub(r’-[0-9]*$‘,’', node))

Nice :slight_smile: Looks like your overall stats are pretty good with the cleared counters and the extra workers.

  • We’re currently running 10 threads per worker unpinned

  • Medium load average on both boxes is around 6

  • Memory usage on both boxes is around 75G.

https://github.com/corelight/zeek-ssl-clear-state and https://github.com/corelight/zeek-smb-clear-state may help a bit with the memory usage… 75G is a bit high for only 10 workers… .though some of that reported number is the myricom buffer I think.

  • I’m going to find out if there’s something we can do in the Aristas so they send some more traffic to the secondary interface. I need to find out how they’re configured and why.

It does look like something is setup weirdly if the traffic is not evenly balanced… Maybe you have 2 different capture groups setup instead of just dumping everything into one group split across 4 nics? The closer you look the more problems you find :slight_smile:

I contacted myricom support and the first thing they suggested was turning debug logging on (via the environment variable in the node.cfg) so they could get some info. I did that last Wednesday and we’ve had none of the memory issues that we were experiencing before that. So at least that part of the situation is in check.

Here’s the packet drop stats over the last ~4.5 days:

worker-1 dropped=534746334 rx=138087967089 0.39%

worker-2 dropped=340526849 rx=147729484064 0.23%

worker-3 dropped=0 rx=9064403485 0.00%

worker-4 dropped=0 rx=9183660589 0.00%

Totals dropped=875273183 rx=304065515227 0.29%

Just to refresh. We’re running unpinned with 10 lb_procs per worker.

Should I… give it more processes or try to pin the CPU’s? Load average on the workers is hovering around 5.5 and since we’re unpinned we’re using all 28 cores (but they’re only around 15-20% load).

I’ve read through a good portion on the Berkley 100G doc and I’m wondering if I should start looking at shunting as well.

I contacted myricom support and the first thing they suggested was turning debug logging on (via the environment variable in the node.cfg) so they could get some info. I did that last Wednesday and we’ve had none of the memory issues that we were experiencing before that. So at least that part of the situation is in check.

I wouldn’t have expected that to be the cause or the fix… Did you load those packages I had suggested?

Here’s the packet drop stats over the last ~4.5 days:

worker-1 dropped=534746334 rx=138087967089 0.39%

worker-2 dropped=340526849 rx=147729484064 0.23%

worker-3 dropped=0 rx=9064403485 0.00%

worker-4 dropped=0 rx=9183660589 0.00%

Totals dropped=875273183 rx=304065515227 0.29%

Just to refresh. We’re running unpinned with 10 lb_procs per worker.

Should I… give it more processes or try to pin the CPU’s? Load average on the workers is hovering around 5.5 and since we’re unpinned we’re using all 28 cores (but they’re only around 15-20% load).

Load average isn’t a terribly useful metric… what really helps is having per cpu utilization graphs.
Those drop numbers are already looking a lot better, but the rx packets per worker is still really skewed and points to something weird with your load balancing. The first 2 nics are seeing 15 times the number of packets as the other 2.

I’ve read through a good portion on the Berkley 100G doc and I’m wondering if I should start looking at shunting as well.

If you have enough elephant flows, it couldn’t hurt… but likely want to fix the load balancing first.

I contacted myricom support and the first thing they suggested was turning debug logging on (via the environment variable in the node.cfg) so they could get some info. I did that last Wednesday and we’ve had none of the memory issues that we were experiencing before that. So at least that part of the situation is in check.

I wouldn’t have expected that to be the cause or the fix… Did you load those packages I had suggested?

I didn’t. I was trying to keep it as close to a vanilla configuration as possible (especially after looping support in). I probably will this week though.

Here’s the packet drop stats over the last ~4.5 days:

worker-1 dropped=534746334 rx=138087967089 0.39%

worker-2 dropped=340526849 rx=147729484064 0.23%

worker-3 dropped=0 rx=9064403485 0.00%

worker-4 dropped=0 rx=9183660589 0.00%

Totals dropped=875273183 rx=304065515227 0.29%

Just to refresh. We’re running unpinned with 10 lb_procs per worker.

Should I… give it more processes or try to pin the CPU’s? Load average on the workers is hovering around 5.5 and since we’re unpinned we’re using all 28 cores (but they’re only around 15-20% load).

Load average isn’t a terribly useful metric… what really helps is having per cpu utilization graphs.
Those drop numbers are already looking a lot better, but the rx packets per worker is still really skewed and points to something weird with your load balancing. The first 2 nics are seeing 15 times the number of packets as the other 2.

I would think the load average should at least indicate how the processors are keeping up with the traffic. That said, I know there’s an imbalance. I need access to the Arista switches. I guess it’s possible that if we can balance them properly I can get away without doing any more CPU/pinning work.

I’ve read through a good portion on the Berkley 100G doc and I’m wondering if I should start looking at shunting as well.

If you have enough elephant flows, it couldn’t hurt… but likely want to fix the load balancing first.

Will do.

Thank you

Joe

The imbalance in traffic on the interfaces is apparently due to the way the Arista switches are physically connected. Networking is going to change that:

We have span ports on 2 different links feeding 2 different Arista switches. One Arista switch goes to one set of interfaces (eth4 on both boxes). And the other Arista switch goes to the other set of interfaces (eth5 on both boxes). One link is used as primary so it sees a lot more traffic than the other. In addition to that, the Arista switch with the heavy link has span ports from the core.

We’re going to move the core spans over to the Arista with the light link… that should hopefully work to rebalance things.

That said… Will the Arista switches do any dedup on the packets that they’re getting? I was uninvolved on the design/setup of the system. It would seem to me that if we have some vlans spanning to the Aristas, and ports from the edge spanning to the Aristas…. Packets that go through the edge and through those vlans in the core will both be spanned to the Arista (and sent to zeek).

The imbalance in traffic on the interfaces is apparently due to the way the Arista switches are physically connected. Networking is going to change that:

We have span ports on 2 different links feeding 2 different Arista switches. One Arista switch goes to one set of interfaces (eth4 on both boxes). And the other Arista switch goes to the other set of interfaces (eth5 on both boxes). One link is used as primary so it sees a lot more traffic than the other. In addition to that, the Arista switch with the heavy link has span ports from the core.

We’re going to move the core spans over to the Arista with the light link… that should hopefully work to rebalance things.

That said… Will the Arista switches do any dedup on the packets that they’re getting? I was uninvolved on the design/setup of the system. It would seem to me that if we have some vlans spanning to the Aristas, and ports from the edge spanning to the Aristas…. Packets that go through the edge and through those vlans in the core will both be spanned to the Arista (and sent to zeek).