- If I try to start up the workers with any more than ~8 threads, packet drop and memory usage goes through the roof in pretty short order. If I try to pin them, the first “worker” cpu’s get pegged pretty high and the others stay more or less idle (though that could be due to the amount of traffic the second worker interface is receiving).
8 workers on each card should work fine. Based on your netstats output your load balancing might not be working that well. The total received by interface is:
worker-1 7484629230
worker-2 13622938689
worker-3 4497737524
worker-4 4479277818
which is a bit skewed.
- If I try to start up “1” worker (per worker node), using the “myricom::*” interface, the worker node goes unresponsive and needs to be hardware bounced. (Driver issue?)
Could be… I never used that feature.
- I can start workers nodes with multiple workers and ~5 threads each (currently “unpinned”), but after a few days, Packet drop is still excessive.
Not super surprising with only 5 workers per interface… with those cpus I’d run 10-12
My current node.cfg is below [1]. Output from ‘zeekctl netstats’ is also below [2]. It’s been up since Friday ~2:00pm Eastern. Load average is higher than I would think it should be (given how much cpu these workers actually have, and how idle most of the cpu’s actually are). Htop output included [3].
I understand we should probably be pinning the worker threads, but the output of ‘lstopo-no-graphics --of txt’ is terrible to try and trace with 56 threads available. Also, do I want to use the “P” or the “L” listings? I can include that as a follow up if necessary.
Turning off HT will simplify that quite a bit. You want the P cores I believe. In htop you’ll end up with cores 1-28 busy and 29-56 idle since those are the fake ones. Using lstopo-no-graphics --of ascii (make font really small) may make things easier to understand.
I’d change your node.cfg to be more like this, the output will make more sense:
(remember to stop the cluster before changing worker names)
[worker-1-a]
type=worker
host=WORKER 1
lb_method=custom
lb_procs=5
interface=myricom::eth4
[worker-1-b]
type=worker
host=WORKER 1
lb_method=custom
lb_procs=5
interface=myricom::eth5
[worker-2-a]
type=worker
host=WORKER 2
lb_method=custom
lb_procs=5
interface=myricom::eth4
[worker-2-b]
type=worker
host=WORKER 2
lb_method=custom
lb_procs=5
interface=myricom::eth5
The last time I worked out a node.cfg for a myricom based cluster this is what I ended up with:
[foo-a]
type=worker
host=foo
interface=myricom::p1p1
lb_method=custom
lb_procs=8
pin_cpus=3,5,7,9,11,13,15,17
env_vars=SNF_APP_ID=1,SNF_DATARING_SIZE=16384MB,SNF_DESCRING_SIZE=4096MB
[foo-b]
type=worker
host=fooo
interface=myricom::p2p1
lb_method=custom
lb_procs=8
pin_cpus=2,4,6,8,10,12,14,16
env_vars=SNF_APP_ID=2,SNF_DATARING_SIZE=16384MB,SNF_DESCRING_SIZE=4096MB
I’m about 90% sure I did the pinning right
[2]
================
bro@bro-master-1:~$ zeekctl netstats
worker-1-1: 1581949346.194441 recvd=2178149468 dropped=2260820124 link=15063051356
worker-1-2: 1581949346.194473 recvd=274557259 dropped=2260820124 link=13159459147
worker-1-3: 1581949346.168558 recvd=1888926901 dropped=2260820124 link=14773828789
worker-1-4: 1581949346.081130 recvd=2110377092 dropped=2260820124 link=14995278980
worker-1-5: 1581949346.234478 recvd=1032618510 dropped=2260820124 link=13917520398
worker-2-1: 1581949346.269794 recvd=1551167612 dropped=640636540 link=14436069500
worker-2-2: 1581949346.271224 recvd=2811566586 dropped=640636540 link=15696468474
worker-2-3: 1581949346.292474 recvd=3295536154 dropped=640636540 link=16180438042
worker-2-4: 1581949346.314556 recvd=2505663441 dropped=640636540 link=15390565329
worker-2-5: 1581949343.011855 recvd=3459004896 dropped=640636540 link=20638874080
One thing to keep in mind with the myrcom driver is the drops are shared… so you aren’t dropping 2260820124 per worker, you dropped 2260820124 total.
The number is right, but to work out the total drop % you need to sum recvd but not the dropped number.
Here’s a script i wrote a while ago that does that… can pipe netstats output to this:
#!/usr/bin/env python
from future import print_function
import sys
import re
from collections import defaultdict
totals = defaultdict(int)
host_dropped = {}
total_rx = total_drop = 0
for line in sys.stdin:
parts = re.split(‘[ =:]+’, line.strip())
node, time, _, recvd, _, dropped, _, link = parts
host = node[:-2]
totals[host] += int(link)
total_rx += int(link)
if host not in host_dropped:
total_drop += int(dropped)
host_dropped[host] = int(dropped)
for host, total in sorted(totals.items()):
if not total: continue
d = host_dropped[host]
print(“%s dropped=%d rx=%d %0.2f%%” % (host, d, total, 100.0*d/total))
print()
print(“Totals dropped=%d rx=%d %0.2f%%” % (total_drop, total_rx, 100.0*total_drop/total_rx))
may need to tweak the host = node[:-2] line if you run more than 9 workers
running that on your output I get
worker-1 dropped=2260820124 rx=71909138670 3.14%
worker-2 dropped=640636540 rx=82342415425 0.78%
which isn’t so bad… without accounting for the shared drops it looks more like 15% and 4%. Double the number of workers and fix the load balancing and you should be able to get that to zero.