5 node cluster

Hello

The myricom cards in my cluster nodes are dropping packets, and I am not getting any log information in prefix/logs. Did I miss something during the setup process ? Please see below for initial info and please let me know what else is needed. Thank you.

Darrain

I compiled bro using the option below.
–with-pcap=/opt/snf/

[bromgr@bromgr ~]$ ldd /usr/local/bro/bin/bro | grep pcap

libpcap.so.1 => /opt/snf/lib/libpcap.so.1 (0x00007faf9c3d5000)

I get the following when I run capstats

[BroControl] > capstats

Interface kpps mbps (10s average)

Hello

The myricom cards in my cluster nodes are dropping packets, and I am not getting any log information in prefix/logs. Did I miss something during the setup process ? Please see below for initial info and please let me know what else is needed. Thank you.

Darrain

I compiled bro using the option below.
--with-pcap=/opt/snf/

[bromgr@bromgr ~]$ ldd /usr/local/bro/bin/bro | grep pcap

libpcap.so.1 => /opt/snf/lib/libpcap.so.1 (0x00007faf9c3d5000)

Looks good

I get the following when I run capstats

[BroControl] > capstats

Interface kpps mbps (10s average)

----------------------------------------

worker-1-1: capstats failed (error: eth2: snf_ring_open_id(ring=-1) failed: Device or resource busy)

worker-2-1: capstats failed (error: eth2: snf_ring_open_id(ring=-1) failed: Device or resource busy)

worker-3-1: capstats failed (error: eth2: snf_ring_open_id(ring=-1) failed: Device or resource busy)

worker-4-1: capstats failed (error: eth2: snf_ring_open_id(ring=-1) failed: Device or resource busy)

worker-5-1: capstats failed (error: eth2: snf_ring_open_id(ring=-1) failed: Device or resource busy)

This is normal.. capstats for snf never worked right (it could never work with snfv2 and with snfv3 it needs to set a different app id as bro, otherwise it can't capture at the same time as bro. As long as bro is running and not failing with the same error you're ok. There are better ways to get data out of a myricom card using the myricom tools as well.

Your node.cfg looks mostly ok. I would switch to only running 1 or 2 proxies and just run them on the manager node.

Why are you using 7,8,9,10,11,18,19,20,21,22 in particular? What CPUs do you have? This is potentially not doing what you intend. Most likely 7/19 8/20 9/21 10/22 are the same cpu.

Your underlying problem is probably that a firewall is enabled on your hosts and the worker processes can't reach the manager. Daniel just wrote a good section on this for the manual:

This section summarizes the network communication between Bro and BroControl,
which is useful to understand if you need to reconfigure your firewall. If
your firewall is preventing Bro communication, then either the "deploy"
command or the "peerstatus" command will fail.

For a cluster setup, BroControl uses ssh to run commands on other hosts in
the cluster, so the manager host needs to connect to TCP port 22 on each
of the other hosts in the cluster. Note that BroControl never attempts
to ssh to the localhost, so in a standalone setup BroControl does not use ssh.

Each instance of Bro in a cluster needs to communicate directly with other
instances of Bro regardless of whether these instances are running on the same
host or not. Each proxy and worker needs to connect to the manager,
and each worker needs to connect to one proxy. If a logger node is defined,
then each of the other nodes needs to connect to the logger.

Note that you can change the port that Bro listens on by changing the value
of the "BroPort" option in your ``broctl.cfg`` file (this should be needed
only if your system has another process that listens on the same port). By
default, a standalone Bro listens on TCP port 47760. For a cluster setup,
the logger listens on TCP port 47761, and the manager listens on TCP port 47762
(or 47761 if no logger is defined). Each proxy is assigned its own port
number, starting with one number greater than the manager's port. Likewise,
each worker is assigned its own port starting one number greater than the
highest port number assigned to a proxy.

Finally, a few BroControl commands (such as "print" and "peerstatus") rely
on broccoli to communicate with Bro. This means that for those commands to
function, BroControl needs to connect to each Bro instance.

Thanks for the quick reply. I put proxy on everything because I was grabbing at straws. I did only have 1 proxy and it was on the manager with the same results.

Why are you using 7,8,9,10,11,18,19,20,21,22 in particular? What CPUs do you have? This is potentially not doing what you intend. Most likely 7/19 8/20 9/21 10/22 are the same cpu.

Those are the core that are with node 1 and node 1 is associated with the myricom card.

[bromgr@bromgr 2016-10-07]$ lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 24

On-line CPU(s) list: 0-23

Thread(s) per core: 2

Core(s) per socket: 6

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 63

Model name: Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz

Stepping: 2

CPU MHz: 1200.000

BogoMIPS: 6799.00

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 20480K

NUMA node0 CPU(s): 0-5,12-17

NUMA node1 CPU(s): 6-11,18-23

Your underlying problem is probably that a firewall is enabled on your hosts and the worker processes can’t reach the manager.

I have ip6 & iptables off

peerstatus

[BroControl] > peerstatus

manager

1475875039.738664 peer=worker-2-2 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-3 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=proxy-2 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=proxiy-5 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-4 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-3 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-4 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-8 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-9 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-1 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-1 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-9 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-8 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-6 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-9 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-3 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=proxy-3 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-7 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-7 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=proxy-4 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-8 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=proxy-1 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-2 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-4 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-6 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-1 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-10 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-9 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-10 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-2 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-3 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-1 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-8 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-5 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-6 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-6 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-8 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-7 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-7 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-6 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-1 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-5 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-10 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-10 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-7 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-3 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-9 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-5 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-2 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer= host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=proxy-5 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-5 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-4 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-5-10 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-2-5 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-4-4 host=10.0.40.15 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-2 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

Thanks for the quick reply. I put proxy on everything because I was grabbing at straws. I did only have 1 proxy and it was on the manager with the same results.

Why are you using 7,8,9,10,11,18,19,20,21,22 in particular? What CPUs do you have? This is potentially not doing what you intend. Most likely 7/19 8/20 9/21 10/22 are the same cpu.

Those are the core that are with node 1 and node 1 is associated with the myricom card.

[bromgr@bromgr 2016-10-07]$ lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 24

On-line CPU(s) list: 0-23

Thread(s) per core: 2

Core(s) per socket: 6

I see. You have 2 6 core cpus with hyper threading. So those are the two sets of cpus that make up each hypertheading pair. We haven't gotten to do performance testing for this yet, but you might get better performance by just using 2,3,4,5,6,7,8,9,10,11. It's the tradeoff between having to copy half of the packets across to the other numa node, but using more of the 'real' cores and less of the hyper threading ones.

Your underlying problem is probably that a firewall is enabled on your hosts and the worker processes can't reach the manager.
I have ip6 & iptables off

On all the machines? "everything is working but there are no logs" almost always turns out to be firewall rules. The last time it turned out that another admin had re-enabled the firewall.. :slight_smile:

One thing to check for that are the logs written to the spool/ on each worker. There will be a local communication.log for each worker that may be complaining about something.

Now that I reread your first message I see "I am not getting any log information in prefix/logs". Do you mean that there are literally no log files in there? under current/ you should at least have stderr.log and communication.log. If you literally have no log files you may have some permission issues if you are not running bro as root.

You can also run tcpdump on the manager and see if the workers are even trying to send it anything.

peerstatus

[BroControl] > peerstatus

    manager

1475875039.738664 peer=worker-2-2 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-1-3 host=10.0.40.18 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=proxy-2 host=10.0.40.17 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=proxiy-5 host=10.0.40.19 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-4 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

1475875039.738664 peer=worker-3-3 host=10.0.40.16 events_in=3165 events_out=3165 ops_in=0 ops_out=3472 bytes_in=? bytes_out=?

That appears normal.. I'm not sure what bytes_in and bytes_out were supposed to be.. it doesn't look like we output that anymore.

What does 'broctl netstats' show?

Sorry, yeah I am getting comm logs and stderr on the manager. I do have two NICS enabled on each system, one for management with IP and the other is the myricom with no IP and in sniffer mode.

Each of the workers do have the spool wirker directories but they are empty.

I use to be able to run this on the manager

[bromgr@bromgr etc]$ sudo tcpdump -i eth2

tcpdump: snf_ring_open_id(ring=-1) failed: Device or resource busy

[BroControl] > netstats

worker-1-1: 1475878452.092051 recvd=1 dropped=17260812 link=17260813

worker-1-10: 1475878452.292009 recvd=1 dropped=17260812 link=17260813

worker-1-2: 1475878452.493003 recvd=1 dropped=17260812 link=17260813

worker-1-3: 1475878452.693975 recvd=1 dropped=17260812 link=17260813

worker-1-4: 1475878452.895009 recvd=1 dropped=17260812 link=17260813

worker-1-5: 1475878453.095000 recvd=1 dropped=17260812 link=17260813

worker-1-6: 1475878453.296049 recvd=1 dropped=17260812 link=17260813

worker-1-7: 1475878453.497139 recvd=1 dropped=17260812 link=17260813

worker-1-8: 1475878453.697990 recvd=1 dropped=17260812 link=17260813

worker-1-9: 1475878453.897974 recvd=1 dropped=17260812 link=17260813

worker-2-1: 1475878450.084311 recvd=1 dropped=43750502 link=43750503

worker-2-10: 1475878450.285335 recvd=1 dropped=43750502 link=43750503

worker-2-2: 1475878450.485317 recvd=1 dropped=43750502 link=43750503

worker-2-3: 1475878450.686430 recvd=1 dropped=43750502 link=43750503

worker-2-4: 1475878450.887373 recvd=1 dropped=43750502 link=43750503

worker-2-5: 1475878451.088348 recvd=1 dropped=43750502 link=43750503

worker-2-6: 1475878451.288262 recvd=1 dropped=43750502 link=43750503

worker-2-7: 1475878451.489370 recvd=1 dropped=43750502 link=43750503

worker-2-8: 1475878451.689311 recvd=1 dropped=43750502 link=43750503

worker-2-9: 1475878451.890323 recvd=1 dropped=43750502 link=43750503

worker-3-1: 1475878448.077118 recvd=1 dropped=9847880 link=9847881

worker-3-10: 1475878448.278158 recvd=1 dropped=9847880 link=9847881

worker-3-2: 1475878448.479115 recvd=1 dropped=9847880 link=9847881

worker-3-3: 1475878448.679110 recvd=1 dropped=9847880 link=9847881

worker-3-4: 1475878448.880134 recvd=1 dropped=9847880 link=9847881

worker-3-5: 1475878449.081098 recvd=1 dropped=9847880 link=9847881

worker-3-6: 1475878449.281137 recvd=1 dropped=9847880 link=9847881

worker-3-7: 1475878449.482134 recvd=1 dropped=9847880 link=9847881

worker-3-8: 1475878449.683136 recvd=1 dropped=9847880 link=9847881

worker-3-9: 1475878449.884120 recvd=1 dropped=9847880 link=9847881

worker-4-1: 1475878446.070765 recvd=1 dropped=14367380 link=14367381

worker-4-10: 1475878446.271782 recvd=1 dropped=14367380 link=14367381

worker-4-2: 1475878446.472749 recvd=1 dropped=14367380 link=14367381

worker-4-3: 1475878446.672736 recvd=1 dropped=14367380 link=14367381

worker-4-4: 1475878446.873773 recvd=1 dropped=14367380 link=14367381

worker-4-5: 1475878447.074779 recvd=1 dropped=14367380 link=14367381

worker-4-6: 1475878447.274758 recvd=1 dropped=14367380 link=14367381

worker-4-7: 1475878447.475787 recvd=1 dropped=14367380 link=14367381

worker-4-8: 1475878447.676719 recvd=1 dropped=14367380 link=14367381

worker-4-9: 1475878447.876731 recvd=1 dropped=14367380 link=14367381

Ah, ok.. so this isn't the firewall issue... That's when "everything is working but there are no logs" but in your case nothing is working :slight_smile:

I'd stop bro and then make sure everything is stopped. You can use 'broctl ps.bro' to ensure that there are no stray procs lying around. Then at that point with nothing else running you should be able to run things like 'tcpdump' or 'broctl capstats' and verify that you can capture packets.

You should also be able to run tools like

/opt/snf/bin/myri_nic_info
/opt/snf/bin/myri_counters
/opt/snf/bin/myri_bandwidth
/opt/snf/sbin/myri_license

to ensure that the card+drivers are working properly as well as check dmesg output and check to see if it is complaining about anything

I don't recall every seeing that particular netstats output, but I bet you'll be able to reproduce the problem with regular tcpdump. Generally speaking if tcpdump -w foo.pcap writes out packets that look ok, and you can use bro -r against foo.pcap, bro it should work in realtime.

The snf issues on the manager may be due to trying to use snf libs against a regular NIC, I've had to use things like

LD_PRELOAD=/usr/lib64/libpcap.so.1 tcpdump ...

to force it to use standard libpcap.

Turns out it was a simple config issue (like most times & RTFM), and traffic is flowing to snf0. My workers were not using the snf0 interface as you must if you compiled using the myricom sniffer. Also changed the cpu pinning so thanks for that info. I also turned off the time source in the snf driver. Now I need to add Arista time source. Thanks for your time.

[worker-3]
type=worker
host=10.0.40.16
interface=snf0 use to be eth2
lb_method=myricom
lb_procs=10
pin_cpus=2,3,4,5,6,7,8,9,10,11 the cpu pinning was not right either
#env_vars=LD_LIBRARY_PATH=/opt/snf/lib:$PATH, SNF_FLAGS=0x1, SNF_DATARING_SIZE=0x100000000, SNF_NUM_RINGS=10

[worker-3]

type=worker

host=10.0.40.16

interface=eth2

lb_method=myricom

lb_procs=10

pin_cpus=7,8,9,10,11,18,19,20,21,22

env_vars=LD_LIBRARY_PATH=/opt/snf/lib:$PATH, SNF_FLAGS=0x1, SNF_DATARING_SIZE=0x100000000, SNF_NUM_RINGS=10

To:

[worker-3]
type=worker
host=10.0.40.16
interface=snf0
lb_method=myricom
lb_procs=10
pin_cpus=2,3,4,5,6,7,8,9,10,11

You don’t seem to use native Myricom support, there’s a plugin for that.

Great, would you point me in the right direction. Thank you.

http://lmgtfy.com/?q=Bro+Myricom

Finds me

https://www.bro.org/sphinx-git/components/bro-plugins/myricom/README.html

You’re welcome :wink:

Tested on 2.5 master and beta. Haven’t tried on 2.4 although by the time you will build your cluster 2.5 will have been released.

Interesting.. which cards and which version of the snf drivers are you using?

I use interface=p1p1 on our clusters and have never had an issue. As I understood things using snf0 was just an alias for 'the first myricom card'

Serial MAC ProductCode Driver Version License

0 482741 00:60:dd:43:84:4a 10G-PCIE2-8C2-2S-SYNC myri_snf 3.0.9.50782 Valid

1 482741 00:60:dd:43:84:4b 10G-PCIE2-8C2-2S-SYNC myri_snf 3.0.9.50782 Valid

It may not make any difference see below. I will try eth2, and that would reduce it to the time source default settings on the cards.

myri_snf INFO: myriC0: my ether interface name is eth2

myri_snf INFO: eth2: Will use skbuf frags (4096 bytes, order=0)

myri_snf INFO: Enabling host timestamping.

I did compile using below. Perhaps that is not enough, as I was following the myricom instructions. See attached

BRO App Note.pdf (274 KB)

Do you have a time source hooked up? If you do not, you’ll need to start the kernel module with myri_timesource=0. I learned this the hard way because I received a SYNC device when I ordered a regular.

My startup script does this on my SYNC device, unload your module /opt/snf/sbin/myri_start_stop stop then

/opt/snf/sbin/rebuild.sh

/opt/snf/sbin/myri_start_stop start myri_timesource=0

Yeah, I already did that and determined that it was the issue. I responded as much, and the sysadmin blocked my response because I had myricom instructions attached to it.

Thanks for your response, it is helpful.