BlueField-2 NIC 100G live capture not creating intel.log

Hello!

I am struggling with Zeek not generating an intel.log file from my 100G interfaces on my cluster. A live capture of a 10G link on the same device, with the same scripts loads, as the 100G interfaces will net an intel.log when a hit occurs.

I am not seeing anything come in the reporter.log or stderr.log on start with zeekctl deploy. Nor do I get errors when I manually start a live capture on the 10G or 100G link with zeek -i

My setup:

  • 2x bare metal machines with 2x Mellanox BlueField-2 NICs for the capture.
  • The 10G NIC that successfully made an intel log from a capture is a Broadcom BCM57414 on one of the machines.
  • One is the manager and logger.
  • They both have 12 worker processes assigned to each interface on their two BlueField-2 NICs.
  • Zeek version 6.0.0
  • AlmaLinux 9 kernel 5.14.0-284.25.1

What I see:
The zeek capture using the 100G BlueField-2 capture interfaces create and fill ssh, conn, dns, etc. logs as expected. I can see the IPs that I added to an intelligence file for “test bad guys” when I grep for them in those logs. But an intel.log is never generated. And nothing is added to the notice.log in regards to those IPs. This happens both when I start zeek using zeekctl deploy and a manual zeek -i <interface> local run on those interfaces.

When I do a live capture on the 10G interface with zeek -i <interface> local, all the logs populate with connections made to/from the 10G NIC as expected. When I make a connection to/from a “test bad guy” IP that I’ve added to my intelligence file, an intel.log is created and I see every connection attempt I made pop up as an intel hit. A notice.log event is also made for that intelligence hit as expected.

What I want to see:
I want intel.log generated for the live capture data from the four 100G BlueField-2 interfaces on the two nodes when I force a test. I am currently not getting one. There is no difference in the scripts used for intelligence between my 10G and 100G test. I ran zeek -i <interface> local on both and the 10G created an intel.log and the 100G did not.

Other

  • The two types of NICs have different configuration, but the packets are being captured in both instances.
  • The reporter.log and stderr.log are both empty for the live 100G capture links processes. I do see the a CaptureLoss::Too_Little_Traffic Only obeserved 0 TCP ACKs and was expecting at least 1 on the 100G capture notice.log when started with zeekctl deploy. This did not come up with zeek -i for the duration I tested.
  • All processes in the cluster are running
  • I can reproduce the results above 100% of the time. The 10G link always creates an intel.log file on hits from capture, but the 100G links fail to every time.

Any help to troubleshoot would be appreciated. I played around with the Zeek docs, zeek and network configs, and tests. Finally identified that the 100G interface captures specifically aren’t behaving as expected today.

Thanks!

Mostly to double check, but given your explanation that must be true: local.zeek contains @load frameworks/intel/seen in both setups?

Given you see conn.log entries for “bad ip” with either interface, that should create a notice.log if the connection was observed to have established successfully.

Could you provide/compare the history and conn_state field for the conn.log entries that have “bad ip” between the 10G and 100G interface? If they are the same, there should also be a notice.log entry for both, otherwise it should provide information what might be wrong.

I double checked, and @load policy/frameworks/intel/seen is in local.zeek that ran on both interfaces. The command to run the test for both interfaces (the 10G and 100G are on the same machine) was zeek -i <interface> Log::default_logdir=<some dir> local

I don’t see a notice.log entry or intel.log file at all for the 100G run that saw the bad IP. I do see an intel.log for the 10G interface, and a notice.log entry for the IP as well.

As for the conn_state, I noticed a difference.
The 100G conn.log entry for the bad IP has a OTH while the 10G has a RSTO

100G

#fields ts      uid     id.orig_h       id.orig_p       id.resp_h       id.resp_p       proto   service duration        orig_bytes      resp_bytes      conn_state  local_orig      local_resp      missed_bytes    history orig_pkts       orig_ip_bytes   resp_pkts       resp_ip_bytes   tunnel_parents
1695743985.797192       CSsOSJsA8tOT6Jrzl       67.176.240.185  61307   192.170.240.8   22      tcp     -       0.329537        2373    0       OTH     F           F       0       SADR    12      2865    0       0       -

10G

#fields ts      uid     id.orig_h       id.orig_p       id.resp_h       id.resp_p       proto   service duration        orig_bytes      resp_bytes      conn_state  local_orig      local_resp      missed_bytes    history orig_pkts       orig_ip_bytes   resp_pkts       resp_ip_bytes   tunnel_parents
1695743984.316926       C7KMdC4waOtWMifVO7      67.176.240.185  61306   192.170.240.8   22      tcp     ssh     0.336991        2373    1653    RSTO    F           F       0       ShADadR 12      2865    12      2145    -

Hey @djordan66 , thanks for providing the conn.log entries.

The SADR history from the 100G conn.log indicates that Zeek is only seeing traffic sent by the client (all upper case letters mean packets from the originator). Compare with ShADadR from the 10G conn.log.

Due to seeing only “one side” of the traffic, Zeek never raises the connection_established event required by the intel/seen scripts.

You should confirm that other tools (e.g., tcpdump -i <interface>) show the same behavior. If yes, consult manual, documentation or support for the card and/or driver if there’s settings around flow hashing / load balancing that may need to be tweaked. You could also try zeek -i af_packet::<interface> and see if that makes a difference, but I suspect not.

Let us know what you find, thanks!

Hello,

tcpdump gives me the same results. A dump from the 10G gives me an intel.log, but once I found the connection on a 100G link, I did not get an intel.log from that interface.

Would this have something to do with our capture setup?

The 100G links are capture only. They receive a copy of the production packets before they enter/leave our network to the WAN.

For example, packets coming into our network could be seen on interface ens2f0, while the response could be seen on ens2f1. Would this cause an issue with intel/seen?

The 10G interface is for administration on the node, so it has transfer and receive when I was running zeek -i <interface> and tcpdump.

I haven’t completely followed this whole thread, but I think the use of a different interface for each direction of traffic is your problem. To make sure I have it right, you have two worker processes like:
zeek -i ens2f0 …
zeek -i ens2f1 …

Zeek workers do not share connection information, so you have to get all packets from a given flow to a single worker. I’ve never done it, but I believe you can create a bonded interface and listen to that one. I’m not sure how that interacts with AF_PACKET for load balancing across workers though.

A lot of people utilize tap/aggregation switches in front of their tools to do some of this traffic merging.

-Dop

Yes.

Adding to what @dopheide said, would maybe BlueField Link Aggregation work for you?

  • As a result, only a single Physical function would be available to the host side.

That said, I’m not even sure this is the right documentation for your card :slight_smile:

Thanks for the replies. I am going to try bonding the interfaces on the host side with one single worker (multiple threads) to see if that helps. I’ll update once I get a chance to test.

Hello again!

I tried bonding the interfaces, but was running into new issues that I didn’t want to troubleshoot involving the bonds flapping, etc. So I looked at Intelligence Framework — Book of Zeek (git/master)

Specifically this:

event new_connection(c: connection)
        {
        Intel::seen([$host=c$id$orig_h, $conn=c, $where=Conn::IN_ORIG]);
        Intel::seen([$host=c$id$resp_h, $conn=c, $where=Conn::IN_RESP]);
        }

This flags even the one-sided traffic as it just looks for any connection. I figured there should be a way for Zeek to flag intel hits on that as well. I also get notifications and everything works as I expect now!

Thanks for the replies and the information that what Zeek was seeing was one-sided according to Zeek. I wouldn’t have thought to try this without that info.

You should consider fixing you packet capture/mirroring setup. It appears to me BlueField Link Aggregation is the suggested solution. If Zeek sees only one side of the connection, you’ll likely run into a lot more issues and the missing Intel hits was just the first to observe.

1 Like