BRO Logger crashing due to large DNS log files

All,

Having an issue with the bro logger crashing due to large volumes of DNS log traffic, 20-30GB an hour. This is completely a local configuration, on a system with super-fast flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS logging, everything works fine without issue. When I enable it, I get the results below. I thought it might be an issue with gzipping the old logs, so I replaced the standard gzip with pigz and I can manually compress the 30+ gig files in seconds, so don’t think that is the issue. I also tried pinning dedicated cores to the logger, currently 6 cores, which should be way more than enough. Any thoughts or suggestions.

Thanks,

Ron

current]# ll -h

total 43G

-rw-r–r–. 1 root root 3.2K Aug 18 12:00 capture_loss-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 3.2K Aug 18 12:18 capture_loss-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 2.3M Aug 18 12:00 communication-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 1.4M Aug 18 12:18 communication-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 4.8K Aug 18 12:18 communication.log

-rw-r–r–. 1 root root 19G Aug 18 11:39 dns-18-08-18_10.11.22.log

-rw-r–r–. 1 root root 16G Aug 18 12:26 dns-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 12M Aug 18 12:00 files-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 5.2M Aug 18 12:18 files-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 15K Aug 18 12:00 known_certs-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 15K Aug 18 12:18 known_certs-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 98K Aug 18 12:00 known_hosts-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 24K Aug 18 12:18 known_hosts-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 71K Aug 18 12:00 known_services-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 5.2K Aug 18 12:18 known_services-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 1.6K Aug 18 12:00 notice-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 954 Aug 18 12:18 notice-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 262 Aug 18 12:18 reporter-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 23M Aug 18 12:00 smtp-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 9.2M Aug 18 12:18 smtp-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 1.2M Aug 18 12:00 snmp-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 415K Aug 18 12:18 snmp-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 81K Aug 18 12:00 software-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 8.4K Aug 18 12:18 software-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 30K Aug 18 12:00 ssh-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 13K Aug 18 12:18 ssh-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 217K Aug 18 12:00 ssl-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 78K Aug 18 12:18 ssl-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 37K Aug 18 12:00 stats-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 16K Aug 18 12:18 stats-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 28 Aug 18 12:18 stderr.log

-rw-r–r–. 1 root root 188 Aug 18 10:11 stdout.log

-rw-r–r–. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log

-rw-r–r–. 1 root root 178K Aug 18 12:00 x509-18-08-18_11.00.00.log

-rw-r–r–. 1 root root 80K Aug 18 12:18 x509-18-08-18_12.00.00.log

/usr/local/bro/bin/bro --version

/usr/local/bro/bin/bro version 2.5.4

All,

                Having an issue with the bro logger crashing due to large volumes of DNS log traffic, 20-30GB an hour.

Is it actually crashing? Are you getting a crash report at all? From the filenames you listed it looks more like log rotation is failing.

This is completely a local configuration, on a system with super-fast flash storage, 64 cores, 256GB RAM running BRO 2.5.4. If I disable DNS logging, everything works fine without issue. When I enable it, I get the results below. I thought it might be an issue with gzipping the old logs, so I replaced the standard gzip with pigz and I can manually compress the 30+ gig files in seconds, so don’t think that is the issue.

It could be related to the gzipping. The way log rotation works is not great.. all log files get compressed at the same time which can cause some thrashing.

If you set

    compresslogs = 0

in broctl.cfg so that broctl does not gzip the logs at all, does the problem go away?

You could do something like that, and then run a script like:

while true; do
    for f in /usr/local/bro/logs/201*/*.log ; do
        gzip $f
    done
    sleep 60
done

to compress the logs in the background serially.

Another thing to keep an eye on is if your logger is able to keep up with the volume of data. This script is a plugin for munin, but you can run it directly:

#!/usr/bin/env python
import os
import sys
import time

DEFAULT_LOG = "/usr/local/bro/logs/current/dns.log"

def config():
    print """
graph_category network

graph_title Bro log lag
graph_vlabel lag
graph_args --base 1000 --vertical-label seconds --lower-limit 0
graph_info The bro log lag

lag.label lag
lag.info log message lag in seconds
lag.min 0
lag.warning 0:15
lag.critical 0:60
""".strip()

    return 0

def get_latest_time(fn):
    f = open(fn)

    f.seek(-4096, os.SEEK_END)
    end = f.read().splitlines()[1:-1] #ignore possibly incomplete first and last lines
    times = [line.split()[0] for line in end]
    timestamps = map(float, times)
    latest = max(timestamps)
    return latest

def lag(fn):
    lag = 500
    for x in range(3):
        try :
            latest = get_latest_time(fn)
            now = time.time()
            lag = now - latest
            break
        except (IOError, ValueError):
            #File could be rotating, wait and try again
            time.sleep(5)
    print "lag.value %f" % lag

if __name__ == "__main__":

    filename = os.getenv("BRO_LAG_FILENAME", DEFAULT_LOG)

    if sys.argv[1:] and sys.argv[1] == 'config':
        config()
    else:
        lag(filename)

It will output something like

lag.value 2.919352

A normal value should be about 5, anything under 20 is probably ok. If it's 500 and climbing, that's a problem.

Also..

-rw-r--r--. 1 root root 6.8G Aug 18 12:00 weird-18-08-18_11.00.00.log
-rw-r--r--. 1 root root 2.5G Aug 18 12:18 weird-18-08-18_12.00.00.log

That's a LOT of weird.log, what's going on there?

Hi Ron,

I had a similar issue where both the Bro DNS and weird logs were huge. In my case, it turned out that our taps were primarily seeing inbound DNS replies but not all of the outbound DNS requests, which triggered weird log entries. Something to consider if the majority of your weird logs are DNS-related entries.

Justin,

  Thanks, I turned off compression and so for 2+ hours, everything is working well. I kinda had an idea it was related to the compression, but thought the pigz replacement would take care of that, guess not. Appreciate the help. Will let everyone know how it goes over the long term. I think you and Chris hit the nail on the head about the weird logs. I haven't really started tuning much, wanted to get the system nice and stable first and then start tuning and looking at the weird stuff, which is heavy DNS.

Thanks Again,

Ron

[root@ current]# cat weird.log | bro-cut name|sort|uniq -c|sort -rn
34264380 dns_unmatched_msg
16696030 dns_unmatched_reply
330912 DNS_RR_unknown_type
  62288 possible_split_routing
  59512 data_before_established
  38396 NUL_in_line
  21210 inappropriate_FIN
  21209 line_terminated_with_single_CR
  18978 DNS_RR_length_mismatch
   1852 bad_TCP_checksum
   1060 dnp3_corrupt_header_checksum
    922 truncated_tcp_payload
    326 dnp3_header_lacks_magic
    230 DNS_truncated_RR_rdlength_lt_len
     92 non_ip_packet_in_ethernet
     92 above_hole_data_without_any_acks
     48 SYN_seq_jump
     46 window_recision
     46 dns_unmatched_msg_quantity
     46 DNS_truncated_ans_too_short
     46 DNS_RR_bad_length
     46 DNS_Conn_count_too_large
     46 ayiya_tunnel_non_ip

Update:

  Worked for almost 3 hours, but then started failing again. I even changed the log rotation to every 15 minutes and it still crashes . Any other sugestions? Has anyone ever tried to configured syslog-ng to handle the logging?

Warning: broctl config has changed (run the broctl "deploy" command)
Name Type Host Status Pid Started
logger logger localhost terminating 28295 20 Aug 12:30:03
manager manager localhost running 28336 20 Aug 12:30:05
proxy-1 proxy localhost running 28375 20 Aug 12:30:06
worker-1-1 worker localhost running 28565 20 Aug 12:30:08

Thanks,

Ron

I hope I’m not asking the obvious, but was the warning heeded?

Warning: broctl config has changed (run the broctl “deploy” command)

That's really interesting.. because "terminating" means something very specific, and not something you would see if it was crashing.

Unfortunately broctl throws away the 2nd part of the status file that would narrow that down further, but there are only a few reasons:

src/main.cc
275: set_processing_status("TERMINATING", "done_with_network");
331: set_processing_status("TERMINATING", "terminate_bro");
392: set_processing_status("TERMINATING", "termination_signal");
413: set_processing_status("TERMINATING", "sig_handler");

src/Net.cc
432: set_processing_status("TERMINATING", "net_finish");
457: set_processing_status("TERMINATING", "net_delete");

done_with_network, net_finish, and net_delete wouldn't apply to a logger node that has no network interfaces.

termination_signal and sig_handler happen when bro gets a SIGINT or SIGTERM, and terminate_bro happens
when bro exits normally.

If it does happen again and stays like that if you could run

$ sudo cat /usr/local/bro/spool/logger/.status
RUNNING [net_run]

that should show

TERMINATING [one of those reasons]

which would definitively narrow it down.

Is there anything on your system that would be killing bro? If it were the kernel OOM killer I'd expect that to show up as crashed and not terminating.

Jim,

Never hurts to ask, but I did use deploy.

Thanks,

Ron

Justin,

  Nothing really on the system that would be killing logger, system is a base CENTOS 7 box, recently built just for BRO. The .status file shows "TERMINATED[atexit]".

Ron

[root@ ron]# sudo cat /logs/bro/spool/logger/.status
TERMINATED [atexit]

Name Type Host Status Pid Started
logger logger localhost crashed
manager manager localhost running 55680 20 Aug 15:25:47
proxy-1 proxy localhost running 55719 20 Aug 15:25:49
worker-1-1 worker localhost running 55893 20 Aug 15:25:50
worker-1-2 worker localhost running 55897 20 Aug 15:25:50
worker-1-40 worker localhost running 56411 20 Aug 15:25:50
worker-1-41 worker localhost crashed
worker-1-42 worker localhost running 56444 20 Aug 15:25:50
worker-1-43 worker localhost running 56446 20 Aug 15:25:50

Ah.. well now that says 'crashed' which is what you'd expect if it was crashing (not 'terminating')

If it is crashing then something should say why...

Is broctl sending you a crash report when that happens? What does broctl diag say?

Are there any kernel OOM messages in dmesg or syslog?

Or any messages that look like

bro[60506]: segfault at 0 ip 00000000005fcf8d sp 00007fffaf9d2f40 error 6 in bro[400000+624000]

Justin,

  The first 5 lines are consistent, the last 2 lines the first time seen were today. Crash report wasn't very useful (see below), diag was pretty much the same. Hopefully the OOM message helps.

Ron

Aug 21 09:45:18 aosoc kernel: Out of memory: Kill process 6610 (bro) score 507 or sacrifice child
Aug 21 09:45:18 aosoc kernel: Killed process 6610 (bro) total-vm:139995144kB, anon-rss:137467264kB, file-rss:0kB, shmem-rss:0kB
Aug 21 11:32:23 aosoc kernel: bro invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 21 11:32:23 aosoc kernel: bro cpuset=/ mems_allowed=0-1
Aug 21 11:32:23 aosoc kernel: CPU: 57 PID: 21655 Comm: bro Kdump: loaded Not tainted 3.10.0-862.11.6.el7.x86_64 #1
Aug 21 11:32:23 aosoc kernel: Out of memory: Kill process 20158 (bro) score 544 or sacrifice child
Aug 21 11:32:23 aosoc kernel: Killed process 20158 (bro) total-vm:150275592kB, anon-rss:147621508kB, file-rss:0kB, shmem-rss:0kB

===============Crash Report===================

This crash report does not include a backtrace. In order for crash reports to be useful when Bro crashes, a backtrace is needed.

No core file found and gdb is not installed. It is recommended to install gdb so that BroControl can output a backtrace if Bro crashes.

Bro 2.5.4
Linux 3.10.0-862.11.6.el7.x86_64

Bro plugins: (none found)

==== No reporter.log

==== stderr.log
received termination signal

==== stdout.log
max memory size (kbytes, -m) unlimited
data seg size (kbytes, -d) unlimited
virtual memory (kbytes, -v) unlimited
core file size (blocks, -c) unlimited

==== .cmdline
-U .status -p broctl -p broctl-live -p local -p logger local.bro broctl base/frameworks/cluster local-logger.bro broctl/auto

==== .env_vars
PATH=/usr/local/bro/bin:/usr/local/bro/share/broctl/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/bro/bin:/home/ron/.local/bin:/home/ron/bin:/usr/local/bro/bin
BROPATH=/logs/bro/spool/installed-scripts-do-not-touch/site::/logs/bro/spool/installed-scripts-do-not-touch/auto:/usr/local/bro/share/bro:/usr/local/bro/share/bro/policy:/usr/local/bro/share/bro/site
CLUSTER_NODE=logger

==== .status
TERMINATED [atexit]

==== No prof.log

==== No packet_filter.log

==== No loaded_scripts.log

Ah, this is great.. well, not great in that it is crashing but great in that now we know what is wrong: You're running out of ram.

So you said you had 256GB, which should normally be more than enough as long as everything is working properly, but I have a feeling some things are not working quite right though.

Have you had a chance to run that python program I posted? If you have a high amount of log lag, something is not keeping up well.

Do you have any graphs of memory usage on your host?

What exactly does this output:

$ cat /proc/cpuinfo |grep 'model name'|sort|uniq -c
     40 model name : Intel(R) Xeon(R) CPU E5-2470 v2 @ 2.40GHz

The fact that you are seeing

34264380 dns_unmatched_msg
16696030 dns_unmatched_reply
62288 possible_split_routing
59512 data_before_established

in your weird.log points to something being very very wrong with your traffic. This can cause bro to work many times harder than it needs to.

How is your load balancing setup in your node.cfg?

Can you try running bro-doctor from bro-pkg: https://packages.bro.org/packages/view/74d45e8c-4fb7-11e8-88be-0a645a3f3086

If you can't run bro-pkg you just need to grab

https://raw.githubusercontent.com/ncsa/bro-doctor/master/doctor.py

and drop it in

/usr/local/bro/lib/broctl/plugins/doctor.py

and run broctl doctor.bro

Thanks Justin, here is the info.

Ron

######################Memory CPUI################################################################

[root@ ron]# free
              total used free shared buff/cache available
Mem: 263620592 241469748 861800 4464 21289044 21038584
Swap: 4194300 80508 4113792

[root@ current]# cat /proc/cpuinfo |grep 'model name'|sort|uniq -c
     72 model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz

######################Perl Script OutputI################################################################

lag.value 500.000000

######################node.cfg################################################################

cat /usr/local/bro/etc/node.cfg
# Example BroControl node configuration.

Thanks Justin, here is the info.

Ron

ok so there's a lot of stuff wrong here, but it's all fixable.

######################Memory CPUI################################################################

[root@ ron]# free
             total used free shared buff/cache available
Mem: 263620592 241469748 861800 4464 21289044 21038584
Swap: 4194300 80508 4113792

[root@ current]# cat /proc/cpuinfo |grep 'model name'|sort|uniq -c
    72 model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz

[worker-1]
type=worker
host=localhost
interface=ens1f0
lb_method=pf_ring
lb_procs=48
pin_cpus=12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60

This is an absolute beast of a CPU, but you only have 36 real cores. I would run something like 30 workers, not 48.
You also have to ensure you don't pin workers to 'fake' hyperthreading cores. Depending on how the CPU gets
enumerated you either need to use something like 6 through 35, or every other cpu. 6-35 is probably right for your CPU though.
This can be verified using hwloc-ps in the hwloc package.. I want to add this to bro-doctor but it's.. tricky :slight_smile:

You may also have 2 NUMA nodes in that case you would probably want to use 15 workers on each NUMA node, which would be something like
3-18 and 21-36... very tricky. Ideally broctl could just support pin_cpus=auto as well :slight_smile:

######################Perl Script OutputI################################################################

lag.value 500.000000

This isn't great.. but about what would be expected based on running out of ram.

[root@ ron]# broctl doctor.bro
#################################################################
# Checking if many recent connections have a SAD or had history #
#################################################################
error: No conn log files in the past day???

If you disabled the conn.log then most of the checks that bro-doctor does can't be ran :frowning:

###############################################################################
# Checking if bro is linked against a custom malloc like tcmalloc or jemalloc #
###############################################################################
error: configured to use a custom malloc=False

I would install gperftools-devel and rebuild bro

##################################
# Checking pf_ring configuration #
##################################
error: bro binary on node worker-1-1 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
error: bro binary on node worker-1-2 is neither linked against pf_ring libpcap or using the bro pf_ring plugin
..
error: configured to use pf_ring=True pcap=False plugin=False

This is super super bad! This means that every bro processes is seeing all of your traffic.. all 48 of them!

You can fix pf_ring, but on a recent Centos7 system AF_Packet works great out of the box, so I would try using that first
instead. You just need to

bro-pkg install j-gras/bro-af_packet-plugin

and then in your node.cfg use

[worker-1]
type=worker
host=localhost
interface=af_packet::ens1f0
lb_method=custom
lb_procs=30
pin_cpus=6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
af_packet_fanout_id=22
af_packet_fanout_mode=AF_Packet::FANOUT_HASH

I double checked this to verify I was saying the right thing, and looking closer at all our boxes that are NUMA it seems the cpus get allocated in round robin between them, so

0,2,4,6,8,10....34 go to numa node 0, then
1,3,5,7,9,11....35 go to numa node 1

But I think that can depend on kernel version and I swear this changed between centos6 and 7.

This matters more if you have 2 cards you can capture from though, since most likely one card is attached to each numa node.
If you only have one 10g interface you are capturing from it doesn't matter as much.

$ hwloc-ls -p
Machine (64GB total)
NUMANode P#0 (32GB)
   Package P#0 + L3 (14MB)
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
       PU P#0
       PU P#20
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
       PU P#2
       PU P#22
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
       PU P#4
       PU P#24
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
       PU P#6
       PU P#26
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
       PU P#8
       PU P#28
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#12
       PU P#10
       PU P#30
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
       PU P#12
       PU P#32
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
       PU P#14
       PU P#34
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
       PU P#16
       PU P#36
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
       PU P#18
       PU P#38
   HostBridge P#2
     PCIBridge
       PCI 8086:1584
         Net "p2p1"
NUMANode P#1 (32GB)
   Package P#1 + L3 (14MB)
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
       PU P#1
       PU P#21
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
       PU P#3
       PU P#23
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
       PU P#5
       PU P#25
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
       PU P#7
       PU P#27
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
       PU P#9
       PU P#29
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#12
       PU P#11
       PU P#31
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
       PU P#13
       PU P#33
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
       PU P#15
       PU P#35
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
       PU P#17
       PU P#37
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
       PU P#19
       PU P#39
   HostBridge P#6
     PCIBridge
       PCI 8086:1584
         Net "p3p1"

This gets really confusing but it shows that p2p1 is connected to the first numa node and has real cpus 0,2,4,8... and hyperthread cores 20,22,24,26...

For boxes like that, the configuration would look something like

[...p2p1]
type=worker
host=..
interface=af_packet::p2p1
lb_method=custom
lb_procs=8
pin_cpus=4,6,8,10,12,14,16,18
af_packet_fanout_id=21
af_packet_fanout_mode=AF_Packet::FANOUT_HASH

[...p3p1]
type=worker
host=..
interface=af_packet::p3p1
lb_method=custom
lb_procs=8
pin_cpus=5,7,9,11,13,15,17,19
af_packet_fanout_id=22
af_packet_fanout_mode=AF_Packet::FANOUT_HASH

Justin,

  I finished most of your recommendations, just need to rebuild bro, but was going to let it run over night and see how it is running now. I really appreciate all the help.

Thanks,

Ron

]# hwloc-ls -p
Machine (256GB total)
  NUMANode P#0 (128GB)
    Package P#0 + L3 (25MB)
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
        PU P#0
        PU P#36
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
        PU P#1
        PU P#37
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
        PU P#2
        PU P#38
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
        PU P#3
        PU P#39
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
        PU P#4
        PU P#40
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
        PU P#5
        PU P#41
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
        PU P#6
        PU P#42
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
        PU P#7
        PU P#43
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
        PU P#8
        PU P#44
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#16
        PU P#9
        PU P#45
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#17
        PU P#10
        PU P#46
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#18
        PU P#11
        PU P#47
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#19
        PU P#12
        PU P#48
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#20
        PU P#13
        PU P#49
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#24
        PU P#14
        PU P#50
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#25
        PU P#15
        PU P#51
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#26
        PU P#16
        PU P#52
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#27
        PU P#17
        PU P#53
    HostBridge P#0
      PCIBridge
        PCI 14e4:1657
          Net "eno1"
        PCI 14e4:1657
          Net "eno2"
        PCI 14e4:1657
          Net "eno3"
        PCI 14e4:1657
          Net "eno4"
      PCIBridge
        PCI 102b:0538
          GPU "card0"
          GPU "controlD64"
    HostBridge P#1
      PCIBridge
        PCI 8086:1572
          Net "ens1f0"
        PCI 8086:1572
          Net "ens1f1"
      PCIBridge
        PCI 8086:1572
          Net "ens3f0"
        PCI 8086:1572
          Net "ens3f1"
    HostBridge P#2
      PCIBridge
        PCI 8086:1572
          Net "ens2f0"
        PCI 8086:1572
          Net "ens2f1"
    HostBridge P#3
      PCIBridge
        PCI 9005:028f
          Block(Disk) "sda"
          Block(Disk) "sdc"
  NUMANode P#1 (128GB) + Package P#1 + L3 (25MB)
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
      PU P#18
      PU P#54
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
      PU P#19
      PU P#55
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#2
      PU P#20
      PU P#56
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#3
      PU P#21
      PU P#57
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
      PU P#22
      PU P#58
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#8
      PU P#23
      PU P#59
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#9
      PU P#24
      PU P#60
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#10
      PU P#25
      PU P#61
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#11
      PU P#26
      PU P#62
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#16
      PU P#27
      PU P#63
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#17
      PU P#28
      PU P#64
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#18
      PU P#29
      PU P#65
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#19
      PU P#30
      PU P#66
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#20
      PU P#31
      PU P#67
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#24
      PU P#32
      PU P#68
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#25
      PU P#33
      PU P#69
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#26
      PU P#34
      PU P#70
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#27
      PU P#35
      PU P#71

  I finished most of your recommendations, just need to rebuild bro, but was going to let it run over night and see how it is running now. I really appreciate all the help.

Great! There may be more things to fix, but once that load balancing is working properly things will be in a lot better shape.

This was really helpful to see as well:

]# hwloc-ls -p
Machine (256GB total)
NUMANode P#0 (128GB)
   Package P#0 + L3 (25MB)
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
       PU P#0 <----
       PU P#36
     L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1
       PU P#1 <---
       PU P#37

You have CPU 0,1,2,3.. on the same numa node, but every box I have puts 0,2,4... on one and 1,3,5... on the other.

Machine (64GB total)
NUMANode P#0 (32GB)
  Package P#0 + L3 (14MB)
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0
      PU P#0 <---
      PU P#20
    L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#4
      PU P#2 <---
      PU P#22

All the more reason for me to get bro-doctor to do this analysis and confirm the proper pin_cpus values are being used.

Justin,

  Got good news and solid progress with your help. BRO is running on both boxes and hasn't crashed since 10pm last night. If I read the data about NUMA from my systems, I don't really need to split the load between 2 workers as you did, right? I'm working on tuning some now and also trying to address the really high lag (500) that I'm still seeing. Currently seeing some loss on it, but will continue to tune and see what if I can get that under control. Let me know if you need help testing the doctor script.

Ron

# cat capture_loss.log
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path capture_loss
#open 2018-08-22-10-01-21
#fields ts ts_delta peer gaps acks percent_lost
#types time interval string count count double
1534946481.938006 900.000084 worker-1-20 33 696 4.741379
1534946481.941548 900.000000 worker-1-24 20 2722 0.734754
1534946481.938533 900.000059 worker-1-21 630 40222 1.566307
1534946481.938396 900.000070 worker-1-9 89 1470 6.054422
1534946481.941452 900.000044 worker-1-8 156 1821 8.566722
1534946481.941323 900.000017 worker-1-12 1062 232679 0.456423
1534946481.939547 900.000037 worker-1-27 1023 216063 0.473473
1534946481.937269 900.000040 worker-1-10 749 5465 13.705398
1534946481.937517 900.000111 worker-1-3 87 15720 0.553435
1534946481.941367 900.000649 worker-1-16 117 2187 5.349794
1534946481.939451 900.000079 worker-1-7 870 195358 0.445336
1534946481.940450 900.000041 worker-1-5 111 626 17.731629
1534946481.931345 900.000019 worker-1-4 44 885 4.971751
1534946481.941268 900.000074 worker-1-17 131 1641 7.982937
1534946481.946945 900.000039 worker-1-18 189 1350 14.0
1534946481.941532 900.000083 worker-1-25 118 9414 1.253452
1534946481.942680 900.000094 worker-1-30 1375 2635 52.182163
1534946481.937385 900.000074 worker-1-1 1050 232183 0.452229
1534946481.939621 900.000062 worker-1-26 20 1973 1.013685
1534946481.942331 900.000127 worker-1-2 1236 240350 0.51425
1534946481.938535 900.000003 worker-1-29 133 2923 4.55012
1534946481.938737 900.000077 worker-1-13 1463 223976 0.653195
1534946481.937868 900.000121 worker-1-15 278 2360 11.779661
1534946481.937738 900.000006 worker-1-28 36 765 4.705882
1534946481.940076 900.000039 worker-1-23 43 3749 1.146973
1534946481.940530 900.000008 worker-1-22 1151 4798 23.989162
1534946481.944632 900.000030 worker-1-19 510 88481 0.576395
1534946481.937329 900.000045 worker-1-6 891 146039 0.610111
1534946481.938533 900.000095 worker-1-14 206 2276 9.050967
1534946481.937384 900.000074 worker-1-11 222 2176 10.202206
1534947381.938548 900.000013 worker-1-29 1135 241449 0.470079
1534947381.942682 900.000002 worker-1-30 399 13150 3.034221
1534947381.939458 900.000007 worker-1-7 332 66504 0.499218
1534947381.937742 900.000004 worker-1-28 31 711 4.360056
1534947381.940622 900.000092 worker-1-22 77 1728 4.456019
1534947381.938073 900.000067 worker-1-20 103 2343 4.396073
1534947381.941622 900.000074 worker-1-24 90 7394 1.217203
1534947381.941549 900.000017 worker-1-25 1259 235553 0.534487
1534947381.941454 900.000087 worker-1-16 231 5455 4.234647
1534947381.942399 900.000068 worker-1-2 69 1293 5.336427
1534947381.941324 900.000056 worker-1-17 152 759 20.02635
1534947381.931395 900.000050 worker-1-4 1310 240018 0.545792
1534947381.938810 900.000073 worker-1-13 109 17301 0.630021
1534947381.938606 900.000073 worker-1-14 305 2184 13.965201
1534947381.937398 900.000069 worker-1-6 67 3465 1.933622
1534947381.940457 900.000007 worker-1-5 118 1280 9.21875
1534947381.937470 900.000085 worker-1-1 24 1581 1.518027
1534947381.940195 900.000119 worker-1-23 189 20872 0.905519
1534947381.937614 900.000097 worker-1-3 1167 213001 0.547885
1534947381.944751 900.000119 worker-1-19 160 4249 3.765592
1534947381.937943 900.000075 worker-1-15 593 2541 23.337269
1534947381.947066 900.000121 worker-1-18 809 160344 0.50454
1534947381.939548 900.000001 worker-1-27 219 2612 8.38438
1534947381.938628 900.000095 worker-1-21 302 1627 18.56177
1534947381.937326 900.000057 worker-1-10 107 1763 6.0692
1534947381.938497 900.000101 worker-1-9 1599 238664 0.66998
1534947381.941398 900.000075 worker-1-12 201 2936 6.846049
1534947381.937399 900.000015 worker-1-11 1382 236433 0.584521
1534947381.939677 900.000056 worker-1-26 52 1100 4.727273
1534947381.941453 900.000001 worker-1-8 224 1601 13.991255
1534948281.939548 900.000090 worker-1-7 1088 235524 0.461949
1534948281.941678 900.000129 worker-1-25 202 32683 0.618058
1534948281.947198 900.000132 worker-1-18 284 6208 4.574742
1534948281.937477 900.000079 worker-1-6 70 14679 0.476872
1534948281.937532 900.000062 worker-1-1 57 1621 3.516348
1534948281.937477 900.000078 worker-1-11 71 24940 0.284683
1534948281.938938 900.000128 worker-1-13 111 12288 0.90332
1534948281.941679 900.000057 worker-1-24 731 121315 0.602564
1534948281.938621 900.000015 worker-1-14 1056 230109 0.458913
1534948281.942751 900.000069 worker-1-30 34 448 7.589286
1534948281.938548 900.000000 worker-1-29 219 1033 21.200387
1534948281.941325 900.000001 worker-1-17 671 111097 0.603977
1534948281.937348 900.000022 worker-1-10 145 1917 7.563902
1534948281.938055 900.000112 worker-1-15 859 187429 0.458307
1534948281.939622 900.000074 worker-1-27 50 3453 1.448016
1534948281.931396 900.000001 worker-1-4 193 3759 5.134344
1534948281.937780 900.000038 worker-1-28 230 6086 3.779165
1534948281.938109 900.000036 worker-1-20 1135 230316 0.492801
1534948281.938512 900.000015 worker-1-9 44 3888 1.131687
1534948281.940323 900.000128 worker-1-23 30 1212 2.475248
1534948281.939677 900.000000 worker-1-26 165 6336 2.604167
1534948281.940527 900.000070 worker-1-5 96 5162 1.859744
1534948281.937736 900.000122 worker-1-3 1123 249305 0.450452
1534948281.941454 900.000001 worker-1-8 67 1910 3.507853
1534948281.940679 900.000057 worker-1-22 115 4310 2.668213
1534948281.938677 900.000049 worker-1-21 25 2141 1.167679
1534948281.944879 900.000128 worker-1-19 29 1637 1.771533
1534948281.942454 900.000055 worker-1-2 36 2033 1.770782
1534948281.941453 900.000055 worker-1-12 26 991 2.623613
1534948281.941454 900.000000 worker-1-16 1127 230791 0.488321

cat capture_loss.log
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path capture_loss
#open 2018-08-22-10-06-13
#fields ts ts_delta peer gaps acks percent_lost
#types time interval string count count double
1534946772.685666 900.000108 worker-1-9 71276 209039 34.096987
1534946772.682117 900.000110 worker-1-20 43286 430827 10.047188
1534946772.686758 900.000020 worker-1-22 58337 172653 33.788582
1534946772.689750 900.000013 worker-1-17 61579 422200 14.585268
1534946772.683422 900.000599 worker-1-4 62846 224500 27.993764
1534946772.692533 900.000076 worker-1-13 56519 190555 29.660203
1534946772.684749 900.000086 worker-1-15 41612 129870 32.041272
1534946772.684889 900.000230 worker-1-27 76559 187163 40.904987
1534946772.683731 900.000001 worker-1-25 74450 188407 39.515517
1534946772.681934 900.000111 worker-1-5 50253 153355 32.769065
1534946772.682021 900.000012 worker-1-28 52191 151854 34.369197
1534946772.682825 900.000074 worker-1-8 52037 190660 27.293087
1534946772.699409 900.000084 worker-1-16 88137 266670 33.050962
1534946772.685734 900.000100 worker-1-30 51271 238600 21.488265
1534946772.682739 900.000022 worker-1-6 66273 250566 26.449319
1534946772.682741 900.000063 worker-1-26 49902 153687 32.46989
1534946772.681960 900.000006 worker-1-1 89188 255018 34.973218
1534946772.682631 900.000622 worker-1-29 60705 210476 28.841768
1534946772.681953 900.000075 worker-1-2 38281 125211 30.573192
1534946772.682673 900.000007 worker-1-3 67450 187531 35.967387
1534946772.686732 900.000060 worker-1-23 55932 191885 29.148709
1534946772.681828 900.000005 worker-1-7 66947 445007 15.044033
1534946772.681886 900.000007 worker-1-11 48944 138084 35.445091
1534946772.693528 900.000000 worker-1-14 65762 188557 34.876456
1534946772.681885 900.000006 worker-1-10 62149 428124 14.516589
1534946772.685697 900.000017 worker-1-21 48039 147640 32.53793
1534946772.683753 900.000022 worker-1-19 59660 157172 37.958415
1534946772.705397 900.000127 worker-1-24 71820 223813 32.089289
1534946772.688718 900.000117 worker-1-18 48410 452562 10.696877
1534946772.685511 900.000137 worker-1-12 46673 145455 32.087587
1534947672.682048 900.000114 worker-1-5 68107 180382 37.757093
1534947672.683025 900.000286 worker-1-6 45761 183027 25.002322
1534947672.685750 900.000053 worker-1-21 50836 422213 12.040368
1534947672.683879 900.000126 worker-1-19 53010 178899 29.631244
1534947672.693643 900.000115 worker-1-14 92038 425392 21.636044
1534947672.682825 900.000084 worker-1-26 55076 176437 31.215675
1534947672.682008 900.000123 worker-1-10 73148 207138 35.313656
1534947672.699475 900.000066 worker-1-16 72461 223957 32.354872
1534947672.684952 900.000063 worker-1-27 47858 167864 28.509984
1534947672.686884 900.000126 worker-1-22 65305 192727 33.884718
1534947672.681973 900.000020 worker-1-2 60511 181325 33.37157
1534947672.682136 900.000176 worker-1-1 109592 280275 39.101597
1534947672.682749 900.000118 worker-1-29 64164 192112 33.399267
1534947672.689756 900.000006 worker-1-17 61667 166246 37.093825
1534947672.683803 900.000072 worker-1-25 56366 464877 12.124928
1534947672.682152 900.000035 worker-1-20 49701 148229 33.529876
1534947672.685826 900.000092 worker-1-30 54071 160228 33.746287
1534947672.684823 900.000074 worker-1-15 60758 204305 29.738871
1534947672.685527 900.000016 worker-1-12 51410 166297 30.914569
1534947672.688722 900.000004 worker-1-18 73693 218226 33.76912
1534947672.682082 900.000061 worker-1-28 62184 198747 31.288019
1534947672.686826 900.000094 worker-1-23 57861 221752 26.092662
1534947672.682903 900.000078 worker-1-8 48482 219779 22.059432
1534947672.685711 900.000045 worker-1-9 53372 172244 30.986275
1534947672.692602 900.000069 worker-1-13 62358 502957 12.398277
1534947672.682167 900.000281 worker-1-11 48767 198101 24.617241
1534947672.705447 900.000050 worker-1-24 55112 186729 29.51443
1534947672.682731 900.000058 worker-1-3 56891 162845 34.935675
1534947672.683487 900.000065 worker-1-4 78602 255868 30.719746
1534947672.681880 900.000052 worker-1-7 51099 541967 9.428434
1534948572.682094 900.000086 worker-1-10 82032 524780 15.631693
1534948572.693667 900.000024 worker-1-14 85369 297217 28.722785
1534948572.682472 900.000499 worker-1-2 53654 221056 24.271678
1534948572.686886 900.000002 worker-1-22 55666 467706 11.901921
1534948572.685008 900.000056 worker-1-27 86916 263647 32.966808
1534948572.682279 900.000127 worker-1-20 89828 256003 35.088651
1534948572.682223 900.000087 worker-1-1 62337 344970 18.070267
1534948572.685750 900.000000 worker-1-21 70389 510644 13.784359
1534948572.684880 900.000057 worker-1-15 67459 206447 32.676183
1534948572.685740 900.000029 worker-1-9 57163 227031 25.1785
1534948572.682752 900.000021 worker-1-3 61958 204039 30.365763
1534948572.682835 900.000010 worker-1-26 54506 196350 27.759613
1534948572.683153 900.000128 worker-1-6 60501 190365 31.781577
1534948572.682183 900.000016 worker-1-11 63835 191625 33.312459
1534948572.682208 900.000126 worker-1-28 91876 284589 32.28375
1534948572.683828 900.000025 worker-1-25 44239 139128 31.797338
1534948572.685880 900.000054 worker-1-30 55616 172434 32.2535
1534948572.689884 900.000128 worker-1-17 69725 178142 39.140124
1534948572.681961 900.000081 worker-1-7 53776 220472 24.391306
1534948572.683937 900.000058 worker-1-19 50184 163270 30.736816
1534948572.685538 900.000011 worker-1-12 60185 260306 23.120865
1534948572.686889 900.000063 worker-1-23 59788 194439 30.748975
1534948572.682908 900.000005 worker-1-8 60904 532647 11.434214
1534948572.692674 900.000072 worker-1-13 67152 216975 30.949188
1534948572.688750 900.000028 worker-1-18 70383 235710 29.859997
1534948572.705484 900.000037 worker-1-24 57008 201189 28.335545
1534948572.682147 900.000099 worker-1-5 61878 194825 31.760811
1534948572.699536 900.000061 worker-1-16 76385 256671 29.759887
1534948572.682829 900.000080 worker-1-29 52464 188150 27.884135
1534948572.683536 900.000049 worker-1-4 110222 314119 35.08925

[root@aosoc current]# broctl netstats
worker-1-1: 1534949053.166850 recvd=813997 dropped=0 link=813997
worker-1-2: 1534949053.366803 recvd=873351 dropped=0 link=873353
worker-1-3: 1534949053.567778 recvd=1770808 dropped=0 link=1770810
worker-1-4: 1534949053.767852 recvd=865443 dropped=0 link=865449
worker-1-5: 1534949053.968873 recvd=349355 dropped=0 link=349361
worker-1-6: 1534949054.168785 recvd=1152160 dropped=0 link=1152161
worker-1-7: 1534949054.368825 recvd=1358553 dropped=0 link=1358553
worker-1-8: 1534949054.569808 recvd=345267 dropped=0 link=345272
worker-1-9: 1534949054.769982 recvd=856725 dropped=0 link=856732
worker-1-10: 1534949054.969811 recvd=351148 dropped=0 link=351148
worker-1-11: 1534949055.170855 recvd=883897 dropped=0 link=883897
worker-1-12: 1534949055.370950 recvd=820117 dropped=0 link=820125
worker-1-13: 1534949055.571899 recvd=1132465 dropped=0 link=1132473
worker-1-14: 1534949055.771751 recvd=823249 dropped=0 link=823249
worker-1-15: 1534949055.972921 recvd=754342 dropped=0 link=754343
worker-1-16: 1534949056.173778 recvd=822102 dropped=0 link=822106
worker-1-17: 1534949056.373806 recvd=570905 dropped=0 link=570911
worker-1-18: 1534949056.573815 recvd=1033845 dropped=0 link=1033846
worker-1-19: 1534949056.774737 recvd=648977 dropped=0 link=649001
worker-1-20: 1534949056.974823 recvd=816836 dropped=0 link=816838
worker-1-21: 1534949057.175858 recvd=423896 dropped=0 link=423901
worker-1-22: 1534949057.375894 recvd=761794 dropped=0 link=761796
worker-1-23: 1534949057.576737 recvd=415151 dropped=0 link=415153
worker-1-24: 1534949057.776887 recvd=604342 dropped=0 link=604349
worker-1-25: 1534949057.978046 recvd=911772 dropped=0 link=911785
worker-1-26: 1534949058.177749 recvd=358386 dropped=0 link=358395
worker-1-27: 1534949058.379062 recvd=1283463 dropped=0 link=1283465
worker-1-28: 1534949058.578751 recvd=364801 dropped=0 link=364807
worker-1-29: 1534949058.778735 recvd=930041 dropped=0 link=930042
worker-1-30: 1534949058.979938 recvd=857963 dropped=0 link=857967

Justin,

  Got good news and solid progress with your help. BRO is running on both boxes and hasn't crashed since 10pm last night. If I read the data about NUMA from my systems, I don't really need to split the load between 2 workers as you did, right?

If you can get another NIC so each box has 2, then you could divide the workers between each NIC and NUMA node. Otherwise it doesn't matter so much.

I'm working on tuning some now and also trying to address the really high lag (500) that I'm still seeing. Currently seeing some loss on it, but will continue to tune and see what if I can get that under control. Let me know if you need help testing the doctor script.

Ron

1534948572.682908 900.000005 worker-1-8 60904 532647 11.434214
1534948572.692674 900.000072 worker-1-13 67152 216975 30.949188
1534948572.688750 900.000028 worker-1-18 70383 235710 29.859997
1534948572.705484 900.000037 worker-1-24 57008 201189 28.335545
1534948572.682147 900.000099 worker-1-5 61878 194825 31.760811
1534948572.699536 900.000061 worker-1-16 76385 256671 29.759887
1534948572.682829 900.000080 worker-1-29 52464 188150 27.884135
1534948572.683536 900.000049 worker-1-4 110222 314119 35.08925

[root@aosoc current]# broctl netstats
worker-1-1: 1534949053.166850 recvd=813997 dropped=0 link=813997
worker-1-2: 1534949053.366803 recvd=873351 dropped=0 link=873353
worker-1-3: 1534949053.567778 recvd=1770808 dropped=0 link=1770810
worker-1-4: 1534949053.767852 recvd=865443 dropped=0 link=865449
worker-1-5: 1534949053.968873 recvd=349355 dropped=0 link=349361
worker-1-6: 1534949054.168785 recvd=1152160 dropped=0 link=1152161
worker-1-7: 1534949054.368825 recvd=1358553 dropped=0 link=1358553
worker-1-8: 1534949054.569808 recvd=345267 dropped=0 link=345272
worker-1-9: 1534949054.769982 recvd=856725 dropped=0 link=856732
worker-1-10: 1534949054.969811 recvd=351148 dropped=0 link=351148
worker-1-11: 1534949055.170855 recvd=883897 dropped=0 link=883897
worker-1-12: 1534949055.370950 recvd=820117 dropped=0 link=820125
worker-1-13: 1534949055.571899 recvd=1132465 dropped=0 link=1132473
worker-1-14: 1534949055.771751 recvd=823249 dropped=0 link=823249
worker-1-15: 1534949055.972921 recvd=754342 dropped=0 link=754343
worker-1-16: 1534949056.173778 recvd=822102 dropped=0 link=822106
worker-1-17: 1534949056.373806 recvd=570905 dropped=0 link=570911
worker-1-18: 1534949056.573815 recvd=1033845 dropped=0 link=1033846
worker-1-19: 1534949056.774737 recvd=648977 dropped=0 link=649001
worker-1-20: 1534949056.974823 recvd=816836 dropped=0 link=816838
worker-1-21: 1534949057.175858 recvd=423896 dropped=0 link=423901
worker-1-22: 1534949057.375894 recvd=761794 dropped=0 link=761796
worker-1-23: 1534949057.576737 recvd=415151 dropped=0 link=415153
worker-1-24: 1534949057.776887 recvd=604342 dropped=0 link=604349
worker-1-25: 1534949057.978046 recvd=911772 dropped=0 link=911785
worker-1-26: 1534949058.177749 recvd=358386 dropped=0 link=358395
worker-1-27: 1534949058.379062 recvd=1283463 dropped=0 link=1283465
worker-1-28: 1534949058.578751 recvd=364801 dropped=0 link=364807
worker-1-29: 1534949058.778735 recvd=930041 dropped=0 link=930042
worker-1-30: 1534949058.979938 recvd=857963 dropped=0 link=857967

If you're seeing a high percentage of capture loss but netstats is showing 0 dropped packets that means one of two things:

* Something still isn't right with the load balancing. It could be that your NIC isn't doing symmetric hashing properly.
* There's an issue with the traffic upstream of bro.

A bunch of the checks that bro-doctor does can help diagnose this, but you'd need to re-enable the conn.log

Sorry, forgot to send that, I did re-enable the conn.log.

Ron