Logging and memory leak

Noticed this today and I believe it’s related to a recent cluster crash. The virtual memory size of the Bro manager continues to grow slowly (56G right now) while the resident active memory remains around 4G. To my knowledge this suggests a memory leak. The system is FreeBSD.

We are using a Kafka only based output with local file logging disabled and a modified number of loggers. I think a single logger works fine for the most part but I used 4 here for testing after the recent cluster crash.

Regarding the crash, the cluster was running fine and then active memory spiked sharply until reaching the ceiling and consuming swap. I suspect it was related to the VSIZE of the manager reaching a maximum of some sort and triggering the failure.

Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 85959 parent 176M 67M 6% bro
logger-1 logger 10.1.1.1 86115 child 194M 62M 0% bro
logger-2 logger 10.1.1.1 85962 parent 709M 175M 19% bro
logger-2 logger 10.1.1.1 86017 child 194M 65M 1% bro
logger-3 logger 10.1.1.1 85965 parent 663M 164M 13% bro
logger-3 logger 10.1.1.1 86114 child 202M 71M 0% bro
logger-4 logger 10.1.1.1 85967 parent 663M 157M 17% bro
logger-4 logger 10.1.1.1 86113 child 194M 63M 0% bro
manager manager 10.1.1.1 86204 child 878M 649M 100% bro
manager manager 10.1.1.1 86109 parent 56G 4G 16% bro

last pid: 1482; load averages: 4.73, 4.73, 4.58 up 3+23:25:56 00:12:45
52 processes: 2 running, 50 sleeping
CPU: 3.4% user, 0.3% nice, 3.3% system, 0.0% interrupt, 92.9% idle
Mem: 538M Active, 6674M Inact, 17G Wired, 26M Cache, 101G Free
ARC: 14G Total, 2325M MFU, 11G MRU, 144K Anon, 57M Header, 548M Other
Swap: 12G Total, 17M Used, 12G Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
86204 bro 1 108 5 878M 649M CPU27 27 26.0H 100.00% bro
85962 bro 172 20 0 709M 175M select 25 22.4H 25.63% bro
86109 bro 7 20 0 57682M 4879M uwait 13 499:16 22.17% bro
85967 bro 162 20 0 663M 157M select 13 19.7H 20.80% bro
85965 bro 162 20 0 663M 164M select 23 18.2H 16.31% bro
85959 bro 21 20 0 176M 69352K select 40 193:32 6.30% bro
86017 bro 1 25 5 194M 66772K select 36 28:18 1.27% bro
86113 bro 1 25 5 194M 65068K select 42 23:28 0.98% bro
86114 bro 1 25 5 202M 73308K select 38 13:43 0.39% bro

Are you loading misc/scan or misc/detect-traceroute ?

Oh! You got multiple loggers working? I thought you had the broctl changes right before, but you were seeing those weird issues.. did you figure out what was causing them?

No, both are disabled.

Not really, I was going to reply to the old thread regarding this.

I’m in the process of switching back to a single logger and considering trying Kafka export from each worker directly. Right now the logs are backlogged by about 20 minutes which I suspect is the bottleneck issue. Apparently when using only Kafka export it has taken 24+ hours to reach this state as opposed to previously with file based logging where the logs would be delayed by 20 minutes within an hour’s time.

I’m sure the weird issues still exist with multiple loggers I just can’t see them as easily right now, my Logstash parser doesn’t handle them yet. The priority has been to get the cluster stable, after that I’ll have time to work on optimization. It seems with the current configuration a cluster restart once per day is going to be required.

I’m also about to add another 44 workers to resolve the 9-17% packet loss per worker during peak. I’m expecting the individual worker export to have its own set of challenges so my time may be better spent re-writing the logger node for high volume. Right now I don’t know how to reconfigure Bro to Kafka export from the workers directly, have to read more source.

Do you have any other custom scripts loaded that are using sumstats?

With a dedicated logger process the manager doesn't really do anything other than sumstats.

Look in your cluster-layout.bro to see what port your manager process is assigned.. with 4 loggers I'd imagine it is around 47765/tcp

Then, run this command on the manager, on the interface that it talks to workers:

    tcpdump -n -i em1 port 47765 -A | egrep -io '[A-Za-z_:-]{10,}'

That will output the names of the events that are bouncing between the workers and the manager

And see what you see.. It SHOULD be almost nothing, maybe a trickle of events.

No, no custom scripts. I found the cluster overwhelmed again today with massive virtual memory usage and an hour after restarting it the same condition returns.

I’m using a single logger since the last time. It seems when using a Kafka only export a single logger works fine as the event timestamps arriving at Kafka are near realtime. (the “ts” for conn, http, etc.)

The logger is using 47761, manager is 47762. I took a sample of 5000 packets for each during the high memory usage and it looks like the manager is still receiving logs of some sort dealing with x509 certificates.

tcpdump -A -r Bro__Manager_port_47662.pcap | egrep -io ‘[A-Za-z_:-]{10,}’ | sort | uniq -c | sort -rn | head -10
reading from file Bro__Manager_port_47662.pcap, link-type EN10MB (Ethernet)
2852 certificate
2064 Notice::begin_suppressionA
1800 Conn::IN_ORIG
739 Authentication
685 Identifier
681 Encipherment
660 validation
646 Corporation

Config:

event bro_init()
Analyzer::disable_analyzer(Analyzer::ANALYZER_SYSLOG);
@load misc/loaded-scripts
@load tuning/defaults
@load misc/stats
@load misc/capture-loss
@load frameworks/software/vulnerable
@load frameworks/software/version-changes
@load-sigs frameworks/signatures/detect-windows-shells
@load protocols/ftp/software
@load protocols/smtp/software
@load protocols/ssh/software
@load protocols/http/software
@load protocols/dns/detect-external-names
@load protocols/ftp/detect
@load protocols/conn/known-hosts
@load protocols/conn/known-services
@load protocols/ssl/known-certs
@load protocols/ssl/validate-certs
@load protocols/ssl/log-hostcerts-only
@load protocols/ssh/geo-data
@load protocols/ssh/detect-bruteforcing
@load protocols/ssh/interesting-hostnames
@load protocols/http/detect-sqli
@load frameworks/files/hash-all-files
@load frameworks/files/detect-MHR
@load frameworks/intel/seen
@load frameworks/intel/do_notice
@load local-intel.bro
@load Bro/Kafka/logs-to-kafka.bro

Snapshots before the first restart at

@ Wed Feb 1 19:57:28 UTC 2017

[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 25469 parent 849M 330M 171% bro
logger-1 logger 10.1.1.1 25523 child 458M 69M 38% bro
manager manager 10.1.1.1 25685 child 494M 261M 100% bro
manager manager 10.1.1.1 25543 parent 9G 1G 27% bro

@ Fri Feb 3 04:05:04 UTC 2017

[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 25469 parent 793M 284M 99% bro
logger-1 logger 10.1.1.1 25523 child 466M 80M 25% bro
manager manager 10.1.1.1 25685 child 494M 261M 100% bro
manager manager 10.1.1.1 25543 parent 9G 1G 35% bro

@ Fri Feb 3 21:15:51 UTC 2017

[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 25469 parent 813M 316M 178% bro
logger-1 logger 10.1.1.1 25523 child 466M 83M 38% bro
manager manager 10.1.1.1 25685 child 8G 8G 100% bro
manager manager 10.1.1.1 25543 parent 1222G 87G 99% bro

last pid: 33713; load averages: 7.35, 6.19, 5.69 up 6+20:45:54 21:32:43
49 processes: 4 running, 45 sleeping
CPU: 8.0% user, 0.3% nice, 5.7% system, 0.2% interrupt, 85.8% idle
Mem: 31G Active, 69G Inact, 20G Wired, 1884K Cache, 5167M Free
ARC: 14G Total, 2327M MFU, 11G MRU, 155K Anon, 58M Header, 555M Other
Swap: 12G Total, 17M Used, 12G Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
25469 bro 172 20 0 1089M 591M select 11 118.3H 258.45% bro
25685 bro 1 108 5 9174M 9035M CPU22 22 63.0H 100.00% bro
25543 bro 7 20 0 1275G 93626M uwait 2 21.1H 100.00% bro
25523 bro 1 92 5 466M 85900K CPU3 3 18.0H 51.27% bro
33713 bro 1 78 0 97192K 17284K CPU32 32 0:02 21.09% python2.7
33709 bro 1 52 0 97192K 17300K piperd 31 0:01 2.69% python2.7

I restarted the cluster at 21:42 and see the same behavior almost immediately, within an hour:

@ Fri Feb 3 22:27:38 UTC 2017

[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 37525 parent 647M 124M 15% bro
logger-1 logger 10.1.1.1 37576 child 458M 65M 2% bro
manager manager 10.1.1.1 37653 child 506M 199M 100% bro
manager manager 10.1.1.1 37600 parent 250G 18G 100% bro

@ Sat Feb 4 00:22:01 UTC 2017

[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 37525 parent 663M 134M 14% bro
logger-1 logger 10.1.1.1 37576 child 458M 65M 2% bro
manager manager 10.1.1.1 37653 child 506M 237M 100% bro
manager manager 10.1.1.1 37600 parent 640G 46G 38% bro

last pid: 75833; load averages: 2.31, 2.47, 2.57 up 6+23:49:38 00:36:27
47 processes: 2 running, 45 sleeping
CPU: 1.2% user, 0.2% nice, 2.4% system, 0.1% interrupt, 96.1% idle
Mem: 20G Active, 26G Inact, 22G Wired, 1752K Cache, 57G Free
ARC: 8159M Total, 2242M MFU, 5323M MRU, 64K Anon, 49M Header, 545M Other
Swap: 12G Total, 16M Used, 12G Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
37653 bro 1 108 5 506M 237M CPU33 33 172:00 100.00% bro
37600 bro 7 20 0 640G 47116M uwait 25 143:23 27.88% bro
37525 bro 162 20 0 663M 134M select 45 106:45 12.60% bro
37576 bro 1 25 5 458M 67404K select 6 5:03 1.46% bro

@ Sat Feb 4 00:38:50 UTC 2017

Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 37525 parent 663M 134M 13% bro
logger-1 logger 10.1.1.1 37576 child 458M 65M 1% bro
manager manager 10.1.1.1 37653 child 506M 237M 99% bro
manager manager 10.1.1.1 37600 parent 640G 46G 37% bro

With pure Kafka export there’s very few logs on the manager filesystem. I don’t think this is related to the manager issue but fwiw there’s a steady stream of kafka errors even though logging seems to be working:

From /opt/bro_data/logs/current/stderr.log → (/opt/bro_data/spool/logger-1)

%3|1486168762.913|ERROR|rdkafka#producer-15| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168773.922|FAIL|rdkafka#producer-18| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168773.922|ERROR|rdkafka#producer-18| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168775.662|FAIL|rdkafka#producer-19| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168775.663|ERROR|rdkafka#producer-19| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168778.818|FAIL|rdkafka#producer-21| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168778.819|ERROR|rdkafka#producer-21| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
(…)
%3|1486169053.779|ERROR|rdkafka#producer-8| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486169082.658|FAIL|rdkafka#producer-23| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486169082.659|ERROR|rdkafka#producer-23| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected

[bro@mgr /opt/bro_data/logs]$ date -r 1486169082
Sat Feb 4 00:44:42 UTC 2017

[bro@mgr /opt/bro_data/logs]$ date
Sat Feb 4 00:45:08 UTC 2017

What timeframe was that pcap for? If the pcap was from an hour or so, it's probably nothing.. but if that was from a few seconds you could have a problem there.

The 2000+ things are definitely related to SSL, as well as the other strings in there.. if you look at the raw tcpdump output those would make more sense in the normal order.. The numbers are a little inflated because when the manager sends out something like the Notice::begin_suppression event, it has to send it once to each worker (which is also something that needs to be addressed for better scaling bro). ~2064 events would have been sent out for only ~14 notices events if you had 150 workers.

What does your notice.log contain related to SSL. Do you have a TON of notices for Invalid_Server_Cert or something like it? Is your known_certs.log file growing rapidly?

using a larger cutoff for head may have shown SSL::Invalid_Server_Cert.

The 1800 Conn::IN_ORIG are from workers -> manager from policy/frameworks/intel/seen/conn-established.bro

One thing you could try is commenting out anything in your config related to ssl or intel, and see if that's stable. That would help narrow down what the problem is.

In general, the manager just isn't doing much anymore, so for it to be using that much ram that fast, it would have to be doing something extremely frequently. That's why knowing the timeframe is really important :slight_smile:

If your cluster is doing something like generating Invalid_Server_Cert notices at an extremely high rate, then it's possible that the manager parent is trying to tell all the workers about it and the manager child is not able to keep up. That kind of fits with this output:

manager manager 10.1.1.1 37653 child 506M 237M 100% bro
manager manager 10.1.1.1 37600 parent 640G 46G 38% bro

What timeframe was that pcap for? If the pcap was from an hour or so, it’s probably nothing… but if that was from a few seconds you could have a problem there.

The pcap is 9 seconds during the problem timeframe at the peak of system usage.

The 2000+ things are definitely related to SSL, as well as the other strings in there… if you look at the raw tcpdump output those would make more sense in the normal order… The numbers are a little inflated because when the manager sends out something like the Notice::begin_suppression event, it has to send it once to each worker (which is also something that needs to be addressed for better scaling bro). ~2064 events would have been sent out for only ~14 notices events if you had 150 workers.

Good to know. 132 workers.

What does your notice.log contain related to SSL. Do you have a TON of notices for Invalid_Server_Cert or something like it? Is your known_certs.log file growing rapidly?

Not sure at the moment.

using a larger cutoff for head may have shown SSL::Invalid_Server_Cert.

The 1800 Conn::IN_ORIG are from workers → manager from policy/frameworks/intel/seen/conn-established.bro

It did not but I think that type of message would have been sent to the Logger on a different port.

2064 Notice::begin_suppression
1800 Conn::IN_ORIG
396 Notice::cluster_notice
26 SumStats::cluster_key_intermediate_response
1 Intel::match_no_items
1 Conn::Info

One thing you could try is commenting out anything in your config related to ssl or intel, and see if that’s stable. That would help narrow down what the problem is.

In general, the manager just isn’t doing much anymore, so for it to be using that much ram that fast, it would have to be doing something extremely frequently. That’s why knowing the timeframe is really important :slight_smile:

If your cluster is doing something like generating Invalid_Server_Cert notices at an extremely high rate, then it’s possible that the manager parent is trying to tell all the workers about it and the manager child is not able to keep up. That kind of fits with this output:

I’ll try that. Seems like I’ll be able to narrow down the issue this week. There’s a weekly pattern to the failure starting late Thursday and continuing most of the day Friday so I’m guessing either a researcher or an automated scan is the cause.

…The saga continues…

Disabling protocols/ssl/validate-certs relieved the rapid memory leak but the problem still exists over long periods of time.

Implemented the change on Feb 6th and by Feb 18th the VZSIZE / RSS was 32G / 2G. By Feb 23rd it was 337G / 24G.

FWIW I’m considering migrating the cluster to Intel CPUs but they are the same speed so I’m not sure it’s worth the trouble since the worker count will remain the same.