No, no custom scripts. I found the cluster overwhelmed again today with massive virtual memory usage and an hour after restarting it the same condition returns.
I’m using a single logger since the last time. It seems when using a Kafka only export a single logger works fine as the event timestamps arriving at Kafka are near realtime. (the “ts” for conn, http, etc.)
The logger is using 47761, manager is 47762. I took a sample of 5000 packets for each during the high memory usage and it looks like the manager is still receiving logs of some sort dealing with x509 certificates.
tcpdump -A -r Bro__Manager_port_47662.pcap | egrep -io ‘[A-Za-z_:-]{10,}’ | sort | uniq -c | sort -rn | head -10
reading from file Bro__Manager_port_47662.pcap, link-type EN10MB (Ethernet)
2852 certificate
2064 Notice::begin_suppressionA
1800 Conn::IN_ORIG
739 Authentication
685 Identifier
681 Encipherment
660 validation
646 Corporation
Config:
event bro_init()
Analyzer::disable_analyzer(Analyzer::ANALYZER_SYSLOG);
@load misc/loaded-scripts
@load tuning/defaults
@load misc/stats
@load misc/capture-loss
@load frameworks/software/vulnerable
@load frameworks/software/version-changes
@load-sigs frameworks/signatures/detect-windows-shells
@load protocols/ftp/software
@load protocols/smtp/software
@load protocols/ssh/software
@load protocols/http/software
@load protocols/dns/detect-external-names
@load protocols/ftp/detect
@load protocols/conn/known-hosts
@load protocols/conn/known-services
@load protocols/ssl/known-certs
@load protocols/ssl/validate-certs
@load protocols/ssl/log-hostcerts-only
@load protocols/ssh/geo-data
@load protocols/ssh/detect-bruteforcing
@load protocols/ssh/interesting-hostnames
@load protocols/http/detect-sqli
@load frameworks/files/hash-all-files
@load frameworks/files/detect-MHR
@load frameworks/intel/seen
@load frameworks/intel/do_notice
@load local-intel.bro
@load Bro/Kafka/logs-to-kafka.bro
Snapshots before the first restart at
@ Wed Feb 1 19:57:28 UTC 2017
[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 25469 parent 849M 330M 171% bro
logger-1 logger 10.1.1.1 25523 child 458M 69M 38% bro
manager manager 10.1.1.1 25685 child 494M 261M 100% bro
manager manager 10.1.1.1 25543 parent 9G 1G 27% bro
@ Fri Feb 3 04:05:04 UTC 2017
[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 25469 parent 793M 284M 99% bro
logger-1 logger 10.1.1.1 25523 child 466M 80M 25% bro
manager manager 10.1.1.1 25685 child 494M 261M 100% bro
manager manager 10.1.1.1 25543 parent 9G 1G 35% bro
@ Fri Feb 3 21:15:51 UTC 2017
[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 25469 parent 813M 316M 178% bro
logger-1 logger 10.1.1.1 25523 child 466M 83M 38% bro
manager manager 10.1.1.1 25685 child 8G 8G 100% bro
manager manager 10.1.1.1 25543 parent 1222G 87G 99% bro
last pid: 33713; load averages: 7.35, 6.19, 5.69 up 6+20:45:54 21:32:43
49 processes: 4 running, 45 sleeping
CPU: 8.0% user, 0.3% nice, 5.7% system, 0.2% interrupt, 85.8% idle
Mem: 31G Active, 69G Inact, 20G Wired, 1884K Cache, 5167M Free
ARC: 14G Total, 2327M MFU, 11G MRU, 155K Anon, 58M Header, 555M Other
Swap: 12G Total, 17M Used, 12G Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
25469 bro 172 20 0 1089M 591M select 11 118.3H 258.45% bro
25685 bro 1 108 5 9174M 9035M CPU22 22 63.0H 100.00% bro
25543 bro 7 20 0 1275G 93626M uwait 2 21.1H 100.00% bro
25523 bro 1 92 5 466M 85900K CPU3 3 18.0H 51.27% bro
33713 bro 1 78 0 97192K 17284K CPU32 32 0:02 21.09% python2.7
33709 bro 1 52 0 97192K 17300K piperd 31 0:01 2.69% python2.7
I restarted the cluster at 21:42 and see the same behavior almost immediately, within an hour:
@ Fri Feb 3 22:27:38 UTC 2017
[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 37525 parent 647M 124M 15% bro
logger-1 logger 10.1.1.1 37576 child 458M 65M 2% bro
manager manager 10.1.1.1 37653 child 506M 199M 100% bro
manager manager 10.1.1.1 37600 parent 250G 18G 100% bro
@ Sat Feb 4 00:22:01 UTC 2017
[bro@mgr /opt/bro]$ bin/broctl top manager logger-1 && echo “” && date
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 37525 parent 663M 134M 14% bro
logger-1 logger 10.1.1.1 37576 child 458M 65M 2% bro
manager manager 10.1.1.1 37653 child 506M 237M 100% bro
manager manager 10.1.1.1 37600 parent 640G 46G 38% bro
last pid: 75833; load averages: 2.31, 2.47, 2.57 up 6+23:49:38 00:36:27
47 processes: 2 running, 45 sleeping
CPU: 1.2% user, 0.2% nice, 2.4% system, 0.1% interrupt, 96.1% idle
Mem: 20G Active, 26G Inact, 22G Wired, 1752K Cache, 57G Free
ARC: 8159M Total, 2242M MFU, 5323M MRU, 64K Anon, 49M Header, 545M Other
Swap: 12G Total, 16M Used, 12G Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
37653 bro 1 108 5 506M 237M CPU33 33 172:00 100.00% bro
37600 bro 7 20 0 640G 47116M uwait 25 143:23 27.88% bro
37525 bro 162 20 0 663M 134M select 45 106:45 12.60% bro
37576 bro 1 25 5 458M 67404K select 6 5:03 1.46% bro
@ Sat Feb 4 00:38:50 UTC 2017
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 37525 parent 663M 134M 13% bro
logger-1 logger 10.1.1.1 37576 child 458M 65M 1% bro
manager manager 10.1.1.1 37653 child 506M 237M 99% bro
manager manager 10.1.1.1 37600 parent 640G 46G 37% bro
With pure Kafka export there’s very few logs on the manager filesystem. I don’t think this is related to the manager issue but fwiw there’s a steady stream of kafka errors even though logging seems to be working:
From /opt/bro_data/logs/current/stderr.log → (/opt/bro_data/spool/logger-1)
%3|1486168762.913|ERROR|rdkafka#producer-15| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168773.922|FAIL|rdkafka#producer-18| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168773.922|ERROR|rdkafka#producer-18| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168775.662|FAIL|rdkafka#producer-19| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168775.663|ERROR|rdkafka#producer-19| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168778.818|FAIL|rdkafka#producer-21| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486168778.819|ERROR|rdkafka#producer-21| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
(…)
%3|1486169053.779|ERROR|rdkafka#producer-8| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486169082.658|FAIL|rdkafka#producer-23| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
%3|1486169082.659|ERROR|rdkafka#producer-23| 10.1.1.5:9092/bootstrap: Receive failed: Disconnected
[bro@mgr /opt/bro_data/logs]$ date -r 1486169082
Sat Feb 4 00:44:42 UTC 2017
[bro@mgr /opt/bro_data/logs]$ date
Sat Feb 4 00:45:08 UTC 2017