Logger Child Memory Leak (logger crashing often)

Hey All,

Running Bro 2.5 on a single server with 20 cores and some 240 GB of memory.

node.cfg specifies 14 workers, 2 proxies, 1 manager and a 1 logger process.

We are running a custom build of bro built with tmalloc enabled and pfring enabled.

I’m working to get my bro cluster stable. As it stand, often the logger process will crash causing us to lose a period of log files. Looking at the output of broctl top, it seems that the system is likely killing the bro logger process when it sees the amount of memory resources it is consuming.

==== stderr.log

listening on p5p2

1484325490.230681 received termination signal

broctl top

Name Type Host Pid Proc VSize Rss Cpu Cmd

logger logger localhost 47880 parent 4G 3G 82% bro

logger logger localhost 47902 child 38G 37G 13% bro

As I’ve been writing this email I have watched the logger process’s memory utilization slowly climb from 16% to 17% (broctl top is now indicating 41G memory usage by logger child)

I’ve been investigating if the bottleneck goes back to our storage solution, which is just a bunch of disks. Based on utilization indicated by iostat and iotop’s output, it seems like the Bro logger process is writing around 4MB/s to our disc which seems reasonable and does not indicate a bottleneck to me.

Aside: there is a tangential problem in that currently we are seeing a very high drop rate indicated by netstats:

broctl netstats

worker-1-1: 1484334410.387744 recvd=1087209145 dropped=3525526435 link=317784575

worker-1-10: 1484334410.711691 recvd=2916765681 dropped=1696150851 link=317965517

Thanks for any insights or suggestions!

-Ryan

Most likely this isn't a leak, but that the logger process isn't able to process the data fast enough.

What model CPUs does this server have? Can you show what this command outputs after bro has been running for a bit:

    top -b -n 1 -H -o TIME | fgrep bro: | head -n 20

The last column will be truncated, don't worry about that.

Hey Justin,

The results of running top (the "-o TIME" parameter set was unavailable on my system)

# top -b -n 1 -H | fgrep bro: | head -n 20
48059 root 20 0 4496m 4.0g 260m S 27.7 1.7 38:36.49 bro: conn/Log::
47908 root 20 0 4496m 4.0g 260m S 5.5 1.7 8:29.99 bro: weird/Log:
47911 root 20 0 4496m 4.0g 260m S 5.5 1.7 8:32.29 bro: dns/Log::W
47907 root 20 0 4496m 4.0g 260m S 1.8 1.7 4:36.68 bro: syslog/Log
10331 root 20 0 19.1g 15g 257m S 0.0 6.7 0:06.92 bro: /opt/bro/f
29622 root 20 0 134m 59m 5396 S 0.0 0.0 0:00.10 bro: loaded_scr
29634 root 20 0 593m 520m 5444 S 0.0 0.2 0:00.12 bro: loaded_scr
29624 root 20 0 175m 59m 5476 S 0.0 0.0 0:00.26 bro: /opt/bro/f
29632 root 20 0 175m 59m 5476 S 0.0 0.0 0:00.19 bro: loaded_scr
47901 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.15 bro: packet_fil
47903 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.07 bro: loaded_scr
47904 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.06 bro: reporter/L
47905 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.26 bro: communicat
47906 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.08 bro: stats/Log:
47909 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.76 bro: known_serv
47910 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.30 bro: known_host
47912 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.88 bro: software/L
47913 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:00.07 bro: dce_rpc/Lo
47914 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:03.22 bro: files/Log:
47915 root 20 0 4496m 4.0g 260m S 0.0 1.7 0:10.19 bro: http/Log::

For some more information -- the processors we are running are the following:
# lstopo -v | grep Socket
    Socket L#0 (P#0 CPUModel="Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz" CPUType=x86_64)
    Socket L#1 (P#1 CPUModel="Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz" CPUType=x86_64)

Hyper threading is disabled on this server.

Thanks!
-Ryan

Hey Justin,

The results of running top (the "-o TIME" parameter set was unavailable on my system)

Ah, it looks like the default is close enough.

# top -b -n 1 -H | fgrep bro: | head -n 20
48059 root 20 0 4496m 4.0g 260m S 27.7 1.7 38:36.49 bro: conn/Log::
47908 root 20 0 4496m 4.0g 260m S 5.5 1.7 8:29.99 bro: weird/Log:

This shows most of your time is spent writing the conn.log and the weird.log.

Does your conn.log look normal? The main thing to check for when using pf_ring is to see if things are actually being load balanced properly. If you make a single tcp connection, does it get logged to the conn.log once, or 14 times?

What does this command output:

    cat /bro/logs/current/conn.log |bro-cut history|sort|uniq -c|sort -rn|head -n 50

The weird.log shouldn't be very large, what does this output?

    cat /bro/logs/current/weird.log|bro-cut name|sort|uniq -c|sort -rn

For some more information -- the processors we are running are the following:
# lstopo -v | grep Socket
   Socket L#0 (P#0 CPUModel="Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz" CPUType=x86_64)
   Socket L#1 (P#1 CPUModel="Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz" CPUType=x86_64)

Hyper threading is disabled on this server.

Well, those are good CPUs, things should be keeping up a bit better than they are.

Are you using the pin_cpus setting in your node.cfg?

Thank you for the response Justin!
I will continue investigating whether or not pfring is misconfigured.
Last week I did fiddle with CPU pinning with no change to the logger crashing behavior. The current configuration is:
Manager: pinned to 0
Logger: pinned to 1
Workers: pinned to 2-15
Proxies: unpinned

cat /home/bro/logs/current/conn.log |bro-cut history|sort|uniq -c|sort -rn|head -n 50

1691591 D
1393495 ^d
988587 A
587617 S
490552 ^a
207430 -
205059 Dd
193311 ^f
183811 ^h
180860 F
163651 ^dA
130203 Ad
114927 DA
114207 ^dD
111956 AD
98371 Da
82048 R
69047 ^c
68888 Aa
56273 ^ad
54982 ^aA
49199 ^aD
49005 ^da
41245 Adc
40925 ^dAc
38275 ^r
36261 SA
35243 Acd
32155 ^dcA
31712 ^hA
30753 DdA
30186 ^cA
28592 AF
27648 SD
27378 Ac
26598 ^df
23810 DF
23686 ^cAd
23194 Af
22530 ADd
22168 ^cdA
21580 ^hd
21373 ^hD
20249 Sd
19126 ^dAD
19051 Df
18797 ^cd
17837 Sa
17813 ^ha
17813 ^dc

cat /home/bro/logs/current/weird.log|bro-cut name|sort|uniq -c|sort -rn

297534 dns_unmatched_msg
159574 dns_unmatched_reply
85718 above_hole_data_without_any_acks
51007 truncated_tcp_payload
31704 TCP_ack_underflow_or_misorder
28605 possible_split_routing
18759 data_before_established
10649 TCP_seq_underflow_or_misorder
4952 bad_TCP_checksum
1915 active_connection_reuse
1729 line_terminated_with_single_CR
1535 inappropriate_FIN
1149 DNS_RR_unknown_type
1042 excessive_data_without_further_acks
569 SYN_seq_jump
511 unknown_dce_rpc_auth_type_68
333 window_recision
313 SYN_inside_connection
313 DNS_RR_length_mismatch
215 bad_HTTP_request
203 DNS_Conn_count_too_large
173 connection_originator_SYN_ack
132 NUL_in_line
123 binpac exception: out_of_bound: SSLRecord:length: 13 > 2
57 FIN_advanced_last_seq
51 bad_UDP_checksum
33 SYN_after_close
32 data_after_reset
31 binpac exception: out_of_bound: Syslog_Priority:lt: 1 > 0
30 DNS_truncated_len_lt_hdr_len
29 ssl_early_application_data
21 DNS_truncated_RR_rdlength_lt_len
20 SYN_with_data
18 non_ip_packet_in_encap
12 unknown_HTTP_method
10 dnp3_corrupt_header_checksum
8 SYN_after_reset
7 bad_SYN_ack
6 empty_http_request
5 Teredo_bubble_with_payload
5 binpac exception: out_of_bound: SSLRecord:rec: 61752 > 39
(Abbriviated – see attached for full output)

weird_log_uniq (20.1 KB)