Worker OOM crashes

Has anyone experienced this crash scenario that I’m in the process of debugging? This just started in the last couple of days on a cluster that has been in production for just shy of two years without issue.

tcmalloc: large alloc 18446744072956297216 bytes == (nil) @ 0x7f08e85b5dcb 0x7f08e85b5f1b 0x7f08e85b6965 0x7f08e85e89c5 0x7f1fff 0x867016 0x86781e 0x7f1a96 0x7f1db6 0x7f121a 0x7efbbe 0x7ed334 0x866eb5 0x56441c 0x600327 0x60124e 0x5cfa14 0x837db3 0x5cfecf 0x52ecb0 0x7f08e759fb45 0x5373ad (nil) out of memory in new.
1477631586.955885 fatal error: out of memory in new.

The server itself still has some ceiling height:

$ free -h
total used free shared buffers cached
Mem: 126G 23G 102G 16M 67M 5.1G
-/+ buffers/cache: 18G 107G
Swap: 33G 0B 33G

But I’m wondering if the processes themselves are hitting a maximum size for memory allocation, as some are nearing 4GB.

PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
24925 bro 20 0 3964M 3262M 269M S 78.1 2.5 2h47:55 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-5 loc
618 bro 20 0 1217M 1160M 269M S 77.0 0.9 1h57:12 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-2 loc
26073 bro 20 0 3989M 3156M 269M S 76.4 2.4 2h41:38 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-7 loc
23938 bro 20 0 2610M 2318M 269M R 75.9 1.8 2h45:00 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-9 loc
27816 bro 20 0 1554M 1455M 269M S 74.8 1.1 2h37:40 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-1 loc
27817 bro 20 0 1532M 1435M 269M S 72.1 1.1 2h33:42 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-4 loc
26677 bro 20 0 1421M 1372M 269M R 70.5 1.1 2h40:42 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-6 loc
622 bro 20 0 1165M 1108M 269M R 70.5 0.9 1h54:24 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-8 loc
613 bro 20 0 1231M 1166M 269M R 69.9 0.9 1h55:45 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-10 lo
621 bro 20 0 1242M 1177M 269M R 68.9 0.9 1h53:02 /opt/bro/bin/bro -i eth6 -U .status -p broctl -p broctl-live -p local -p WIN_EXT-3 loc
1175 bro 20 0 234M 178M 10776 S 2.2 0.1 2h42:07 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p WIN_EXT_PXY_1 local.b

-Dave

Yes.. there seems to be an issue where something goes wrong and tries to allocate too much. I don't think it has anything to do with 4G

Can you apply this patch and see if the next time it happens you get a core dump + backtrace?

diff --git a/src/util.cc b/src/util.cc
index acfcb19..661f8ac 100644
--- a/src/util.cc
+++ b/src/util.cc
@@ -1624,7 +1624,7 @@ extern "C" void out_of_memory(const char* where)

        if ( reporter )
                // Guess that might fail here if memory is really tight ...
- reporter->FatalError("out of memory in %s.\n", where);
+ reporter->FatalErrorWithCore("out of memory in %s.\n", where);

        abort();
        }

I worked with Dave on this a bit.. had no luck getting bro to dump a core, but then I realized that the tcmalloc log line WAS the stack trace. Still working on getting a real core dump, but this is a start.

18446744072956297216 is FFFFFFFFD31A4000 in hex

stderr.log:tcmalloc: large alloc 18446744072956297216 bytes == (nil) @ 0x7fc9e487bdcb 0x7fc9e487bf1b 0x7fc9e487c965 0x7fc9e48ae9c5 0x7f1fef 0x867006 0x86780e 0x7f1a86 0x7f1da6 0x7f120a 0x7efbae 0x7ed324 0x866ea5 0x56440c 0x600317 0x60123e 0x5cfa04 0x837da3 0x5cfebf 0x52ecb0 0x7fc9e3865b45 0x5373ad (nil)

(gdb) info symbol 0x7fc9e487bdcb
No symbol matches 0x7fc9e487bdcb.

(gdb) info symbol 0x7fc9e487bf1b
No symbol matches 0x7fc9e487bf1b.

(gdb) info symbol 0x7fc9e487c965
No symbol matches 0x7fc9e487c965.

(gdb) info symbol 0x7fc9e48ae9c5
No symbol matches 0x7fc9e48ae9c5.

(gdb) info symbol 0x7f1fef
analyzer::tcp::ContentLine_Analyzer::DeliverStream(int, unsigned char const*, bool) + 207 in section .text

(gdb) info symbol 0x867006
analyzer::Analyzer::NextStream(int, unsigned char const*, bool) + 102 in section .text

(gdb) info symbol 0x86780e
analyzer::Analyzer::ForwardStream(int, unsigned char const*, bool) + 158 in section .text

(gdb) info symbol 0x7f1a86
analyzer::tcp::TCP_Reassembler::DeliverBlock(unsigned long, int, unsigned char const*) + 118 in section .text

(gdb) info symbol 0x7f1da6
analyzer::tcp::TCP_Reassembler::BlockInserted(DataBlock*) + 102 in section .text

(gdb) info symbol 0x7f120a
analyzer::tcp::TCP_Reassembler::DataSent(double, unsigned long, int, unsigned char const*, bool) + 106 in section .text

(gdb) info symbol 0x7efbae
analyzer::tcp::TCP_Endpoint::DataSent(double, unsigned long, int, int, unsigned char const*, IP_Hdr const*, tcphdr const*) + 110 in section .text

(gdb) info symbol 0x7ed324
analyzer::tcp::TCP_Analyzer::DeliverPacket(int, unsigned char const*, bool, unsigned long, IP_Hdr const*, int) + 4932 in section .text

(gdb) info symbol 0x866ea5
analyzer::Analyzer::NextPacket(int, unsigned char const*, bool, unsigned long, IP_Hdr const*, int) + 149 in section .text

(gdb) info symbol 0x56440c
Connection::NextPacket(double, int, IP_Hdr const*, int, int, unsigned char const*&, int&, int&, pcap_pkthdr const*, unsigned char const*, int) + 188 in section .text

(gdb) info symbol 0x600317
NetSessions::DoNextPacket(double, pcap_pkthdr const*, IP_Hdr const*, unsigned char const*, int, EncapsulationStack const*) + 2279 in section .text

(gdb) info symbol 0x60123e
NetSessions::NextPacket(double, pcap_pkthdr const*, unsigned char const*, int) + 798 in section .text

(gdb) info symbol 0x5cfa04
net_packet_dispatch(double, pcap_pkthdr const*, unsigned char const*, int, iosource::PktSrc*) + 244 in section .text

(gdb) info symbol 0x837da3
iosource::PktSrc::Process() + 323 in section .text

(gdb) info symbol 0x5cfebf
net_run() + 111 in section .text

(gdb) info symbol 0x52ecb0
main + 7232 in section .text

(gdb) info symbol 0x7fc9e3865b45
No symbol matches 0x7fc9e3865b45.

(gdb) info symbol 0x5373ad
_start + 41 in section .text