Bro's limitations with high worker count and memory exhaustion

There’s some sort of association between memory exhaustion and a high number of workers. The poor man’s fix would be to purchase new servers with higher CPU speeds as that would reduce the worker count. Issues with high worker count and/or memory exhaustion appears to be a well know problem based on the mailing list archives.

In the current version of bro-2.4 my previous configuration immediately causes the manager to crash: 15 proxies, 155 workers. To resolve this I’ve lowered the count to 10 proxies and 140 workers. However even with this configuration the manager process will exhaust all memory and crash within about 2 hours.

The manager is threaded; I think this is an issue with the threading behavior between manager, proxies, and workers. Debugging threading problems is complex and I’m a complete novice… my current tutorial is using information from a stack overflow thread:

http://stackoverflow.com/questions/981011/c-programming-debugging-with-pthreads

Does anyone else have this problem ? What have you tried and what do you suggest ?

Thanks

1435347409.458185 worker-2-18 parent - - - info [#10000/10.1.1.1:36994] peer sent class “control”

1435347409.458185 worker-2-18 parent - - - info [#10000/10.1.1.1:36994] phase: handshake

1435347409.661085 worker-2-18 parent - - - info [#10000/10.1.1.1:36994] request for unknown event save_results

1435347409.661085 worker-2-18 parent - - - info [#10000/10.1.1.1:36994] registered for event Control::peer_status_response

1435347409.694858 worker-2-18 parent - - - info [#10000/10.1.1.1:36994] peer does not support 64bit PIDs; using compatibility mode

1435347409.694858 worker-2-18 parent - - - info [#10000/10.1.1.1:36994] peer is a Broccoli

1435347409.694858 worker-2-18 parent - - - info [#10000/10.1.1.1:36994] phase: running

I experienced similar problems (memory gets eaten up quickly and workers crash with segfault) using tcmalloc. Which malloc do you use?

Regards,

Jan

Looks like malloc from glibc, default on Ubuntu. I will try jemalloc and others.

Switching to jemalloc fixed the stability issue but not the worker count limitation.

Nevermind… new box, default nofile limits. Thanks for the malloc tip.

The manager still crashes. Interesting note about a buffer overflow.

[manager]

Bro 2.4

Linux 3.16.0-38-generic

core

[New LWP 18834]

[Thread debugging using libthread_db enabled]

Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.

Core was generated by `/usr/local/3rd-party/bro/bin/bro -U .status -p broctl -p broctl-live -p local -’.

Program terminated with signal SIGABRT, Aborted.

#0 0x00007f163bb46cc9 in __GI_raise (sig=sig@entry=6) at …/nptl/sysdeps/unix/sysv/linux/raise.c:56

Thread 1 (Thread 0x… (LWP 18834)):

#0 0x00007f163bb46cc9 in __GI_raise (sig=sig@entry=6) at …/nptl/sysdeps/unix/sysv/linux/raise.c:56

#1 0x00007f163bb4a0d8 in __GI_abort () at abort.c:89

#2 0x00007f163bb83394 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x… “*** %s ***: %s terminated\n”) at …/sysdeps/posix/libc_fatal.c:175

#3 0x00007f163bc1ac9c in __GI___fortify_fail (msg=, msg@entry=0x… “buffer overflow detected”) at fortify_fail.c:37

#4 0x00007f163bc19b60 in __GI___chk_fail () at chk_fail.c:28

#5 0x00007f163bc1abe7 in __fdelt_chk (d=) at fdelt_chk.c:25

#6 0x00000000005e962a in Set (set=0x…, this=0x…) at /home/bro/Bro-IDS/bro-2.4/src/iosource/FD_Set.h:59

#7 SocketComm::Run (this=0x…) at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:3406

#8 0x00000000005e9c31 in RemoteSerializer::Fork (this=0x…) at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:687

#9 0x00000000005e9d4f in RemoteSerializer::Enable (this=0x…) at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:575

#10 0x00000000005b6943 in BifFunc::bro_enable_communication (frame=, BiF_ARGS=) at bro.bif:4480

#11 0x00000000005b431d in BuiltinFunc::Call (this=0x…, args=0x…, parent=0x…) at /home/bro/Bro-IDS/bro-2.4/src/Func.cc:586

#12 0x0000000000599066 in CallExpr::Eval (this=0x…, f=0x…) at /home/bro/Bro-IDS/bro-2.4/src/Expr.cc:4544

#13 0x000000000060ceb4 in ExprStmt::Exec (this=0x…, f=0x…, flow=@0x…: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:352

#14 0x000000000060b174 in IfStmt::DoExec (this=0x…, f=0x…, v=, flow=@0x…: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:456

#15 0x000000000060ced1 in ExprStmt::Exec (this=0x…, f=0x…, flow=@0x…: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:356

#16 0x000000000060b211 in StmtList::Exec (this=0x…, f=0x…, flow=@0x…: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696

#17 0x000000000060b211 in StmtList::Exec (this=0x…, f=0x…, flow=@0x…: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696

#18 0x00000000005c042e in BroFunc::Call (this=0x…, args=, parent=0x0) at /home/bro/Bro-IDS/bro-2.4/src/Func.cc:403

#19 0x000000000057ee2a in EventHandler::Call (this=0x…, vl=0x…, no_remote=no_remote@entry=false) at /home/bro/Bro-IDS/bro-2.4/src/EventHandler.cc:130

#20 0x000000000057e035 in Dispatch (no_remote=false, this=0x…) at /home/bro/Bro-IDS/bro-2.4/src/Event.h:50

#21 EventMgr::Dispatch (this=this@entry=0x… ) at /home/bro/Bro-IDS/bro-2.4/src/Event.cc:111

#22 0x000000000057e1d0 in EventMgr::Drain (this=0xbbd720 ) at /home/bro/Bro-IDS/bro-2.4/src/Event.cc:128

#23 0x00000000005300ed in main (argc=, argv=) at /home/bro/Bro-IDS/bro-2.4/src/main.cc:1147

A guess is that you’re bumping into an FD_SETSIZE limit — the way remote I/O is currently structured has at least 5 file descriptors per remote connection from what I can see at a glance (a pair of pipes, 2 fds each, for signaling read/write readiness related to ChunkedIO and one fd for the actual socket). Typically, FD_SETSIZE is 1024, so with ~150-200 remote connections and 5 fds per connection plus whatever other descriptors Bro may need to have open (e.g. for file I/O), it seems reasonable to guess that’s the problem. But you could easily verify w/ some code modifications to check whether the FD_SET call is using a fd >= FD_SETSIZE.

Other than making involved code changes to Bro (e.g. to move away from select() for I/O event handling), the only suggestions I have are 1) reducing number of remote connections 2) see if you can increase FD_SETSIZE via preprocessor stuff or CFLAGS/CXXFLAGS upon ./configure’ing (I’ve never done this myself to know if it works, but I’ve googled around before and think the implication was that it may work on Linux).

- Jon

Thanks. Some limited reading says it’s not possible to increase FD_SETSIZE on linux and it’s time to migrate to poll().

Do you think a high worker count with the current implementation of select() would cause high memory usage ?

I’m trying to figure out why the manager always exhausts all memory:

top - 18:36:13 up 1 day, 14:42, 1 user, load average: 12.67, 10.83, 10.95

Tasks: 606 total, 5 running, 601 sleeping, 0 stopped, 0 zombie

%Cpu(s): 15.3 us, 6.4 sy, 1.3 ni, 76.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st

KiB Mem: 65939412 total, 65251768 used, 687644 free, 43248 buffers

KiB Swap: 67076092 total, 54857880 used, 12218212 free. 4297048 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

35046 logstash 20 0 10.320g 511600 3784 S 782.1 0.8 5386:34 java

9925 bro 25 5 97.504g 0.045t 1508 R 99.7 73.8 814:58.88 bro

9906 bro 20 0 22.140g 3.388g 3784 S 73.2 5.4 1899:18 bro

2509 root 20 0 308440 44064 784 R 48.5 0.1 1029:56 redis-server

2688 bro 30 10 4604 1440 1144 R 44.8 0.0 0:00.49 gzip

180 root 20 0 0 0 0 S 8.2 0.0 4:26.54 ksoftirqd/8

2419 debug 20 0 25376 3564 2600 R 7.3 0.0 0:00.76 top

2689 logstash 20 0 8 4 0 R 5.5 0.0 0:00.06 bro

You are not trying to run 140 workers on a single machine with 64GB memory, right?

No. The cluster is six 48-core/64GB servers with the manager as one and 3 proxies and 28 workers for the others. I enabled profiling but didn’t see anything wrong. The bro process that is consuming all memory is not the same process detailed by prof.log.

Each worker should have its own prof.log under /bro/spool/ as well on the actual hosts where the workers are running.

Ok, but the problem isn’t with a worker it’s with the secondary manager process that collects the logs from workers.

I think the memory exhaustion is related to worker count only in the sense that more threads (workers) is causing more frequent leaks. I read accounts of others on this list restarting the manager once per month or something… I’m thinking it’s the same bug (mem leak) but with a lower worker count.

Ah, fair enough. I was potentially one of those people who struggled with memory exhaustion in the past, but the issue ended up largely being due to memory leaks in various Bro builds that existed between Bro 2.2 and 2.3. I haven't seen similar leaks since 2.3 unless I made a mistake in a script such as not validating that Bro populated a value. The only other issue I've seen really tended to crash the proxies (or have them oom-killed), not the manager. The latter was also due to a form of communication overload, but involved software asset tracking, high IP turn over, and lots of table updates getting shared out to the cluster.

I’d guess the slowness of disk I/O is more of a contributing factor — a general problem is that if the manager can’t consume logs (i.e. write them to disk) at a greater or equal rate that workers produce them, then the manager buffers them up until out of memory. If you enable profiling, the prof.log would contain details to indicate whether this is the situation.

Any slowness of select() due to a large number of fds probably puts more pressure back on workers and the remote connection as it’s involved in actually pulling data off the wire and also processing some chunks pulled off the wire and those intermediate buffers have a limit before shutting down connections with peers that talk too much. This could cause thrashing of the actually connection between workers and manager (communication.log may reference stuff “filling up” or something like that), but it’s probably still possible to get in a situation where the manager’s logging thread queues still can’t dequeue entries (i.e. write to disk) fast enough and eventually hits OOM.

And if you haven’t already, take a look at reporter.log for any scripting errors as certain types may cause memory leaks.

- Jon

I’d guess the slowness of disk I/O is more of a contributing factor — a
general problem is that if the manager can’t consume logs (i.e. write them
to disk) at a greater or equal rate that workers produce them, then the
manager buffers them up until out of memory. If you enable profiling, the
prof.log would contain details to indicate whether this is the situation.

This is something I checked before and it seemed a non-issue. I'm using a
ram backed fs and rotating logs every 5 minutes. With prof.log the pending
counter is always 0/0. Is there something else I should be looking for ?

/**

         * Statistics about inter-thread communication.

         */

        struct Stats

                {

                uint64_t sent_in; //! Number of messages sent to the
child thread.

                uint64_t sent_out; //! Number of messages sent from
the child thread to the main thread

                uint64_t pending_in; //! Number of messages sent to the
child but not yet processed.

                uint64_t pending_out; //! Number of messages sent from
the child but not yet processed by the main thread.

                /// Statistics from our queues.

                Queue<BasicInputMessage *>::Stats queue_in_stats;

                Queue<BasicOutputMessage *>::Stats queue_out_stats;

                };

1435868941.791125 TCP-States:Fin.

1435868941.791125 TCP-States:Rst.

1435868941.791125 Connections expired due to inactivity: 0

1435868941.791125 Total reassembler data: 0K

1435868941.791125 Timers: current=69 max=71 mem=5K lag=505.67s

1435868941.791125 DNS_Mgr: requests=0 succesful=0 failed=0 pending=0
cached_hosts=0 cached_addrs=0

1435868941.791125 Triggers: total=0 pending=0

1435868941.791125 RotateTimer = 34

1435868941.791125 ScheduleTimer = 10

1435868941.791125 TableValTimer = 25

1435868941.791125 Threads: current=34

1435868941.791125 loaded_scripts/Log::WRITER_ASCII in=114 out=112
pending=0/0 (#queue r/w: in=114/114 out=112/112)

1435868941.791125 communication/Log::WRITER_ASCII in=147 out=112
pending=0/0 (#queue r/w: in=147/147 out=112/112)

1435868941.791125 files/Log::WRITER_ASCII in=10522 out=54 pending=0/0
(#queue r/w: in=10522/10522 out=54/54)

1435868941.791125 reporter/Log::WRITER_ASCII in=82 out=54 pending=0/0
(#queue r/w: in=82/82 out=54/54)

1435868941.791125 weird/Log::WRITER_ASCII in=1083 out=54 pending=0/0
(#queue r/w: in=1083/1083 out=54/54)

1435868941.791125 packet_filter/Log::WRITER_ASCII in=55 out=54
pending=0/0 (#queue r/w: in=55/55 out=54/54)

1435868941.791125 dpd/Log::WRITER_ASCII in=180 out=54 pending=0/0
(#queue r/w: in=180/180 out=54/54)

1435868941.791125 software/Log::WRITER_ASCII in=404 out=54 pending=0/0
(#queue r/w: in=404/404 out=54/54)

1435868941.791125 tunnel/Log::WRITER_ASCII in=109 out=54 pending=0/0
(#queue r/w: in=109/109 out=54/54)

1435868941.791125 conn/Log::WRITER_ASCII in=20583 out=54 pending=0/0
(#queue r/w: in=20583/20583 out=54/54)

1435868941.791125 dns/Log::WRITER_ASCII in=4369 out=54 pending=0/0
(#queue r/w: in=4369/4369 out=54/54)

1435868941.791125 ftp/Log::WRITER_ASCII in=94 out=54 pending=0/0
(#queue r/w: in=94/94 out=54/54)

1435868941.791125 ssl/Log::WRITER_ASCII in=2566 out=54 pending=0/0
(#queue r/w: in=2566/2566 out=54/54)

1435868941.791125 x509/Log::WRITER_ASCII in=1625 out=54 pending=0/0
(#queue r/w: in=1625/1625 out=54/54)

1435868941.791125 http/Log::WRITER_ASCII in=7894 out=54 pending=0/0
(#queue r/w: in=7894/7894 out=54/54)

1435868941.791125 irc/Log::WRITER_ASCII in=57 out=54 pending=0/0
(#queue r/w: in=57/57 out=54/54)

1435868941.791125 kerberos/Log::WRITER_ASCII in=65 out=54 pending=0/0
(#queue r/w: in=65/65 out=54/54)

1435868941.791125 mysql/Log::WRITER_ASCII in=78 out=54 pending=0/0
(#queue r/w: in=78/78 out=54/54)

1435868941.791125 rdp/Log::WRITER_ASCII in=140 out=54 pending=0/0
(#queue r/w: in=140/140 out=54/54)

1435868941.791125 sip/Log::WRITER_ASCII in=108 out=54 pending=0/0
(#queue r/w: in=108/108 out=54/54)

1435868941.791125 snmp/Log::WRITER_ASCII in=104 out=54 pending=0/0
(#queue r/w: in=104/104 out=54/54)

1435868941.791125 smtp/Log::WRITER_ASCII in=232 out=54 pending=0/0
(#queue r/w: in=232/232 out=54/54)

1435868941.791125 socks/Log::WRITER_ASCII in=69 out=54 pending=0/0
(#queue r/w: in=69/69 out=54/54)

1435868941.791125 ssh/Log::WRITER_ASCII in=167 out=54 pending=0/0
(#queue r/w: in=167/167 out=54/54)

1435868941.791125 pe/Log::WRITER_ASCII in=78 out=54 pending=0/0
(#queue r/w: in=78/78 out=54/54)

1435868941.791125 capture_loss/Log::WRITER_ASCII in=56 out=54 pending=0/0
(#queue r/w: in=56/56 out=54/54)

1435868941.791125 known_hosts/Log::WRITER_ASCII in=137 out=54 pending=0/0
(#queue r/w: in=137/137 out=54/54)

1435868941.791125 known_services/Log::WRITER_ASCII in=108 out=54
pending=0/0 (#queue r/w: in=108/108 out=54/54)

1435868941.791125 known_certs/Log::WRITER_ASCII in=77 out=54 pending=0/0
(#queue r/w: in=77/77 out=54/54)

1435868941.791125 radius/Log::WRITER_ASCII in=78 out=53 pending=0/0
(#queue r/w: in=78/78 out=53/53)

1435868941.791125 dhcp/Log::WRITER_ASCII in=53 out=52 pending=0/0
(#queue r/w: in=53/53 out=52/52)

1435868941.791125 notice/Log::WRITER_ASCII in=96 out=50 pending=0/0
(#queue r/w: in=96/96 out=50/50)

1435868941.791125 modbus/Log::WRITER_ASCII in=50 out=49 pending=0/0
(#queue r/w: in=50/50 out=49/49)

1435868941.791125 traceroute/Log::WRITER_ASCII in=51 out=47 pending=0/0
(#queue r/w: in=51/51 out=47/47)

Any slowness of select() due to a large number of fds probably puts more
pressure back on workers and the remote connection as it’s involved in
actually pulling data off the wire and also processing some chunks pulled
off the wire and those intermediate buffers have a limit before shutting
down connections with peers that talk too much. This could cause thrashing
of the actually connection between workers and manager (communication.log
may reference stuff “filling up” or something like that), but it’s probably
still possible to get in a situation where the manager’s logging thread
queues still can’t dequeue entries (i.e. write to disk) fast enough and
eventually hits OOM.

There are a few messages in communication.log about filling up but I'm not
sure they are frequent enough:

#fields ts peer src_name connected_peer_desc
connected_peer_addr connected_peer_port level message

#types time string string string addr port string string

1435870417.629212 manager child - - - info
[#10005/10.1.1.40:53113] connection closed

1435870417.629212 manager child - - - error
queue to parent filling up; shutting down heaviest connection

1435870417.629212 manager script - - - info
connection closed

1435870417.629212 manager parent - - - info
[#10094/10.1.1.37:49822] peer disconnected

1435870417.629212 manager child - - - info
[#10094/10.1.1.37:49822] connection closed

1435870417.629212 manager child - - - error
queue to parent filling up; shutting down heaviest connection

1435870417.629212 manager script - - - info
connection closed

1435870417.629212 manager parent - - - info
[#10090/10.1.1.41:33718] peer disconnected

1435870423.145758 manager child - - - info
[#10090/10.1.1.41:33718] connection closed

1435870423.145758 manager child - - - error
queue to parent filling up; shutting down heaviest connection

1435870423.145758 manager script - - - info
connection closed

1435870423.145758 manager parent - - - info
[#10038/10.1.1.38:34331] peer disconnected

1435870423.145758 manager child - - - info
[#10038/10.1.1.38:34331] connection closed

1435870423.145758 manager child - - - error
queue to parent filling up; shutting down heaviest connection

1435870423.145758 manager script - - - info
connection closed

1435870423.145758 manager parent - - - info
[#10128/10.1.1.39:37465] peer disconnected

1435870423.519705 manager child - - - info
[#10128/10.1.1.39:37465] connection closed

1435870423.519705 manager script - - - info
connection closed

1435870423.519705 manager child - - - error
queue to parent filling up; shutting down heaviest connection

1435870423.519705 manager parent - - - info
[#10147/10.1.1.38:34353] peer disconnected

1435870423.873331 manager child - - - info
[#10147/10.1.1.38:34353] connection closed

1435870423.873331 manager child - - - error
queue to parent filling up; shutting down heaviest connection

1435870423.873331 manager script - - - info
connection closed

1435870429.806242 manager parent - - - info
[#10116/10.1.1.37:49813] peer disconnected

1435870429.806242 manager child - - - info
[#10116/10.1.1.37:49813] connection closed

1435869706.106072 proxy-4 parent - - - info
parent statistics: pending=0 bytes=515841K/3942969K chunks=2770217/19597747
io=1104183/1199656 bytes/io=0.47K/3.29K events=0/0
operations=1383192/9797004

1435869706.106072 proxy-4 parent - - - info
child statistics: [0] pending=0 bytes=0K/94K chunks=8/101 io=4/51
bytes/io=0.05K/1.85K

1435869706.106072 proxy-4 parent - - - info
child statistics: [1] pending=0 bytes=58392K/436287K chunks=306097/2178102
io=137164/1089052 bytes/io=0.43K/0.40K

1435869706.106072 proxy-4 parent - - - info
child statistics: [2] pending=0 bytes=57694K/436873K chunks=302777/2179288
io=135313/1089645 bytes/io=0.43K/0.40K
1435869706.106072 proxy-4 parent - - - info child
statistics: [3] pending=0 bytes=54266K/440600K chunks=305363/2180192
io=136884/1090097 bytes/io=0.40K/0.40K

For yesterday:

bro@manager:/usr/local/bro$ zgrep -i fill logs/2015-07-01/communication.*

logs/2015-07-01/communication.00:15:40-00:20:54.log.gz:1435709777.104131
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.00:15:40-00:20:54.log.gz:1435709777.104131
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.07:26:19-07:31:10.log.gz:1435735579.660321
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.07:26:19-07:31:10.log.gz:1435735799.681583
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745466.524085
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745689.339628
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745689.339628
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.081825
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745734.889231
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745734.889231
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745737.839721
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745737.839721
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745738.539531
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745738.821476
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.16:05:06-16:10:40.log.gz:1435766942.335851
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:39:56-19:45:50.log.gz:1435779596.437060
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:39:56-19:45:50.log.gz:1435779596.437060
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779950.225082
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779950.225082
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435780017.662903
    manager child - - - error queue to parent filling
up; shutting down heaviest connection

And if you haven’t already, take a look at reporter.log for any scripting
errors as certain types may cause memory leaks.

There's a regular stream of:

1435796386.319295 Reporter::ERROR incomplete base64 group, padding
with 12 bits of 0 (empty)

1435796386.701326 Reporter::ERROR character 45 ignored by Base64
decoding (empty)

1435796386.701326 Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0 (empty)

1435796388.426591 Reporter::ERROR extra base64 groups after '='
padding are ignored (empty)

1435796388.426591 Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0 (empty)

1435796397.701170 Reporter::ERROR character 45 ignored by Base64
decoding (empty)

1435796397.701170 Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0 (empty)

1435796400.416380 Reporter::ERROR extra base64 groups after '='
padding are ignored (empty)
1435796400.416380 Reporter::ERROR incomplete base64 group, padding
with 6 bits of 0 (empty)

And occasionally messages for SumStat (port-scan, addr-scan,
detect-sql-attackers, traceroute-detection):

0.000000 Reporter::WARNING SumStat key request for the
KoufzswOJrf SumStat uid took longer than 1 minute and was automatically
cancelled. /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin

e 218

0.000000 Reporter::WARNING SumStat key request for the
tVVpSQcL7Ac SumStat uid took longer than 1 minute and was automatically
cancelled. /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin

e 218

0.000000 Reporter::WARNING SumStat key request for the
fvaD58lPrkh SumStat uid took longer than 1 minute and was automatically
cancelled. /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin

e 218

0.000000 Reporter::WARNING SumStat key request for the
mHN4TlfikJ6 SumStat uid took longer than 1 minute and was automatically
cancelled. /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
line 218

0.000000 Reporter::WARNING Manager received a uid for an
unknown request. SumStat: addr-scan, Key: [str=55554/tcp,
host=x.x.x.x]
/usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
line 379

0.000000 Reporter::WARNING Manager received a uid for an
unknown request. SumStat: addr-scan, Key: [str=445/tcp, host=x.x.x.x]
  /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro, line 379

0.000000 Reporter::WARNING Manager received a uid for an
unknown request. SumStat: addr-scan,

No, I recall that being the thing to watch and the prof.log snippet looked sane to me if it is from a time period when memory usage was high. You might also graph memory usage over time to see if it’s spiky or just a steady increase. The later may suggest a leak or script-layer state (e.g. tables/sets) not having an appropriate expiration interval. I think prof.log also has info about memory usage of script-layer tables that might be worth looking for. Other than trying to get things running under a memory profiler, sorry I don’t have much other ideas at the moment.

- Jon

Looks like you were right, after watching this closer I do see incidents of pending counters over zero. I tried reducing logs by disabling files::log but that didn’t help.

If Bro is having problems writing to a ram backed FS I don’t think adding faster disks is going to help.

I still don’t understand enough about the architecture from reading src/RemoteSerializer.cc and src/input/readers/raw/Raw.cc. What else should I look at if the child’s buffering is not related to disk I/O ?