I’d guess the slowness of disk I/O is more of a contributing factor — a
general problem is that if the manager can’t consume logs (i.e. write them
to disk) at a greater or equal rate that workers produce them, then the
manager buffers them up until out of memory. If you enable profiling, the
prof.log would contain details to indicate whether this is the situation.
This is something I checked before and it seemed a non-issue. I'm using a
ram backed fs and rotating logs every 5 minutes. With prof.log the pending
counter is always 0/0. Is there something else I should be looking for ?
/**
* Statistics about inter-thread communication.
*/
struct Stats
{
uint64_t sent_in; //! Number of messages sent to the
child thread.
uint64_t sent_out; //! Number of messages sent from
the child thread to the main thread
uint64_t pending_in; //! Number of messages sent to the
child but not yet processed.
uint64_t pending_out; //! Number of messages sent from
the child but not yet processed by the main thread.
/// Statistics from our queues.
Queue<BasicInputMessage *>::Stats queue_in_stats;
Queue<BasicOutputMessage *>::Stats queue_out_stats;
};
1435868941.791125 TCP-States:Fin.
1435868941.791125 TCP-States:Rst.
1435868941.791125 Connections expired due to inactivity: 0
1435868941.791125 Total reassembler data: 0K
1435868941.791125 Timers: current=69 max=71 mem=5K lag=505.67s
1435868941.791125 DNS_Mgr: requests=0 succesful=0 failed=0 pending=0
cached_hosts=0 cached_addrs=0
1435868941.791125 Triggers: total=0 pending=0
1435868941.791125 RotateTimer = 34
1435868941.791125 ScheduleTimer = 10
1435868941.791125 TableValTimer = 25
1435868941.791125 Threads: current=34
1435868941.791125 loaded_scripts/Log::WRITER_ASCII in=114 out=112
pending=0/0 (#queue r/w: in=114/114 out=112/112)
1435868941.791125 communication/Log::WRITER_ASCII in=147 out=112
pending=0/0 (#queue r/w: in=147/147 out=112/112)
1435868941.791125 files/Log::WRITER_ASCII in=10522 out=54 pending=0/0
(#queue r/w: in=10522/10522 out=54/54)
1435868941.791125 reporter/Log::WRITER_ASCII in=82 out=54 pending=0/0
(#queue r/w: in=82/82 out=54/54)
1435868941.791125 weird/Log::WRITER_ASCII in=1083 out=54 pending=0/0
(#queue r/w: in=1083/1083 out=54/54)
1435868941.791125 packet_filter/Log::WRITER_ASCII in=55 out=54
pending=0/0 (#queue r/w: in=55/55 out=54/54)
1435868941.791125 dpd/Log::WRITER_ASCII in=180 out=54 pending=0/0
(#queue r/w: in=180/180 out=54/54)
1435868941.791125 software/Log::WRITER_ASCII in=404 out=54 pending=0/0
(#queue r/w: in=404/404 out=54/54)
1435868941.791125 tunnel/Log::WRITER_ASCII in=109 out=54 pending=0/0
(#queue r/w: in=109/109 out=54/54)
1435868941.791125 conn/Log::WRITER_ASCII in=20583 out=54 pending=0/0
(#queue r/w: in=20583/20583 out=54/54)
1435868941.791125 dns/Log::WRITER_ASCII in=4369 out=54 pending=0/0
(#queue r/w: in=4369/4369 out=54/54)
1435868941.791125 ftp/Log::WRITER_ASCII in=94 out=54 pending=0/0
(#queue r/w: in=94/94 out=54/54)
1435868941.791125 ssl/Log::WRITER_ASCII in=2566 out=54 pending=0/0
(#queue r/w: in=2566/2566 out=54/54)
1435868941.791125 x509/Log::WRITER_ASCII in=1625 out=54 pending=0/0
(#queue r/w: in=1625/1625 out=54/54)
1435868941.791125 http/Log::WRITER_ASCII in=7894 out=54 pending=0/0
(#queue r/w: in=7894/7894 out=54/54)
1435868941.791125 irc/Log::WRITER_ASCII in=57 out=54 pending=0/0
(#queue r/w: in=57/57 out=54/54)
1435868941.791125 kerberos/Log::WRITER_ASCII in=65 out=54 pending=0/0
(#queue r/w: in=65/65 out=54/54)
1435868941.791125 mysql/Log::WRITER_ASCII in=78 out=54 pending=0/0
(#queue r/w: in=78/78 out=54/54)
1435868941.791125 rdp/Log::WRITER_ASCII in=140 out=54 pending=0/0
(#queue r/w: in=140/140 out=54/54)
1435868941.791125 sip/Log::WRITER_ASCII in=108 out=54 pending=0/0
(#queue r/w: in=108/108 out=54/54)
1435868941.791125 snmp/Log::WRITER_ASCII in=104 out=54 pending=0/0
(#queue r/w: in=104/104 out=54/54)
1435868941.791125 smtp/Log::WRITER_ASCII in=232 out=54 pending=0/0
(#queue r/w: in=232/232 out=54/54)
1435868941.791125 socks/Log::WRITER_ASCII in=69 out=54 pending=0/0
(#queue r/w: in=69/69 out=54/54)
1435868941.791125 ssh/Log::WRITER_ASCII in=167 out=54 pending=0/0
(#queue r/w: in=167/167 out=54/54)
1435868941.791125 pe/Log::WRITER_ASCII in=78 out=54 pending=0/0
(#queue r/w: in=78/78 out=54/54)
1435868941.791125 capture_loss/Log::WRITER_ASCII in=56 out=54 pending=0/0
(#queue r/w: in=56/56 out=54/54)
1435868941.791125 known_hosts/Log::WRITER_ASCII in=137 out=54 pending=0/0
(#queue r/w: in=137/137 out=54/54)
1435868941.791125 known_services/Log::WRITER_ASCII in=108 out=54
pending=0/0 (#queue r/w: in=108/108 out=54/54)
1435868941.791125 known_certs/Log::WRITER_ASCII in=77 out=54 pending=0/0
(#queue r/w: in=77/77 out=54/54)
1435868941.791125 radius/Log::WRITER_ASCII in=78 out=53 pending=0/0
(#queue r/w: in=78/78 out=53/53)
1435868941.791125 dhcp/Log::WRITER_ASCII in=53 out=52 pending=0/0
(#queue r/w: in=53/53 out=52/52)
1435868941.791125 notice/Log::WRITER_ASCII in=96 out=50 pending=0/0
(#queue r/w: in=96/96 out=50/50)
1435868941.791125 modbus/Log::WRITER_ASCII in=50 out=49 pending=0/0
(#queue r/w: in=50/50 out=49/49)
1435868941.791125 traceroute/Log::WRITER_ASCII in=51 out=47 pending=0/0
(#queue r/w: in=51/51 out=47/47)
Any slowness of select() due to a large number of fds probably puts more
pressure back on workers and the remote connection as it’s involved in
actually pulling data off the wire and also processing some chunks pulled
off the wire and those intermediate buffers have a limit before shutting
down connections with peers that talk too much. This could cause thrashing
of the actually connection between workers and manager (communication.log
may reference stuff “filling up” or something like that), but it’s probably
still possible to get in a situation where the manager’s logging thread
queues still can’t dequeue entries (i.e. write to disk) fast enough and
eventually hits OOM.
There are a few messages in communication.log about filling up but I'm not
sure they are frequent enough:
#fields ts peer src_name connected_peer_desc
connected_peer_addr connected_peer_port level message
#types time string string string addr port string string
1435870417.629212 manager child - - - info
[#10005/10.1.1.40:53113] connection closed
1435870417.629212 manager child - - - error
queue to parent filling up; shutting down heaviest connection
1435870417.629212 manager script - - - info
connection closed
1435870417.629212 manager parent - - - info
[#10094/10.1.1.37:49822] peer disconnected
1435870417.629212 manager child - - - info
[#10094/10.1.1.37:49822] connection closed
1435870417.629212 manager child - - - error
queue to parent filling up; shutting down heaviest connection
1435870417.629212 manager script - - - info
connection closed
1435870417.629212 manager parent - - - info
[#10090/10.1.1.41:33718] peer disconnected
1435870423.145758 manager child - - - info
[#10090/10.1.1.41:33718] connection closed
1435870423.145758 manager child - - - error
queue to parent filling up; shutting down heaviest connection
1435870423.145758 manager script - - - info
connection closed
1435870423.145758 manager parent - - - info
[#10038/10.1.1.38:34331] peer disconnected
1435870423.145758 manager child - - - info
[#10038/10.1.1.38:34331] connection closed
1435870423.145758 manager child - - - error
queue to parent filling up; shutting down heaviest connection
1435870423.145758 manager script - - - info
connection closed
1435870423.145758 manager parent - - - info
[#10128/10.1.1.39:37465] peer disconnected
1435870423.519705 manager child - - - info
[#10128/10.1.1.39:37465] connection closed
1435870423.519705 manager script - - - info
connection closed
1435870423.519705 manager child - - - error
queue to parent filling up; shutting down heaviest connection
1435870423.519705 manager parent - - - info
[#10147/10.1.1.38:34353] peer disconnected
1435870423.873331 manager child - - - info
[#10147/10.1.1.38:34353] connection closed
1435870423.873331 manager child - - - error
queue to parent filling up; shutting down heaviest connection
1435870423.873331 manager script - - - info
connection closed
1435870429.806242 manager parent - - - info
[#10116/10.1.1.37:49813] peer disconnected
1435870429.806242 manager child - - - info
[#10116/10.1.1.37:49813] connection closed
1435869706.106072 proxy-4 parent - - - info
parent statistics: pending=0 bytes=515841K/3942969K chunks=2770217/19597747
io=1104183/1199656 bytes/io=0.47K/3.29K events=0/0
operations=1383192/9797004
1435869706.106072 proxy-4 parent - - - info
child statistics: [0] pending=0 bytes=0K/94K chunks=8/101 io=4/51
bytes/io=0.05K/1.85K
1435869706.106072 proxy-4 parent - - - info
child statistics: [1] pending=0 bytes=58392K/436287K chunks=306097/2178102
io=137164/1089052 bytes/io=0.43K/0.40K
1435869706.106072 proxy-4 parent - - - info
child statistics: [2] pending=0 bytes=57694K/436873K chunks=302777/2179288
io=135313/1089645 bytes/io=0.43K/0.40K
1435869706.106072 proxy-4 parent - - - info child
statistics: [3] pending=0 bytes=54266K/440600K chunks=305363/2180192
io=136884/1090097 bytes/io=0.40K/0.40K
For yesterday:
bro@manager:/usr/local/bro$ zgrep -i fill logs/2015-07-01/communication.*
logs/2015-07-01/communication.00:15:40-00:20:54.log.gz:1435709777.104131
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.00:15:40-00:20:54.log.gz:1435709777.104131
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.07:26:19-07:31:10.log.gz:1435735579.660321
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.07:26:19-07:31:10.log.gz:1435735799.681583
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745466.524085
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745689.339628
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745689.339628
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.081825
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745734.889231
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745734.889231
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745737.839721
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745737.839721
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745738.539531
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745738.821476
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.16:05:06-16:10:40.log.gz:1435766942.335851
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.19:39:56-19:45:50.log.gz:1435779596.437060
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.19:39:56-19:45:50.log.gz:1435779596.437060
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779950.225082
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779950.225082
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
manager child - - - error queue to parent filling
up; shutting down heaviest connection
logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435780017.662903
manager child - - - error queue to parent filling
up; shutting down heaviest connection
And if you haven’t already, take a look at reporter.log for any scripting
errors as certain types may cause memory leaks.
There's a regular stream of:
1435796386.319295 Reporter::ERROR incomplete base64 group, padding
with 12 bits of 0 (empty)
1435796386.701326 Reporter::ERROR character 45 ignored by Base64
decoding (empty)
1435796386.701326 Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0 (empty)
1435796388.426591 Reporter::ERROR extra base64 groups after '='
padding are ignored (empty)
1435796388.426591 Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0 (empty)
1435796397.701170 Reporter::ERROR character 45 ignored by Base64
decoding (empty)
1435796397.701170 Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0 (empty)
1435796400.416380 Reporter::ERROR extra base64 groups after '='
padding are ignored (empty)
1435796400.416380 Reporter::ERROR incomplete base64 group, padding
with 6 bits of 0 (empty)
And occasionally messages for SumStat (port-scan, addr-scan,
detect-sql-attackers, traceroute-detection):
0.000000 Reporter::WARNING SumStat key request for the
KoufzswOJrf SumStat uid took longer than 1 minute and was automatically
cancelled. /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin
e 218
0.000000 Reporter::WARNING SumStat key request for the
tVVpSQcL7Ac SumStat uid took longer than 1 minute and was automatically
cancelled. /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin
e 218
0.000000 Reporter::WARNING SumStat key request for the
fvaD58lPrkh SumStat uid took longer than 1 minute and was automatically
cancelled. /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin
e 218
0.000000 Reporter::WARNING SumStat key request for the
mHN4TlfikJ6 SumStat uid took longer than 1 minute and was automatically
cancelled. /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
line 218
0.000000 Reporter::WARNING Manager received a uid for an
unknown request. SumStat: addr-scan, Key: [str=55554/tcp,
host=x.x.x.x]
/usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
line 379
0.000000 Reporter::WARNING Manager received a uid for an
unknown request. SumStat: addr-scan, Key: [str=445/tcp, host=x.x.x.x]
/usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro, line 379
0.000000 Reporter::WARNING Manager received a uid for an
unknown request. SumStat: addr-scan,