Stand-alone cluster problems

Hi,

Just wondering if anyone has run into this problem. I'm running Robin's 1.4 cluster code on a stand-alone AMD dual-core machine that is monitoring a 200 Mbit connection. I've been running this setup for a couple months, and it has been working well.

I noticed that Bro seemed to be missing some packets, and Seth told me to look for DroppedPackets in the notice.log. From what I recall, I was dropping upwards of 90% of the packets after filtering. The strange thing was the primary bro process CPU usage seemed to be low (20-30%), even though it was dropping most of the packets. I would have expected CPU to be high in trying to keep up. Is there some throttling mechanism to prevent the CPU from being maxed out?

I turned on some restrict filters to bring the DroppedPackets down. I turned off various things like HTTP, HTTPS, DNS, IMAP, SMTP, and a few others. After that, the ratio of (packets dropped after filtering) to received was less than one percent.

So that is the history and here is my problem. When I start the cluster, the above mentioned ratio will be less than 1%. It remains less than 1% for several days to a week. For no explicable reason, it will jump up as high as 99% and stay stuck there. This has happened twice in the last couple weeks. Whatever caused this problem also caused the logs to stop being rotated. They are showing a timestamp of June 7th (MST). When I run a cluster status, it shows the cluster is running, and I can see both bro-1.4-robin processes running, but their CPU usage has dropped down to 0.00%. I think the CPU usage for one of those processes had typically been around 16-20% when the dropped ratio was less than 1%.

Restarting the cluster should clear the problem again for a few days. Is there any other troubleshooting I can do before restarting to determine the cause of the problem?

Below are a few lines showing the high ratio of dropped packets. These are some of the last lines logged, so based on the timestamps, everything stopped around 01:48 6/8/09 GMT.

1244425212.229043:DroppedPackets:NOTICE_FILE:bro::::::::::317741 packets dropped after filtering, 387325 received, 1117174 on link::@26-c775-10c8ed
1244425222.229050:DroppedPackets:NOTICE_FILE:bro::::::::::16000 packets dropped after filtering, 68358 received, 192124 on link::@26-c775-10c91b
1244425236.737126:DroppedPackets:NOTICE_FILE:bro::::::::::311615 packets dropped after filtering, 341963 received, 979510 on link::@26-c775-10c939
1244425247.488693:DroppedPackets:NOTICE_FILE:bro::::::::::79770 packets dropped after filtering, 172843 received, 477468 on link::@26-c775-10c981
1244425261.942659:DroppedPackets:NOTICE_FILE:bro::::::::::315110 packets dropped after filtering, 315358 received, 878859 on link::@26-c775-10c987
1244425690.711860:DroppedPackets:NOTICE_FILE:bro::::::::::10878167 packets dropped after filtering, 10878305 received, 31316595 on link::@26-c775-10c994

Here are some earlier logs showing what the ratio normally looks like.

1243649373.579172:DroppedPackets:NOTICE_FILE:bro::::::::::1752 packets dropped after filtering, 333996 received, 612544 on link::@88-1bd3-bd5af
1243649383.579295:DroppedPackets:NOTICE_FILE:bro::::::::::1333 packets dropped after filtering, 342521 received, 627890 on link::@88-1bd3-bd5c2
1243649393.579339:DroppedPackets:NOTICE_FILE:bro::::::::::920 packets dropped after filtering, 326511 received, 605614 on link::@88-1bd3-bd5d1
1243649403.579424:DroppedPackets:NOTICE_FILE:bro::::::::::1336 packets dropped after filtering, 318679 received, 615016 on link::@88-1bd3-bd5ec

Tyler

Hi Tyler,

if I understand you correctly, there are actually two problems:

- Bro is dropping many packets even when running at rather low CPU

- after a few days, Bro hangs with 99% CPU and stalls.

Is that correct?

Regarding the former, generally at 20-30% CPU Bro shouldn't drop any
signficant amount of packets, there's no throttling mechanism or
such. One guess here would be the operating system. What kind of
system are you running on? Have you tried the tuning described on
http://www.net.t-labs.tu-berlin.de/research/bpcs/? Another question:
is there any regularity in the timestamps of when the drops occur?
Like in regular intervals? (But longer intervals than 10s as that's
just the reporting interval).

Regarding the latter, it would be good to know in which part of the
code Bro hangs. To find that out, can do you the following next time
it happens:

- Attach a gdb with "gdb /path/to/bro/binary process-id-of-the-main-Bro-process"
- Send me a stack backstrace from gdb's "bt" command

I wouldn't be totally surprised if the state checkpointing is the
culprit. To test that, can you remove the line "@load checkpoint"
from cluster.bro?

Robin

Robin Sommer wrote:

if I understand you correctly, there are actually two problems:

- Bro is dropping many packets even when running at rather low CPU

Hi Robin,
Yes, that is the way it seemed when I didn't have restrict filters turned on. When the cluster started, the CPU for the Bro process would be high, but would drop down to 20-40% even though many packets were being dropped after filtering.

- after a few days, Bro hangs with 99% CPU and stalls.

Partially correct. Bro appears to be hanging, but the CPU is at 0%, and the DroppedPackets/received ratio was banging against 99% just before it started to hang. I haven't restarted the cluster yet, so here is the backtrace.

Lines from top:
   PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
51061 XXXXXX 1 -20 0 1207M 843M swread 1 606:53 0.00% bro-1.4-rob
51082 XXXXXX 1 44 5 31556K 228K select 0 19:29 0.00% bro-1.4-rob

I tried attaching to the process with the large TIME value. Is that the primary one?

$gdb `which bro-1.4-robin` 51061
(gdb) bt
#0 0x081d0e96 in free (mem=0xd724e28) at malloc.c:4229
#1 0x285cfc01 in operator delete () from /usr/lib/libstdc++.so.6
#2 0x080a8f0a in ~Dictionary (this=0x99cd4a0) at Dict.cc:101
#3 0x081c7348 in ~TableEntryValPDict (this=0x99cd4a0) at Val.h:49
#4 0x081c42ac in ~TableVal (this=0x99cd408) at Val.cc:1697
#5 0x081c0e28 in TableVal::DoExpire (this=0x8669d60, t=1244434191.756459)
     at Obj.h:213
#6 0x081a9be2 in PQ_TimerMgr::DoAdvance (this=0x82f2a18,
     new_t=1244434191.756459, max_expire=300) at Timer.cc:164
#7 0x0813ff09 in expire_timers (src_ps=0x90495a0) at Net.cc:392
#8 0x0813ffbd in net_packet_dispatch (t=1244434191.756459, hdr=0x90495d8,
     pkt=0x9049a6a "", hdr_size=14, src_ps=0x90495a0, pkt_elem=0x0)
     at Net.cc:412
#9 0x08140549 in net_packet_arrival (t=1244434191.756459, hdr=0x90495d8,
     pkt=0x9049a6a "", hdr_size=14, src_ps=0x90495a0) at Net.cc:496
#10 0x0814ef1f in PktSrc::Process (this=0x90495a0) at PktSrc.cc:199
#11 0x081402b5 in net_run () at Net.cc:526
#12 0x080501be in main (argc=454545480, argv=0xbfbfeb28) at main.cc:1056

Here is the bt from the other process just in case it helps.
$gdb `which bro-1.4-robin` 51082
(gdb) bt
#0 0x286f8da3 in select () from /lib/libc.so.7
#1 0x081617fa in SocketComm::Run (this=0xbfbfe770) at RemoteSerializer.cc:2743
#2 0x0816629a in RemoteSerializer::Fork (this=0x82fa580)
     at RemoteSerializer.cc:600
#3 0x081664aa in RemoteSerializer::Init (this=0x82fa580)
     at RemoteSerializer.cc:525
#4 0x0804fbab in main (argc=-2147483647, argv=0xbfbfeb28) at main.cc:956

Is that correct?

Regarding the former, generally at 20-30% CPU Bro shouldn't drop any
signficant amount of packets, there's no throttling mechanism or
such. One guess here would be the operating system. What kind of
system are you running on? Have you tried the tuning described on
http://www.net.t-labs.tu-berlin.de/research/bpcs/?

I'm running FreeBSD 7.1 for i386. I had tried tuning based on the Bro Wiki, but the following page showed sysctl debug.bpf_bufsize and sysctl debug.bpf_maxbufsize. Those commands didn't work in FreeBSD 7.1.
http://www.bro-ids.org/wiki/index.php/User_Manual:_Performance_Tuning

The above tu-berlin.de link shows the following:

sysctl -w net.bpf.bufsize=10485760 (10M)
sysctl -w net.bpf.maxbufsize=10485760 (10M)

The Bro-Workshop-July07-tierney.ppt showed the following should be added to the /etc/sysctl.conf

net.bpf.bufsize=4194304 (4M)
net.bpf.maxbufsize=8388608 (8M)

Based on these two examples, I am guessing the bufsize is where the buffer starts, and the max is how large it can grow.

Here are my default values:
$sysctl -a |grep net.bpf
net.bpf.maxbufsize: 524288
net.bpf.bufsize: 4096

According to the FreeBSD 7.1 manpage for sysctl, "The -w option has been deprecated and is silently ignored". I'll try setting both to 10M, like in the link you sent.

$sysctl net.bpf.bufsize=10485760
$sysctl net.bpf.maxbufsize=10485760

I also added those values to the /etc/sysctl.conf so they get set on reboot.

I just restarted the cluster, and the bro-1.4-robin process is sitting at 11-13%. The DroppedPackets/received ratio is flucuating between 3 and 25%. Shouldn't the CPU be maxing out before packets get dropped?

> Another question:
> is there any regularity in the timestamps of when the drops occur?
> Like in regular intervals? (But longer intervals than 10s as that's
> just the reporting interval).

In the previous email, it looks like the intervals were 10s, but there was a gap of over a minute at epoch 1244425261.942659, which is right before the cluster froze. I'll try to keep an eye out for that if it happens again.

I wouldn't be totally surprised if the state checkpointing is the
culprit. To test that, can you remove the line "@load checkpoint"
from cluster.bro?

I haven't tried this yet. I'll see if the bpf buffer increase helps. If not, I'll try unloading the checkpoint.bro.

Tyler

Hi Tyler,

a number of thoughts:

- increasing the BPF buffer sizes is definitly important and should
generally reduce drops.

- the top output shows that the Bro process is actually swapping out
memory to disk:

  PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
51061 XXXXXX 1 -20 0 1207M 843M swread 1 606:53 0.00%

Once that happens, pretty much all bets are off regarding drops: the
I/O load will dominate everything else and Bro almost certainly drop
tons of packets.

- the stack backtrace shows a piece of code where we just found a
problem in the version checked into my branch, which has the
potential to cause drops with large tables. I can't tell whether
it's a coincidence or indeed causing trouble in your case. In any
case, I've committed a fix, please update your working copy and
recompile to see whether that makes any difference.

Robin

Robin,

I pulled the svn code from 6/15, and got the following error when running make:

Generating code for SSLRecordLayerFlow
make all-am
g++ -DHAVE_CONFIG_H -I. -I.. -I. -I../aux/binpac/lib -I../src -I. -I.. -O -I/usr/local/include -W -Wall -Wno-unused -g -O2 -MT main.o -MD -MP -MF .deps/main.Tpo -c -o main.o main.cc
In file included from Frame.h:11,
                  from main.cc:31:
Val.h:853: error: extra qualification 'TableVal::' on member 'RecursiveSize'
main.cc: In function 'int main(int, char**)':
main.cc:452: warning: deprecated conversion from string constant to 'char*'
*** Error code 1

Am I missing some dependency?

In the meantime, I'll try unloading some of the scripts to drop the memory usage.

Thanks,

Tyler

Robin Sommer wrote:

That got introduced by the recent merge and dependennt on the gcc
version it does or does not generate an error. Now fixed.

Robin