BroPing Connection Failure

Hi all,

I am using Bro 1.4 stable on Linux and I'm having problems with Broccoli. On one machine with Ubuntu, everything works fine. But on another machine with a custom Linux distribution, I have problems to connect to Bro. The behaviour is not very consistent.

My configure options are --without-openssl --disable-select-loop and --enable-debug.

After compilation, I run in one terminal:
$ src/bro -i eth0 -C aux/broccoli/test/broping.bro

And in a second terminal:
$ aux/broccoli/test/broping -c 1

Most of the time, this fails and the error message is:
"Could not connect to Bro at 127.0.0.1:47758."

The TCP connection however, is fully established, as I can verify with Tcpdump. The client is the one who sends the first FIN to tear the connection down.

Some times, the connection can be established. Attached you find the remote.log of a successful (first) and an unsuccessful attempt. It looks like the handshake could not be completed.

I further tried to debug by running
$ aux/broccoli/test/broping -d -c 1
or
$ strace aux/broccoli/test/broping -c 1
but in both cases, it was not possible to reproduce the error.

It looks like some kind of race condition. Does anyone have an explanation for this behaviour or a clue about what the cause could be? In case you need more information, just let me know.

Regards,

Fabian

libc 2.9
libm 2.9
Linux 2.6.26
i686

remote.log (2.45 KB)

Sounds like a nasty race condition of some sort. The remote.log only
shows that something weird is going on but isn't detailed enough to
understand what's causing it. Please enable debugging output on both
sides. For Broccoli, see here:
http://www.icir.org/christian/broccoli/manual/c84.html#AEN814

For Bro, configure with --enable-debug and then run with "-B comm".
That should produce a debug.log with lots of information.

It would also be good if you could try it with the current
development version from SVN to see if the problem still occurs with
that one.

Robin

Thanks for the fast response, Robin.

Robin Sommer wrote:

Sounds like a nasty race condition of some sort. The remote.log only
shows that something weird is going on but isn't detailed enough to
understand what's causing it. Please enable debugging output on both
sides. For Broccoli, see here:
Using Broccoli

For Bro, configure with --enable-debug and then run with "-B comm".
That should produce a debug.log with lots of information.

I enabled debugging on the client with "-d -d" (bropingc.txt) and on the server with "-B comm" (bropings.txt). I also attached a tcpdump trace of the communication.

Even if the server seems to finish the handshake, it doesn't send anything back to the client (see tcpdump trace). As a result of that the client times out.

I found out that compiling without "--disable-select-loop" doesn't show these connection problems. I'm not sure whether this is related or a coincidence. I can use this as a workaround, but would prefer another solution.

It would also be good if you could try it with the current
development version from SVN to see if the problem still occurs with
that one.

The current development version of Bro 1.5 does still show the same behaviour.

Fabian

bropingc.txt.gz (2.97 KB)

bropings.txt.gz (871 Bytes)

broping.pcap.gz (338 Bytes)

Fabian Hugelshofer wrote:

Robin Sommer wrote:

Sounds like a nasty race condition of some sort. The remote.log only
shows that something weird is going on but isn't detailed enough to
understand what's causing it. Please enable debugging output on both
sides. For Broccoli, see here:

I enabled debugging on the client with "-d -d" (bropingc.txt) and on the server with "-B comm" (bropings.txt). I also attached a tcpdump trace of the communication.

Are you still looking at this? Do you have an idea about the cause of
this behaviour? Or a hint on where to start looking?

Fabian

Before digging into this, what's the reason for compiling with
--disable-select-loop? Seeing that the problem disappears without
that switch, the problem likley lies somewhere in the alternative
code path it enables. That code is rather old and rarely used these
days, and I'm thinking it can actually be removed completely in
future versions. So, what's your use case here?

Robin

Robin Sommer wrote:

Before digging into this, what's the reason for compiling with
--disable-select-loop? Seeing that the problem disappears without
that switch, the problem likley lies somewhere in the alternative
code path it enables. That code is rather old and rarely used these
days, and I'm thinking it can actually be removed completely in
future versions. So, what's your use case here?

I use it because without, Bro causes a high CPU usage, even if there isn't any traffic to analyse. That's about 30% CPU usage on a Pentium 4 with 2.4 Ghz.

I read about using --disable-select-loop on http://bro-ids.org/wiki/index.php/User_Manual:_Performance_Tuning. There it sais that Phil Wood's libpcap is buggy in non-blocking mode. I'm using Phil's libpcap (0.9.8.20081128).

Fabian

If you are running Bro 1.4 and loading listen-clear.bro, you could be encountering this issue:
   http://tracker.icir.org/bro/ticket/31

If that's where your problem is, it's fixed in trunk.

  .Seth

I see. That's actually pretty old and I don't remember what
specifically the problem was; it might be fixed with newer versions
of Phil's libpcap, don't know. Do you see the high CPU when using a
standard pcap?

The select-based loop was introduced specifically to combine pcap
input with simultaneous remote communication (like via Broccoli);
which is what you're doing it seems. The suggestion in the Wiki is
meant for configurations not involving any communication (even
though it doesn't say so ...), I wouldn't recommend using
--disable-select-loop when doing both as that won't work reliable.

Robin

P.S.: And yes, please try Seth's fix to see that already helps.

Hi all,

Robin Sommer wrote:

I see. That's actually pretty old and I don't remember what
specifically the problem was; it might be fixed with newer versions
of Phil's libpcap, don't know. Do you see the high CPU when using a
standard pcap?

[...]

P.S.: And yes, please try Seth's fix to see that already helps.

I did some more experiments regarding this performance issue. Without
debug mode enabled, the CPU usage is not as high as I described
earlier. I am using our own policy scripts that include listen-clear.

From a shell script, I ran Bro with different binaries for 10 minutes and then read the CPU time it consumed with top. There wasn't a lot of traffic in the network.

Bro 1.4 noloop 13.90s (parent 13.76s), child 0.14s)
Bro 1.4 default 51.06s (parent 26.56s, child 24.50s)
Bro 1.4 Patch 115.41s (parent 115.28s, child 0.13s)

The best result over all is with "--disable-select-loop". With the default options, the child process uses quite a lot of CPU time. Seth's patch fixes that for the child, but the CPU usage of the parent increases.

Bro 1.4 PCAP+default 39.21s (parent 21.05s, child 18.16s)
Bro 1.4 PCAP+Patch 82.70s (parent 81.96s, child 0.74s)

As I mentioned, I am using Phil Wood's libpcap (the current one). With the default version of libpcap, the performance is a bit better than with Phil Wood's. Still it's not great and Seth's patch again increased the CPU usage of the parent process.

Bro 1.5 84.66s (parent 84.41s, child 0.25s)

For the current version from trunk, it is a bit better than with 1.4+patch. While being mainly idle, using blocking sockets is still the only acceptable solution for me. Bro can't use more than 10% CPU time for basically nothing.

Is there a way of reducing the negative effects from Seth's patch? It's a bit irritating that the CPU usage for the parent increases. I wouldn't have expected that.

Fabian

What scripts are you loading? Any custom ones? I'd like to attempt to replicate the effects you're seeing.

Thanks,
   .Seth

Seth Hall wrote:

Is there a way of reducing the negative effects from Seth's patch? It's
a bit irritating that the CPU usage for the parent increases. I wouldn't
have expected that.

What scripts are you loading? Any custom ones? I'd like to attempt to replicate the effects you're seeing.

Before I showed you the results from my custom scripts. To exclude any unnecessary influence, I let it run again: on the loopback device and just with broping.bro:

src/bro -i lo -C aux/broccoli/test/broping.bro

The results:
noloop 13.26
default 89.68
patch 77.14
pcap 32.7
pcap+patch 50.2
1.5 87.45

The regression of the patch is visible with the default version of libpcap, but not with Phil Wood's. Not sure wether I had a calculation error there before (dnt have the resutls from the first run anymore).

Fabian