A more parallel Bro

Robin Sommer wrote:

I've been doing some performance profiling on Bro.

Interesting! Do you have any results you could share?

I saw Vern's talk a while back at UIUC about how Bro works. As I recall, there
were three major components:

1. A packet capture engine, based on libpcap, which makes use of BPF to
selectively discard packets as early as practical (to reduce the overall quantity of
traffic),
2. Protocol parsers, which turn blocks of data from the network into
semantically-meaningful PDU-like data structures, and
3. An event engine, which applies constraints to the lexed protocol data, and
possibly raises alarms if the traffic is suspicious or otherwise "interesting".

My group initially decided to pursue protocol parsing, because our micro-
benchmarks (many of which appeared in our RAID '08 paper) showed that
protocol parser-generators (e.g. binpac and Nikita et al.'s GAPA) leave
significant performance on the table versus a hand-coded parser. In particular,
binpac's use of C++ objects to represent PDUs really suffers on protocols
which feature small, deeply-nested messages, like DNS.

However, the ultimate question is how much the performance gap between a
hand-coded parser vs. a machine-generated one matters in the context of a
real system. So I built a simple tool I call "dns-packet-gen", which generates
pathological DNS packets constructed for their difficulty in parsing. The tool
emits pcap traces containing DNS traffic with many records (50+), all of which
feature deeply-recursive DNS names. For the record, Wireshark routinely took
an hour or more to load traces of 1M of these packets.

Based on my understanding of the Bro code, it looks like parsing "normal" DNS
traffic (real DNS scraped from my own web browsing) only accounts for 4-5%
of all CPU time (I used Shark on OS X with a debug build, and checked the
sampled CPU time of all descendants of UDP_Analyzer::DeliverPacket()). I'm
using a version of the "detect backdoors" script which includes DPD, which
might have skewed the results a little. (If anyone has comments on this, I'd
appreciate hearing them). In either case, my bad DNS traffic only pushed the
binpac-based analyzers up to 13-14% of CPU time, which is still comparatively
tiny in relation to the EventMgr's CPU time.

What I take away from this is that, even if binpac is slow, it's not the piece that
most demands attention. So I started thinking about the systems-level issues
(the Shunt paper took this same direction), and realized the main event loop is
single-threaded. And that got me to where I am today :slight_smile:

I'm wondering what the ICSI folks' position is on threads vs. clustering.

In short: we consider them orthogonal and are pursuing both. :slight_smile:

To elaborate a bit: The cluster is the approach that can provide
significantly increased performance right now; ignoring a few
implementation glitches we still need to sort out, it's working and
ready to use[1]. The cluster is also an approach that will work
long-term: there will always be limits to what a single box can do
(multi-core or not) and the cluster offers the capability to go
beyond that.

You're right about scale, modulo communications issues. However, I think it's
important for a large-scale system to be designed in such a way as to gain
more performance as the architecture guys throw faster and faster chips over
the wall. The way I read the tea leaves, the number of CPU cores in commodity
PCs isn't dropping anytime soon, so software that isn't written with multi-
threaded event loops is going to hit a performance wall even as CPUs get
faster.

That said, we are in parallel (no pun intended :slight_smile: working on a
multi-threaded Bro implementation. While we have already made quite
a bit of progress on that, that will however still take a bit to get
into any kind of usable state; we need to restructure Bro internally
quite a bit to exploit its concurrency potential. Once finished
though, we expect Bro's performance to scale pretty well with the
number of available cores (not considering other aspects of the
hardware which might turn into bottlenecks at some point). The
Sarnoff paper has some of the basic ideas for the multi-threaded
Bro, and there'll be an extended journal version of that coming out
soon with more details on the parallel execution model (let me know
if you'd to like to get a draft).

Does that answer your question? If you have any particular thoughts
on either clustering or the multi-threaded Bro, I'd be happy to hear
them.

I haven't explored how threadable Bro would be, but it seems you're already on
it. Truthfully, I spent a few weeks picking through the code to convince myself
I knew roughly how to profile it; I'm no expert on Bro internals. I'm going to
have a chat with Nikita soon (my academic adviser) to see what he thinks about
all this, and I'll be in touch. Please send the draft of the Sarnoff paper to my
UIUC email address; I'd love to have a look.

I'm wondering how you plan to address cache behavior in your multi-threaded
implementation of Bro. Obviously, this is a really hard problem, and I'm not
sure how well-studied it is in the literature, especially in the context of IDS.
Fortunately, I have a lot of friends here that do high-performance computing
work (their long-range research agenda includes a 3000-core CPU tapeout), so
I'll talk to them about it.

David

It is possible to take advantage of muti cpu system with the current bro. I am running 4 cpu test system now that runs 4 bro instances on 10GigE interface using a modification of the odd/even filter that was suggest on bro wiki in the User Manual: Performance Tuning section.

It is spreading the load across the cpus.

Bill Jones

I'm running literally dozens of bros on a 8 core system, each listening
to its own lightly loaded virtual interface. Its a fairly specialized
internal monitoring application where lightweight probes send internal
traffic to a centralized system which then pushes each probe's traffic
onto its own virtual interface, which is then listened to by a bro. In
the future, it may be that I implement a master bro that coordinates all
the baby bro's - probably will wait for Robin to get the kinks out of
the cluster code first.

Hope this helps.

William L. Jones wrote:

real system. So I built a simple tool I call "dns-packet-gen", which generates
pathological DNS packets constructed for their difficulty in parsing. The tool
emits pcap traces containing DNS traffic with many records (50+), all of which
feature deeply-recursive DNS names.

Neat. Is that tool available?

appreciate hearing them). In either case, my bad DNS traffic only pushed the
binpac-based analyzers up to 13-14% of CPU time, which is still comparatively
tiny in relation to the EventMgr's CPU time.

Yeah. Some of our recent measurements indicate that for some
protocols (including DNS) script interpretation dominitates the
performance, with the analyzers being less of an issue. That's why
our first target for the multi-threaded Bro is to parallelize
script interpretation.

You're right about scale, modulo communications issues. However, I think it's
important for a large-scale system to be designed in such a way as to gain
more performance as the architecture guys throw faster and faster chips over
the wall.

Definitely. As I said, we're interested in both, from a research
perspective as well as for operations.

I'm wondering how you plan to address cache behavior in your multi-threaded
implementation of Bro. Obviously, this is a really hard problem, and I'm not
sure how well-studied it is in the literature, especially in the context of IDS.

That's an excellent question and I don't think it has been studied
in the IDS context yet. Our main strategy is to keep related state
together, e.g., by running all analysis involving the same
connection on the same core. It's hard to predict though how cache
effects will eventually turn out to hurt us, and we expect to do a
number of measurements in this regard once the multi-threaded Bro is
in a shape that we can actually measure something. Going from there,
I'm sure there'll be lots of further optimization potential.

Fortunately, I have a lot of friends here that do high-performance computing
work (their long-range research agenda includes a 3000-core CPU tapeout), so
I'll talk to them about it.

I'd be glad to hear what they have to say as I'm not very familiar
with this space myself either.

Robin

That's indeed a solution too, and in fact the cluster shell supports
such a setup out of the box: if you configure it to install multiple
backends on a single host, it will set things up accordingly (except
for your custom BPF filter which one would still need to configure
manually.).

I don't have much experience with such a setup though. One thing I'd
like to know is how the capture performance is when running multiple
Bros. Is it a problem to have multiple, high-load packet capturing
processes running at the same time, or is the kernel able to handle
that just fine?

Robin