Robin Sommer wrote:
I've been doing some performance profiling on Bro.
Interesting! Do you have any results you could share?
I saw Vern's talk a while back at UIUC about how Bro works. As I recall, there
were three major components:
1. A packet capture engine, based on libpcap, which makes use of BPF to
selectively discard packets as early as practical (to reduce the overall quantity of
traffic),
2. Protocol parsers, which turn blocks of data from the network into
semantically-meaningful PDU-like data structures, and
3. An event engine, which applies constraints to the lexed protocol data, and
possibly raises alarms if the traffic is suspicious or otherwise "interesting".
My group initially decided to pursue protocol parsing, because our micro-
benchmarks (many of which appeared in our RAID '08 paper) showed that
protocol parser-generators (e.g. binpac and Nikita et al.'s GAPA) leave
significant performance on the table versus a hand-coded parser. In particular,
binpac's use of C++ objects to represent PDUs really suffers on protocols
which feature small, deeply-nested messages, like DNS.
However, the ultimate question is how much the performance gap between a
hand-coded parser vs. a machine-generated one matters in the context of a
real system. So I built a simple tool I call "dns-packet-gen", which generates
pathological DNS packets constructed for their difficulty in parsing. The tool
emits pcap traces containing DNS traffic with many records (50+), all of which
feature deeply-recursive DNS names. For the record, Wireshark routinely took
an hour or more to load traces of 1M of these packets.
Based on my understanding of the Bro code, it looks like parsing "normal" DNS
traffic (real DNS scraped from my own web browsing) only accounts for 4-5%
of all CPU time (I used Shark on OS X with a debug build, and checked the
sampled CPU time of all descendants of UDP_Analyzer::DeliverPacket()). I'm
using a version of the "detect backdoors" script which includes DPD, which
might have skewed the results a little. (If anyone has comments on this, I'd
appreciate hearing them). In either case, my bad DNS traffic only pushed the
binpac-based analyzers up to 13-14% of CPU time, which is still comparatively
tiny in relation to the EventMgr's CPU time.
What I take away from this is that, even if binpac is slow, it's not the piece that
most demands attention. So I started thinking about the systems-level issues
(the Shunt paper took this same direction), and realized the main event loop is
single-threaded. And that got me to where I am today
I'm wondering what the ICSI folks' position is on threads vs. clustering.
In short: we consider them orthogonal and are pursuing both.
To elaborate a bit: The cluster is the approach that can provide
significantly increased performance right now; ignoring a few
implementation glitches we still need to sort out, it's working and
ready to use[1]. The cluster is also an approach that will work
long-term: there will always be limits to what a single box can do
(multi-core or not) and the cluster offers the capability to go
beyond that.
You're right about scale, modulo communications issues. However, I think it's
important for a large-scale system to be designed in such a way as to gain
more performance as the architecture guys throw faster and faster chips over
the wall. The way I read the tea leaves, the number of CPU cores in commodity
PCs isn't dropping anytime soon, so software that isn't written with multi-
threaded event loops is going to hit a performance wall even as CPUs get
faster.
That said, we are in parallel (no pun intended working on a
multi-threaded Bro implementation. While we have already made quite
a bit of progress on that, that will however still take a bit to get
into any kind of usable state; we need to restructure Bro internally
quite a bit to exploit its concurrency potential. Once finished
though, we expect Bro's performance to scale pretty well with the
number of available cores (not considering other aspects of the
hardware which might turn into bottlenecks at some point). The
Sarnoff paper has some of the basic ideas for the multi-threaded
Bro, and there'll be an extended journal version of that coming out
soon with more details on the parallel execution model (let me know
if you'd to like to get a draft).Does that answer your question? If you have any particular thoughts
on either clustering or the multi-threaded Bro, I'd be happy to hear
them.
I haven't explored how threadable Bro would be, but it seems you're already on
it. Truthfully, I spent a few weeks picking through the code to convince myself
I knew roughly how to profile it; I'm no expert on Bro internals. I'm going to
have a chat with Nikita soon (my academic adviser) to see what he thinks about
all this, and I'll be in touch. Please send the draft of the Sarnoff paper to my
UIUC email address; I'd love to have a look.
I'm wondering how you plan to address cache behavior in your multi-threaded
implementation of Bro. Obviously, this is a really hard problem, and I'm not
sure how well-studied it is in the literature, especially in the context of IDS.
Fortunately, I have a lot of friends here that do high-performance computing
work (their long-range research agenda includes a 3000-core CPU tapeout), so
I'll talk to them about it.
David