feature proposal: bro batch processing


as a follow-up to my question about "delayed bro operation" I would
like to propose a new feature for bro. I call it batch mode and it
helps to run bro over a large amount of pcap files.

Until now I used a modified version of tcpslice to send multiple pcaps
to bro through a pipe. This setup works but is a bit complicated. It
also has the downside, that bro blocks reading from the pipe which
breaks the usual event loop.

What I hacked together is a simple change to PcapSource: When passed a
directory instead of a file it looks for files with the suffix pcap and
processes them. When EOF is reached, it closes the file and renames it.
But instead of also closing the IOSource it looks for the next file.
When there are no files to process, it behaves just like live capture
mode when there are no packets available.

My patch is quite basic right now: just grab the first pcap you can
find and work with it, but one could think of extended features:

1) Read files to work on from text file: This would also come handy
when the source files are distributed in the file system, e.g. sorted by
date or just to avoid to many files in one directory. Compared to
passing multiple file names to bro via command line, this also works
around the problem of a "to long argument list".

2) sort mode: check timestamps of all available files in directory and
process them in the right order. This mode would have to be smart
enough not to open all files at the same time running out of file
descriptors (like mergecap). So check timestamps first and open only the
files needed. This could be more than one when there are separate pcaps
for rx/tx.

3) For each flow save the the name/path of the first and last file it
was read from. So when detailed analysis is necessary, your exactly know
which files to open.

I would like to know if others would be interested in this kind of
feature. Also better ideas how to solve this "the bro way" are welcome.


This is a feature that we actually expect packet-bricks to be able to solve shortly. We’ve been thinking about this for a while and with packet-bricks I actually expect we’ll be able to take this even a bit further to process large sets of traces as a cluster as well. You could have packet-bricks essentially play the role of coordinator to pass packets to the processes and continue passing packets as more traces show up (I know of some people that are doing “real-time” sniffing by copying traces from a remote location to their Bro installation). This allows the Bro processes to remain up and keep their state without actually making any changes to Bro. I can see making a small change to Bro to support collecting the timestamps from the trace by having packet-bricks annotating the packets with timestamps.

It’s nice to see that other people are concretely thinking about solving this same problem. Definitely keep an eye on packet-bricks over the next couple of months. :slight_smile: