strategies for ingesting large number of PCAP files


I am working with a storm topology to process a large number of PCAP files which can be of variable sizes, but tend to be in the range of 100MB to 200MB, give or take. My current batch to work on contains about 42K files…I am aiming to process with as much parallelism as possible while avoiding the issue of sessions that span more than one file (so you know why I am doing this)

my main constraints/focus:

  1. take advantage of large number of cores (56) and RAM (~750GB) on my node(s)
  2. Avoid disk as much as possible (I have relatively slow spinning disks, though quite a few of them that can be addressed individually, which could mitigate the disk IO bottleneck to some degree)
  3. Prioritize completeness above all else…get as many sessions reconstructed as possible by stitching the packets back together in one of the ways below…or another if you folks have a better idea…

my thinking…and hope for suggestions on the best approach…or a completely different one if you have a better solution:

  1. run mergecap and setup bro to run as a cluster and hope for the best
  2. upside: relatively simple and lowest level of effort
  3. downside: not sure it will scale the way I want. I’d prefer to isolate Bro to running on no more than two nodes in my cluster…each node has 56 cores and ~750GB RAM. Also, it will be one more hack to have to work into my Storm topology1. use Storm topology (or something else) to re-write packets to individual files based on SIP/DIP/SPORT/DPORT or similar
  4. upside: this will ensure a certain level of parallelism and keep the logic inside my topology where I can control it to the greatest extent
  5. downside: This seems like it is horribly inefficient because I will have to read the PCAP files twice: once to split and once again when Bro get them, and again to read the Bro logs (if I don’t get the Kafka plugins to do what I want). Also, this will require some sort of load balancing to ensure that IP’s that represent a disproportionate percentage of traffic don’t gum up the works, nor do IP’s that have relatively little traffic take up too many resources. My thought here is to simply keep track of approximate file sizes and send IP’s in rough balance (though still always sending any given IP/port pair to the same file). Also, this makes me interact with the HDD’s at least three times (once to read PCAP, next to write PCAP, again to read Bro logs, which is undesirable)1. Use Storm topology or TCP replay (or similar) to read in PCAP files, then write to virtual interfaces (a pool setup manually) so that Bro can simply listen on each interface and process as appropriate.
  6. upside: Seems like this could be the most efficient option as it probably avoids disk the most, seems like it could scale very well, and would support clustering by simply creating pools of interfaces on multiple nodes, session-ization takes care of itself and I just need to tell Bro to wait longer for packets to show up so it doesn’t think the interface went dead if there are lulls is traffic
  7. downside: Most complex of the bunch and I am uncertain of my ability to preserve timestamps when sending the packets over the interface to Bro1. Extend Bro to not only write directly to Kafka topics, but also to read from them such that I could use one of the methods above to split traffic and load balance and then have Bro simply spit out logs to another topic of my choosing
  8. upside: This could be the most elegant solution because it will allow me to handle failures and hiccups using Kafka offsets
  9. downside: This is easily the most difficult to implement for me as I have not messed with extending Bro at all.
    Any suggestions or feedback would be greatly appreciated! Thanks in advance…


P.S. sorry for the verbose message…but was hoping to give as complete a problem/solution statement as I can

Is this for that Metron thing? I had chatted with someone on irc about this a ways back. I think the simplest way to integrate bro would be to write a 'pcapdir' packet source plugin that works similar to the 'live' pcap mode, but instead reads packets from sequentially numbered pcap files in a directory. That way fetching the next packet would boil down to, in pythonish pseudocode

        while not current file:
            current file = find next pcap file
            sleep for 100ms
        packet = get next packet from current pcap file
        if packet:
            return packet
        close current pcap file
        delete current pcap file
        current file = None

packet source plugins are pretty easy to write, you could probably get this working in a few hours. It would be a lot easier to implement than interfacing with kafka directly.

Then you just need to atomically move pcap files into the directory that bro is watching. Since a single instance of bro is running you don't have to worry about sessions that span more than one file... that should just work normally.

To use more than one bro process you would just need to write a tool that can read pcaps, hash by 2 or 4 tuple, and output sliced pcaps to different places.

You should be able to do everything but the last part without pcaps ever touching the disk.

You probably want to avoid using something like tcpreplay since you'd lose a lot of the performance benefits of bro reading from pcap files.

This is not for Metron…I am doing something different…research…I could see the read from a directory approach, but unfortunately, I cannot control the file naming scheme nor can I be certain the “last-modified” timestamp will be entirely reliable. That is why I was trying to deal with the packets and their respective timestamps directly…

I write code in Scala/Java, and Perl, but never have written anything in Python…which is also a hurdle I will have to deal with…(but am willing to deal with if needed)

Sorry, missed the last part…I already have code written (Storm topology) that reads the PCAP files and sends them to the follow–on bolt to write to disk…or whatever…it already hashes on the 2/4/5 tuple (configurable). I am using pcap4j for that…SO, writing files to disk is trivial and mostly done…but I kept getting heartburn when I was looking at the disk IO implications (read,write,read,write,read,send to Kafka…which is another write once it persists). This approach also means that I am stuck with having to probably read PCAP twice…once to process for Bro and once to process the PCAP itself, part of which involves an in-memory JOIN of sorts that associates each Bro log (conn.log) to the individual packets that made up the session.