All,
I am working with a storm topology to process a large number of PCAP files which can be of variable sizes, but tend to be in the range of 100MB to 200MB, give or take. My current batch to work on contains about 42K files…I am aiming to process with as much parallelism as possible while avoiding the issue of sessions that span more than one file (so you know why I am doing this)
my main constraints/focus:
- take advantage of large number of cores (56) and RAM (~750GB) on my node(s)
- Avoid disk as much as possible (I have relatively slow spinning disks, though quite a few of them that can be addressed individually, which could mitigate the disk IO bottleneck to some degree)
- Prioritize completeness above all else…get as many sessions reconstructed as possible by stitching the packets back together in one of the ways below…or another if you folks have a better idea…
my thinking…and hope for suggestions on the best approach…or a completely different one if you have a better solution:
- run mergecap and setup bro to run as a cluster and hope for the best
- upside: relatively simple and lowest level of effort
- downside: not sure it will scale the way I want. I’d prefer to isolate Bro to running on no more than two nodes in my cluster…each node has 56 cores and ~750GB RAM. Also, it will be one more hack to have to work into my Storm topology1. use Storm topology (or something else) to re-write packets to individual files based on SIP/DIP/SPORT/DPORT or similar
- upside: this will ensure a certain level of parallelism and keep the logic inside my topology where I can control it to the greatest extent
- downside: This seems like it is horribly inefficient because I will have to read the PCAP files twice: once to split and once again when Bro get them, and again to read the Bro logs (if I don’t get the Kafka plugins to do what I want). Also, this will require some sort of load balancing to ensure that IP’s that represent a disproportionate percentage of traffic don’t gum up the works, nor do IP’s that have relatively little traffic take up too many resources. My thought here is to simply keep track of approximate file sizes and send IP’s in rough balance (though still always sending any given IP/port pair to the same file). Also, this makes me interact with the HDD’s at least three times (once to read PCAP, next to write PCAP, again to read Bro logs, which is undesirable)1. Use Storm topology or TCP replay (or similar) to read in PCAP files, then write to virtual interfaces (a pool setup manually) so that Bro can simply listen on each interface and process as appropriate.
- upside: Seems like this could be the most efficient option as it probably avoids disk the most, seems like it could scale very well, and would support clustering by simply creating pools of interfaces on multiple nodes, session-ization takes care of itself and I just need to tell Bro to wait longer for packets to show up so it doesn’t think the interface went dead if there are lulls is traffic
- downside: Most complex of the bunch and I am uncertain of my ability to preserve timestamps when sending the packets over the interface to Bro1. Extend Bro to not only write directly to Kafka topics, but also to read from them such that I could use one of the methods above to split traffic and load balance and then have Bro simply spit out logs to another topic of my choosing
- upside: This could be the most elegant solution because it will allow me to handle failures and hiccups using Kafka offsets
-
downside: This is easily the most difficult to implement for me as I have not messed with extending Bro at all.
Any suggestions or feedback would be greatly appreciated! Thanks in advance…
Aaron
P.S. sorry for the verbose message…but was hoping to give as complete a problem/solution statement as I can