Performance hit with long lived flows

Hi Zeek,

Hope you’re all doing well.
I have a big 4GB sized PCAP and I am running many iterations of it at 10 Gbps with the help of load balancer and multiple instance of bro running on top of it.
It takes around 3.5 seconds to finish one iteration. I have to run multiple iteration of the same pcap because of not having a test network which can pump 10Gbps traffic to my software. I don’t even have a very large pcap so that I can run only one iteration for a long time.

Other than proper(SYN-SYNACK-ACK--------FIN/RST) TCP flows, bro is able to hold all the other connections. If the run is for let’s say an hour, it notifies about the connection after the test is over. This particular scenario is a test specific, but the need to tackle long lived flows is a valid one.

I tired, “connection_status_update” way of handling this. If the update interval is configured to 10 min, it starts dropping exactly around 40 mins. If the interval is kept to 1 min, it starts having problem at around 4 min. I could not figure out why bro behaves this way, what is causing at (interval * 4) mins(there is one parameter i think is playing a role which is the time taken by a PCAP to complete one iteration, but still it doesn’t help coming up with any theory). So, after it starts dropping, the number of broccoli sockets seems to be increasing and bro then goes into unresponsive state.

I tried Connection polling using ConnPolling::watch(), this approach is way better than connection_status_update for sure is what I observed, this takes a little while to drop but it doesn’t go into the very bad state of sockets being increasing and the unresponsive state.

I also tried schedule, and it didn’t serve my purpose either.

After trying out whatever bro suggests me to handle this, I came up with my own implementation.

I haven't explore if schedule routines works in the same main thread??

All script code is currently executed on the main thread.

If yes, then obviously, this can hold bro's main packet processing thread and we may have a serious damage going through a big list of such table entries.

But I also thought of scanning in some batches,

Right, there's potential for scripts that do a lot of work at one time
to interfere w/ packet processing and batching the work across time is
a possible solution. Generalized coroutine support might also make it
a bit easier to structure such batch-and-yield logic, but don't think
there's near-term plans to add that feature.

- Jon