'for loop' variable modification

Hi all:

I’m working on an interesting Bro policy, and I want to be able to begin a ‘for loop’ at some point where it previously left off.

Pseudo-code:

for (foo in bar)
{
if (foo == “baz”) break;
.
process bar[foo]
.
}
.
.
Do other work (not changing bar)

.
.
first_time = T;
for (foo in bar)
{
if (first_time)
{
first_time = F;
foo = “baz”;
}
.
process bar[foo]
.
}

If the loop variable can be reassigned in the loop, and the loop continued from that point, it would facilitate some of the processing I’m doing. The above synthetic code could be refactored to avoid the issue, but…

My real-world issue is that I have a large table to process, and want to amortize the processing of it on the time domain:

A. Process first N items in table
B. Schedule processing of next N items via an event
C. When event triggers, pick up where we left off, and process next N items, etc.

(There are inefficient ways of processing these that solve some, but not all issues, such as putting the indices in a vector, then going thru that - wont go into the problems with that right now)

I haven’t checked whether my desired behavior works, but since its not documented, I wouldn’t want to rely on it in any event.

I would be interested in hearing comments or suggestions on this issue.

Jim

I haven't checked whether my desired behavior works, but since its not
documented, I wouldn't want to rely on it in any event.

Yeah, I doubt the example you gave currently works -- it would just
change the local value in the frame without modifying the internal
iterator.

I would be interested in hearing comments or suggestions on this issue.

What you want, the ability to split the processing of large data
tables/sets over time, makes sense. I've probably also run into at
least a couple cases where I've been concerned about how long it would
take to iterate over a set/table and process all keys in one go. The
approach that comes to mind for doing that would be adding coroutines.
Robin has some ongoing work with adding better support for async
function calls, and I wonder if the way that's done would make it
pretty simple to add general coroutine support as well. E.g. stuff
could look like:

event process_stuff()
    {
    local num_processed = 0;

    for ( local item in foo )
        {
        process_item(item);

        if ( ++num_processed % 1000 == 0 )
            yield; # resume next time events get drained (e.g. next packet)
        }

There could also be other types of yield instructions, like "yield 1
second" or "yield wait_for_my_signal()" which would, respectively,
resume after arbitrary amount of time or a custom function says it
should.

- Jon

Thanks, Jon:

I’ve decided to split the data (a table of IP addresses with statistics captured over a time period) based on a modulo calculation against the IP address (the important characteristic being that it can be done on the fly without an additional pass thru the table), which with an average distribution of traffic gives relatively equal size buckets, each of which can be processed during a single event, as I described.

I like the idea of co-routines - it would help to address issues like these in a more natural manner.

Jim

I got the following idea while perusing non_cluster.bro SumStats::process_epoch_result

i=1;
while (i <= 1000 && |bar| > 0)
{
for (foo in bar)
{
break;
}

process bar[foo]

optional: baz[foo] = bar[foo] #If we need to preserve original data

delete bar[foo];

++i;
}

This will allow iteration thru the table as I originally desired, although destroying the original table.

SumStats::process_epoch_result deletes the current item inside the for loop, so is relying on undefined behavior, per the documentation: “Currently, modifying a container’s membership while iterating over it may result in undefined behavior, so do not add or remove elements inside the loop.” The above example avoids that. Does anyone use sumstats outside of a cluster context?

Yes, actually it would, pretty sure we could use that infrastructure
for a yield keyword.

Robin