'async' update and proposal

A while ago we've been discussing adding a new "async" keyword to run
certain functions asynchronously; Bro would proceed with other script
code until they complete. Summary is here:
https://www.bro.org/development/projects/broker-lang-ext.html#asynchronous-executions-without-when

After my original proof-of-concept version, I now have a 2nd-gen
implementation mostly ready that internally unifies the machinery for
"when" and "async". That simplifies the code and, more importantly,
makes the two work together more smoothly. The branch for that work is
topic/robin/sync, I'll clean that up a bit more and would then be
interested in seeing somebody test drive it.

In the meantime I want to propose a slight change to the original
plan. In earlier discussions, we ended up concluding that introducing
a new keyword to trigger the asynchronous behaviour is useful for the
script writer, as it signals that semantics are different for these
calls. Example using the syntax we arrived at:

    event bro_init()
        {
        local x = async lookup_hostname("www.icir.org"); # A
        print "IP of www.icir.org is", x; # B
        }

Once the DNS request is issued in line A, the event handler will be
suspended until the answer arrives. That means that other event
handlers may execute before line B, i.e., execution order isn't fully
deterministic anymore. The use of "async" is pointing out that
possibility.

However, look at the following example. Let's say we want to outsource
such DNS functionality into a separate framework, like in this toy
example:

    # cat test.bro
    @load my-dns

    event bro_init()
        {
        local x = MyCoolDNSFramework::lookup("www.icir.org"); # A
        print "IP of www.icir.org is", x; # B
        }

    # cat my-dns.bro
    module MyCoolDNSFramework;

    export {
        global lookup: function(name: string) : set[addr];
    }

    function lookup(name: string) : set[addr] {
        local addrs = async lookup_hostname(name); # C
        return addrs; # D
    }

That example behaves exactly as the 1st: execution may suspend between
lines A and B because the call to MyCoolDNSFramework::lookup()
executes an asynchronous function call internally (it will hold
between C and D). But in this 2nd example that behaviour is not
apparent at the call site in line A.

We could require using "async" in line A as well but that would be
extremely painful: whenever calling some function, one would need to
know whether internally the callee may end up using "async" somewhere
(potentially buried further deep inside its call stack).

I think we should instead just skip the "async" keyword altogether.
Requiring it at some places, but not others, hurts more than it helps
in my opinion. The 1st example would then just go back to look like
this:

      event bro_init()
        {
        local x = lookup_hostname("www.icir.org"); # A
        print "IP of www.icir.org is", x; # B
        }

This would still behave the same as before: potential suspension
between A and B.

I don't think skipping "async" this would be a big deal for anything,
as the cases where the new behaviour may actually lead to significant
differences should be rare.

Thoughts?

Robin

I don't think skipping "async" this would be a big deal for anything,
as the cases where the new behaviour may actually lead to significant
differences should be rare.

After pondering this for a while I am a bit afraid that skipping async
completely might lead to quite hard to debug problems in scripts.

At the moments, Bro scripts are basically predictable - e.g. once an event
is executed you know that everything in that event happens before anything
else is executed.

Having asynchronous functions (obviously) changes that - now other things
will execute while you wait for something to happen. Without an async
keyword, users basically have no direct way to determine if a function
will run to the end without interruption. This can be especially
problematic if you manipulate data structures in an event that have to be
present in later events (as it is now commonly done in scripts).

Consider e.g. an example like the following

event protocol_event_1(...)
  {
  c$proto$la = function_call;
  }

event protocol_event_end(...)
  {
  Log::write([....c$proto$la...]);
  }

If I understand everything correctly, in this case it is not guaranteed
to be present - it still might be waiting for function_call (if it is
asynchronous). This might not be a problem for cases in which functions
that obviously need DNS lookups are used - but if this is hidden between a
few layers of indirection it will get really hard to reason about this.

I actually think that it makes sense to be more explicit there. So -
require async in front of all bifs, etc that are asynchronous. And if a
user creates a function that in turn calls an asynchronous function, I
think we should require that function to be called using async too. Either
a user knows that a function uses an asynchronous function, or the script
interpreter will raise an error message telling them that the async
keyword is required here because an async function is in the call stack.
While this might be a bit painful at times I think this still is better
because it makes the script writer aware that things might be interrupted
at this point.

So - my argument is basically exactly the reverse of yours - if an async
function is somewhere in the call stack I definitely would want to know
about this when writing my script - otherwhise I can see really nasty bugs
happening.

If we want to make it more explicit that a function is potentially
asynchronous, we also could consider requiring the function definition to
mark this explicitly, e.g. just by adding the async keyword in front:

async function lookup(name: string) : set[addr] {
  local addrs = async lookup_hostname(name);
  return addrs;
}

...but that is not strictly necessary and does not look that pretty.

Johanna

I was thinking the same thing :slight_smile:

The notice framework currently has this problem, and needs to work around it:

        ## Adding a string "token" to this set will cause the notice
        ## framework's built-in emailing functionality to delay sending
        ## the email until either the token has been removed or the
        ## email has been delayed for :bro:id:`Notice::max_email_delay`.
        email_delay_tokens: set[string] &optional;

So that the extend-email hostnames script can do

        add n$email_delay_tokens["hostnames-src"];
        when ( local src_name = lookup_addr(n$src) )
            {
            output = string_cat("orig/src hostname: ", src_name, "\n");
            tmp_notice_storage[uid]$email_body_sections[|tmp_notice_storage[uid]$email_body_sections|] = output;
            delete tmp_notice_storage[uid]$email_delay_tokens["hostnames-src"];
            }

So 'async' or no 'async' keyword, I think as soon as bro starts doing more things asynchronous a lot of synchronization/ordering issues come into play.

Even stuff like

event protocol_event_1(...) &priority=1
  {
  c$proto$la = function_call;
  }

event protocol_event_1(...)
  {
  ...
  }

Currently the 2nd event handler is guaranteed to be ran only after the first finishes running, right? But what if the first handler does an async operation?
Does the 2nd event handler wait for the async operation to finish, or does it run as soon as the higher priority event function hits an async operation?

If that works, it would be because there's an explicit dependency on the higher priority event with the lower priority event.

For 'protocol_event_1' and 'protocol_event_end' there's no explicit dependency other than that the analyzer raises the events with a known ordering.
If an earlier event can trigger an async operation then all of the assumed ordering goes out the window.

First of all, this async keyword reminds me of asynchronous programming
in C#:
https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/

As I only used it in a student project, I don't have real experience but
maybe someone who is used to that paradigm in C# can provide some
valuable hints.

So - my argument is basically exactly the reverse of yours - if an async
function is somewhere in the call stack I definitely would want to know
about this when writing my script - otherwhise I can see really nasty bugs
happening.

I agree on that! For the C# async paradigm, people say that async is
like a zombie plague as a single asynchronous function can start
"infecting" your code base by propagating async through the call graph.
However, I would prefer being explicit, in particular as in case of Bro
the plague syntactically stops at the event level.

So 'async' or no 'async' keyword, I think as soon as bro starts doing more things asynchronous a lot of synchronization/ordering issues come into play.

Even stuff like

event protocol_event_1(...) &priority=1
  {
  c$proto$la = function_call;
  }

event protocol_event_1(...)
  {
  ...
  }

[...]

For 'protocol_event_1' and 'protocol_event_end' there's no explicit dependency other than that the analyzer raises the events with a known ordering.
If an earlier event can trigger an async operation then all of the assumed ordering goes out the window.

That's a really good example I think. However, with async around, I
would argue that the priority can be used to determine when an event is
scheduled but not when it is finished. In cases where ordering is
important, wouldn't something like the following work?

event protocol_event_1(...)
  {
  c$steps_left = 2;
  c$proto$la = async function_call;
  c$steps_left--;
  if ( c$steps_left <= 0 )
    schedule 0sec { protocol_event_end(...) };
  }

event protocol_event_2(...)
  {
  # do stuff
  c$steps_left--;
  if ( c$steps_left <= 0 )
    schedule 0sec { protocol_event_end(...) };
  }

event protocol_event_end(...)
  {
  # Log write or whatever
  }

As it is explicit as well, this pattern could significantly blow up
scripts. Anyway, I think asynchronous control flow is just more complex
than sequential, no matter how much syntactic sugar is put on top.

Jan

Jan wrote:

First of all, this async keyword reminds me of asynchronous programming
in C#:

Nice, didn't know that.

For the C# async paradigm, people say that async is like a zombie
plague as a single asynchronous function can start "infecting" your
code base by propagating async through the call graph.

That's exactly my point. If we require a keyword to be used for all
asynchronous behavior, it will need to be put in place across the
whole call stack whenever there's even a the slightest chance of
"async" being used somewhere far down inside a framework. I hear you
all on the advantages of making asynchronous behavior explicit, I
fully agree with that. I just don't see it as practical from the
script writer's perspective.

Johanna wrote:

And if a user creates a function that in turn calls an asynchronous
function, I think we should require that function to be called using
async too. Either a user knows that a function uses an asynchronous
function, or the script interpreter will raise an error message
telling them that the async keyword is required here because an async
function is in the call stack.

The problem is that the interpreter cannot determine that statically
(because control flow isn't static), we'd have to resort to runtime
errors -- and that means that code that forgets to use "async" may run
fine for a while until it happens to go down that one branch that
does, e.g., a DNS lookup.

If we required that all the relevant functions (and function
delcarations) get declared as "async", like in C#, then I believe we
could detect errors statcially. But we'd end up having to put that
async declaration into a lot of places just on the chance that
asynchronous behavior could be used somewhere. Consider for example
the plugin functions in NetControl: They'd need to be "async" just so
that someone *could* do DNS lookups in there. Same for hooks: by
definition we don't know what they'll do, so they'll need to be
"async". And that in turn means that NOTICE() for example must become
"async" because it's running a hook. Now everytime we do a NOTICE, we
need to put an "async" in front. And everytime we call a function that
might generate a NOTICE, we'd write "async" there, too.

The point of dependencies/order becoming harder to understand is valid
of course. We already have that challenge with "when" and maybe we
need to find different solutions there to expresse sequentiality
requirements somehow.

Justin wrote:

event protocol_event_1(...) &priority=1
event protocol_event_1(...)

Currently the 2nd event handler is guaranteed to be ran only after the
first finishes running, right?

Correct, and that's actually something we could ensure even with
"async": we could treat the whole set of all handlers as one block
that gets suspended as a whole if an asynchronous function runs. But
as you point out, that wouldn't solve inter-event dependencies. Per
Jan's mail, one can work around that with custom code, yet it would be
much nicer if we had built-in support for that. Actually, I think one
possible solution has been floating around for a while already: event
*scopes* that express serialization requirements in terms of shared
context. Most common example: serialize all events that are triggered
by the same connection. Originally this came up in the context of
running event handlers concurrently. I believe it would solve this
problem here too: when a function suspends, halt all handlers that
depend on the same context (e.g., same connection). More on that idea
in this paper: http://www.icir.org/robin/papers/ccs14-concurrency.pdf

Robin

as you point out, that wouldn't solve inter-event dependencies.
Per Jan's mail, one can work around that with custom code

The inter-event dependencies + code understandability/readability
issue that Johanna points out is maybe something that would also
bother me if it weren't addressed. And taking Jan's idea a bit
further, if we had more complete coroutine/yield support in scripts
you could write:

event protocol_event_1(...)
        {
        c$proto$la = function_call;
        }

event protocol_event_end(...)
        {
        yield WaitForProtoFields();
        Log::write([....c$proto$la...]);
        }

There, WaitForProtoFields() would be an arbitrary/custom function you
write. E.g. return false unless all fields of c$proto have been set.

So here, maybe the code is more readable because the depender is now
explicit about their dependencies. Though I think it's still
problematic to know when to write that code because you would still
have to rely on either memory (or documentation) to tell you that
'function_call' is actually not synchronous. And if 'function_call'
starts as a synchronous function and later changes, that's also kind
of a problem, so you might see people cautiously implementing the same
type of code patterns everywhere even if not required for some cases.

Actually, I think one
possible solution has been floating around for a while already: event
*scopes* that express serialization requirements in terms of shared
context. Most common example: serialize all events that are triggered
by the same connection. Originally this came up in the context of
running event handlers concurrently. I believe it would solve this
problem here too: when a function suspends, halt all handlers that
depend on the same context (e.g., same connection). More on that idea
in this paper: http://www.icir.org/robin/papers/ccs14-concurrency.pdf

Yeah, I think this approach could actually work really well to aid in
reasoning about event ordering. The first example that came to mind
was adding an attribute to event handlers that a user wants to be
serialized as part of a context:

event protocol_event_1(c: connection ...) &context = { return c$id; } { ... }

I only skimmed the paper, though seemed like it outlined a similar way
of generalizing contexts/scopes ?

- Jon

I like this idea a lot! Do you foresee that causing trouble if we went that direction though? It seems like it could cause trouble by causing events to backup waiting for some other event to finish executing.

   .Seth

And if 'function_call' starts as a synchronous function and later
changes, that's also kind of a problem, so you might see people
cautiously implementing the same type of code patterns everywhere
even if not required for some cases.

That's a good point more generally: if we require "async" at call
sites, an internal change to a framework can break existing code.

event protocol_event_1(c: connection ...) &context = { return c$id; } { ... }

I only skimmed the paper, though seemed like it outlined a similar way
of generalizing contexts/scopes ?

Yeah, that's pretty much the idea there. For concurrency, we'd hash
that context value and use that to determine a target threat to
schedule execution too, just like in a cluster the process/machine is
determined.

An attribute can work if we're confident that the relevant information
can always be extracted from the event parameters. In a concurrent
prototype many years ago we instead used a hardcoded set of choices
based on the underlying connection triggering the event (5-tuple, host
pair, src IP, dst IP). So you'd write (iirc):

    event protocol_event_1(c: connection ...) &scope = connection

That detaches the context calculation from event parameters, with the
obvious disadvantage that it can't be customized any further. May be
there's some middle ground where we'd get both.

(To clarify terminology: In that paper "scope" is the scheduling
granularity, e.g., "by connection". "context" is the current
instantiation of that scope (e.g., "1.2.3.4:1234,2.3.4.5:80" for
connection scope).

Robin

I like this idea a lot!

Yeah, I like it, too. Additional benefit: it actually opens the door
for parallelization again, too ...

  Do you foresee that causing trouble if we went that direction
  though? It seems like it could cause trouble by causing events to
  backup waiting for some other event to finish executing.

It could. The async operations all time out, so there's a cap to how
long things can get stalled, but still: if that happens to many async
operations simultaneously, we could end up we lots of stuff in flight.
On the other hand, I don't think this can be avoided though: either we
want dependencies or we don't. You can't have the cake and it eat it
too I guess. :slight_smile:

Robin

Yeah, it seems open to having multiple methods available for the user
to choose from: dynamic call to script-land, dynamic calculation in
core (select from predefined list), or even a static value (not that I
can think of a particular place that would actually use that right
now).

Was there more benefit of using the predefined choice than saving the
overhead of calling out to script-land to do the context calculation?

- Jon

No, don't think so. It mainly came out of an analysis of existing
scripts, and those 5-tuple based subsets were the main use case
anways. Actually I'm not even sure anymore if there might have been a
custom execute-my-own-function scope as well, I'll see if I can find
the old code somewhere.

Robin

I just wanted to quickly chime in here to say that I generally like the
idea of having these contexts. I have no idea how complex it would be to
implement something like this, but that seems like it might be a
relatively clean solution to our problem :slight_smile:

Johanna

Sounds like we all like that idea. Now the question is if we want to
wait for that to materialize (which will take quite a bit more
brainstorming and then implementation, obviously), or if we want to
get async in in the current state and then put that on the TODO list?
I can see arguments either way, curious what others think.

Robin

Releasing the async feature by itself sounds unsafe. Due to the "code
understandability" concern people had, it didn't sound like there was
a big demand to start using it without a solution to that anyway, and
releasing async by itself risks that solution falling through or being
delayed/unusable for an extended period, resulting in "weird code"
being written in the meantime and possibly never rewritten later on
because "it still technically works".

I'd maybe even go about implementing context/scope by branching off
the async branch, just in case there's details that arise that make it
less suitable than originally thought and have to go back to the
drawing board.

- Jon

I have one obvious fear if we put it in now. Which is that it will be more or less put off forever :).

And we might end up with a pretty annoying situation where people (and we) start using it and you end up with all these really hard to debug corner-cases with event ordering. This is already kind of hard to do, especially with protocols that have a bunch of events - I managed to get this wrong a few times in the case of SSL where some parts of the connections where not logged correctly in some circumstances.

Basically - I fear that this will make it really hard to write scripts that work correctly in all cases.

Johanna

Ok, agree that it's best to postpone "async". My gut isn't quite as
skeptical about its safety but I see the argument. (Although then
nobody will be allowed to complain about "when" anymore :wink:

Jon, if you want to think more about context/scoping it would be great
to keep the concurrency aspects of this in mind as well, as this could
eventually get us there, too. For more context here are a couple more
pointers to past ideas that eventually led to that CCS paper I had
pointed to earlier:

     - The scope-based concurrency model was originally described in
       Section 5.1 of this paper:
       http://www.icir.org/robin/papers/cc-multi-core.pdf.

     - I actually e found the old concurrency code from the Bro 1.x
       era). I'll send you a pointer to that. "grep '&scope'" over
       those policy scripts yields the output at the end of this mail.

     - I now also remember that we indeed needed the internal tracking
       of the current context; just relying on event parameters wasn't
       sufficient. For concurrency the most tricky parts are events
       triggered indirectly, like through timers, as they will often
       need to follow the same scheduling constraints as the original
       one (not sure if that applies to "async").

Robin

--------- cut -------------------------------------------------------

superlinear/policy//bro.init:# &scope=<scope>".
superlinear/policy//conn.bro: # determine_service() runs with &scope=pair.
superlinear/policy//demux.bro:event _demux_conn(id: conn_id, tag: string, otag: string, rtag: string) &scope=connection(id)
superlinear/policy//dns.bro: msg: dns_msg, query: string) &scope=connection(c$id)
superlinear/policy//firewall.bro:event report_violation(c: connection, r:rule) &scope=connection(c$id)
superlinear/policy//ftp.bro:event add_to_first_seen_cmds(command: string) &scope=custom(command)
superlinear/policy//hot.bro:event check_hot_event(c: connection, state: count) &scope=connection(c$id)
superlinear/policy//icmp.bro:event update_flow(icmp: icmp_conn, id: count, is_orig: bool, payload: string) &scope=hostpair(icmp$orig_h, icmp$resp_h)
superlinear/policy//interconn.bro:event _remove_from_demuxed_conn(id: conn_id) &scope=connection(id)
superlinear/policy//nfs.bro:event nfs_request_getattr(n: connection, fh: string, attrs: nfs3_attrs) &scope=custom(fh)
superlinear/policy//nfs.bro:event nfs_attempt_getattr(n: connection, status: count, fh: string) &scope=custom(fh)
superlinear/policy//nfs.bro:event nfs_request_fsstat(n: connection, root_fh: string, stat: nfs3_fsstat) &scope=custom(root_fh)
superlinear/policy//nfs.bro:event nfs_attempt_fsstat(n: connection, status: count, root_fh: string) &scope=custom(root_fh)
superlinear/policy//notice-action-filters.bro:event notice_alarm_per_orig_tally(n: notice_info, host: addr) &scope=originator(host)
superlinear/policy//notice.bro:# will run with &scope connection so that we can store notice_tags.
superlinear/policy//notice.bro:event NOTICE_conn(n: notice_info) &scope=connection(n$conn$id)
superlinear/policy//portmapper.bro:event _do_pm_request(r: connection, proc: string, addl: string, log_it: bool) &scope=originator(r$id$orig_h)
superlinear/policy//scan.bro:event do_rpts_check(c: connection, num: count) &scope=hostpair(c$id$orig_h, c$id$resp_h)
superlinear/policy//scan.bro:event check_scan(c: connection, established: bool, reverse: bool) &scope=originator(reverse? c$id$resp_h : c$id$orig_h)
superlinear/policy//scan.bro:event account_tried(c: connection, user: string, passwd: string) &scope=originator(c$id$orig_h)
superlinear/policy//signatures.bro:event _do_signature_match_notice(state: signature_state, msg: string, data: string) &scope=originator(state$conn$id$orig_h)
superlinear/policy//signatures.bro:event _do_count_per_resp(state: signature_state, msg: string, data: string) &scope=responder(state$conn$id$resp_h)
superlinear/policy//signatures.bro:event _check_alarm_once(state: signature_state, msg: string, data: string) &scope=custom(state$id)
superlinear/policy//trw-impl.bro:event add_to_friendly_remotes(a: addr) &scope=originator(a)
superlinear/policy//trw-impl.bro:event check_TRW_scan(c: connection, state: string, reverse: bool) &scope=originator(reverse? c$id$resp_h : c$id$orig_h)
superlinear/policy//weird.bro:event report_weird_conn_once(t: time, name: string, addl: string, c: connection, action: NoticeAction) &scope=custom(name)
superlinear/policy//weird.bro:event net_weird(name: string) &scope=custom(name)
superlinear/policy//worm.bro:global worm_list: table[addr] of count &default=0 &read_expire = 2 days; #&scope=originator;
superlinear/policy//worm.bro: &default=0 &read_expire = 2 days &expire_func=expi; # &scope=originator;

I think we should instead just skip the "async" keyword altogether.
Requiring it at some places, but not others, hurts more than it helps
in my opinion.

This sounds fine to me. Given that Bro is inherently an asynchronous language, it makes sense to for those semantics to trickle down to the function call level. In my opinion, it's in line with user expectations: processing is network-event driven. A function finishes when it has all data it needs for its processing - whether being synchronous or asynchronous.

On a separate note: for asynchronous operations to be truly useful, they need to propagate to the lowest level. That concerns particularly file I/O in addition to network I/O. Do you have any plans to go there as well (perhaps later down the line)? I am asking because there's a natural fit to do asynchronous file I/O with CAF. It just hasn't been tackled yet.

    Matthias