Consistent error handling for scripting mistakes

Trying to broaden the scope of:

https://github.com/bro/bro/issues/208

I recently noticed there's a range of behaviors in how various
scripting mistakes are treated. They may (1) abort, as in case of bad
subnet mask or incompatible vector element assignment (2) skip over
evaluating (sub)expression(s), but otherwise continue current function
body, as in case of non-existing table index access or (3) exit the
current function body, as in the classic case of uninitialized record
field access.

1st question: should these be made more consistent? I'd say yes.

2nd question: what is the expected way for these to be handled? I'd
argue that (3) is close to expected behavior, but it's still weird
that it's only the *current function body* (yes, *function*, not
event) that exits early -- hard to reason about what sort of arbitrary
code was depending on that function to be fully evaluated and what
other sort of inconsistent state is caused by exiting early.

I propose, for 2.7, to aim for consistent error handling for scripting
mistakes and that the expected behavior is to unwind all the way to
exiting the current event handler (all its function bodies). That
makes it easier to explain how to write event handlers such that they
won't enter too wild/inconsistent of a state should a scripting error
occur: "always write an event handler such that it makes no
assumptions about order/priority of other events handlers". That's
already close to current suggestions/approaches.

One exception may be within bro_init(), if an error happens at that
time, I'd say it's fine to completely abort -- it's unlikely or hard
to say whether Bro would operate well if it proceeded after an error
that early in initialization.

Thoughts?

- Jon

Along the same vein of sensible Bro script error handling, I’m resending an issue I found in January:

I was tinkering with the sumstats code, and inadvertantly deleted the final “}” closing out the last function. When running the code, the misleading error message is received:

error in bro/share/bro/base/frameworks/tunnels/./main.bro, line 8: syntax error, at or near “module”

presumably due to the function still being open when the next policy script is loaded. Wouldn’t it be more reasonable to check at the end of each script when loaded that there are no dangling functions, expressions, etc. ???

I was tinkering with the sumstats code, and inadvertantly deleted the final "}" closing out the last function. When running the code, the misleading error message is received:

Yes, that's a bit of a different topic, but still tracked (at
low-normal priority):

https://github.com/bro/bro/issues/167

There are also silent fails which probably should give a warning, such as failing to include the fully-qualified event name silently preventing the event from being triggered.

Also a bit different that what I was talking about, but also tracked
(at higher priority since it's a common mistake):

https://github.com/bro/bro/issues/163

My idea on runtime scripting errors would be to apply a sensible default to the offending expression (null or 0, as the case may be, might be sufficient), log the error, and continue....

In the following example (comments reflect current behavior) you'd
expect the "false" branch in foo() to be taken?

I recently noticed there's a range of behaviors in how various
scripting mistakes are treated.

There's a 4th: InterpreterException.

1st question: should these be made more consistent? I'd say yes.

Yes, definitely.

that it's only the *current function body* (yes, *function*, not
event) that exits early -- hard to reason about what sort of arbitrary
code was depending on that function to be fully evaluated and what
other sort of inconsistent state is caused by exiting early.

... and what happens if the function is supposed to return a value,
but now doesn't?

I propose, for 2.7, to aim for consistent error handling for scripting
mistakes and that the expected behavior is to unwind all the way to
exiting the current event handler (all its function bodies).

Agree with that. Can we do that cleanly though? The problem with
InterpreterException is that it may leak memory. We'd need to do the
unwinding manually throughout the interpreter code, but that means
touching a number of places to pass the error information back.

One exception may be within bro_init(), if an error happens at that
time, I'd say it's fine to completely abort -- it's unlikely or hard
to say whether Bro would operate well if it proceeded after an error
that early in initialization.

Yeah, that makes sense.

Robin

> that it's only the *current function body* (yes, *function*, not
> event) that exits early -- hard to reason about what sort of arbitrary
> code was depending on that function to be fully evaluated and what
> other sort of inconsistent state is caused by exiting early.

... and what happens if the function is supposed to return a value,
but now doesn't?

Accesses to it emit a "value used but not set" error, but subsequent
statements within same function/event still get evaluated (unless they
are themselves now incoherent and trigger various cascading error
behaviors).

> I propose, for 2.7, to aim for consistent error handling for scripting
> mistakes and that the expected behavior is to unwind all the way to
> exiting the current event handler (all its function bodies).

Agree with that. Can we do that cleanly though? The problem with
InterpreterException is that it may leak memory. We'd need to do the
unwinding manually throughout the interpreter code, but that means
touching a number of places to pass the error information back.

Should be possible, just a question of effort/difficulty. An
alternative to manually passing error information back via return
values is migrating from explicit reference counting to shared_ptr.
Either approach requires touching similar code locations, but also the
later may be easier to proceed with after the old serialization system
gets removed from the BroObj class hierarchy.

Since we already have a class of errors that may induce leaks, we
could still move forward with applying consistent error handling
behavior via InterpreterException, but then later expect to resolve
the leakage issue independently via implementation detail
improvements.

The final resolution is still for people to correct underlying
scripting mistakes, it's just that having more consistent and improved
error handling makes it easier to reason about the subsequent
operational state of Bro with more confidence.

- Jon

I like what you & Robin sketch. FWIW, it's hard for me to get excited
over the issue of leaks-in-the-face-of-error-recovery. Presumably it would
take in practice a lot of error recovery before this actually hoses the
execution due to running out of memory. At that point, it's not unreasonable
for things to keel over anyway.

An alternative/additional approach would be to introduce a notion of
"failure" as a first-class object. In another life I used a language that
did that, and it worked remarkably well. But clearly that's a bigger
undertaking than the valuable near-term notion of regularizing how Zeek
deals with errors.

    Vern

Yeah, it's not great, but intention would still be to eventually fix
the leakage problem, too.

Anyway, made a new GH issue to track the broader error handling enhancement:

https://github.com/bro/bro/issues/211

- Jon