Proposed change to lambda semantics - shallow copying rather than references

Hi Folks,

For the script optimization/compilation work I’ve been doing, I’ve been looking into what it will take to compile lambdas (anonymous functions). Currently, these use “reference” semantics when referring to local variables. For example, this code:

function demo(): function()
    {
    local i = 3;
    local f = function() { print i; };
    i = 5;
    return f;
    }

event zeek_init()
    {
    demo()();
    }

will print 5, because the anonymous function assigned to f holds a reference to demo’s local variable i, so reflects the change made to i after the instantiation of the anonymous function. This continues to work even after demo exits (which requires some extra work internally to support).

Due to how the script compiler represents values, the above semantics would require some fairly extensive support.

My proposal is to change the semantics to instead be shallow-copying, which means that atomic values are copied, but aggregate values (tables, records, and vectors) are shared. With this change, the above code would print 3. However, this code would still print 5:

function demo(): function()
    {
    local r: record { i: count; };
    r$i = 3;
    local f = function() { print r$i; };
    r$i = 5;
    return f;
    }

event zeek_init()
    {
    demo()();
    }

This change also brings the functionality closer to that used in when blocks, which use deep-copying. (Arguably, it could be a good idea to change when to use shallow-copying, too, but that’s a different discussion.) If one wants deep-copy semantics, that can be done with explicit copy()s, like this:

function demo(): function()
    {
    local r: record { i: count; };
    r$i = 3;
    local r2 = copy(r);
    local f = function() { print r2$i; };
    r$i = 5;
    return f;
    }

event zeek_init()
    {
    demo()();
    }

— Vern

I think the current functionality would be better, since it's how I'd expect it to behave (which probably just reflects my bias toward procedural languages). That said, since Zeek only gained support for closures relatively recently there's likely little precedent, and there's a way out via records. So I'd be okay with it.

It'd be good to explore whether Zeek could warn if the change affects existing code. Not sure how feasible that is...

Hth,
Christian

My proposal is to change the semantics to instead be shallow-copying, which means that atomic values are copied, but aggregate values (tables, records, and vectors) are shared. With this change, the above code would print 3. However, this code would still print 5:

I for one don’t really like this; in my opinion, atomic values and aggregate types should behave the same. Everything else feels at least unintuitive to me.

I also agree with Christian that, given the choice between a deep copy and the current functionality, I like the current functionality more.

Though I would still prefer deep copies to having different behavior for aggregate/nonaggregate types.

Johanna

I agree with Johanna that whatever is done, it would be best if it were consistent, and not vary with the type of the captured variable.

IIRC, in the ‘70s, it was up to FORTRAN compiler writers whether arguments passed to procedures and functions were passed by value or by reference, and some individual compilers varied their behavior with the argument type. This led to all kinds of bugs and headaches.

Personally, I like the approach used for C++ lambdas, where captured variables are passed by value unless they’re explicitly marked to be passed by reference. But I don’t know how much work something like that would be.

I agree with Johanna that whatever is done, it would be best if it were consistent, and not vary with the type of the captured variable.

It already varies with type of variable for assignments and function call parameters. Thus I’m puzzled at the desire for deep-copy over shallow-copy, given that Zeek is already primarily shallow-copy.

— Vern

Same for me: wondering what's actually (in)consistent given the
behavior of assignment/call-params.

Otherwise, I don't have a strong opinion about what the lambda
semantics should be, but if a change does occur, the one thing that's
for sure on my wishlist is consideration for some deprecation-path,
differentiating-syntax (maybe event just temporary), or other
warning/notice that can help users along instead of potentially
breaking their code outright.

- Jon

Access to the variable here is so natural (in the script code, not the implementation) that I don't immediately think of assignment semantics at all. So I intuitively compared it to other languages, where I'd tend to expect i to be a (deep) reference. If we cover the behavior in the docs I'm fine with either.

Regardless of where this lands, your point strikes me as a great one to work into our current docs push. I just did a bit of digging and I can't find much that describes Zeek's shallow/deep or reference/value semantics. I see a bit in the description of copy(), in that of closures, and in CHANGES by some guy in 2005:

- The manual has been updated to clarify that aggregate values in events
   are passed as shallow copies, so that modifications to elements of the
   values after posting the event but before it's handled will be visible
   to the handlers for the events (Christian Kreibich).

:slight_smile:

Best,
Christian

There’s also some weirdness with the current implementation:

This outputs 4 & 5 like i’d expect:

function demo(): function()
{
local i = 3;
local f = function() { ++i;print i; };
return f;
}

event zeek_init()
{
local d = demo();
d();
d();
}

but this outputs 2,3 instead of 2,7:

type Counter: record {
i: function();
g: function(): count;
};

function demo(): Counter
{
local n = 1;
local c: Counter;
c$i = function() { ++n; };
c$g = function(): count { ++n;return n; };
return c;
}

event zeek_init()
{
local c = demo();
print(c$g());
c$i();
c$i();
c$i();
c$i();
print(c$g());
}

each lambda looks like it gets it’s own ‘n’

for sure on my wishlist is consideration for some deprecation-path,
differentiating-syntax (maybe event just temporary), or other
warning/notice that can help users along instead of potentially
breaking their code outright.

Good point. Seems a natural way to do this is to add C++-style [] capture syntax, and a deprecation warning (and the current semantics) if it’s missing. (And maybe no warning if the body doesn’t use any of the outer variables, since that form will continue to work.)

— Vern

There’s also some weirdness with the current implementation:

Urp. Yeah, things get wonky in the current implementation when the original outer function exits. (Your example works as expected if we move the zeek_init statements into demo.) There’s a bunch of code in that case to salvage access to the now-otherwise-reclaimed frame, and it’s more complicated than one would like because of the need to avoid pointer cycles that will lead to leaks. No doubt it’s duplicating the frame at that point, rather than arranging to still share it.

If lambdas had their scope limited to the outer function, it all gets so much easier … but also then doesn’t address some basic use cases.

— Vern

It’s interesting how different people have different intuitions on semantics here. I also see it as consistent with function arguments, that’s why I’d be fine it. That said, I was also thinking along the same lines of adding explicit capture specifications: deprecate the current, capture-spec-less syntax, and generally just require people to list what they want to capture; seems like a useful practice to me. And then we let them tell Zeek if they want deep or shallow copies (but always copies, not references). “when” could then move into the same direction as well; maybe it could even change to take a lambda instead of its own body. That would simplify the implementation, too.

Robin

Sounds like a way forward then to both address the current concern, and improve this overall. Does this work for everybody?

Robin

Yes for me, just one comment: I'm a bit nervous about getting too inspired by C++ syntax. With every new standard round it's looking more like control character soup. Vern, I'm not sure what you had in mind here ... but perhaps instead of something very close to C++, like

   local f = function[=]() { print r$i; };

to capture that I'd like r to be a deep copy, maybe we could consider something more Zeek-style, perhaps via attributes:

   local f = function() &deepcopy { print r$i; };

for all variables or

   local f = function() &deepcopy=r { print r$i; };

just for r, with shallow-copy the default. More typing, yes, but arguably less cryptic, and this is likely a relatively rarely used feature.

Best,
Christian

Sounds like a way forward then to both address the current concern,
and improve this overall. Does this work for everybody?

Sounds good to me. I intend to add captures for when too, though if that requires further discussion, perhaps whoever thinks so can start up a sub-thread. Will now follow up re syntax in response to Christian’s subsequent note.

— Vern

Yes for me, just one comment: I’m a bit nervous about getting too inspired by C++ syntax. With every new standard round it’s looking more like control character soup. Vern, I’m not sure what you had in mind here … but perhaps instead of something very close to C++, like

I was thinking a much-reduced subset of what C++ supports, no [=] or such, just strict listing of the captures. So rather than:

local f = function= { print r$i; };

this would be

local f = function [r]() { print r$i; };

or, for deep-copy

local f = function [copy r]() { print r$i; };

because it turns out that copy is already a keyword.

local f = function() &deepcopy=r { print r$i; };

just for r, with shallow-copy the default.

(1) There can’t be a default since at least for now, for deprecation purposes we need the default to be the current reference semantics.
(2) Given that, going the attribute way struck me as too clunky, for example:

local f = function() &copy=r1, &deepcopy=r2 { print r1$i + r2$i; };

I think instead

local f = function() [r1, copy r2] { print r1$i + r2$i; };

is readable and more streamlined.

— Vern