set and vector operators

I'm working on some scripts that use sets and vectors, sometimes together,
and am finding it clunky that Bro doesn't offer much in the way of operators
for this. To that end, I'm thinking of implementing some along the following
lines, where values starting with 's' are sets, 'v' are vectors, and 'e'
are type-compatible elements:

  s1 + s2 Set union (for sets of the same type, of course)
  s1 || s2 Set union

  s1 && s2 Set intersection
  s1 - s2 Set difference

  s += e Add the element 'e' to set 's'
        (same as the current "add s[e]")
  s -= e Remove the 'e' element from 's', if present
        (same as the current "delete s[e]")

  s1 += s2 Same as "s1 = s1 + s2"
  s1 -= s2 Same as "s1 = s1 - s2"

  v += e Append the element 'e' to the vector 'v'
  v += s Append the elements of 's' to the vector 'v',
        with the order not being defined

  s += v Add the elements of 'v' to 's'
  s -= v Remove the elements of 'v' from 's', if present

These strike me as pretty straightforward, but please chime in if you
have comments!

    Vern

I'm working on some scripts that use sets and vectors, sometimes together,
and am finding it clunky that Bro doesn't offer much in the way of operators
for this. To that end, I'm thinking of implementing some along the following
lines, where values starting with 's' are sets, 'v' are vectors, and 'e'
are type-compatible elements:

  s1 + s2 Set union (for sets of the same type, of course)
  s1 || s2 Set union

  s1 && s2 Set intersection
  s1 - s2 Set difference

  s += e Add the element 'e' to set 's'
        (same as the current "add s[e]")
  s -= e Remove the 'e' element from 's', if present
        (same as the current "delete s[e]")

  s1 += s2 Same as "s1 = s1 + s2"
  s1 -= s2 Same as "s1 = s1 - s2"

  v += e Append the element 'e' to the vector 'v'
  v += s Append the elements of 's' to the vector 'v',
        with the order not being defined

  s += v Add the elements of 'v' to 's'
  s -= v Remove the elements of 'v' from 's', if present

These strike me as pretty straightforward, but please chime in if you
have comments!

That's very similar to what python does, except they use & and | instead of && and ||.
I think they do that because 'set or' is closer to 'bitwise or' than 'logical or'

They also use ^ for symmetric difference.

a=set([1,2,3])
b=set([2,3,4])
a & b

{2, 3}

a | b

{1, 2, 3, 4}

a - b

{1}

b - a

{4}

a ^ b

{1, 4}

That's very similar to what python does, except they use & and | instead of && and ||.
I think they do that because 'set or' is closer to 'bitwise or' than 'logical or'

Yeah, I thought of that, but Bro currently doesn't have any '&' or '|'
operators, which makes me reluctant to add them just for this. '&' is
particularly problematic, as it would introduce ambiguity as to whether
"&redef" means the redef attribute, or "use the 'and' operator on the value
of the variable 'redef'". We'd have to add a bunch of reserved words
to accommodate this.

They also use ^ for symmetric difference.

(same here re being a new operator)

    Vern

  s1 + s2 Set union (for sets of the same type, of course)
  s1 || s2 Set union

(What's the difference between the two? Or do you mean either one or
the other?)

Like Justin, I was also thinking "|" and "&" might be more intuitive.
"||"/"&&" is really typically associated with boolean contexts, and
other languages mgiht also coerce set operands into booleans in such a
context, so that, e.g., "s1 || s2" evaluates to true if either is
non-empty.

I see the problem with the parser but maybe adding keywords is the way
to go.

  s += e Add the element 'e' to set 's'
        (same as the current "add s[e]")
  s -= e Remove the 'e' element from 's', if present
        (same as the current "delete s[e]")

I'd skip these. I don't think we should add an additional set of
operators for things that Bro already supports, that's seems confusing
to me (like Perl :slight_smile:

  s1 += s2 Same as "s1 = s1 + s2"

(Or s1 |= s2 if we pick "|" for union.)

  v += e Append the element 'e' to the vector 'v'

That's probably the most requested Bro operator ever! :slight_smile:

  v += s Append the elements of 's' to the vector 'v',
        with the order not being defined

This one I'm unsure about. The point about the order being undefined
seems odd. If I don't care about order, wouldn't I stay with a set?

Robin

> s1 + s2 Set union (for sets of the same type, of course)
> s1 || s2 Set union

(What's the difference between the two? Or do you mean either one or
the other?)

No difference. It just seems to me that we need something for intersection,
and using existing operators, the natural for that is "&&". Once we have
that, might as well support "||" for union. But given symmetry with other
operators, "+" should work too.

Like Justin, I was also thinking "|" and "&" might be more intuitive.

If we didn't have the keyword issue with &attributes, then I could see that.
But that strikes me as a significant drawback. (Also, if we do add these,
then a user might reasonably expect them to work bitwise for count's. We
could then consider implementing that too I guess.)

other languages mgiht also coerce set operands into booleans in such a
context, so that, e.g., "s1 || s2" evaluates to true if either is
non-empty.

Hey I don't care about other seriously busted languages! :wink:

I see the problem with the parser but maybe adding keywords is the way
to go.

Yuck.

> s += e Add the element 'e' to set 's'
> (same as the current "add s[e]")
> s -= e Remove the 'e' element from 's', if present
> (same as the current "delete s[e]")

I'd skip these. I don't think we should add an additional set of
operators for things that Bro already supports

I actually feel the opposite, that "add" is clunky ("delete" a bit less so)
and thus these are more natural. But in particular it seems we ought to
support these due to needing to support "v += e" (which is the one that
I most want!).

> s1 += s2 Same as "s1 = s1 + s2"

(Or s1 |= s2 if we pick "|" for union.)

Yeah, if we bite off the '&'-keyword ugliness. Ugh.

> v += e Append the element 'e' to the vector 'v'

That's probably the most requested Bro operator ever! :slight_smile:

Yee-up, per my note above!

> v += s Append the elements of 's' to the vector 'v',
> with the order not being defined

This one I'm unsure about. The point about the order being undefined
seems odd. If I don't care about order, wouldn't I stay with a set?

I do have a use case, but I agree it's odd; let me revisit it to see if
I really do need it. I might instead settle for "vector of set[xxx]".

    Vern

Hmmm thinking about it, we can get away with '&' with minimal keyword
conflict because there's such an easy (and natural-to-presume) fix -
namely, rather than "x&attrkeyword" you use "x & attrkeyword". Now
there's no problem, since the lexer only recognizes "&attrkeyword"
as a unit, with no whitespace allowed.

Given that, here's my updated proposal, with 'c' standing for values of
type count:

  s1 + s2 Set union
  s1 - s2 Set difference
  s1 | s2 Set union
  s1 & s2 Set intersection
  s1 ^ s2 Set symmetric difference

  s + e The set resulting from adding the element 'e' to
        the set 's'
  s - e The set resulting from removing the element 'e' from
        the set 's', if present

  s1 {+=, -=, |=, &=, ^=} s2
      Perform the corresponding set operation between
        s1 and s2 and put the result in s1.
  s {+=, -=} e
      Add or remove the element e from the set s

  c1 | c2
  c1 & c2 Bitwise or/and/xor of two count values
  c1 ^ c2

  c1 {|=, &=, ^=} c2
      Perform the corresponding bitwise operation between
        c1 and c2 and put the result in c1.

  v += e Append the element 'e' to the vector 'v'

  s += v Add the elements of 'v' to 's'
  s -= v Remove the elements of 'v' from 's', if present

How does that sound?

    Vern

Hmmm thinking about it, we can get away with '&' with minimal keyword
conflict because there's such an easy (and natural-to-presume) fix -
namely, rather than "x&attrkeyword" you use "x & attrkeyword". Now
there's no problem, since the lexer only recognizes "&attrkeyword"
as a unit, with no whitespace allowed.

Yeah, I think it could turn out ok. Using '&' and '|' in the set operations seems more natural/consistent to me and so maybe worthwhile to try that approach first.

It's also a nice time to introduce the bitwise operations between count values. I thought about adding that on one occasion and also know others have wanted it.

How does that sound?

Sounds good to me. I'd find most of them more convenient and wish I'd had them in the past.

- Jon

Now there's no problem, since the lexer only recognizes "&attrkeyword"
as a unit, with no whitespace allowed.

Good idea, sounds right. And in case it did turn out to be
problematic, we could still go the way of adding all as keywords
later.

How does that sound?

Sounds good to me, the bitwise operations will be great to have, too.

Just one more thing still: I'm actually feeling pretty strongly
against having multiple different operators for the same operation
(set union, set addition/removal). I just see that as leading to
confusion: scripts will inconsistently use on or the other, people
will have different preferences which version to prefer; they may not
even remember the other one. And we'd end up having to explain why
there are two versions, without having much of a great explanation
("one is ugly" doesn't sound great to me :-). Is it just me feeling
that way?

Robin

Just one more thing still: I'm actually feeling pretty strongly
against having multiple different operators for the same operation
(set union, set addition/removal).

I'm fine with removing "add" and "delete" for sets! (But seems we gotta
keep them for a good while for backward compatibility. Plus, what would
be the remove operator for tables? "t -= index" seems pretty weird to me.)

But I don't think we should forego '+' and '-' for sets. It would be too
weird that "v += e" works but "s += e" does not (and "add v[e]" blows, so
let's not consider going down that path :-P). Once we have s += e, we
certainly should have s -= e. And once we have "s + e", "s1 + s2" seems
very natural to me too; I don't relish having to explain "oh *that* doesn't
work, you have to use s1 | s2" :-P.

will have different preferences which version to prefer; they may not
even remember the other one.

Seems to me they're both sufficiently mnemonic that this isn't a big worry.

    Vern

Just one more thing still: I'm actually feeling pretty strongly
against having multiple different operators for the same operation
(set union, set addition/removal).

I'm maybe convincing myself that it's at least not that useful or there's alternative ways forward that don't introduce redundancies.

I'm fine with removing "add" and "delete" for sets! (But seems we gotta
keep them for a good while for backward compatibility. Plus, what would
be the remove operator for tables? "t -= index" seems pretty weird to me.)

A nice thing about "add" and "delete" for sets is that you can infer the data type that you're operating on just looking at the local code/line. E.g. say you come back to some code after a few months and see "foo += 1". Not obvious what 'foo' is anymore. Could be vector, set, or count, etc.

But I don't think we should forego '+' and '-' for sets. It would be too
weird that "v += e" works but "s += e" does not (and "add v[e]" blows, so
let's not consider going down that path :-P).

Yeah, I agree with that sentiment (on the condition that we did add +/- for sets).

I do also notice that you had "s + e" in the proposal and not "v + e". Isn't that weird by the same logic or is it just an accidental omission?

Once we have s += e, we
certainly should have s -= e. And once we have "s + e", "s1 + s2" seems
very natural to me too; I don't relish having to explain "oh *that* doesn't
work, you have to use s1 | s2" :-P.

That also makes sense, though it's worth seeing what a minimal proposal looks like without the contentious '+' operations:

     s1 - s2, s1 -= s2 Set difference
     s1 | s2, s1 |= s2 Set union
     s1 & s2, s1 &= s2 Set intersection
     s1 ^ s2, s1 ^= s2 Set symmetric difference

     c1 | c2, c1 |= c2 Bitwise-or 'count' values
     c1 & c2, c1 &= c2 Bitwise-and 'count' values
     c1 ^ c2, c1 ^= c2 Bitwise-xor 'count' values

The only one now missing that I'd probably find myself using is:

     v += e Append element to vector

And for that, a BiF or generic script-layer function call (if that were possible) would even make me happy:

     push(v, e)

That also could go back to what I was saying before about readability: it would then be more obvious than "v += e" regarding what data types are involved.

Same idea would apply to "s += v" and "s -= v" (if we were inclined):

     add_to_set(s, v)
     delete_from_set(s, v)

Could that all be an alternative way forward? Or is it missing other important aspects?

- Jon

A nice thing about "add" and "delete" for sets is that you can infer the
data type that you're operating on just looking at the local code/line.

Only sort of. For delete, you don't know whether it's a table or a set,
and for neither do you know what type of table/set if you can't immediately
apprehend the type of the index (e.g., "delete foo[foo2]" doesn't tell
me what type of index the set/table foo uses).

E.g. say you come back to some code after a few months and see "foo +=
1". Not obvious what 'foo' is anymore.

I don't think it's reasonable to have the bar be "can you tell what's going
on in isolation". It should include consideration of associated context,
variable names, and comments. In fact, even now you don't know whether
for "foo += 1" foo is an integer, a count, or a double.

I do also notice that you had "s + e" in the proposal and not "v + e".
Isn't that weird by the same logic or is it just an accidental omission?

This is because "v + e" already has a meaning: apply "+ e" to each element
of v. (Note though that "v += e" is not currently allowed.)

     v += e Append element to vector

And for that, a BiF or generic script-layer function call (if that were
possible) would even make me happy:

     push(v, e)
...
     add_to_set(s, v)
     delete_from_set(s, v)

Yuck. I would hate this. Might as well use Lisp!

A basic tenet of Bro's language design has been to minimize verbosity.
(For example, its use of type inference, and its support of C-style
operators like "++".) Let's please not move in that direction.

    Vern

Yeah, it was maybe a bit of a stretch -- more just an observation I was trying to see if we could run with. Also in relation to my recent experiences with trying to read/debug some C++ code with a lot of operator usage I found myself wishing some were just a named function call so I could more easily navigate the code and even just find where certain operators were defined/declared.

So the point I'm at now is that it would probably be nice not to have multiple operators for the same thing, though don't have a strong feeling about it.

- Jon

I was actually not aware of this - and if we keep this behavior I am a bit
opposed to add a few of the others. It especially feels weird to me if v +
e and v += e are operations that perform something completely different. I
also think it is a bad idea to have s + e and v + e perform completely
different operations.

Johanna

Perhaps my thinking is too driven by how this is handled internally - but
thinking that way I am kind of opposed to get rid of add and delete for
sets. For me sets were always a special case of tables - and it made
complete sense that they operate in the same way.

I think I actually would prefer just keeping add/delete, at least for
sets, and not introduce the plus-syntax. While it is a bit shorter,
add/delete is also not that ugly. And while you did not like the argument
of Jon that that lets you more easily determine what is going on in the
code I kind of think it has a bit of merit.

Johanna

It especially feels weird to me if v + e and
v += e are operations that perform something completely different.

Yeah, I hear you. OTOH, I *really* would like a succinct way to say "add
this to the end of this vector", it's such a common idiom.

Robin and I discussed this a bit. Our ultimate thinking was along the
lines of:

(1) likely there's no significant use right now of "v op e" semantics,
    given how none of us initially remembered it

(2) down the line such semantics could be quite handy if we start pushing
    on vector operations for doing statistical or ML computations (that's why
    I added "v op e" in the first place, inspired by how easy R makes these),

(3) "v += e" really is a nice append-to-vector idiom

(4) so how about we change "v op e" into something else to avoid the
    conceptual clash w/ v += e, while still having it available for
    the possible uses in (2) above?

The question then was what would be the new "v op e" syntax.
The best we could come up with (which we both found not-too-awful) is
"vector(v op e)". Wrapped in "vector(...)", the operation becomes the
current semantics (apply "op e" separately to each element of v).
"v op e" by itself would now be an error (which could point the user
at the "vector(...)" syntax as possibly providing what they're looking
for). "v += e" would be "append e to v".

Do you buy that?

    Vern

I think I actually would prefer just keeping add/delete, at least for
sets, and not introduce the plus-syntax.

Okay, I can live with this as long as '|' and '-' support add-to-set and
remove-from-set. But I think those have to work, given we'll enable them
for operations on two sets.

    Vern

One additional piece of context here: That vector(...) syntax could
then be used more broadly in the sense of creating a different
semantic context for the operations inside. That kind of opens up a
whole new set of of type-specific operator meanings, without affecting
current/standard ones (it's like introducing R inside parentheses :-).
It's not the super-great, but at least it's explicit and we couldn't
come up with anything better if we want to have such operations as
operators. Might work for some other types as well.

Robin

The question then was what would be the new "v op e" syntax.
The best we could come up with (which we both found not-too-awful) is
"vector(v op e)". Wrapped in "vector(...)", the operation becomes the
current semantics (apply "op e" separately to each element of v).

Maybe "vectorize(v op e)" ?

Implies implementing via SIMD instructions.

"v op e" by itself would now be an error (which could point the user
at the "vector(...)" syntax as possibly providing what they're looking
for). "v += e" would be "append e to v".

That still seems odd to me. If "v += e" means "append", then I might expect "v + e" to do the same, except producing a new value w/ original vector not modified.

Maybe that's a less common use-case, though, and so "v op e" being an error would be less weird than suddenly changing the meaning of that operation.

- Jon

Well, my vote then remains not adding new set operators for
add/delete, so that we don't have multiple ways to do the same thing.
Just looked at Python again, as a data point: That's what they do,
too. There are '|'/'&'/'-' for set/set operations, but no versions of
those for individual elements (they do that through methods instead;
add/delete are kind of our version of methods). Same for Ruby. I
looked around for a few more minutes for other languages, but didn't
immediately find any that even have any set operators at all (only
methods/functions for union/intersection/etc.).

Robin

Yup, I think I concur.

Johanna