Per item expiration for tables

Jan · April 6, 2016, 1:11pm

Hi,

I have a few things I am planning to add to the intel-framework. One of
them is expiration for intelligence items. To achieve per item
expiration in a table there is a little hack that is used in the
notice-framework and in the new netcontrol-framework: By setting
&create_expire=0 and returning the intended timeout for each item in the
corresponding expire_func, one can achieve per item expiration (see e.g.
scripts/base/frameworks/netcontrol/catch-and-release.bro).

This approach however does not work for &read_expire and &write_expire,
because accessing the item resets the expiration timeout based on the
&read/write_expire attribute of the table (in this case 0) instead of
the value that was previously returned by the expire_func. The following
script demonstrates this effect:
https://gist.github.com/J-Gras/061983dac59224a03d3bfad4476a1dd9

The straight-forward solution would be to allow each item to hold its
own expiration timeout. Talking to Seth about this, we came up with two
possible approaches to achieve this:
1) Use the return value of the expire_func to set this value.
2) Use a bif or language feature (e.g. expire 10sec { tbl[idx] }; ) to
set this value.

I would prefer the second approach, as the intention of the expire_func
return value is to provide a delay for a single expiration event. This
would e.g. allow to set an individual expire timeout of e.g. 1 hour for
a single item. Once the expire_func is called one could set a delay of
e.g. 10min. In case the item is accessed, the timeout would be reset to
the originally intended 1 hour instead of 10min.

What are your opinions on that? Which approach would you prefer or do
you think per item expiration is a bad idea in general?

Best regards,
Jan

robin · April 7, 2016, 3:58pm

I understand the motivation but I would prefer to stick with existing
mechanisms, as per item expiration times can get expensive (that would
require storing an additional float for all table entries). It might
also be a bit too specialized a use case to add new syntax to support
it.

Let me try an idea: could you limit the set if expiration times to a
predefined list of choices (e.g., 10mins, 1hr, 1d, 1w, 1m)? Then you
could work with a set of tables with corresponding expiration
intervals.

Robin

Jan · April 11, 2016, 9:05pm

Hi Robin,

I understand the motivation but I would prefer to stick with existing
mechanisms, as per item expiration times can get expensive (that would
require storing an additional float for all table entries). It might
also be a bit too specialized a use case to add new syntax to support
it.

While I think adding a float for table entries would not be too
expensive (considering the common dimensions of Bro-deployments), I can
follow that this is an edge case, which might not justify to introduce
new bifs or even syntax support.

Let me try an idea: could you limit the set if expiration times to a
predefined list of choices (e.g., 10mins, 1hr, 1d, 1w, 1m)? Then you
could work with a set of tables with corresponding expiration
intervals.

I am not sure I get that right. Wouldn't that require a lot of duplicate
code (at least for the table declarations)?

My alternative would be to implement a (configurable) timeout and allow
timeout values that are multiples of this value. Another approach could
be to allow any timeout values, use a single table timeout for "garbage
collection" of expired entries and check validity on every match. But I
think the last approach would introduce significant overhead.

Best regards,
Jan

Jan · May 26, 2016, 9:42pm

Hi,

I ran into an issue while trying to make the &write_expire interval
configurable: Using a redefable constant does not work here, as the
expression only gets evaluated when the table is initialized and thus
later redefs do not influence the value. I thought about circumventing
this by setting the value to 0 and maintain an extra variable to check
against in my expire_func and return the right value. Unfortunately this
won't work with write/read_expire as a write or read will reset the
expiration to the initial value of 0.

A solution could be to evaluate the interval expression every time it is
used inside the table implementation. The drawback would be that there
is no fixed value for serialization (I am not sure about the effects
here). Another solution would be to provide a bif (or implement a
language feature) to change the expire_time value from inside the
expire_func.

There was a somehow similar discussion about per item expiration (see
http://mailman.icsi.berkeley.edu/pipermail/bro-dev/2016-April/011731.html)
in which Robin came up with the solution of multiple tables with
different expiration values. Again this would be a solution but doesn't
feel right (duplicate code, static and somehow counterintuitive for the
user). By the way: This is again about intel expiration. This time I
thought I'll keep it simple and just have one user-defined expiration
value for all items

Maybe I am missing something regarding the loading sequence of scripts
and this problem could be solved easier. So I am open for any
suggestions or feedback!

Best regards,
Jan

Jan · June 8, 2016, 6:30pm

My explanations might be hard to follow without examples. So I am adding
some pseudo code:

I ran into an issue while trying to make the &write_expire interval
configurable: Using a redefable constant does not work here, as the
expression only gets evaluated when the table is initialized and thus
later redefs do not influence the value.

# base script:
const exp_val = 5min &redef;
data: table[addr] of string &write_expire=exp_val;

# user script:
redef exp_val = 20min; # has no effect

I thought about circumventing
this by setting the value to 0 and maintain an extra variable to check
against in my expire_func and return the right value. Unfortunately this
won't work with write/read_expire as a write or read will reset the
expiration to the initial value of 0.

# base script:
const exp_val = 5min &redef;

function do_exp(data: table[addr] of string, idx: addr): interval
  {
  if ( is_first_call() )
    return exp_val;
    # in case of a write, expire timer will be reset to 0
  else
    ...
  }

data: table[addr] of string &write_expire=0 expire_func=do_exp;

A solution could be to evaluate the interval expression every time it is
used inside the table implementation. The drawback would be that there
is no fixed value for serialization (I am not sure about the effects
here). Another solution would be to provide a bif (or implement a
language feature) to change the expire_time value from inside the
expire_func.

# base script:
function do_exp(data: table[addr] of string, idx: addr): interval
  {
  if ( is_first_call() )
    expire exp_val; # sets expire timer instead of delay
  else
    ...
  }

There was a somehow similar discussion about per item expiration (see
[Bro-Dev] Per item expiration for tables)
in which Robin came up with the solution of multiple tables with
different expiration values. Again this would be a solution but doesn't
feel right (duplicate code, static and somehow counterintuitive for the
user).

# base script:
type exp_interval enum { ei1m, ei10m, ... };
const exp_val = ei1m &redef;

data1m: table[addr] of string &write_expire=1min;
data10m: table[addr] of string &write_expire=10min;
...
data1d: table[addr] of string &write_expire=1day;

function insert(...)
  {
  switch ( exp_val )
    {
    case ei1m:
      data1m[...] = ...
      break;
    case ei10m:
      data10m[...] = ...
      break;
    ...
    }
  }

# user script:
redef exp_val = ei30m;

Jan · June 10, 2016, 5:08pm

As there was no feedback, I decided to use a bif (see
https://github.com/bro/bro/commit/16b1032beeaaf681763785ddac1eed4128430965).
It might not be the cleanest solution with respect to the bro language
but it is a straight forward approach.

Aashish_Sharma1 · June 10, 2016, 6:05pm

HI Jan,

> A solution could be to evaluate the interval expression every time it is
> used inside the table implementation. The drawback would be that there

For all of my needs above has worked fairly well. including using exp_val= 0 secs as default.

Based on the value of item in the table, I could return a variable time back from the expire function thus either keeping it longer or expiring it.

> used inside the table implementation. The drawback would be that there
> is no fixed value for serialization (I am not sure about the effects

Sorry, I don't quite seem to understand this drawback.

Aashish

Jan · June 10, 2016, 11:28pm

Hi Aashish,

For all of my needs above has worked fairly well. including using exp_val= 0 secs as default.

Based on the value of item in the table, I could return a variable time back from the expire function thus either keeping it longer or expiring it.

Trying this, I ran into an issue with &write_expire:
1) I set &write_expire=0 min
2) In the first call of the expire_func I return a delay, e.g. 10 min
3) After e.g. 5 min there is a write, the timer is set back to 0 min
(instead of 10 min) and the item immediately expires.

From my understanding, this is a result of how the delay is implemented.

As far as I understand the code, a delay is realized by setting the time
of the last access so that the next expiration occurs exactly after the
given delay. The expiration value isn't touched. As a consequence,
accessing the item within the delay time window will reset the
expiration to its original value. To be clear: I think that's perfectly
fine and works how I would expect it to work. It just prevents the
0-expire workaround for read/write_expire.

used inside the table implementation. The drawback would be that there
is no fixed value for serialization (I am not sure about the effects

Sorry, I don't quite seem to understand this drawback.

That's an implementation detail again. The expression given with the
expire attribute is evaluated and stored when the table value is
created. Thus using a const and doing a later redef doesn't lead to the
intended behavior. Meanwhile I realized that the expire_func expression
is also serialized. I assume this would be an option for the expire
value, too. However, I could miss something.

Best regards,
Jan

robin · June 11, 2016, 3:52am

Sorry for being late to the party here. Your examples did actually
help, I'm seeing now that the current behavior really doesn't seem
right:

const exp_val = 5min &redef;
data: table[addr] of string &write_expire=exp_val;

# user script:
redef exp_val = 20min; # has no effect

I'd call this a bug actually. Redefs are supposed to take effect
before anything else, so having the timeout use the original value
here seems quite wrong.

My immediate thought (without looking at the code) would be delaying
evaluating the expression until the value is needed for the first
time. At that point, the redef will have taken effect, so we should be
fine. Essentially, we'd cache the evaluated value for the future once
we have it.

Serialization is an interesting question though. I believe there'd be
nothing wrong with simply serializing the expression itself here
(rather than its value). When deserializing, we'd restore the
expression and make sure the cache value remains unset, so that on
first use it will get evaluated.

In principle, we could even go further and allow a non-constant
expression for the timeout that would get evaluated every time. My
main concern there would be performance, although I'm not sure if that
would actually cause much overhead in the still most common case of a
constant (i.e., for existing scripts).

Robin

Jan · June 11, 2016, 1:36pm

I'd call this a bug actually. Redefs are supposed to take effect
before anything else, so having the timeout use the original value
here seems quite wrong.

I agree that the behavior is at least counterintuitive

My immediate thought (without looking at the code) would be delaying
evaluating the expression until the value is needed for the first
time. At that point, the redef will have taken effect, so we should be
fine. Essentially, we'd cache the evaluated value for the future once
we have it.

I will have a look. If I am able to fix this I will include this in the
pull request for the intel updates.

Serialization is an interesting question though. I believe there'd be
nothing wrong with simply serializing the expression itself here
(rather than its value). When deserializing, we'd restore the
expression and make sure the cache value remains unset, so that on
first use it will get evaluated.

Yep, meanwhile I realized serialization is already done for the
expire_func statement. In case the table is serialized having a cached
value set, it would be preferable to use this value, wouldn't it?

In principle, we could even go further and allow a non-constant
expression for the timeout that would get evaluated every time. My
main concern there would be performance, although I'm not sure if that
would actually cause much overhead in the still most common case of a
constant (i.e., for existing scripts).

I am not sure what the actual performance impact would be. My idea would
be to cache the value in case of a constant and evaluate it every time
otherwise. That should combine the best of both approaches.

Best regards,
Jan

robin · June 11, 2016, 3:10pm

I will have a look. If I am able to fix this I will include this in the
pull request for the intel updates.

Please create a separate pull request for this one first (you can
merge it into the intel update branch, that'll be fine).

expire_func statement. In case the table is serialized having a cached
value set, it would be preferable to use this value, wouldn't it?

It's a question of semantics: what should happen if the Bro
unserializing it has redef'ed the constant to a different value? I
think my intuition would expect to use that modified value after
unserialization.

I am not sure what the actual performance impact would be. My idea would
be to cache the value in case of a constant and evaluate it every time
otherwise. That should combine the best of both approaches.

Yeah, I was thinking about that too. I'd still be curious if the
overhead of re-evaluating the constant overhead becomes noticable. If
you're game, you could try a little benchmark just manipulating a
table plenty times and measure if that changes execution time much.

Robin

Jan · June 13, 2016, 7:41pm

Please create a separate pull request for this one first (you can
merge it into the intel update branch, that'll be fine).

I have opened a new pull request

expire_func statement. In case the table is serialized having a cached
value set, it would be preferable to use this value, wouldn't it?

It's a question of semantics: what should happen if the Bro
unserializing it has redef'ed the constant to a different value? I
think my intuition would expect to use that modified value after
unserialization.

My intuition would be that the deserialized table is identical to the
serialized one. However, this is a matter of style and shouldn't be
relevant now...

Yeah, I was thinking about that too. I'd still be curious if the
overhead of re-evaluating the constant overhead becomes noticable. If
you're game, you could try a little benchmark just manipulating a
table plenty times and measure if that changes execution time much.

... I did a quick measurement of the execution time for re-evaluation of
a constant and couldn't detect any performance impact. So I went for
this solution.

A side note: I suspect that the table syncing did and still does not
work as reliable as one would expect. But according to Johanna this will
become deprecated soon, so I did not touch that code.

Best regards,
Jan

robin · June 13, 2016, 11:57pm

If there's any obvious problem, we should still take a look, but more
generally: yeah Broker will take that over as soon as things are in
sufficient shape.

Robin

Jan · June 14, 2016, 12:35am

A side note: I suspect that the table syncing did and still does not
work as reliable as one would expect. But according to Johanna this will
become deprecated soon, so I did not touch that code.

If there's any obvious problem, we should still take a look, but more
generally: yeah Broker will take that over as soon as things are in
sufficient shape.

The thing is I don't get why reads only need to be propagated once per
(half) &read_expire interval (see
https://github.com/bro/bro/blob/master/src/Val.cc#L2326). If there was a
read at the beginning of the (half) interval and another one close to
the end, a peer might expire a value too early based on the first read.
Seems someone put some thought into this so maybe I miss something here

Best regards,
Jan

robin · June 17, 2016, 3:57pm

That must have been me. I need to look at that for a bit to see if
I can remember my reasoning from many years ago (which might very well
have been flawed!). Please file a ticket with your thoughts and assign
it to me. Thanks,

Robin

Jan · June 17, 2016, 10:53pm

That must have been me. I need to look at that for a bit to see if
I can remember my reasoning from many years ago (which might very well
have been flawed!). Please file a ticket with your thoughts and assign
it to me. Thanks,

I have created BIT-1631 but couldn't assign it to you. Either I am just
unable to find the right button or I don't have assign privileges

Jan

Topic		Replies	Views
Using create_expire and expire_func Zeek	3	299	May 6, 2022
using &*_expire with interval values defined before Zeek	1	86	May 6, 2022
write_expire computational time Zeek	6	103	May 6, 2022
&expire_func/&create_expire question Zeek	7	192	May 6, 2022
&write_expire and &expire_Func Zeek	1	110	May 6, 2022

Per item expiration for tables

Related topics