active_conns

Seth, just looking through the new conn.bro, and I have a
philosophical question about this piece:

       # This is where users can get access to the active Log record for a
       # connection so they can extend and enhance the logged data.
       global active_conns: table[conn_id] of Log;

What kind of data do you see this extended with?

I'm asking because one thing that always struck me as suboptimal is
how currently many scripts are maintaining their own session table.
E.g., the HTTP analyzer has http_sessions[conn_id] where it's stores
its stuff.

With the new record extension mechanisms we could instead do the other
extreme: no script gets its own table anymore, the additional things
just get added to a central record, like this Conn::Log. I'm not sure
whether I really want to advocate that change but I was wondering what
your (or anybodys) thoughts are.

(Note that if it were really Conn::*Log* that gets extended, this
would interfere with logging obvioysly. But we could separate the two
notions, and just have a central Connection record which everybody
extends.).

Robin

I'm asking because one thing that always struck me as suboptimal is
how currently many scripts are maintaining their own session table.
E.g., the HTTP analyzer has http_sessions[conn_id] where it's stores
its stuff.

I agree.

With the new record extension mechanisms we could instead do the other
extreme: no script gets its own table anymore, the additional things
just get added to a central record, like this Conn::Log. I'm not sure
whether I really want to advocate that change but I was wondering what
your (or anybodys) thoughts are.

I like the idea, however, I guess one question would be what the memory
overhead would be. Assuming that you have many analyzers or scripts that
want to add stuff to *some* connections. If the connection record is
extended every connection needs to have the additional field. Probably
as an optional, so it's only a NULL pointer, but still.

cu
gregor

      # This is where users can get access to the active Log record for a
      # connection so they can extend and enhance the logged data.
      global active_conns: table[conn_id] of Log;

What kind of data do you see this extended with?

Initially the stuff that would have gone into the $addl field before. I don't like that field much and a lot of different scripts seem to want to add various and fairly unstructured things to it. I'm not *totally* sure how the extension would play out with the conn.bro script, but it also didn't feel right for that script to not be extensible like all of the other scripts are and will be. I've really been trying to push myself toward consistency across all of the scripts so that without even reading the documentation or source, someone would be able to guess how each of the shipped scripts works and what variables will be named if they know the general convention.

I think that the existence of the active_conns variable is documented in one of the manuals too, but that may have been from the active.bro script that was removed with release 1.1. I don't exactly like the extension model applied to the conn.bro script either, but it has the benefit of being regular (i.e. like the other new scripts).

I'm asking because one thing that always struck me as suboptimal is
how currently many scripts are maintaining their own session table.
E.g., the HTTP analyzer has http_sessions[conn_id] where it's stores
its stuff.

Hm, I guess I hadn't even thought about that at all

With the new record extension mechanisms we could instead do the other
extreme: no script gets its own table anymore, the additional things
just get added to a central record, like this Conn::Log.

Hm, I hadn't even considered it from that approach. I'll think about it some more.

(Note that if it were really Conn::*Log* that gets extended, this
would interfere with logging obvioysly. But we could separate the two
notions, and just have a central Connection record which everybody
extends.).

This is actually the model I've already moved to with most of the other scripts already. I create an ::Info type that is kept in a conn_id indexed table, then inside the Info variable, there is an instance of a ::Log type. Data is stored sort of haphazardly (which I don't particularly like) in either ::Info or ::Log depending on if it's an internal state tracking detail or if it's something that would conceivably ever need to be logged.

What if we make a Conn::Info (maybe different name? I'm awful at naming) type that contains a Conn::Log type record that we could extend that with all of the ::Info typed variables for each individual script? I think I may have to implement it to see how it looks and how it would be used.

  .Seth

Wouldn't be very concerned about that. It's not much per connection
(compared to how much we already store ...), and just by not having
some of the other high volume tables (like http_sessions) we'd
probably compensate for quite a bit alredy.

That said, I'm still not argueing for this. I also like Seth's current
model.

Robin

I like the idea, however, I guess one question would be what the memory
overhead would be.

Wouldn't be very concerned about that. It's not much per connection
(compared to how much we already store ...)

I'm sort of tempted in the base scripts to maintain fairly minimal logging and then have other scripts that add lots of extra state if people want it, but I also really want to avoid making things complicated again. I like being able to do http logging by just loading "http". Perhaps we could have these state extension scripts loaded by default, but if you load some other script (minimal-logging.bro?), it would not load the extended state scripts and give you very minimal state tracking for if you are doing something that doesn't need the full load of information and you have some memory constraint?

Of course, that's increasing the complexity of the base scripts again. If someone really has the need to conserve memory that badly, they could always trim code out of the shipped scripts which should be quite a bit easier with the new scripts (not that I'd *want* people to edit the base scripts, it's just that they could). Most of those state tables don't occupy *that* much memory at any one point in time anyway since they're cleaned up after the connection.

, and just by not having
some of the other high volume tables (like http_sessions) we'd
probably compensate for quite a bit alredy.

I was just about to write here that I would like to keep it as it is for now, but then I started writing some example code to see how it would look if it was stored in the connection record and I sort of like it.

State stored in connection record:
  print c$http_session$log$method;

State stored in separate global table:
  print HTTP::active_conns[c$id]$log$method;

What I *don't* like about this model is that it breaks down when you have data that is stored about something outside of a connection. My current example for this is TLS session IDs which are tracked in a state table so that the same previously established TLS session ID can be used in a different TCP connection. You would then have to resort to storing the sessions IDs in a global table which makes the data storage less regular. It may be worth it to get the syntax benefits though.

  .Seth

Yeah, that's bothering me a bit too, but I'm not sure how to solve it
either.

Well, here's an idea: with the new record-to-sub-record coercion, I
believe we could store *all* information in the Info record, and still
define a separate Log record for determining what gets logged (by
default):

  type Info = record { count a; count b; count c; }
  type Log = record { count a; count b; }

We use Log to create the logging stream, and Info to fill in the data.
When time for logging comes around, we simply pass the Info instance
into the Log:write() function and it will log whatever is defined in
Log, via coercion. (Actually I don't believe this would work right now
because of internals of the log bif, but it should be doable)

Advantage: Only one type of record to fill in.

Disadvantage: Two record types to define and maintain, both of which
              need to match for the logged fields (i.e., redundancy).

Robin

This is kind of neat ... We could even extend c?$http_session to check
whether the record type has that field at all, and then use that as a
replacement for all the "is this script loaded?" hacks currently in
use ...

Robin

Yeah, that's a good point. The approach I've been taking with the "is this script loaded?" hacks is to solve the problem a different way. It seems that many of those hacks are due to one of two things:

1. There is some sort of general library functionality that should probably always be loaded as a sort of base library.
2. The functionality was hacked into an existing script the fastest way possible.

I think the script extension model should make it possible for us to extract a lot of the circular dependencies into separate scripts, but like you said, in the cases where it does make sense to use a "is this script loaded?" hack, checking based on the existence of the protocol specific field certainly makes things cleaner and more regular across all of the scripts.

  .Seth

> print c$http_session$log$method;

This is kind of neat ... We could even extend c?$http_session to check
whether the record type has that field at all, and then use that as a
replacement for all the "is this script loaded?" hacks currently in use ...

Why can't the "is this script loaded?" functionality be implemented as a BIF that queries some global state that the lexical scanner built up at scan/parse-time ?

- Jon

I could, but I'd still count that as a hack because to be more
precise, it's not really about the *script* being loaded but more
about the functionality being available.

Robin

> Why can't the "is this script loaded?" functionality be implemented
> as a BIF that queries some global state that the lexical scanner built
> up at scan/parse-time ?

I could, but I'd still count that as a hack because to be more
precise, it's not really about the *script* being loaded but more
about the functionality being available.

Ah ok. Yeah, when you rephrase it like that, this solution becomes less intuitive than I thought.

- Jon