Call for opinions on logging framework syntax problem

Hello,

I am currently working on the input framework for bro -- which allows reading previously written log files back into bro -- and have encountered a little problem when reading port fields. There are several different methods to solve this problem and I wanted to get a little bit of feedback before implementing any of these solutions.

First to describe the problem...
When the logging framework is used to log port information, the information does not include the protocol -- this is usually stored in a second column.
Hence, a logfile storing port information will usually look something like this

#fields host_a host_p proto
12.12.12.12 80 tcp

The input framework uses record types to define, what fields should be read from a previously written logfile.
To read the fields, one could e.g. define a record like this:

type Values: record {
  host_a: addr;
  host_p: port;
}

The problem with this approach is, that now host_p does not contain the protocol of the port, because it is stored in a different, unrelated column.
Hence, the input framework needs a (preferably syntactically nice, easy to understand) way to identify the column that is used to store the port information.

The easiest solution would be just to assume a fixed column name (e.g. host_p_port if the port is stored in host_p), but this is probably not a very good idea for a number of reasons.

The nicest way we could think of at the moment is to use annotations for this; for our example one could e.g. use

type Values: record {
  host_a: addr;
  host_p: port &protocol_column=proto;
}

This has the disadvantage of introducing a new, very specialized annotation that is only used for this one case.

Does anyone else have any ideas / suggestions?

Johanna

I'm wondering whether we should maybe start including the protocol in the same column in the log files. I.e., the column would then be "80/tcp"....

Or we could handle ports similar to embedded records in the log file. I.e., if we log a port variable named "orig_p" we would get two columns:
   orig_p.port orig_p.proto
I actually like this variant!

cu
Gregor

Adding the protocol to the port column in the logging framework would be a (for me) very easy solution - and probably also the cleanest one. However, apparently that is not desired, because it makes the port field more difficult to parse.

It also would fix another problem. At the moment is possible to have sets or vectors of ports in a single line of the log file. That could e.g. be used to note several ports a host is active on, e.g.

#fields host services
12.12.12.12 53,80,8080

When adding the protocol directly to the port information, the log line would e.g. look like

12.12.12.12 53/udp,80/tcp,8080/tcp

When using a second column one could probably do something like

12.12.12.12 53,80,8080 udp,tcp,tcp

but that is not really easy to read.

Johanna

I'm wondering whether we should maybe start including the protocol in
the same column in the log files. I.e., the column would then be
"80/tcp"….

I don't like this since no databases have an analogous data type and integers are nice and searchable.

Or we could handle ports similar to embedded records in the log file.
I.e., if we log a port variable named "orig_p" we would get two columns:
  orig_p.port orig_p.proto
I actually like this variant!

I like this too, but I get the sense that the protocol should actually be an attribute of the conn_id type and not a part of each port value. If we started using counts for port values (and get rid of the port type?) and add a $proto field to conn_id does that break any existing assumptions within the language? There are a number of cases where the port type has caused me grief for various reasons but I'm not sure if there is some deeper functionality I'm missing that we would lack with this change.

  .Seth

This is definitely one place where the email I just sent breaks down. It's the port value used outside of the context of a conn_id value. Do you have a concrete example of when you'd want to do something like this? I suspect that if you wanted to do that it would actually be better to organize your data in a different way. Like this:

#fields host port proto
12.12.12.12 53 udp
12.12.12.12 80 tcp
12.12.12.12 8080 tcp

  .Seth

No, I have no real concrete example… I just tried to think of things people might perhaps want to do. And the use-case of having a set of ports for one IP did not seem too far fetched.

Johanna

Vern might know whether there's anything that would severely break but
my guess is that in principle we could do that. However, I like the
port type, it provides additional context for type checking and
understanding a script. Also, changing that would be quite a large
internal change: the type is used at many places and at some, we'd be
now be left without port information where we need it.

Robin

Hmm ... I like that too but it breaks when ports are in sets/vectors,
and it gets ugly with ports in records. We could do it just for
top-level fields but that would be quite inconsistent.

Robin

I like it too, until I try to use it too much. :slight_smile: It provides a very nice convenience within the language but creates a large complication if you try to use the values outside of the language. If you really think about it, it seems a little silly that we have to have a BiF for accessing the protocol of a port. If it was a record or a field in a conn_id we would be able to reuse existing knowledge to access the protocol.

  .Seth

Yeah. I can see sets, vectors being a problem, however, wouldn't it break in the same way as sets/vectors of record types?

For ports in records it would just be:
cid.orig_p.port cid.orig_p.proto

That doesn't seem to bad.

I don't think the logging framework supports sets or vectors of records.

.Seth

Yeah. I can see sets, vectors being a problem, however, wouldn't it
break in the same way as sets/vectors of record types?

I don't think the logging framework supports sets or vectors of records.

That's right, it does not. One could simply choose to disable the logging of sets/vectors of ports and use different predefined column names - however this somehow does not feel right, since a port is a basic type.

Agreed.

  .Seth

So looks like we don't really have much of a better idea than using
the attribute Johanna originally proposed? (At least nothing short of
removing the port type altogehter ...)

Robin

I think that's correct.

.Seth

> If we started using counts for port values (and get rid of the port
> type?) and add a $proto field to conn_id does that break any existing
> assumptions within the language?

Vern might know whether there's anything that would severely break but
my guess is that in principle we could do that.

I suspect it would be a mess. But, more fundamentally, I really resist
getting rid of ports as a built-in type.

    Vern

Heh, although this thread may not indicate it I was intensely conflicted when I wrote about removing the type. If you think keeping it is the right thing, that's enough for me. Now the conversation can turn to how to work with the type correctly. I'll think about it more.

  .Seth

I would still opt for making the logging framework log port and protocol as foo.port foo.proto!

Vectors and sets of ports might be problematic but:

* It doesn't appear that vectors/sets of ports are currently used.
* How do I specify the attribute for sets/vectors of ports? For the
   whole vector at once?
* What if I want to add ports with different protocols to a set/vector
   (e.g., logging the now obsolete port_names or a set of sensitive
   ports).
* It feels really hack-y!
* Non-ASCII backends should be able to handle it fairly easily. (E.g.,
   vector of ports in a relational DB would probably be modeled as a
   n:m relationship anyways)
* Need to find a solution for ASCII output of vectors/sets of ports.
   Maybe special case them
* BTW: if you have sets, vectors in the output, then the log file must
   also have an annotation to say what type is in the vector/set, right?
* Maybe we could use two columns in general but use the 80/tcp notation
   for sets/vectors? Or we just simple use a space or some other
   character to separate the port number and the protocol.

If you think that two columns don't work, then I would still prefer something like "80/tcp" in ASCII. Yes it duplicates the protocol but it's IMHO the cleaner solution than using the attribute. One argument for that is that it's printed in the same way a script writer would have to write it if it were a constant.

cu
Gregor

* It feels really hack-y!

Maybe, but all the two-column solutions even more!

If you think that two columns don't work, then I would still prefer
something like "80/tcp" in ASCII. Yes it duplicates the protocol but
it's IMHO the cleaner solution than using the attribute. One argument
for that is that it's printed in the same way a script writer would have
to write it if it were a constant.

The argument against that is that now everybody reading the logs needs
to parse the ports (rather than being able to just read integers).

Robin