$tag in notice_info

Is there anyone around that can explain the purpose of the $tag field in the notice_info type?

  .Seth

It uniquely identifies the NOTICE and can then be used at other
locations to refer to it. The only use of it I recall right now is in
conn.log: the relevant connection shows the tag in the addl field.

I'm actually not sure how helpful having the tag is, I don't think
I've ever used the tag but always grep for the 4-tuple right away.

Robin

It uniquely identifies the NOTICE and can then be used at other
locations to refer to it. The only use of it I recall right now is in
conn.log: the relevant connection shows the tag in the addl field.

Ah, ok.

I'm actually not sure how helpful having the tag is, I don't think
I've ever used the tag but always grep for the 4-tuple right away.

That's sort of what I tend towards as well. My first inclination was to remove it since it's not used much. It should be possible to extend the connection logging with the logging framework to modularly add it back in later if someone wants to use it.

  .Seth

... hmm. This actually reminds me about our discussion about having
unique connection IDs (e.g., 64bit ints) in bro, that can then be used
to locate a connection across log files.

It seems we never filed a ticket for that. Do you think it's worth it? I
might then summarize the the mail thread and plug it into a ticket.

cu
gregor

Yes, definitly worth recording. If we go for it is another question,
but I actually already don't remember the outcome of the email thread.

Robin

Just filed a ticket for that.

The thread basically just "fell-asleep" after discussing particular
ideas on how to assign such an ID and how to make it unique across bro
runs.

cu
gregor

Oh yeah. What's your thought on this? Would you like to have that value print out along with the IP addresses and ports with the connection log and other logs?

I think we may be able to work something out with the logging framework that makes it a little easier to work with. I can imagine choosing to output that value instead of the 4-tuple for database logging since it should be easy to do the join to tie data back together. As I think about it, I'm liking that idea more and more. Especially if we can pull it off cleanly.

  .Seth

I do!
My thinking is that I find somethind interesting in one of the logfiles
(e.g., http.log, alarm.log, conn.log, whatever) and now I want to look
up the connection responsible for that log-entry in other log files.
Using such an ID I could just grep for it (assuming text based logs,
but it should apply similarly to binary logs).

cu
gregor

I'm actually not sure how helpful having the tag is, I don't think
I've ever used the tag but always grep for the 4-tuple right away.

Hmmm, I think I not-uncommonly grep on the tag to map from a notice.log
entry to a conn.log entry, unless I'm missing some context here.

That said, having a more general connection identifier, as subsequently
discussed, would work for this, too.

    Vern

That's what I was going to aim for. Centralizing on a connection identifier makes it possible to remove some other code. There is a sort-of-tag notion in ftp.bro which I'm removing now that goes the other direction by identifying an FTP session.

  .Seth

For notices that's true.
I would like to have the same / a similar mechanism for other log files
(e.g., http.log) as well.

cu
gregor

That's what I'm working towards. I'm not too concerned about disk space so I was thinking of just including the identifier alongside the connection 4-tuple in every log. It would actually be kind of nice. If someone is particularly concerned about it disk space issues in their environment, they'd be able to reconfigure the logging framework locally to either not include the 4-tuple or not include the connection identifier (or include neither if they're crazy).

  .Seth

I like switching from notice tags to a generic conn id used
consistently across logs. My only request is that we make sure we can
identify a connection uniqule even across Bro runs. Then one can just
scan a whole log archive for a specific connection without needing to
worry about when Bro started etc.

Robin

What do you think about using UUID/GUID? I don't know about the overhead to create those values and they're probably quite a bit larger than we need (128-bits displayed as hex), but it would be interesting to be able to have unique values per run and per instance. It'd end up being globally unique log identifiers. :slight_smile: The length would be pretty annoying though.

What sort of uniqueness are we aiming for here? I don't think that was ever very clearly laid out in the previous thread. With GUID we could do uniqueness for eternity (or close to it), but if we do something like hash the bytes for the $start_time timestamp and the 4-tuple that may be unique enough for most cases. I don't know what the relative overheads would be for generating that hash or the GUID would be either which could be a concern.

  .Seth

I don't think we have to go that far. However, I think that using
128bits might be helpful. We could then have a 64-bit counter and
generate a 64bit Bro run-ID. We can then concatenate the two 64bit values.
This way there's pretty much no cost to create a new conn-id

Another small advantage is that this way, one could just strip the
run-ID, if one is only searching through the logs of single run. (or
there could be a flag to force the run-ID to be 0 for testing)

To get the run-ID we could use information like hostname, PID,
time-of-day, Bro's host-id-name (for cluster deployments), etc. and hash
them together using md5 or sha1 or something. (Or use GUID/UUID to
generate the runid and then only use the 64bits with most entropy).

just my 2ct

I'm not convinced we need the separate run-id. Note that while it
would allow to get all connections from the same run, it doesn't get
all the *logs* from the same run (because some logs may not have
connection-level semantics). That doesn't seem worth storing an
additional 64-bit value with every connection in almost every log to
me. Also, 128-bit is really long and ugly.

So I propose we go with a single 64-bit value that combines the run-id
and the conn-id into a likely unique value, something like in this
pseudo-code:

    struct { uint64 run_id; uint64 conn_count } id;
    id.run_id = md5(hostname, timeofday, pid);
    id.conn_count = ++global_conn_counter;

    uint64 unique_val = crc64(id);

Robin

I think I would prefer to leave out the md5. Do you think that we'd ever see conflicts by just adding those values together?

Also, would CRC-64 provide enough reliability against collisions considering that some installations may run for a very long time? I don't know the characteristics of CRC algorithms, but I know they weren't designed for this use and I'd be a little worried about collisions. Maybe this is ok though?

  .Seth

I would hash them. What's the reason you don't like md5? It would run
exactly once at startup.

cu
Gregor

>> uint64 unique_val = crc64(id);

(could just take the bottom 64 bits of the MD5 here)

How often would these hashed be computed? You may need to worry about performance.