Unique connection ID for bro <-> logging framework

Hi,

I was wondering whether it would make sense to assign each connection an
ID that's unique for this bro run. This ID can just be a 64-bit counter
that gets incremented on every new connection.

Why: If we add this ID to log outputs, it would be much easier to
correlate activity across logs (e.g., find the connection in http.log,
alarm.log, and conn.log, without having to match 5-tuples and timestamps)

I think this would be a rather nice (and very easy to implement) feature.

Cluster considerations: maybe add a nodeID or something to the
connection ID. E.g., in the high-order 8 or 16 bits.

Thoughts?
Comments?
cu
Gregor

I was wondering whether it would make sense to assign each connection an
ID that's unique for this bro run. This ID can just be a 64-bit counter
that gets incremented on every new connection.

That's an interesting idea.

Why: If we add this ID to log outputs, it would be much easier to
correlate activity across logs (e.g., find the connection in http.log,
alarm.log, and conn.log, without having to match 5-tuples and timestamps)

My only question is under what circumstance you do that activity correlation activity within a single connection? I'm unable to think of a single time when I've needed to do something like that where I wasn't able to just search for the single IP address that I was interested in because I was interested in anything that IP address was referenced in and not just that single connection.

  .Seth

Why: If we add this ID to log outputs, it would be much easier to
correlate activity across logs (e.g., find the connection in http.log,
alarm.log, and conn.log, without having to match 5-tuples and timestamps)

My only question is under what circumstance you do that activity correlation activity within a single connection? I'm unable to think of a single time when I've needed to do something like that where I wasn't able to just search for the single IP address that I was interested in because I was interested in anything that IP address was referenced in and not just that single connection.

Some examples:

* I want to count the number of HTTP request per connection

* I do per connection stats (e.g., number of packets, number of
  bytes, retransmissions, RTTs), store them in their own log files
  and then want to correlate with the conn.log or the http.log

* Easier debugging / analysis:
  I can just grep for the connectionID, instead of
  having to map between different connection formattings (e.g., notices
  have origIP:origPort -> respIP:respPort but when I want to grep for
  them in conn.log, I have to do some awk to get there)

* ...

I guess I have more a measurement point of view here....

cu
Gregor

* I want to count the number of HTTP request per connection

Ah, ok. Now that you mention it, I have done searches for that before too. :slight_smile:

(e.g., notices
have origIP:origPort -> respIP:respPort but when I want to grep for
them in conn.log, I have to do some awk to get there)

If the logging framework proceeds in the direction that Robin and I have been outlining, most of this trouble will go away.

I guess I have more a measurement point of view here....

Yeah, makes sense. I just wasn't understanding that before. :slight_smile: Reading the things you need to do does remind me that Justin Azoff and I need to get back to the metrics framework we've been talking about. It could help you output logs with a lot of the measurement type data you are looking to get instead of having to do post-processing on the existing logs.

Getting back to your question though, it's an interesting idea but I wonder if it will still be necessary once the "normal" logging output changes. At the very least, if you output tab separated value data, you should be able to do something like this....

cat whatever.log | grep "1.2.3.4<tab>35231<tab>4.3.2.1<tab>80"

The binary log output may make that even easier too.

  .Seth

Getting back to your question though, it's an interesting idea but I wonder if it will still be necessary once the "normal" logging output changes. At the very least, if you output tab separated value data, you should be able to do something like this....

cat whatever.log | grep "1.2.3.4<tab>35231<tab>4.3.2.1<tab>80"

In general yes, as long as the 5-tuple isn't reused.

(I can basically do this right now, if I use awk to reorder the
connection-tuple so that I can grep for it. Might thought was that
having a single numeric ID might make life easier.)

The binary log output may make that even easier too.

Being able to use grep, sed, awk, and co. is still very nice, so I'll
probably end up using a binary to ascii converter quite frequently.

cu
gregor

BTW, the addl field in conn.log is sometimes used for something similar.
E.g., http.bro will create a unique ID for each HTTP-session and put
this session ID into the connections addl....

cu
gregor

Two more thoughts on htis:

- generally, I see that such an ID could be quite helpful. While
with the new logging, filtering for 4-tuples should be quite easy,
having a per-conn ID gets us more crisp semantics and generally
simplifies things.

- however, I think there'd need to be one more piece to that story:
the IDs should be unique across Bro runs. Otherwise crunching
information from a big log archive woulnd't be much better than it
is today. But that would probably mean we'd need to go beyond
64-bit integers, perhaps to a string prefixed with something likely
to be unique.

Robin

- however, I think there'd need to be one more piece to that story:
the IDs should be unique across Bro runs. Otherwise crunching
information from a big log archive woulnd't be much better than it
is today. But that would probably mean we'd need to go beyond
64-bit integers, perhaps to a string prefixed with something likely
to be unique.

We can probably keep a 64 bit counter internally and also add a
bro_instance_ID, that's globally unique across Bro runs. For logging, we
can then log the 64 bit counter and the instance_ID, or concatenate the
two (I would guess that the instance_ID will be handy in other
situations too). Doesn't the cluster already have/need something like that?

In order to generate such an instance_ID, we could:

a) make sure it's truly globally unique, e.g., by using a
   cryptographically secure, long (128 bit, maybe even 160 or more)
   random number. Possibly from an entropy pool (can we use OpenSSL for
   that?)

b) the user supplies a "hostID", we can then add time and PID
   and hash all that together to get the instance ID, e.g.,
   md5(hostID + PID + gettimeofday())
   (this should probably be fairly tolerant even if the hostID gets
   reused across machines).

cu
Gregor

We can probably keep a 64 bit counter internally and also add a
bro_instance_ID, that's globally unique across Bro runs. For logging, we
can then log the 64 bit counter and the instance_ID, or concatenate the
two (I would guess that the instance_ID will be handy in other
situations too). Doesn't the cluster already have/need something like that?

There's a global peer_description (string) that if set will be used
as prefix for IDs in logs; see prefixed_id() in bro.init. The
cluster sets that differently for each a node.

However, the cluster currently also doesn't give unique IDs across
runs, just unique across nodes within a single run).

In order to generate such an instance_ID, we could:

My main concern is not wasting too many bytes for these IDs, as I
imagine they would be included in pretty much every log entry. On
the other hand, I don't think we need to be 100% sure that the IDs
are unique as long as the probability of a collision is small. Seems
that a single 64-bit int should be able to achieve that already if
we hash all information in.

b) the user supplies a "hostID", we can then add time and PID
   and hash all that together to get the instance ID, e.g.,
   md5(hostID + PID + gettimeofday())

I generally like this, and the hostID can be the peer_description.
But I think we can hash into 64-bit instead and probably take a
simpler hash function as well. And then we can just add the 64-bit
counter to that value.

Robin

b) the user supplies a "hostID", we can then add time and PID
   and hash all that together to get the instance ID, e.g.,
   md5(hostID + PID + gettimeofday())

I generally like this, and the hostID can be the peer_description.
But I think we can hash into 64-bit instead and probably take a
simpler hash function as well. And then we can just add the 64-bit
counter to that value.

I'd prefer to keep the counter and the runID separate. E.g., by making
the runID n-bits and the counter 64-n bits and the OR-ing them together.

OTOH, I don't think we need to worry too much about wasting bytes by
using a say, 32 bit runID + 64 bit unique-ID. We now use ACSII for
logging, if we move to binary (and possibly compression), then this will
save *way* more memory than our ID will ever add.

(we could also just add the run-id once per log-file or log-group inside
a file, but might be too cumbersome)

cu
gregor

I'd prefer to keep the counter and the runID separate. E.g., by making
the runID n-bits and the counter 64-n bits and the OR-ing them together.

What's the advatnage of keeping them separate?

using a say, 32 bit runID + 64 bit unique-ID. We now use ACSII for
logging, if we move to binary (and possibly compression), then this will
save *way* more memory than our ID will ever add.

Well, that's right, but still no excuse for waste if we don't need
it. :slight_smile:

(I'm also thingking about potentially piping logs into a DB and
perhaps indexing on such an ID as a key.)

Robin