Bro Types Not Following Bro Types Documention

I am working doing some automation with Bro, Avro, Kafka and I am a little bit frustrated. (Or I am looking at the wrong documentation, hence my post here, I am very good with being extremely wrong because I am looking at something wrong).

Specifically I am looking at the default conn.log. The Type that is specified for some fields such as
orig_bytes or resp_bytes is type count

Based on the docs I am using here:

https://www.bro.org/sphinx/script-reference/builtins.html

a count is:

`count`

A numeric type representing a 64-bit unsigned integer. A count constant is a string of digits, e.g.1234 or 0. A count can also be written in hexadecimal notation (in which case “0x” must precede the hex digits), e.g. 0xff or 0xABC123.

The count type supports the same operators as the int type. A unary plus or minus applied to acount results in an int.

This is well and good, however looking at some of the data in my log I see the character “-” as a value. Based on my reading of a count, that shouldn’t exist, a - is not a unsigned integer, nor is it a string of digits whether in base 10 or hexidecimal.

Thus my frustration, I’d like to develop some generic bindings to push bro logs into Avro Serialized Kafka messages, but looking at this, I can’t even trust the documentation to be accurate? Am I missing something? Is there another documentation reference that more fully represents the data types that would explain why - is a valid integer?

John,

The “-” you’re seeing in this case isn’t meant to be representing a count. The “-” is used in Bro logs to represent a field with missing data. For whatever reason Bro couldn’t the bytes sent in this case. You’ll see -'s more commonly in other logs, like http.log, ssl.log, etc.

-Dop

(Mike I am putting this on list, I replied only to you)

I found https://www.bro.org/sphinx-git/logs/index.html which is helpful in that - represents an unset field. I am still think from a data nerd perspective, having a character that doesn’t fit the type to represent something is dangerous, however, I can parse the values and replace programmatically with the information provided, so now I an continue on my merry way. Thanks for the insight.

That all said… why put anything in a field (as a default) to represent unset or empty? Are we at risk of evasion? Besides obviously breaking typing, what about when the type actually accepts the unset character… what if the user-agent is - or (empty) couldn’t that cause downstream errors? “You can change the logs to log however you want” is likely the answer, and correct I can, but shouldn’t we try be logical in our approach so assumptions aren’t made on the default material?

[...]

That all said... why put anything in a field (as a default) to represent
unset or empty? Are we at risk of evasion?

I am not sure what you would should do instead. From a protocol point of
view, there is often a huge difference between "an empty string was
transferred" and "this was not seen at all". For example, in HTTP a
Referrer of "-" means that no referrer header was set at all. "" (the
empty string) instead means that it was seen, but empty. Same for sets,
there is a difference between the set was not seen at all ("-"), the set
was seen but empt ("(empty)") and the set was seen and contains one
element with an empty string ("").

Besides obviously breaking typing, what about when the type actually
accepts the unset character... what if the user-agent is - or (empty)
couldn't that cause downstream errors?

In that case, the character should be replaced by the escaped version of
it (i.e. you should find \x[ascii-code] or similar) in the log-file
instead of the -. Hence, it should still be decideable which of the two
cases happened.

"You can change the logs to log however you want" is likely the
answer, and correct I can, but shouldn't we try be logical in our approach
so assumptions aren't made on the default material?

I hope this helps,
Johanna

I’ll just add one high level point. It’s important to remember that, for a lot of people, the logs are the final output. They must be human readable and easily processed with simple unix command line tools.

-Dop

I think that another important point is that this is something that's
occuring in the ASCII writer (a bit of a misnomer, really a
tab-separated value writer). The fact that Bro has the concept of
optional fields means that there's a difference between a field that was
set to a non-empty string, a field that was set to an empty string, and
a field that was never set. Other output formats (e.g. JSON) have a
better way of differentiating between these, but this was the solution
developed for TSV output.

As with much of Bro, you can redef what exactly is written out in these
cases (see:
https://www.bro.org/sphinx-git/scripts/base/frameworks/logging/writers/ascii.bro.html#id-LogAscii::empty_field
and
https://www.bro.org/sphinx-git/scripts/base/frameworks/logging/writers/ascii.bro.html#id-LogAscii::unset_field).

As Johanna mentioned, there should always be a 1:1 mapping between the
log and the record that's being logged. If you're seeing ambiguity,
that's something that we should fix. Fundamentally, a Bro log line
should be as clear as possible.

  --Vlad

Solid point on the difference. Thanks for clarifying. This is a tough problem. One of our systems, has a concept of null vs. empty strings in a final storage, but as pointed out, that makes things difficult from a human readable aspect. (What I mean there is if the referer doesn’t exist, the field is NULL, if it does and is empty it’s a “”)

I wonder if like (empty) unset may also be “more” verbose. I know that seems counter intuitive, especially to my point on types, but - may be error prone, especially on string fields (it’s more obviously on non-string fields), but what if we went with (unset) instead? at least in that case, if it gets down stream to someone who isn’t clear on how Bro is doing things, there is more of a chance that they will understand that it didn’t exist vs. just assuming - is the value that was passed. Issues here are obviously backwards compatibility and creating larger log files.