Bro's escaping of non-printable characters behaves unexpected

Hello everyone,

I'm encountering a problem where I am unable to reconstruct original
inputs from bro log files. This example summarizes the problem:

Hello,

That was a poor example, as it used \0 which is special cased by the
bro escape functionality.

This problem also extends beyond non-printable to non-ascii (unicode)
characters. Here's another example with a unicode character for the
registered sign ® (\xc2\xae).

Hello Paul,

I think the reason that the ascii writer of the logging framework of Bro
does not support arbitrary binary data is, that it was conceived as a
framework for writing human-readable log files, not arbitrary binary data.

If you want to write binary data to log files, I would recommend just
base64-encoding it before using the encode_base64 bif.

If you are ok with just using the standard methods for writing to files
outside of the logging framework, you can put them into binary mode, as
you probably are aware.

Johanna

Hey Johanna,

Thanks for taking the time to respond.

I think the reason that the ascii writer of the logging framework of Bro
does not support arbitrary binary data is, that it was conceived as a
framework for writing human-readable log files, not arbitrary binary data.

I'm going to push back a bit on characterizing this as supporting
arbitrary binary data. These are unicode characters appearing in URIs
($http$URI) that I'm encountering in actual network traffic. I'm
actually encountering them somewhat frequently. The problem manifests
itself in the standard http.log, as well as the extensions I'm working
on.

I realize the RFC does not permit unicode in URLs, but given that they
do occur in practice (browsers will just silently handle them), this
seems like something worth supporting.

I'll also point out that Bro's ascii logging facilities do currently
support logging these characters, they simply do so in an
unrecoverable/non-canonical way. What I'm proposing is
standardization/cleanup for the escaping that Bro already performs.

Thanks.
-Paul

Hey Paul,

I realize the RFC does not permit unicode in URLs, but given that they
do occur in practice (browsers will just silently handle them), this
seems like something worth supporting.

I think what you are looking for is this.

http://en.wikipedia.org/wiki/Internationalized_resource_identifier

Thanks.
-Paul

Best

Christian