Good morning everyone.
I’m researching compression of Zeek data. I’m currently dumping Zeek data into Parquet files, and one of the most challenging fields to compress is uid because of its high entropy.
I’m wondering if there’s any interest in changing the format of the uid to something like ULID, of which there is a C++ implementation already.
A ULID-based uid implementation would:
- allow uids to be sorted, which isn’t helpful in-and-of-itself, but very helpful for compression
- still URL-safe
- always 26 characters, for simpler storage
- case-insensitive
Looking through the code (UID.h and UID.cc) and its usages, it doesn’t look technically difficult but I’m sure I’m missing some reasons. For example, I noticed that prefixes such as the letter ‘C’ are used to denote kinds of connections. Perhaps that data can be extracted to another field instead?
Anyways, looking for thoughts, comments, suggestions, and anything else. Thank you!