Good morning everyone.
I’m researching compression of Zeek data. I’m currently dumping Zeek data into Parquet files, and one of the most challenging fields to compress is uid because of its high entropy.
I’m wondering if there’s any interest in changing the format of the uid to something like ULID, of which there is a C++ implementation already.
A ULID-based uid implementation would:
- allow uids to be sorted, which isn’t helpful in-and-of-itself, but very helpful for compression
- still URL-safe
- always 26 characters, for simpler storage
Looking through the code (UID.h and UID.cc) and its usages, it doesn’t look technically difficult but I’m sure I’m missing some reasons. For example, I noticed that prefixes such as the letter ‘C’ are used to denote kinds of connections. Perhaps that data can be extracted to another field instead?
Anyways, looking for thoughts, comments, suggestions, and anything else. Thank you!
I don’t have much feedback on the uid bits, but I’m very interested in Parquet! I had looked into doing this a while back but the tooling around parquet was very java/big data focussed and not very CLI friendly. Are you using the new c++ implementation in a log writer or are you converting json to parquet?
I’d say the tooling is still Java-focused, but I found some decent CLI tooling at https://github.com/apache/parquet-mr/tree/master/parquet-tools
Specifically, I used the convert command to go from JSON → Parquet. JSON.gz to Parquet (gzip compression code) saved us about 35%.
When you say “log writer”, do you mean custom Zeek writer that writes to Parquet directly?
The major issue we’re facing is that the schema for Zeek output can change over time (more columns can be added). That’s an issue for Parquet.