changing format of uid to ULID?

Karl_Pietrzak · August 30, 2019, 6:10pm

Good morning everyone.

I’m researching compression of Zeek data. I’m currently dumping Zeek data into Parquet files, and one of the most challenging fields to compress is uid because of its high entropy.

I’m wondering if there’s any interest in changing the format of the uid to something like ULID, of which there is a C++ implementation already.

A ULID-based uid implementation would:

allow uids to be sorted, which isn’t helpful in-and-of-itself, but very helpful for compression
still URL-safe
always 26 characters, for simpler storage
case-insensitive

Looking through the code (UID.h and UID.cc) and its usages, it doesn’t look technically difficult but I’m sure I’m missing some reasons. For example, I noticed that prefixes such as the letter ‘C’ are used to denote kinds of connections. Perhaps that data can be extracted to another field instead?

Anyways, looking for thoughts, comments, suggestions, and anything else. Thank you!

JustinAzoff · August 30, 2019, 6:21pm

I don’t have much feedback on the uid bits, but I’m very interested in Parquet! I had looked into doing this a while back but the tooling around parquet was very java/big data focussed and not very CLI friendly. Are you using the new c++ implementation in a log writer or are you converting json to parquet?

Karl_Pietrzak · August 30, 2019, 6:47pm

I’d say the tooling is still Java-focused, but I found some decent CLI tooling at https://github.com/apache/parquet-mr/tree/master/parquet-tools

Specifically, I used the convert command to go from JSON → Parquet. JSON.gz to Parquet (gzip compression code) saved us about 35%.

When you say “log writer”, do you mean custom Zeek writer that writes to Parquet directly?

The major issue we’re facing is that the schema for Zeek output can change over time (more columns can be added). That’s an issue for Parquet.

Topic		Replies	Views
Is there a regex that can be used to match the uids in the logs? Zeek	6	65	May 6, 2022
uid in files logs Zeek	4	173	May 6, 2022
[JIRA] (BIT-1016) Option to extend uids to 128 bit Development development	2	77	May 6, 2022
conn. uid Zeek	3	139	May 6, 2022
How to insert protocol log into conn log that have same uid? Zeek development	1	293	August 27, 2022

changing format of uid to ULID?

Related topics