many of us use Elastic Search as a sink for bro-logs. I am thinking
about written a script to extract the correct mapping from the bro
header.
This would mean:
* mapping data types:
string, addr, enum -> string
int, count, port -> long
interval, double -> double
time -> epoch_millis
* setting 'not_analyzed' for types like addr where this makes no sense
* handle container types (table, set, vector)
in case you are talking about importing a Bro ASCII log into the database
- I did something like that for Postgres once. My script automatically
created tables with the right types (including stuff like inet), and
converted sets and vectors to postgres arrays.
ElasticSearch gets difficult, because there's a lot of context-specific
data that should be captured too, especially when it comes to indexing.
For example, I liked to index domain names with a reverse-path
tokenization on '.' as the delimeter, so that www.ncsa.illinois.edu will
show up in searches for "edu," "illinois.edu," "ncsa.illinois.edu," and
"www.ncsa.illinois.edu." Capturing this context can be very tricky, and
I don't think that it's currently available in the ASCII logs.
I'd be curious if anyone has thoughts on how to improve this.