elastic search / bro questions

Hey all,

Just going to throw this out there and hope some people are willing to potentially share some learning experiences if they have any.

We have a system which generates around 15k-30k BRO events/sec and are trying to ingest these logs into a fairly beefy elasticsearch cluster. Total cluster memory ~300GB, storage ~300TB.

Long story short, we’re having some problems keeping up with this feed. Does anyone have any performance tuning with this module? I’ve played a lot with rsyslog batch sizes with elasticsearch and was hoping there would be some simple directive i could try and apply to BRO.

Does anyone have this experience here? Does this module batch anything?

Thanks in advance.

Cheers,

JB

Unless it’s changed within the past month or so, the ElasticSearch writer that comes with Bro is very alpha-level code. For the most part it fires and forgets and can be prone to losing messages if your cluster isn’t able to keep up or some other situation causes it not to be able to ingest the data properly.

Your best bet, as of now, is to write out the logs to disk and use some intermediary program to process the logs and ingest them into ES. Logstash can help, but with the default custom format Bro uses, it can’t parse the data properly. If you’re using Bro 2.3, you can modify the output format of the ascii writer to use json instead and then use logstash to feed the data relatively easily into ES. Further, I’d recommend using a rabbit river so ES can ingest the data at its leisure.

If you’re stuck with the non-json format, well your options are kinda limited. You can write a crazy custom logstash conf using grok (which is super inefficient) or figure out some other mechanism.

As an aside, I’ve written a custom logstash filter that processes the custom bro format and is, to a limited extent, bro type aware so it can take old-style bro logs relatively easily and make it more usable (numbers are turned into numbers and sets, vectors and tables are turned into arrays – same as how I’ve seen the ES writer output data). There are some caveats in its usage though. I’m putting the finishing touches on it and plan to release it when I get a chance (hopefully within the next week or two).

Yuck. I was really hoping this wasn’t the way. From everything you said, the river is where i’m focusing. I really, really dislike logstash (i’d rather bend rsyslog + ES output plugin to my liking any day).

I’ve written a few custom ES output/input parsers and many SOLR parsers that will parse bro logs, proxy logs, etc…, but would rather focus on something more native to output to ES if possible.

I guess it might be time to dig into some src…

Thanks for the feedback.

Cheers,

JB

There is a solution that has been in development for some time. We've done some work with having Bro write directly to NSQ (a disk backed http based queuing daemon) and there is another tool that pulls from NSQ and inserts into Elasticsearch. So far it seems that this can keep up with quite high volume networks.

Thanks for reporting to the list. More people showing problems like this can certainly prompt development on features like this. :wink:

  .Seth

If you’re looking at the source for 2.3 it’ll be in src/logging/writers/ElasticSearch.cc

In master, the ES writer has been moved to the new plugin architecture although the code remains largely the same (as far as I can tell), so you can take a look at the bro-plugins repo on github.

As a little preview, the writer uses the ES bulk interface to send data using libcurl and if there’s any error, basically ignores it and continues on.

Also, I wrote two custom writers a few months back, an AMQP writer and an ElasticSearch River writer, both of which wrote to an AMQP server (the latter of which made river compliant messages for direct ingestion into ES). They worked well under testing, but I didn’t go any further with them since my pull request to the bro repo wasn’t accepted.

We tried the ElasticSearch writer with mixed results. Our Bro cluster (two workers, one manager/proxy) processes somewhere the same number of events you’re dealing with and we write all Bro data to a RAID10 array. We then have a Logstash shipper grab the logs as fast as they come in and ship those off to a couple REDIS systems. We then have a Logstash indexer pull the data from REDIS and mutate the data in various ways (rename attributes, geolocation) before being shipped off to our ElasticSearch cluster. We have limited hardware but we’ve been able to pump data fast enough for 10Gbps link which peaks around 3.5Gbps. The architecture is more complicated, but its scalable.

Having REDIS in place makes it nice when you have traffic bursts and Logstash needs time to catch up. It acts as a nice buffer. Logstash can also be tuned to have multiple worker processes and filter threads which took us some time to tune. It’s a bit of a balancing act.

… And I just saw your response. Sounds like Logstash is not a good option for you.

Nick

Could you remind me of the ticket number? I don't recall that we rejected your patches, it's possible that we've just not had a motivator to drive the patches forward.

  .Seth

How about using Heka to read and parse the logs, and MozDef to collect them? That’s what we do here with I believ 7k eps, soon to be more. Or just Heka. I’d go for both, we’re working on a plug and play configuration.

One of the good things about Heka is - it’s insane fast. Tests were showing 10Gbit/sec pipe saturated with logs.

Heka

http://blog.mozilla.org/services/2013/04/30/introducing-heka/
https://github.com/mozilla-services/heka
https://hekad.readthedocs.org/en/v0.8.0/

MozDef

https://github.com/jeffbryner/MozDef
http://mozdef.readthedocs.org/en/latest/

Hey all,

We’re still fighting with certain troubles but have completed isolating all data nodes from our master nodes. That seemed to help with general availability of the cluster, and data throughput (we think some of the data nodes couldn’t talk to the masters, creating a bunch of stability issues).

Now all that is kind of an ES thing, though i thought it might be valuable i’ve added it.

I now have a question for the BRO folks regarding indexing to the ‘bro’ index (as opposed to the ‘bro-201410242100’). We have our ‘bro’ index up over 10B records. When the index needs to be brought back up after (routine) catastrophic failures, we find ourselves waiting for a really, really long time while the massive ‘bro’ primary shards initialize.

My question is this. Many of these ES issues appear that they can be alleviated if we were shoving all of the bro logs into ‘bro-YYYYmmddHHMM’, instead of some there, and some in the giant ‘bro’ index. Is there any reason why we can’t force all of the ES logging into the time based indicies instead of the one giant bro index? Would anyone know where to start hacking the BRO code to try and make this possible?

By the way, thanks tons for the help everyone, i’ll definitely be posting a full lessons learned once we get everything up the way we’re expecting.

Cheers,

JB


Are you processing tracefiles? If you are processing live traffic from an interface it should already be sharding into indexes like you want.

  .Seth

I’m not processing offline files, if that’s what you mean (still a bit new to bro, feel free to expand on the tracefiles).

I’m sniffing many interfaces, but it appears most (not all, but most) logs are going into the bro index, without the time.

I was going to try and hack something around this in ‘share/bro/base/frameworks/logging/writers/elasticsearch.bro’ to change the index to be dynamic with the date:

Name of the ES index.

const index_prefix = “bro” &redef;

Not sure if that would only get read on program instantiation though…

I might also be way out in left field… Any shove in the right direction helps :).

Cheers,

JB

Ohh, I know what's happening. You're running Bro directly at the command line without using broctl aren't you? Bro doesn't have log rotation enabled by default and the index name rotation is based on log log rotation.

Set this in a script you're loading...

redef Log::default_rotation_interval = 1hr;

I haven't double checked and I not sure what that will do to the Ascii logs, but it should at least give you partitioned index names in ES.

  .Seth

Nope, i invoke bro using broctl like this:

su snort -c “export https_proxy=‘https://$PROXY:$PROXYPORT’; /opt/data/bro/bin/broctl restart --clean”

Which usually shows things like this:

cleaning up …
cleaning up nodes …
checking configurations…
manager scripts are ok.
proxy-0 scripts are ok.
worker-0-1 scripts are ok.
worker-0-2 scripts are ok.
worker-0-3 scripts are ok.
worker-0-4 scripts are ok.
worker-1-1 scripts are ok.
worker-1-2 scripts are ok.
worker-1-3 scripts are ok.
worker-2-1 scripts are ok.
worker-2-2 scripts are ok.
worker-2-3 scripts are ok.
worker-3-1 scripts are ok.
worker-3-10 scripts are ok.
worker-3-11 scripts are ok.
worker-3-12 scripts are ok.
worker-3-2 scripts are ok.
worker-3-3 scripts are ok.
worker-3-4 scripts are ok.
worker-3-5 scripts are ok.
worker-3-6 scripts are ok.
worker-3-7 scripts are ok.
worker-3-8 scripts are ok.
worker-3-9 scripts are ok.
worker-4-1 scripts are ok.
worker-4-2 scripts are ok.
worker-4-3 scripts are ok.
worker-5-1 scripts are ok.
worker-5-2 scripts are ok.
worker-5-3 scripts are ok.
worker-5-4 scripts are ok.
installing …
removing old policies in /opt/data/bro/spool/installed-scripts-do-not-touch/site … done.
removing old policies in /opt/data/bro/spool/installed-scripts-do-not-touch/auto … done.
creating policy directories … done.
installing site policies … done.
generating cluster-layout.bro … done.
generating local-networks.bro … done.
generating broctl-config.bro … done.
updating nodes … done.
starting …
starting manager …
starting proxy-0 …
starting worker-0-1 …
starting worker-0-2 …
starting worker-0-3 …
starting worker-0-4 …
starting worker-1-1 …
starting worker-1-2 …
starting worker-1-3 …
starting worker-2-1 …
starting worker-2-2 …
starting worker-2-3 …
starting worker-3-1 …
starting worker-3-10 …
starting worker-3-11 …
starting worker-3-12 …
starting worker-3-2 …
starting worker-3-3 …
starting worker-3-4 …
starting worker-3-5 …
starting worker-3-6 …
starting worker-3-7 …
starting worker-3-8 …
starting worker-3-9 …
starting worker-4-1 …
starting worker-4-2 …
starting worker-4-3 …
starting worker-5-1 …
starting worker-5-2 …
starting worker-5-3 …
starting worker-5-4 …

Our node looks like this:

[manager]
type=manager
host=$IP
[proxy-0]
type=proxy
host=$IP
[worker-0]
type=worker
host=$IP
interface=eth2
lb_method=pf_ring
lb_procs=4
pin_cpus=0,1,2,3
[worker-1]
type=worker
host=$IP
interface=eth3
lb_method=pf_ring
lb_procs=3
pin_cpus=5,6,7
[worker-2]
type=worker
host=$IP
interface=eth4
lb_method=pf_ring
lb_procs=3
pin_cpus=4,8,9
[worker-3]
type=worker
host=$IP
interface=eth5
lb_method=pf_ring
lb_procs=12
pin_cpus=10,11,12,13,14,15,23,24,25,26,27,28
[worker-4]
type=worker
host=$IP
interface=eth6
lb_method=pf_ring
lb_procs=3
pin_cpus=16,17,18
[worker-5]
type=worker
host=$IP
interface=eth7
lb_method=pf_ring
lb_procs=4
pin_cpus=19,20,21,22

Logs-to-elasticsearch.bro has this:

const rotation_interval = 24hr &redef;

We add custom country logging doing stuff like this (this is smtp/savecountry.bro):

redef record SMTP::Info += {
orig_cc: string &log &optional;
resp_cc: string &log &optional;
};

event smtp_reply(c: connection, is_orig: bool, code: count, cmd: string,
msg: string, cont_resp: bool) &priority=3
{
local orig_loc = lookup_location(c$id$orig_h);
if ( orig_loc?$country_code )
c$smtp$orig_cc = orig_loc$country_code;
local resp_loc = lookup_location(c$id$resp_h);
if ( resp_loc?$country_code )
c$smtp$resp_cc = resp_loc$country_code;

This shouldn’t need to have the redef for log rotation should it? The only non stock stuff we do is adding countries to conn and smtp. Everything else should be stock.

Any ideas?

Cheers,

JB

Weird… As Seth mentioned, the writer uses the time and the rotation interval to name the indexes. It should also create an @ index for metadata. I thought the time format was hard coded in the es writer, but it’s been a while since I read the code …

Also, in regards to ES restart, there are some tunable elements. For one, optimizing indexes should help. Also if you have the bandwidth, you can increase the number of concurrent recoveries and the allowed network throughput.

So for the record, this is what happens when you configure bro to have a log rotate interval of 0 within broctl, and still send logs to elasticsearch. Most of the logs will end up in the ‘bro’ index, but some will still end up being sent to bro-$DATETIME index. This was the result of some legacy configs (no logrotate for rsyslog so as not to lose file handles) which sent the data to a homebrew ES plugin. I had forgotten to remove these configs when setting up bro for the more native ES.

Thanks tons for the quick response in the IRC channel.

Cheers,

JB

One more thing i wanted to share… In ‘bro/share/bro/base/frameworks/logging/writers/elasticsearch.bro’ it says:
##! There is one known memory issue. If your elasticsearch server is
##! running slowly and taking too long to return from bulk insert
##! requests, the message queue to the writer thread will continue
##! growing larger and larger giving the appearance of a memory leak.

Interesting to see this queuing graphed out on a box with 96gb of ram… It ran into swap pretty quickly… :slight_smile:

Inline image 1

All in good fun i suppose…

Cheers,

JB

Yeah, unfortunately ES frequently is having a hard time keeping up for people. This is where having logs go to an external queueing system first can be beneficial.

  .Seth