We are looking to set up a proper ES cluster and dumping bro logs into it via logstash. The thought is to have 6 ES nodes (2 dedicated master, 4 data nodes). If we are dumping 15 TB of data into the cluster a year (possibly as high as 20 or 25TB) from Bro, is 4 data nodes sufficient? The boxen will only have 64 gigs of ram (30 for java heap, 34 for system use) and probably 16 discrete cores. I have a feeling that this is horribly insufficient for a data cluster of that size.
1 year retention, 1x replication.
From my adventures with two ES clusters receiving bro logs:
The big “sizing” issues relate to how many bro events are being inserted into ES. One of my ES clusters does about 25k events/sec from bro. We found this is more than a simple bro ->logstash ->ES can take. We ended up putting a kafka buffer between bro and logstash, so now it’s bro -> kafka -> logstash -> ES. We tried with other buffer-type things, like syslog-ng and a rabbitmq, but ended up on kafka since there is a bro-pkg for it, too. I hear some people like redis for this funstion. So bro can now burst event logs out and kafka makes the logstash to ES process run at a more sustained pace. Kafka and logstash run on some old box we had laying around so the bro box doesn't have to do anything but slop logs out.
The other issue was the rate at which ES can insert events when it’s doing other stuff. We ended up making data nodes just do data; the masters just be masters, and specifically crafting “insertion” nodes that only talk to logstash. This takes the index loading and computational work off the data nodes. A couple of our systems have a master node and a data node on them since the masters use very little resources. Note when you search in ES it pauses its other activities since ES thinks search is its primary function in life. So insertion rate drops. What we found is that having specific insertion nodes lets ES keep taking inserts even when you do heavy searches. Keeps my OPS people happy.
When I saw your "25TB of data per year" I chuckled... My *small* ES cluster is three Dell R380s (each with 20C, 64GB, 35TB disk). (I should have gotten more memory.) But our 95TB of disk lets us keep about 120 days of bro logs before curator deletes old logs to makes us more free disk space. This cluster has been continuously up about 4 months since I last played with the configs, so I'm content. I think your 4 data nodes is fine. If your insertion rate is high make an insertion node on one of your masters.
p.s. Keep your java heap under 26GB; 23GB if you hate compressed java pointers.
Thank you for your response Patrick! To be honest, I am not sure just how large our data set is. One of the problems we have is that we just don’t have enough disk space to unpack our gzip’d logs to see what a year would look like. Do you happen to have a good document on how you are interfacing kafka with logstash?