Are intel files loaded into memory or statically evaluated? We have a 7.4 meg intel file we are looking to push; however, out of 400 gigs of ram, we are using 400 gigs, with a load average well over 10… This is only a 3.5 Gb/s sustained link. We have about 2000 lines of intel (cert hash, file hash, domain) currently. This new addition would drive this up to ~35,000 lines of intel. We are trying to determine if this is practical given our current load on the box.
Also, why does bro continuously chew ram up? When first started, bro eats about 80 gigs, then moves up through the day to about 120-175. However, if we leave it running for a few days, it ends up at the max of the memory allowed for the system…
What process is using memory? Workers? Proxies? Manager? If you can include the output of 'broctl top' that would be helpful.
Are intel files loaded into memory or statically evaluated?
It's loaded into memory. It's just using normal Bro data types which have some overhead.
We have about 2000 lines of intel (cert hash, file hash, domain) currently. This new addition would drive this up to ~35,000 lines of intel. We are trying to determine if this is practical given our current load on the box.
Generally I would expect that amount of intelligence to be fine. It seems as though you may have some other trouble in your deployment though.
Also, why does bro continuously chew ram up? When first started, bro eats about 80 gigs, then moves up through the day to about 120-175.]
How many workers are you running?
.Seth
We are running pfring with lb_procs=20. We have 40 cores on the box.
Are intel files loaded into memory or statically evaluated?
It’s loaded into memory. It’s just using normal Bro data types which have some overhead.
We have about 2000 lines of intel (cert hash, file hash, domain) currently. This new addition would drive this up to ~35,000 lines of intel. We are trying to determine if this is practical given our current load on the box.
Generally I would expect that amount of intelligence to be fine. It seems as though you may have some other trouble in your deployment though.
Also, why does bro continuously chew ram up? When first started, bro eats about 80 gigs, then moves up through the day to about 120-175.]
How many workers are you running?
.Seth
Is that 40 cores with hyper threading or without? It's possible you're overwhelming the system if 20 of those cores are hyper threaded (this is definitely a guess though since there are so many things that could cause trouble).
.Seth
We have 2 10 physical core systems with 20 logical cores for a total of 40. Bro has a capture loss of sub .5% across all workers, so it seems unlikely that the box is overloaded. The capture rate of the box, per pfring is about 3.5Gb/s. We reported memory issues in the past, but those were written off as not related to the memory leak recently patched in the 24 branch and the 25 branch.
What process is using memory? Workers? Proxies? Manager? If you can include the output of 'broctl top' that would be helpful. Otherwise it is pretty hard to determine what the issue may even be.
If you have a dual 10 core system and are running 20 workers then that leaves no room for the manager or for any tasks like log rotation. For a 20 core system I would run at most 18 workers.
With hyperthreading that’s actually 40 cores, not 20. Running 20 workers with 40 cores available should be more than sufficient. At the time brotop was run, 355 out of 390 gigs of ram are in use. The only things running on this box are bro, and a splunk forwarder. The splunk forwarder is only using about 15 gigs of ram. This excessive memory consumption is on all of our bro boxes, no matter the input stream. Even on boxes only getting 500Mb/s, we see this memory creep until it is exhausted. At no point is oomkiller called however, so it is not exceeding available memory, just consuming all of the available memory.
brotop
Cool, thanks for the details. If you don't load your intelligence data, do you see any memory trouble? That seems like the next logical step to take for testing.
.Seth
Can you show the output of
free -m
total used free shared buffers cached
Mem: 371336 340383 30952 0 300 111823
-/+ buffers/cache: 228259 143076
Swap: 15999 191 15808
Ah, I think you have been looking at the wrong numbers.
You are only using 228259M, (~222G, not 355G)
111823M is unallocated and currently used for buffer/disk cache.
This amount will always grow until it ends up using almost all the 'free' memory on the machine.
The reason why the OOM killer isn't killing anything is because you still have over 128G of ram free.
I added up all the ram usage from the output of bro top, and adding some overhead for the rounded amounts measured in gigs, came to
56184M.
Minus splunk, that does still leave about 150G unaccounted for.
I believe some of that will be used by packet buffers in the kernel, depending on how you have configured pf_ring.
But even at a huge 1G buffer for each of 20 workers (which I think is much much more than it uses by default) that is only another 20G.
This was about 2 hours after Bro was rebooted. Here is the output of a bro box with nearly identical throughput that has had bro up and running for the past 48 hours. As you can see, the buffer at this rate has shot up to 4g per parent. If I did not reboot the first box reported every 8 hours or so, we would see the same result there.
free -m
total used free shared buffers cached
Mem: 387495 375736 11758 0 1073 248216
-/+ buffers/cache: 126446 261048
Swap: 15999 267 15732
/opt/bro/bin/broctl top
waiting for lock (owned by PID 71999) …
Name Type Host Pid Proc VSize Rss Cpu Cmd
manager manager localhost 69286 parent 853M 325M 108% bro
manager manager localhost 69313 child 381M 171M 21% bro
proxy-1 proxy localhost 69363 parent 1G 845M 34% bro
proxy-1 proxy localhost 69394 child 215M 112M 3% bro
proxy-2 proxy localhost 69391 child 210M 94M 7% bro
proxy-2 proxy localhost 69364 parent 935M 829M 3% bro
worker-1-1 worker localhost 69517 parent 4G 4G 79% bro
worker-1-1 worker localhost 70127 child 712M 627M 1% bro
worker-1-10 worker localhost 69526 parent 4G 4G 98% bro
worker-1-10 worker localhost 70123 child 712M 626M 1% bro
worker-1-11 worker localhost 69537 parent 4G 4G 83% bro
worker-1-11 worker localhost 70095 child 712M 627M 1% bro
worker-1-12 worker localhost 69545 parent 4G 4G 86% bro
worker-1-12 worker localhost 70098 child 712M 628M 1% bro
worker-1-13 worker localhost 69563 parent 4G 4G 92% bro
worker-1-13 worker localhost 70027 child 712M 628M 1% bro
worker-1-14 worker localhost 69564 parent 4G 4G 98% bro
worker-1-14 worker localhost 70140 child 712M 626M 1% bro
worker-1-15 worker localhost 69582 parent 4G 4G 98% bro
worker-1-15 worker localhost 70143 child 712M 628M 1% bro
worker-1-16 worker localhost 69577 parent 4G 4G 100% bro
worker-1-16 worker localhost 70125 child 712M 628M 0% bro
worker-1-17 worker localhost 69595 parent 4G 4G 98% bro
worker-1-17 worker localhost 70135 child 712M 629M 1% bro
worker-1-18 worker localhost 69600 parent 4G 4G 79% bro
worker-1-18 worker localhost 70141 child 712M 628M 0% bro
worker-1-19 worker localhost 69618 parent 4G 4G 77% bro
worker-1-19 worker localhost 70106 child 712M 624M 1% bro
worker-1-2 worker localhost 69615 parent 4G 4G 79% bro
worker-1-2 worker localhost 70138 child 712M 628M 0% bro
worker-1-20 worker localhost 69620 parent 4G 4G 88% bro
worker-1-20 worker localhost 70131 child 712M 628M 1% bro
worker-1-3 worker localhost 69631 parent 4G 4G 81% bro
worker-1-3 worker localhost 70025 child 712M 626M 1% bro
worker-1-4 worker localhost 69639 parent 4G 4G 86% bro
worker-1-4 worker localhost 70139 child 712M 628M 1% bro
worker-1-5 worker localhost 69636 parent 4G 4G 98% bro
worker-1-5 worker localhost 70108 child 712M 626M 1% bro
worker-1-6 worker localhost 69646 parent 4G 4G 100% bro
worker-1-6 worker localhost 70107 child 712M 625M 1% bro
worker-1-7 worker localhost 69647 parent 4G 4G 67% bro
worker-1-7 worker localhost 70097 child 712M 622M 1% bro
worker-1-8 worker localhost 69649 parent 4G 4G 84% bro
worker-1-8 worker localhost 70026 child 712M 626M 1% bro
worker-1-9 worker localhost 69651 parent 4G 4G 67% bro
worker-1-9 worker localhost 70134 child 712M 628M 1% bro
total used free shared buffers cached
Mem: 371336 340383 30952 0 300 111823
-/+ buffers/cache: 228259 143076
Swap: 15999 191 15808
Ah, I think you have been looking at the wrong numbers.
You are only using 228259M, (~222G, not 355G)
111823M is unallocated and currently used for buffer/disk cache.
This amount will always grow until it ends up using almost all the ‘free’ memory on the machine.
The reason why the OOM killer isn’t killing anything is because you still have over 128G of ram free.
I added up all the ram usage from the output of bro top, and adding some overhead for the rounded amounts measured in gigs, came to
56184M.
Minus splunk, that does still leave about 150G unaccounted for.
I believe some of that will be used by packet buffers in the kernel, depending on how you have configured pf_ring.
But even at a huge 1G buffer for each of 20 workers (which I think is much much more than it uses by default) that is only another 20G.
This box has 256G of ram free.
I'm sorry but I just don't see where you have a problem here.
Because, on boxes where we arent consistently rebooting bro, we are having oomkiller nuking splunk and bro.
Ok.. because before you said "At no point is oomkiller called"
I'm assuming that you have a cron job or something running broctl restart every 8 hours.
Can you add a script that does this, once per hour or so (and set to run at a particular minute so it runs before the job that restarts bro runs)
date
free -m
top -a -b -n 1
broctl top
and sends that to a file, then show us what that says after a day or so?
If you've been showing us system information from immediately after bro is restarted and not while the problem is occurring then that data isn't very useful.