Memory Consumption

Hello Everyone:

I have a question concerning Bro memory utilization. I have two servers, both with the following specs…

Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz x 32
48GB RAM

Running: CentOs 6.5

w/ PF_RING running on 16 workers with a 65536 ring slots.

I have Suricata and Bro running fine on one server (server A). Then I just have Bro running on another server by itself (server B).

Both servers see about the same amount of data at peak load (~1gb/s).

On server A, memory utilization operates at capacity. With bro and suri taking up just about all there is.

On server B, memory utilization operates at capacity as well? With Bro slowly but surely consuming just about all available memory until it leaves me with about 200k free for the rest of the system. This same behavior is seen on server A as well, however, server A is seemingly stable with two apps running? At certain points about a gig of memory is freed up, but then it gets consumed all over again to that 200k threshold.

What is going on here? Is it normal for Bro to take up as much memory as it can? Are others experiancing this too? Is there something I can do to restrict Bro from being such a memory hog?

I was considering getting more memory for the servers, however based on these tests I cannot be certain Bro will not just run away with the new memory I give it in the same fashion :slight_smile:

FWIW, I have been experiancing this with Bro 2.2 and 2.3.

Many thanks in advance for the help!

Thanks,
Jason

What is going on here? Is it normal for Bro to take up as much memory as it can? Are others experiancing this too? Is there something I can do to restrict Bro from being such a memory hog?

Bro typically does consume quite a bit of memory and you're a bit tight on memory for the number of workers you're running.

A common problem is that sometimes there will be state tables that grow too large. If you load misc/profiling then you should have a prof.log in each of your processes spool directories (they don't log centrally like other logs). Every so often in there will be an indication of the largest global variables and you can look at that for a start to see if some script is just holding on to more state than you'd like it to.

I was considering getting more memory for the servers, however based on these tests I cannot be certain Bro will not just run away with the new memory I give it in the same fashion :slight_smile:

At least memory upgrades are just about the cheapest and easiest upgrade you could possibly do with your system. :slight_smile:

  .Seth

At least memory upgrades are just about the cheapest and easiest upgrade you could possibly do with your system. :slight_smile:

Very true indeed :slight_smile:

Bro typically does consume quite a bit of memory and you’re a bit tight on memory for the number of workers you’re running.

Curious what would you recommend for just bro itself? Double, triple this?

A common problem is that sometimes there will be state tables that grow too large. If you load misc/profiling then you should have a prof.log in each of your processes spool directories (they don’t log centrally like other logs). Every so often in there will be an indication of the largest global variables and you can look at that for a start to see if some script is just holding on to more state than you’d like it to.

I will definately take a look, thanks for the info!

> Bro typically does consume quite a bit of memory and you're a bit tight on memory for the number of workers you're running.
Curious what would you recommend for just bro itself? Double, triple this?

It seems like most people just put 128G of memory in Bro boxes now because the cost just isn't really worth going any lower if there's a remote possibility you might use it.

I will definately take a look, thanks for the info!

Feel free to ask again if you're having trouble. We really should write up some debugging documentation for this process sometime. Anyone with experience doing this memory debugging activity up for it? Doesn't have to be anything fancy, just the steps and various things to look at to figure out what exactly is happening.

  .Seth

Thanks Seth:

I’m not sure I have a license for an experianced bro memory debugger, however I will document what I’ve done here for folks in hopes it proves useful!

I’ve enabled profiling by adding the following.

Vim /opt/bro/share/bro/site/local.bro
@load misc/profiling

Then enforced the changes…

broctl stop
broctl install
broctl start

Every so often in there will be an indication of the largest global variables

Is this what you mean (taken from one worker)…?

1403803224.322453 Global_sizes > 100k: 0K
1403803224.322453 Known::known_services = 469K (3130/3130 entries)
1403803224.322453 Cluster::manager2worker_events = 137K
1403803224.322453 Weird::weird_ignore = 31492K (146569/146569 entries)
1403803224.322453 Known::certs = 58K (310/310 entries)
1403803224.322453 SumStats::threshold_tracker = 668K (4/2916 entries)
1403803224.322453 FTP::ftp_data_expected = 181K (46/46 entries)
1403803224.322453 Notice::suppressing = 595K (2243/2243 entries)
1403803224.322453 Communication::connected_peers = 156K (2/2 entries)
1403803224.322453 SumStats::sending_results = 8028K (3/5545 entries)
1403803224.322453 Software::tracked = 33477K (12424/31111 entries)
1403803224.322453 FTP::cmd_reply_code = 48K (325/325 entries)
1403803224.322453 SumStats::result_store = 27962K (5/19978 entries)
1403803224.322453 SSL::cipher_desc = 97K (356/356 entries)
1403803224.322453 RADIUS::attr_types = 44K (169/169 entries)
1403803224.322453 Weird::actions = 35K (163/163 entries)
1403803224.322453 Known::known_hosts = 3221K (21773/21773 entries)
1403803224.322453 Weird::did_log = 54K (287/287 entries)
1403803224.322453 SSL::recently_validated_certs = 8667K (24752/24752 entries)
1403803224.322453 Communication::nodes = 188K (4/4 entries)
1403803224.322453 SSL::root_certs = 204K (144/144 entries)
1403803224.322453 Global_sizes total: 116727K

1403803224.322453 Total number of table entries: 213548/260715
1403803239.322685 ------------------------
1403803239.322685 Memory: total=1185296K total_adj=1137108K malloced: 1144576K

Any other pointers on how to interpret this data?

FWIW, here are some additional statistics from the worker prof.log…

grep "Memory: " prof.log | awk ‘NR % 10 == 1’
0.000000 Memory: total=48188K total_adj=0K malloced: 47965K
1403802189.315606 Memory: total=614476K total_adj=566288K malloced: 614022K
1403802339.316381 Memory: total=938380K total_adj=890192K malloced: 938275K
1403802489.317426 Memory: total=1006168K total_adj=957980K malloced: 1003385K
1403802639.318199 Memory: total=1041288K total_adj=993100K malloced: 1035422K
1403802789.319107 Memory: total=1063544K total_adj=1015356K malloced: 1058229K
1403802939.320170 Memory: total=1140652K total_adj=1092464K malloced: 1139608K
1403803089.321327 Memory: total=1184540K total_adj=1136352K malloced: 1179411K
1403803239.322685 Memory: total=1185296K total_adj=1137108K malloced: 1144576K
1403803389.323680 Memory: total=1185296K total_adj=1137108K malloced: 1118961K
1403803539.324677 Memory: total=1185296K total_adj=1137108K malloced: 1092719K
1403803689.325763 Memory: total=1185296K total_adj=1137108K malloced: 1091447K

Small follow up to this as well since it may be relevant. I notice the timers for stale connections seems to increase in paralel with memory…

grep ‘ConnectionInactivityTimer’ prof.log | awk ‘NR % 10 == 1’
1403802069.314888 ConnectionInactivityTimer = 5844
1403802219.315759 ConnectionInactivityTimer = 21747
1403802369.316387 ConnectionInactivityTimer = 32275
1403802519.317613 ConnectionInactivityTimer = 32716
1403802669.318303 ConnectionInactivityTimer = 32597
1403802819.319193 ConnectionInactivityTimer = 34207
1403802969.320204 ConnectionInactivityTimer = 39176
1403803119.321978 ConnectionInactivityTimer = 40394
1403803269.323058 ConnectionInactivityTimer = 38631
1403803419.323688 ConnectionInactivityTimer = 35847
1403803569.324716 ConnectionInactivityTimer = 34432
1403803719.325888 ConnectionInactivityTimer = 34591
1403803869.326713 ConnectionInactivityTimer = 34716
1403804019.327664 ConnectionInactivityTimer = 35361
1403804169.329254 ConnectionInactivityTimer = 35915
1403804319.330507 ConnectionInactivityTimer = 34994
1403804469.331842 ConnectionInactivityTimer = 33212
1403804619.332236 ConnectionInactivityTimer = 32290
1403804769.332993 ConnectionInactivityTimer = 32513
1403804919.333717 ConnectionInactivityTimer = 32592
1403805069.334477 ConnectionInactivityTimer = 32388
1403805219.334875 ConnectionInactivityTimer = 32932
1403805369.335753 ConnectionInactivityTimer = 31771
1403805519.337054 ConnectionInactivityTimer = 28749
1403805669.337563 ConnectionInactivityTimer = 26509
1403805819.339240 ConnectionInactivityTimer = 26654
1403805969.340812 ConnectionInactivityTimer = 26297
1403806119.341841 ConnectionInactivityTimer = 25362
1403806269.344342 ConnectionInactivityTimer = 24435
1403806419.345146 ConnectionInactivityTimer = 24954
1403806569.346057 ConnectionInactivityTimer = 24088
1403806719.347671 ConnectionInactivityTimer = 30207
1403806869.349643 ConnectionInactivityTimer = 34276

Notice the steady increase, then slight decrease, then steady increase again. Is there a way to control these settings for performance testing purposes?

I know while I was tuning Suricata, I needed to be mindful of connection timeouts and due to the volume of flows I am getting I needed to be pretty aggressive.

Thanks,
Jason

Hi:

I believe this particular timer is a general timer used to track inactivity for all protocols (but someone can correct me if I’m wrong :). Maybe try tuning the following:

const tcp_inactivity_timeout = 5 min &redef;
const udp_inactivity_timeout = 1 min &redef;
const icmp_inactivity_timeout = 1 min &redef;

Reference: http://www.notary.icsi.berkeley.edu/sphinx-git/scripts/base/init-bare.bro.html#id-udp_inactivity_timeout

Also, I believe it’s possible to set timeouts per-connection based on properties of the established connections. For an example of doing this / how this might be useful, take a look at:

https://bro.org/sphinx/scripts/base/protocols/conn/inactivity.bro.html

Re: interpreting prof.log output – a few notes from my experience:

There should be lines that include the number of connections currently active for each major protocol type, e.g:

Conns: tcp=1/130 udp=1/70 icmp=0/0

Syntax here is: tcp=/ udp=/ icmp=/

The line following the above includes more detailed connection overhead information:

Conns: total=6528 current=2/2 ext=0 mem=9312K avg=4656.0 table=24K connvals=6K

A few notes about fields that might be useful there:

  • total=total number of connections (aggregate: not just at this particular moment)
  • current=X/Y: X and Y are two counts that will usually differ to some extent, but both count the number of connections observed
  • X: the number of active connections in total (not necessarily all of which are kept in state tables)
  • Y: the number of connections stored in bro’s state tables (tcp + udp + icmp) at this moment in time
  • avg=average memory use (in bytes) per active connection
  • table=total amount of memory used by the state tables (tcp + udp + icmp)

‘avg’ and ‘table’ are only recorded occasionally because computing these values can be expensive. When that “Global_sizes …” output is included in a log entry, these numbers will be accurate. Otherwise, they will be 0.

For an idea of the overhead associated with the Timer objects themselves (read: the overhead for the timers isn’t included in the overhead computed for the connection state), take a look at the line that looks something like:

Timers: current=19 max=19 mem=1K lag=0.00s

*current=number of timers currently active in total
*max=maximum number of timers ever active at once
*mem=total memory consumed by all of the currently active timers (usually pretty small compared to other things, though)

Also, one other note: under ‘Threads’, there’s a bunch of lines that look something like:

http/Log::WRITER_ASCII in=11318 out=10882 pending=0/0 (#queue r/w: in=11318/11318 out=10882/10882)
ssl/Log::WRITER_ASCII in=10931 out=10878 pending=0/0 (#queue r/w: in=10931/10931 out=10878/10878)
files/Log::WRITER_ASCII in=10989 out=10792 pending=0/0 (#queue r/w: in=10989/10989 out=10792/10792)
dhcp/Log::WRITER_ASCII in=1031 out=1029 pending=0/0 (#queue r/w: in=1031/1031 out=1029/1029)

Generally, pending X/Y will describe how much memory is currently being consumed (relatively speaking) by messages waiting to be written to a log file / that have been read from that input source but not yet processed by bro.

A pending X/Y that grows steadily over time is an indication that bro could eventually run out of room to store outstanding log / input framework messages, and that these messages could eventually come to consume so much memory that the worker would thrash the machine into sweet digital oblivion.

Hope something in there is useful,
Gilbert

Thanks Gilbert!

I think I am getting close to at least isolating the issue.

I redefined some of the inactivity timeout values to something pregger aggressive…
redef tcp_inactivity_timeout = 15 sec;
redef udp_inactivity_timeout = 5 sec;
redef icmp_inactivity_timeout = 5 sec;

After committing the changes and restarting I am still seeing the same kind of slow memory consumption behavior.

I checked the IO statistics you gave above and think this is where I am getting backed up. Below is a brief escalation of just the http logs themselves.

grep ‘http/Log::WRITER_ASCII’ prof.log | awk ‘NR % 10 == 1’
1403880551.747191 http/Log::WRITER_ASCII in=40 out=11 pending=0/0 (#queue r/w: in=40/40 out=11/11)
1403880701.759282 http/Log::WRITER_ASCII in=632 out=160 pending=0/0 (#queue r/w: in=632/632 out=160/160)
1403880851.764553 http/Log::WRITER_ASCII in=1254 out=310 pending=0/0 (#queue r/w: in=1254/1254 out=310/310)
1403881001.794827 http/Log::WRITER_ASCII in=1881 out=459 pending=0/0 (#queue r/w: in=1881/1881 out=459/459)
1403881151.907771 http/Log::WRITER_ASCII in=2496 out=607 pending=0/0 (#queue r/w: in=2496/2496 out=607/607)
1403881302.133110 http/Log::WRITER_ASCII in=3140 out=754 pending=0/0 (#queue r/w: in=3140/3140 out=754/754)
1403881452.684259 http/Log::WRITER_ASCII in=3781 out=900 pending=0/0 (#queue r/w: in=3781/3781 out=900/900)
1403881611.446692 http/Log::WRITER_ASCII in=4321 out=1000 pending=0/0 (#queue r/w: in=4321/4321 out=1000/1000)
1403881783.945970 http/Log::WRITER_ASCII in=4816 out=1069 pending=0/0 (#queue r/w: in=4816/4816 out=1069/1069)
1403881991.154812 http/Log::WRITER_ASCII in=5435 out=1105 pending=0/0 (#queue r/w: in=5435/5435 out=1105/1105)
1403882156.814938 http/Log::WRITER_ASCII in=6066 out=1190 pending=0/0 (#queue r/w: in=6066/6066 out=1190/1190)
1403882336.215055 http/Log::WRITER_ASCII in=6690 out=1267 pending=0/0 (#queue r/w: in=6690/6690 out=1267/1267)
1403882494.089058 http/Log::WRITER_ASCII in=7350 out=1377 pending=0/0 (#queue r/w: in=7350/7350 out=1377/1377)

If I am interpreting this correctly, I am far exceeding my ability to write out logs as time goes on, resulting in a backup of that data in memory presumably. The same kind of behavior is seen in other log types as well.

Am I interpreting this correctly? If so the solution seems to be I need faster drives and/or more memory :slight_smile:

Hi Jason:

It’s only when pending begins to grow that there’s a large queue of messages waiting to be written (as far as I know). In this case, pending stays at 0/0 for each of those log updates, so I don’t think that’s an issue (for this log, at least :slight_smile:

One possibility here is going to be that bro is actually leaking memory somewhere, and you’re lucky enough to have found a bug :slight_smile: Usually this is going to be identified by a steady rise in memory usage over time with a relatively constant traffic volume / types. One thing to try (if it’s something you feel comfortable with) might be to pull down the current copy of bro in master, building that, and pushing this out to the nodes to see what kind of an effect that has (if anything).

Another thing to try might be to start tweaking which scripts are actually loaded in local.bro: comment out all the @load statements in local.bro (add a ‘#’ to the beginning of each line) and see what kind of an effect that has on memory utilization. Assuming memory usage drops, then you can start slowly start removing the # characters / restarting bro to re-load scripts one at a time. This is going to be pretty tedious, though.

A third thing to try might be doing a bit of sampling so that bro only sees some % of incoming packets. From there, slowly start to bring traffic back to see how bro’s memory utilization rises as traffic is restored. Might want to define a few different levels (drop 90% of connections, drop 75% of connections, drop 50% of connections, drop 25% of connections) and just leave the drop rate at each step for a while (read: a few hours would probably be a good start) to see what happens to the memory utilization. If memory utilization stays pretty constant at each level, and the overall memory pattern ends up looking like a staircase, it might be time to consider that RAM upgrade :slight_smile:

*** Note that it’s important that traffic be connection-sampled when trying the above: packet-sampling will lead to dropped packets in the middle of connections, which might not work quite as expected. Seth: how does connection sampling work for the packet filter framework? I haven’t ever really used it, but I think it can do that, right? Also, any other ideas / things I might be missing here?

One other question: did adjusting those timeouts change the number of inactivity timers reported? Probably not relevant to this issue, but just wondering if the change had any measurable effect.

-Gilbert

Thanks again Gilbert.

I put in an aggressive bpf filter that eliminates 90% of traffic and memory flat lined so I do not believe there to be a memory leak.

I then removed the old restrictive filter and replaced it with a more broad one that does nix out mulicast and broadcast traffic. I did notice the ConnectionInactivityTimer to be much more stable actually.

grep ‘ConnectionInactivityTimer’ prof.log | awk ‘NR % 10 == 1’
1404155766.423613 ConnectionInactivityTimer = 4412
1404155916.424489 ConnectionInactivityTimer = 4776
1404156066.425171 ConnectionInactivityTimer = 4627
1404156216.426120 ConnectionInactivityTimer = 4974
1404156366.426889 ConnectionInactivityTimer = 4784
1404156516.428065 ConnectionInactivityTimer = 4750
1404156666.429125 ConnectionInactivityTimer = 4687
1404156816.431119 ConnectionInactivityTimer = 5006
1404156966.431837 ConnectionInactivityTimer = 4830

So far with multicast and broadcast traffic being filtered out, I have noticed the memory to gradually increase but at a much slower rate. It has not come close to using all resources yet but I will check again tomorrow. It would seem that the perscription for more RAM needs to be written based on the observables collected so far. Many thanks for the helpful tips.

I wanted to circle back on this real quick because after doing a little more poking into this matter I believe I have found the root cause of my issues.

What was happening ultimately, was that the high volumes of memory usage I was seeing was due to large amounts of memory previously allocated to (presumably) bro processes, become inactive. Inactive memory is memory that has previously been allocated to some process but is no longer running. While this pool of memory is an option the OS goes to when it needs to go back to the well for more memory (to accommodate other processes), it is NOT included in ‘free memory’.

For example…

cat /proc/meminfo
MemTotal: 49376004 kB
MemFree: 1909988 kB
Buffers: 231036 kB
Cached: 17096308 kB
SwapCached: 75124 kB
Active: 21040696 kB
Inactive: 16141408 kB
Active(anon): 17410144 kB
Inactive(anon): 2445380 kB
Active(file): 3630552 kB
Inactive(file): 13696028 kB

Very little free memory (relatively speaking) here after running Bro on a system that sees high volumes of traffic over the weekend.

This memory can be freed however, and one need not reboot the server to make this happen…

; clear the memory buffers and cache

#] sync && echo 3 > /proc/sys/vm/drop_caches

cat /proc/meminfo
MemTotal: 49376004 kB
MemFree: 19784904 kB
Buffers: 3316 kB
Cached: 135400 kB
SwapCached: 75120 kB
Active: 17439080 kB
Inactive: 2554152 kB
Active(anon): 17410136 kB
Inactive(anon): 2445380 kB
Active(file): 28944 kB
Inactive(file): 108772 kB

Now that’s more like it :slight_smile:

At the end of the day I am not sure if this is something we need to be concerned about since the OS will allocate memory from the ‘inactive’ pool as well as the ‘free’ pool as warranted. However, if you are running any monitoring apps that complain when system resources meet a certain threshold, this may be something to certainly consider.

Of course, anyone who is experiencing similar issues like I had and is curious ought to be made aware of this in my mind.

Helpful link explaining:
http://www.tinylan.com/article/how-to-clear-inactive-memory-in-Linux

One note on performance, at least one article I have found explains how clearing this may actually be a detriment.

http://apple.stackexchange.com/questions/67031/isnt-inactive-memory-a-waste-of-resources

However, there are offerings out there that claim to free inactive memory under the banner of performance optimization.

https://www.youtube.com/watch?v=qd9V9eSdjLQ

Not really sure what the case is with Bro, but I am interested if anyone cares to weigh in on that point? Is it worth setting up a cron of the command above every so often?

Many thanks,
Jason

Sounds like this is what you are talking about

http://www.linuxatemyram.com/