I am having an issue with Bro and memory exhaustion. Currently I’m using click on a system with 8 x CPU cores to break up a network tap into three virtual interfaces (tap0, tap1 and tap2). I’m then running my Bro cluster on the same machine with a three workers operating on different CPU cores and virtual interfaces. The system has 16G of physical RAM. After running for about 24 hours or so all of the physical RAM is exhausted and Bro being to go after swap. I increased swap to 8GB but this is a never ending battle as Bro will eventually eat everything it can find and crash the system.
How do I go about diagnosing which scripts/policies are causing this, or if it is an internal memory leak somewhere? I have seen references to reduce-memory.bro and profile.bro in some of the Wiki and or mailing list searches but these don’t appear to be in the current 1.5.1 release.
I am running a large number of scripts from Seth Hall’s script repository in addition to the ones that are enabled by default. Below are the policies I’m loading in local.bro:
I am having an issue with Bro and memory exhaustion. Currently I'm using
click on a system with 8 x CPU cores to break up a network tap into three
virtual interfaces (tap0, tap1 and tap2). I'm then running my Bro cluster on
the same machine with a three workers operating on different CPU cores and
virtual interfaces. The system has 16G of physical RAM. After running for
about 24 hours or so all of the physical RAM is exhausted and Bro being to go
after swap. I increased swap to 8GB but this is a never ending battle as Bro
will eventually eat everything it can find and crash the system.
I'm having a similar problem, but it usually takes about 4 days to get that bad
here. I've been considering just going back to restarting bro every day in the
middle of the night like I used to. I used to do that before I installed
broctl, as it was the easiest way to rotate the logs every day.
I am having an issue with Bro and memory exhaustion.
..
I am running a large number of scripts from Seth Hall's script repository in
addition to the ones that are enabled by default. Below are the policies I'm
loading in local.bro:
Hi all, following up on this..
4 days ago I merged my bro policy with the latest updates from Seth, and
since then the memory usage on my bro machine has flatlined(it used to look
like a sawtooth wave from being restarted all the time). I'm not sure what the
cause was, but my guess is something to do with the http file identification.
The latest version of the script uses bro signatures instead of libmagic for
file identification. I wonder if the libmagic code has a memory leak in it
somewhere?
If you are still having memory problems it would be really interesting to
see if updating fixes things for you as well.
I synced my scripts up with the latest and greatest from Seth's repository but am still seeing Bro consume all 16gb of memory after only an hour or two. When time permits I will try to debug further to see if I can narrow it down to a particular script/policy.
I forgot to mention, the name of the policy for the file detection changed..
Are you still loading http-identified-files or are you loading
http-ext-identified-files?
If it's using that much memory that quickly, my guess would be that there is a state table growing out of control. Load the "profiling" script, it will print out globals sizes every 20 minutes or so in a file named prof.log, then and you'll be able to see what variable(s) is/are so huge.
If you are able to find out what variable is causing the memory consumption issue, please reply and let us know. It may be an issue that needs to be resolved or at least addressed in some way.
Instead of loading each of the "logging." scripts, you could just load enable-ext-logging at the top.
Oh! I think I just noticed your problem (and it's my fault!). Remove dns-passive-replication.bro from your list of scripts and I think your memory problems will go away. The two dns scripts need work still. I may merge the two together at some point, but they don't clean up after themselves very well yet and they *do* cause bad memory consumption problems. Sorry about that! I really need to get all of the documentation written for my scripts.
I just moved both of the dns scripts into the testing/ directory to clear up any confusion about their stability. When I get time and make them better with memory I'll move them back to the main directory.
Thanks. I'm now running without the DNS scripts and have profiling enabled. I will see how it goes. Right now Bro is using about 4.5GB between the manager, proxy and my three workers (all running on the same system w/click splitting up the tap). I was restarting each day at 1am but I have commented out the cron. I'll check it in the morning and see if things are cleaning up after themselves.
My memory consumption is better but still growing and not shrinking. I've been examining the globals in the prof.log files for each of the various components (workers, manager, etc.) but am not sure what is causing so much memory to be allocated. Below is an example from one of my workers. There is ~3.6GB of memory allocated, total, but the globals are only 214MB. This is replicated across my three workers... plus the memory being used by the manager and proxy... so grand total I'm now up to ~12GB of allocated memory and it continues to grow.
Mar 19 13:42:20 ------------------------
Mar 19 13:42:20 Memory: total=3821576K total_adj=3765668K malloced: 3814486K
Mar 19 13:42:20 Run-time: user+sys=55905.4 user=53574.3 sys=2331.1 real=99872.4
Mar 19 13:42:20 Conns: total=12370755 current=4998/859 ext=0 mem=3372528K avg=3926.1 table=3430K connvals=2328K
Mar 19 13:42:20 ConnCompressor: pending=36 pending_in_mem=582 full_conns=-4895 pending+real=4175 mem=48K avg=1368.7/84.7
Mar 19 13:42:20 Conns: tcp=0/0 udp=844/1984 icmp=15/50
Mar 19 13:42:20 TCP-States: Inact. Syn. SA Part. Est. Fin. Rst.
Mar 19 13:42:20 TCP-States:Inact. 76 2 4
Mar 19 13:42:20 TCP-States:Syn.
Mar 19 13:42:20 TCP-States:SA
Mar 19 13:42:20 TCP-States:Part. 12 755 1 26
Mar 19 13:42:20 TCP-States:Est. 2412 98 7
Mar 19 13:42:20 TCP-States:Fin. 8 5 62 416 3
Mar 19 13:42:20 TCP-States:Rst. 95 12 90 54 1
Mar 19 13:42:20 Connections expired due to inactivity: 2012770
Mar 19 13:42:20 Total reassembler data: 236K
Mar 19 13:42:20 RuleMatcher: matchers=2 dfa_states=599 ncomputed=9765 mem=1309K avg_nfa_states=19
Mar 19 13:42:20 Timers: current=12852 max=19240 mem=1004K lag=0.00s
Mar 19 13:42:20 ConnectionDeleteTimer = 590
Mar 19 13:42:20 ConnectionInactivityTimer = 6874
Mar 19 13:42:20 DNSExpireTimer = 385
Mar 19 13:42:20 NetworkTimer = 1
Mar 19 13:42:20 NTPExpireTimer = 60
Mar 19 13:42:20 RotateTimer = 35
Mar 19 13:42:20 ScheduleTimer = 840
Mar 19 13:42:20 TableValTimer = 79
Mar 19 13:42:20 TCPConnectionAttemptTimer = 255
Mar 19 13:42:20 TCPConnectionExpireTimer = 3733
Mar 19 13:42:20 Global_sizes > 100k: 0K
Mar 19 13:42:20 SSH::did_ssh_version = 24K (109/109 entries)
Mar 19 13:42:20 Login::login_sessions = 122K (140/140 entries)
Mar 19 13:42:20 SMTP::smtp_sessions = 973K (17/17 entries)
Mar 19 13:42:20 KnownServices::established_conns = 191K (386/386 entries)
Mar 19 13:42:20 ssl_cipher_desc = 30K (106/106 entries)
Mar 19 13:42:20 dpd_analyzer_ports = 128K (35/700 entries)
Mar 19 13:42:20 Scan::rops_idx = 39K (171/171 entries)
Mar 19 13:42:20 notice_tags = 262K (690/690 entries)
Mar 19 13:42:20 KnownHosts::known_hosts = 1861K (14160/14160 entries)
Mar 19 13:42:20 Login::output_trouble = 399K
Mar 19 13:42:20 DNS::distinct_PTR_requests = 481K (648/648 entries)
Mar 19 13:42:20 Scan::distinct_ports = 5880K (5376/20084 entries)
Mar 19 13:42:20 HTTP::http_sessions = 9018K (1697/1697 entries)
Mar 19 13:42:20 ssl_connections = 2436K (905/905 entries)
Mar 19 13:42:20 ftp_cmd_reply_code = 40K (273/273 entries)
Mar 19 13:42:20 Weird::weird_ignore = 99K (94/188 entries)
Mar 19 13:42:20 DNS::distinct_answered_PTR_requests = 45K (145/145 entries)
Mar 19 13:42:20 SMTP::reject_counter = 5115K (9475/9475 entries)
Mar 19 13:42:20 Scan::distinct_backscatter_peers = 269K (126/724 entries)
Mar 19 13:42:20 DetectProtocolHTTP::conns = 438K (470/940 entries)
Mar 19 13:42:20 HTTP::sql_injection_regex = 603K
Mar 19 13:42:20 Scan::accounts_tried = 94K (96/222 entries)
Mar 19 13:42:20 Portmapper::rpc_programs = 35K (129/129 entries)
Mar 19 13:42:20 HTTP::known_user_agents = 10475K (8027/29020 entries)
Mar 19 13:42:20 Scan::possible_scan_sources = 14K (106/106 entries)
Mar 19 13:42:20 IRC::active_channels = 334K (47/47 entries)
Mar 19 13:42:20 ssl_sessionIDs = 117981K (27276/27276 entries)
Mar 19 13:42:20 FTP::hot_files = 112K
Mar 19 13:42:20 Scan::pre_distinct_peers = 31560K (35230/72640 entries)
Mar 19 13:42:20 HTTP::sensitive_URIs = 519K
Mar 19 13:42:20 DetectProtocolHTTP::protocols = 278K (7/7 entries)
Mar 19 13:42:20 Scan::distinct_low_ports = 89K (98/196 entries)
Mar 19 13:42:20 IRC::active_users = 525K (96/96 entries)
Mar 19 13:42:20 Scan::scan_triples = 7386K (106/17547 entries)
Mar 19 13:42:20 Software::host_software = 9502K (5079/10272 entries)
Mar 19 13:42:20 DNS::dns_sessions = 1011K (629/629 entries)
Mar 19 13:42:20 Scan::distinct_peers = 4584K (571/30458 entries)
Mar 19 13:42:20 HTTP::suspicious_http_posts = 733K
Mar 19 13:42:20 KnownServices::known_services = 42K (261/261 entries)
Mar 19 13:42:20 Login::input_trouble = 108K
Mar 19 13:42:20 Weird::weird_action = 39K (170/170 entries)
Mar 19 13:42:20 HTTP::conn_info = 3007K (759/759 entries)
Mar 19 13:42:20 Global_sizes total: 219225K
Mar 19 13:42:20 Total number of table entries: 115411/243104
Mar 19 13:42:35 ------------------------
Your issue has been nagging me the past few days because I couldn't explain why your memory use is so high. Today I finally realized what it could be. Did you provide the '--enable-brov6' flag when you built Bro? Even more worthwhile, could you provide the full configure line you used when you built Bro? (it's in the config.log file in the directory extracted from the tar.gz)
Yes, I did include '--enable-brov6' because we are getting ready to rollout IPv6 in or perimeter and I was also seeing messages from Bro that it was not compiled with IPv6 support (via "broctl diag").
Rebuild Bro without brov6 and int64 for now. Currently when you enable IPv6, all IP addresses consume 128-bits of memory (even IPv4 addresses!). You can see that this is what's happening by looking at the line in your prof.log that starts with "Conns:". It indicates that memory consumed just by connection state is over 3G (3372528K).
There has been talk about changing things around so that IPv4 addresses still only take up 32-bits of memory even when IPv6 is enabled, but I don't know where those discussions ended and I don't know how difficult of a change that would be to make. Maybe Robin or Vern will comment on that?
The IPv6 code has not been tested all that well either, so it's also possible that there are some memory leaks or other bugs lurking that could lead to high memory use.
I recompiled without IPv6 and int64 today and so far my memory footprint is considerably lower, as expected. I will keep an eye on it over the next few days (I have disabled my nightly restart cron) and see how it behaves.
We have just brought IPv6 to our border router and will soon be testing it in the perimeter. Hopefully by the time we get anywhere close to wide spread usage Bro will have better support for it. Wishful thinking, huh?
Seth, I wanted to circle back around on this. This was definitely the issue as my memory usage has now flat lined. I have not restarted Bro in 4 days and my total memory usage is < 3GB for all workers, proxy and manager combined.
This was definitely the issue as my memory usage has now flat lined. I have not restarted Bro in 4 days and my total memory usage is < 3GB for all workers, proxy and manager combined.
Awesome!
Thanks for the help.
No problem, I'm glad that helped. I'm taking a look at some of the IPv6 code now to see if there is anything I can do to help reduce memory usage because I'd also like to be able to run Bro with IPv6 enabled.