Manager swapping..

Hey all,

We have logger and manager running on the same node, and it started to use complete swap and bro logs in current dir stopped rotating.

We have run in this type of issue before when running Bro2.4, and it turned out that moving proxies to the worker nodes solved the high load issue on manager, and things started working normally.

Now, we have all the proxies on the worker nodes (4 in total) and logger is running on the same node as manager, so my guess would be, that might be causing the high load on manager.

The bro processes are really big on the manager:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

104772 bro 20 0 24.926g 0.017t 1300 S 45.7 25.0 4542:04 bro
125346 bro 20 0 0.221t 0.027t 3444 S 40.4 39.4 187:28.80 bro
125366 bro 25 5 1510856 275516 728 R 40.1 0.4 222:22.58 bro
104776 bro 25 5 540736 228920 360 S 8.9 0.3 893:42.05 bro

Also, the free -g output looks like this:

$ free -g
total used free shared buff/cache available
Mem: 70 47 0 0 22 21
Swap: 7 7 0

Next thing I am going to try is to disable some of the protocols from logging (don’t know how much help it would be) and restart Bro.

Any other suggestions/Best practices to follow, to avoid this situation in future (really not looking forward to the quick and dirty fix of restarting Bro whenever this happens :slight_smile: )?

Also, I have proper ethtool settings (tso off gso off gro off rx off tx off sg off) on the manager as well (as suggested in some of the posts for better performance).

Thanks,
Fatema.

Was just brainstorming, and thinking if multi-threading can be used for logger as well, just like worker threads?
As a single Bro logger process is becoming big, why not to distribute the work load across multiple logger processes.
Is it possible to do? and if it impacts manager on the same node?
Anybody tried that?

The bro processes are really big on the manager:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
104772 bro 20 0 24.926g 0.017t 1300 S 45.7 25.0 4542:04 bro
125346 bro 20 0 0.221t 0.027t 3444 S 40.4 39.4 187:28.80 bro
125366 bro 25 5 1510856 275516 728 R 40.1 0.4 222:22.58 bro
104776 bro 25 5 540736 228920 360 S 8.9 0.3 893:42.05 bro

Which process is which in this output? Can you use broctl top manager logger instead? that will include the output about which process is which along with the cpu/mem usage. How to troubleshoot the issue depends a lot on if it is the manage process or the logger process causing the problems.

Also, the free -g output looks like this:
$ free -g
              total used free shared buff/cache available
Mem: 70 47 0 0 22 21
Swap: 7 7 0

Looks like you have some headroom there, but not much.

Next thing I am going to try is to disable some of the protocols from logging (don't know how much help it would be) and restart Bro.

Well, if it's the logger node, reducing the log volume can help. Depending on how many cpu cores you have one thing that can help is using logging filters to split logs out into multiple files. That lets the logger node dedicate more threads to writing logs.

Any other suggestions/Best practices to follow, to avoid this situation in future (really not looking forward to the quick and dirty fix of restarting Bro whenever this happens :slight_smile: )?

Also, I have proper ethtool settings (tso off gso off gro off rx off tx off sg off) on the manager as well (as suggested in some of the posts for better performance).

That shouldn't really matter on the manager, but it can't hurt.

Was just brainstorming, and thinking if multi-threading can be used for logger as well, just like worker threads?
As a single Bro logger process is becoming big, why not to distribute the work load across multiple logger processes.
Is it possible to do? and if it impacts manager on the same node?
Anybody tried that?

The logger does distribute the work across multiple threads, but it has a central component that has to receive all the messages.

Someone else on the mailing list was having issues with logger scaling and I pointed them to the parts of broctl that needed to be tweaked to let you run multiple logger nodes. If you're currently using something like the kafka log writer or something like logstash to ship bro logs off to another system it will kind of work. The 'issue' is that you end up with 2 log directories that each contain the logs from half the workers. As long as you have something else that can merge them back together and correlate everything that's not a problem.

Hopefully multiple logger nodes can be supported officially at some point.

And right after I send this I see that Daniel has a branch of broctl with the initial changes needed to make this work.

Thanks Justin for the input :slight_smile:

I restarted Bro after disabling some of the protocols logging (like rdp, syslog, snmp etc) yesterday afternoon,

as the machine is in production and needed to be fixed kind of “ASAP”. Hence couldn’t get a chance to run

the broctl top while having the issue, I know you have mentioned it couple of times in past to use “broctl top”

instead of normal “top”, but magically I keep forgetting to do that, I think I should come up with by BRO troubleshoot

guide, which should list some basic troubleshooting commands that you guys suggest in these emails :slight_smile:

Anyways, I did run the command today, and it looks like the manager process is overwhelmed,

hmm I thought that it might logger that might be having issues catching up on the load, but I was wrong:

$ sudo -u bro /usr/local/bro/2.5/bin/broctl top manager logger
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger logger IDS 60928 parent 2G 90M 17% bro
logger logger IDS 60932 child 522M 246M 5% bro
manager manager IDS 60990 child 1G 257M 35% bro
manager manager IDS 60973 parent 222G 31G 23% bro

It makes me think, if there is some memory leak issue with manager.

Thanks,

Fatema.

Are you loading misc/detect-traceroute or misc/scan in your local.bro?

Nope, based on our previous discussion in another thread,
I disabled the misc/scan, and loaded scan-NG-master script.
I always thought that the scripts would have more load on workers than manager.
When I was seeing memory issues on workers, I stopped using misc/scan and switched to
the scan-NG script.
Didn’t know that it would impact manager performance as well, hmm.

Try disabling the SSL/TLS cert verification. I'm not sure why but
that helped, without it the manager would slowly climb to massive
memory usage. Now it works fine for one or two weeks before
unexpectedly using all memory.

#@load protocols/ssl/validate-certs

Good:

Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 6241 parent 701M 163M 20% bro
logger-1 logger 10.1.1.1 6261 child 458M 69M 3% bro
manager manager 10.1.1.1 6345 child 510M 377M 100% bro
manager manager 10.1.1.1 6292 parent 890M 804M 24% bro

Bad:

Name Type Host Pid Proc VSize Rss Cpu Cmd
logger-1 logger 10.1.1.1 52731 parent 1G 806M 0% bro
logger-1 logger 10.1.1.1 52951 child 8G 8G 0% bro
manager manager 10.1.1.1 53127 child 1G 742M 0% bro
manager manager 10.1.1.1 52979 parent 1573G 100G 0% bro

Thanks Sanjay for suggestions.I already have the @load protocols/ssl/validate-certs disabled in local.bro. :slight_smile:

I was looking into the reporter logs and see some logs like this:

Some INFO logs:

1490288453.884071 Reporter::INFO Got counters: [new_conn_counter=4394103, is_catch_release_active=7433937, known_scanners_counter=0, not_scanner=2439888, darknet_counter=64358, not_darknet_counter=3114626, already_scanner_counter=0, filteration_entry=0, filteration_success=1543038, c_knock_filterate=3548445, c_knock_checkscan=0, c_knock_core=0, c_land_filterate=22317, c_land_checkscan=0, c_land_core=0, c_backscat_filterate=3548445, c_backscat_checkscan=0, c_backscat_core=0, c_addressscan_filterate=3548445, c_addressscan_checkscan=0, c_addressscan_core=0, check_scan_counter=0, worker_to_manager_counter=0, run_scan_detection=0, check_scan_cache=1543038, event_peer=worker-1-15] manager

1490288454.925040 Reporter::INFO known_scanners_inactive: [scanner=94.51.38.120, status=T, detection=KnockKnockScan, detect_ts=1490202054.11266, event_peer=manager, expire=F] manager
1490288454.925040 Reporter::INFO known_scanners_inactive: [scanner=171.249.5.188, status=T, detection=KnockKnockScan, detect_ts=1490202053.07045, event_peer=manager, expire=F] manager

Ans these error logs:

0.000000 Reporter::ERROR field value missing [Scan::geoip_info$country_code] /usr/local/bro/2.5/share/bro/site/scan-NG-master/scripts/./scan-summary.bro, line 292
0.000000 Reporter::ERROR value used but not set (Scan::c_landmine_scan_summary) /usr/local/bro/2.5/share/bro/site/scan-NG-master/scripts/./check-landmine.bro, line 33
0.000000 Reporter::ERROR value used but not set (Scan::c_landmine_scan_summary) /usr/local/bro/2.5/share/bro/site/scan-NG-master/scripts/./check-landmine.bro, line 33

Are they anywhere related to the issue?

Thanks,
Fatema.

The issue got resolved. :slight_smile:
I rebuilt Bro with tcmalloc, for efficiet memory usage, on the cluster and it seems to resolve the heavy memory usage on the manager.
After that, when I disabled the scan scripts in the cluster, the memory usage dropped down to ~5%, and
when it’s enabled the memory usage toggles around ~25% (i.e ~25-28GB on manager) and around ~31GB
on workers, so far the cluster seems to be stable with regard to memory usage.

Thank you all for the help, for resolving the issue. Appreciate it :slight_smile:

-Fatema.

Oh, that is interesting.

Just to check - this was on 2.5?

Would you potentially be up for a little bit of digging to see what is
causing this? I am not aware of anyone encountering this problem before,
and I really would like this not to happen :slight_smile:

If you are ok with it, I will supply you with an instrumented version of
the validation script that outputs a bit of debugging information to help
me check what is going on here.

Johanna