BROKER + CLUSTER - stuck (Mike Dopheide)

However when I try to run broctl deploy, its stuck on “checking
configurations …” and never finish executing.

I have exact same issue, but not using any part of Broker, hence no “exit_only_after_terminate”
flag solution for me.
The only work around I have currently on our Bro cluster is to use:
broctl install
broctl restart
instead of directly using “broctl deploy”, because for some reason it just hangs on the “checking configuration…” and never finishes, and hence eventually I have to kill the process. :frowning:

$ /usr/local/bro/2.5/bin/broctl deploy
checking configurations …
^C
$

Just a clarification, “exit_only_after_terminate” doesn’t depend on Broker, it could just be buried in a script picked up from someone else as it’s very common for debugging.

-Dop

Thanks Mike for clarification. Does it means I can use it and redef it in my local.bro ?
I was thinking is there anyone else encountering same issues with deploy cmd, and
if it can be added to Bro documentation on how to solve the hanging issue.
Because thinking of it as a normal user perspective: I upgrade the Bro cluster to 2.5, and when use broctl deploy, it just hangs there, while user having no clue why it would happen.

Thanks,
Fatema.

I used to see this issue on heavily overloaded clusters, but haven't been able to reproduce it in a while. All check does is run a sc

One thing you could try is changing the check.bro script that comes with broctl to do

terminate();

instead of

terminate_communication();

I think Daniel also had a branch of broctl that just used bro -a instead of the check.bro script.

Thanks Justin for the input!
Yeah, you are right, tested the deploy cmd on a standalone node, and it does not hang there.
I will test out the check.bro suggestions on the prod cluster.

The cluster nodes use an average of ~30-35Gigs of memory (having ~125G in total)
And the capture loss also doesn’t report any loss i.e 0.025% etc
Hence thought that the nodes were doing Ok, not sure if they are getting loads of traffic and hence getting overloaded.

Also, I have noticed that when doing a restart on the cluster, it takes longer now (in 2.5) than it used to take when running the old version (2.4.1),
maybe the custom scripts can be the culprit, but had same scripts in the old version as well.

Ah, I should have said manager not cluster.

Check actually runs 100% on the manager. I think the hang is due to a race condition of some sort that prevents it from exiting like it is supposed to. It seems to only occur when the load is high, which is why deploy has an issue but stop first+check works ok.

Hmm, looks like the manager is also running with low memory:

$ free -g
total used free shared buff/cache available
Mem: 70 9 46 0 14 60
Swap: 7 4 3

$ top
top - 15:30:04 up 47 days, 22:33, 8 users, load average: 1.26, 1.25, 1.37
Tasks: 495 total, 2 running, 491 sleeping, 2 stopped, 0 zombie
%Cpu(s): 2.9 us, 1.7 sy, 0.4 ni, 94.9 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 73949688 total, 48927592 free, 9836872 used, 15185224 buff/cache
KiB Swap: 8388600 total, 4192868 free, 4195732 used. 63369176 avail Mem

Anyways, not going into that rabbit hole :slight_smile:
So the correct sequence to deploy any config changes in a cluster would be:
stop → check → install → start
I was looking at the cmds available and looks like “restart --clean” would do the trick?
or I can just script the above sequence in my restart-bro script :slight_smile:

Thanks,
Fatema.

Hmm, looks like the manager is also running with low memory:
$ free -g
              total used free shared buff/cache available
Mem: 70 9 46 0 14 60
Swap: 7 4 3

That's fine.. you have 46G of ram free.

$ top
top - 15:30:04 up 47 days, 22:33, 8 users, load average: 1.26, 1.25, 1.37
Tasks: 495 total, 2 running, 491 sleeping, 2 stopped, 0 zombie
%Cpu(s): 2.9 us, 1.7 sy, 0.4 ni, 94.9 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 73949688 total, 48927592 free, 9836872 used, 15185224 buff/cache
KiB Swap: 8388600 total, 4192868 free, 4195732 used. 63369176 avail Mem

Anyways, not going into that rabbit hole :slight_smile:
So the correct sequence to deploy any config changes in a cluster would be:
stop -> check -> install -> start
I was looking at the cmds available and looks like "restart --clean" would do the trick?
or I can just script the above sequence in my restart-bro script :slight_smile:

Yeah.. deploy basically automates check+install+restart.

stop+check+install+start is basically the same as stop+deploy.

The reason why deploy does check first is because if you accidentally broke your configuration, you need to start it again with the old configuration, fix the configuration, and then retry - effectively wasting a restart.

Doing check first means that you don't stop bro unless you know the new configuration will work.

The worst thing you can do is stop+install+check . If you do that, you can end up with a broken bro installation that needs to be fixed before you can start it up again. A lot of people were doing this which is why added deploy.

Cool! Thanks Justin for the explanation :slight_smile:

Tried using stop first+deploy but hangs and never completes.
Also tried doing stop+check+install+start, but hangs again.

Hence, ended up doing install+restart, to bring back the cluster up. :frowning:
The manager doesn’t look overloaded though.
I think next will try out the check.bro change you suggested at the beginning, and see if that helps…

Thanks!

When "broctl deploy" hangs, were there any bro processes
running (before you ran "deploy")?

Also, when you ran "stop+check+install+start", which
command hangs? Also, in this case, did you verify that
all bro processes were stopped by the "stop" command?

There were bro processes running before I used deploy cmd ( manager acts as Proxies-4, logger as well as manager, so around 5-6 bro processes run on manager machine).

When I ran stop+check+install+start, all the bro processes stopped properly (atleast that’s what console output said), and it hung on check.

After restoring the working cluster, I ran ‘broctl check’ on manager just to see what it does when run standalone, while the cluster is running:
it just never completed, and turned out, after couple of minutes, that the manager started swapping and hung, so I had to run ‘sudo kill -9 bro’
to get back the hold of machine, then restarted bro normally by doing a ‘install and restart’… :confused:

Thanks,
Fatema.

Although I couldn't reproduce this problem, I have a
possible fix. If you decide to try it, let me know
if it fixes the problem.

Apply the following patch to
/usr/local/bro/share/bro/broctl/check.bro

--- check.bro.orig 2017-03-08 17:49:53.000000000 -0600
+++ check.bro 2017-03-08 17:49:37.000000000 -0600
@@ -17,3 +17,6 @@
      Log::remove_filter(LoadedScripts::LOG, "default");
      Log::add_filter(LoadedScripts::LOG, f);
      }

Deploy works fine when I run it on our standalone test instance, but hangs when run on prod cluster, so it might be the cluster specific issue?
Thanks Dan for providing the patch, will try it on Bro cluster and then let you know. :slight_smile:

Hi Dan,

Thanks for the patch, finally tested on production cluster and seems to be working fine.

Thanks,
Fatema.