Manager and logger threads crash immediately on deploy

Chris_Herdt · July 12, 2017, 6:38pm

I’ve successfully run smaller Bro clusters, but now that I’m scaling out I’m seeing the manager and logger threads crash immediately when I deploy the configuration.

What I’m trying to run:

1 manager, 1 logger on 1 host
8 proxies and 32 workers on 8 hosts

I’m using Bro 2.5.1. Each worker host has 2 Myricom 10G NICs w/2 ports each, using the 3.0.10 Myricom SNF driver. I’m attempting to run 9 processes (lb_procs) per worker node, each pinned to its own CPU core.

What I’m finding is that any time the number of worker processes exceeds ~160 (not a magic number–not consistent, but around that value based on observation), the manager and logger threads crash. If I keep the number of worker processes at or below ~160 (either by reducing processes per node, reducing nodes per host, or reducing hosts in the cluster) it runs successfully. Ideally, the cluster would have 288 worker processes.

This does not seem to be related packet volume, as the manager and logger threads crash even if I am not sending any traffic to the worker nodes.

Any troubleshooting or optimization suggestions are appreciated.

dopheide · July 12, 2017, 6:56pm

Is that (8 proxies and 32 workers) on ALL 8 hosts? For 64 total proxies?

That seems like a lot to me.

Chris_Herdt · July 12, 2017, 7:01pm

8 total proxies, 32 total workers.
(1 proxy node + 4 worker nodes) * 8 hosts

dopheide · July 12, 2017, 7:32pm

Okay, I think I’m following now, but I want to restate it so that other’s with more large cluster experience can chime in.

1 physical host = manager and logger

8 x physical hosts = proxy + 4*(worker w/ 9lb_procs)

I’m not sure if having multiple worker nodes per physical host is all that common. I assume you’re doing that so each ‘worker’ node only monitors one of your four 10G links per physical host. Ignoring the actual traffic capture aspect, have you tried running only one worker node w/ 36 lb_procs per physical host?

-Dop

Azoff_Justin_S · July 12, 2017, 7:49pm

Yes.. this is a problem.

Bro currently uses select() internally for the IO loop and select can't handle more than 1024 file descriptors.

Around 170 worker processes is where the manager will accumulate more than 1024 fds.

There are a few options here:

* Run less lb_procs per port to stay under the limit.

* Run two separate bro manager installations so that each manager/logger only handles half the workers. You can currently run more than one logger, but that doesn't help for the manager.

* Wait until the broker work is done and the old select code is removed.

* Swap out all the uses of select in the communication code with poll - I had started doing this a while back, but it got put on hold. It's probably not that much work to update it for 2.5. From what I remember it seemed to work but I didn't have a chance to do much testing on it.

https://github.com/bro/bro/commit/99dbd850b4caa0ed1af351cb7e2695b318ee54ad

Topic		Replies	Views
Bro manager dies in a large cluster Zeek	1	118	May 6, 2022
cluster manager crash Zeek	7	110	May 6, 2022
2.5 Beta cluster issue Zeek	2	80	May 6, 2022
Manager swapping.. Zeek	11	106	May 6, 2022
Cluster minimal logs on manager Zeek	3	88	May 6, 2022

Manager and logger threads crash immediately on deploy

Related topics