Manager and logger threads crash immediately on deploy

I’ve successfully run smaller Bro clusters, but now that I’m scaling out I’m seeing the manager and logger threads crash immediately when I deploy the configuration.

What I’m trying to run:

  • 1 manager, 1 logger on 1 host

  • 8 proxies and 32 workers on 8 hosts

I’m using Bro 2.5.1. Each worker host has 2 Myricom 10G NICs w/2 ports each, using the 3.0.10 Myricom SNF driver. I’m attempting to run 9 processes (lb_procs) per worker node, each pinned to its own CPU core.

What I’m finding is that any time the number of worker processes exceeds ~160 (not a magic number–not consistent, but around that value based on observation), the manager and logger threads crash. If I keep the number of worker processes at or below ~160 (either by reducing processes per node, reducing nodes per host, or reducing hosts in the cluster) it runs successfully. Ideally, the cluster would have 288 worker processes.

This does not seem to be related packet volume, as the manager and logger threads crash even if I am not sending any traffic to the worker nodes.

Any troubleshooting or optimization suggestions are appreciated.

Is that (8 proxies and 32 workers) on ALL 8 hosts? For 64 total proxies?

That seems like a lot to me.

8 total proxies, 32 total workers.
(1 proxy node + 4 worker nodes) * 8 hosts

Okay, I think I’m following now, but I want to restate it so that other’s with more large cluster experience can chime in.

1 physical host = manager and logger

8 x physical hosts = proxy + 4*(worker w/ 9lb_procs)

I’m not sure if having multiple worker nodes per physical host is all that common. I assume you’re doing that so each ‘worker’ node only monitors one of your four 10G links per physical host. Ignoring the actual traffic capture aspect, have you tried running only one worker node w/ 36 lb_procs per physical host?

-Dop

Yes.. this is a problem.

Bro currently uses select() internally for the IO loop and select can't handle more than 1024 file descriptors.

Around 170 worker processes is where the manager will accumulate more than 1024 fds.

There are a few options here:

* Run less lb_procs per port to stay under the limit.

* Run two separate bro manager installations so that each manager/logger only handles half the workers. You can currently run more than one logger, but that doesn't help for the manager.

* Wait until the broker work is done and the old select code is removed.

* Swap out all the uses of select in the communication code with poll - I had started doing this a while back, but it got put on hold. It's probably not that much work to update it for 2.5. From what I remember it seemed to work but I didn't have a chance to do much testing on it.

https://github.com/bro/bro/commit/99dbd850b4caa0ed1af351cb7e2695b318ee54ad