Bro cluster requirements and manager logging backlog bug

Hovsep_Levi · January 5, 2017, 8:50pm

Thanks. I’ve found the BroControl scripts to modify the logger setup and will be testing soon.

Hovsep_Levi · January 6, 2017, 7:14pm

Here’s the problem I’m having so far. I’ve included most of the modified code below. I think I’m using the wrong data type for logger_str_array ?

[bro@mgr /opt/bro]$ bin/broctl start
starting logger-1 …
starting logger-2 …
starting logger-3 (was crashed) …
starting logger-4 …
logger-3 terminated immediately after starting; check output with “diag”
logger-2 terminated immediately after starting; check output with “diag”
logger-1 terminated immediately after starting; check output with “diag”
logger-4 terminated immediately after starting; check output with “diag”

==== stderr.log
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 146: not a record (logger-1$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 147: not a record (logger-2$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 148: not a record (logger-3$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 149: not a record (logger-4$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 150: not a record (logger-1$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 151: not a record (logger-2$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 152: not a record (logger-3$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 153: not a record (logger-4$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 22: uninitialized list value ($node_type=Cluster::WORKER, $ip=10.1.1.1, $zone_id=, $p=47778/tcp, $interface=myri0, $logger=logger-2$ = manager, $proxy=proxy-10)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 22: bad record initializer ([$node_type=Cluster::WORKER, $ip=10.1.1.1, $zone_id=, $p=47778/tcp, $interface=myri0, $logger=logger-2$ = manager, $proxy=proxy-10])

from install.py…

Hovsep_Levi · January 6, 2017, 7:28pm

Ok I fixed that… loggers[logger_index].name should be used , it’s a reference to the previously defined loggers_str_array[] entries.

Now it’s…

error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 22: unknown identifier logger, at or near “logger”

Hovsep_Levi · January 6, 2017, 7:50pm

The format string was the problem. The value of logger_str_array[i] would be $logger=“logger-1”. When trying to set it differently the %s was substituted incorrectly and didnt’ have the correct template of $logger=“logger-name”. (My hack needs to be rewritten.)

Now it’s fatal error in /opt/bro_data/spool/installed-scripts-do-not-touch/site/local.bro, line 126: can’t find logs-to-kafka.bro which I think means I’m pass the multiple logger configuration and closer to getting this running.

Hovsep_Levi · January 6, 2017, 9:58pm

I’m using four loggers and the memory usage remains stable. When I re-enable writing logs to disk there’s a difference since logs/current is a symlink to the first logger, spool/logger-1; the other loggers write into their own spool directories (ex: “spool/logger-3”). I think you mentioned this before.

For some reason logger-1 and logger-3 are doing all of the work, there are no logs in logger-2 and logger-4 and the communication.log files for each doesn’t show any worker communications. At startup there was “peer sent worker-1-1” but nothing afterwards. I’m not sure yet if this happens when Kafka only logging is enabled. The cluster-layout.bro looks correct and shows the 4 loggers are distributed among the workers correctly, so it’s not that.

When I reduced the number of loggers to 2 it’s the same phenomenon, logger-1 is working OK but logger-2 seems to be stalled. Only one worker has sent data and it’s very low volume.

Overall the multiple logger setup shows promise for fixing the issue but there’s a few more things to discover and tune. It seems the reason the cluster is stable is because only half of the logs are being received when using multiple loggers.

Azoff_Justin_S · January 6, 2017, 10:16pm

Looks like you worked out the broctl changes the right way. That code is a bit crufty, but what you have will work. There's a much easier way to do the distributing of workers/proxies:

import itertools
loggers = ['logger-1', 'logger-2']
logger_cycler = itertools.cycle(loggers)
next(logger_cycler)

'logger-1'

next(logger_cycler)

'logger-2'

next(logger_cycler)

'logger-1'

next(logger_cycler)

'logger-2'

I'm using four loggers and the memory usage remains stable. When I re-enable writing logs to disk there's a difference since logs/current is a symlink to the first logger, spool/logger-1; the other loggers write into their own spool directories (ex: "spool/logger-3"). I think you mentioned this before.

Yep. It's an issue for purely local logging, and I'm not sure if rotation would work (but maybe it does? you tell me :-))

For people that use splunk/logstash/kafka it's mostly a non-issue since it will get re-aggregated anyway.

For some reason logger-1 and logger-3 are doing all of the work, there are no logs in logger-2 and logger-4 and the communication.log files for each doesn't show any worker communications. At startup there was "peer sent worker-1-1" but nothing afterwards. I'm not sure yet if this happens when Kafka only logging is enabled. The cluster-layout.bro looks correct and shows the 4 loggers are distributed among the workers correctly, so it's not that.

When I reduced the number of loggers to 2 it's the same phenomenon, logger-1 is working OK but logger-2 seems to be stalled. Only one worker has sent data and it's very low volume.

Overall the multiple logger setup shows promise for fixing the issue but there's a few more things to discover and tune. It seems the reason the cluster is stable is because only half of the logs are being received when using multiple loggers.

It's very promising that you were seeing traffic to logger-1 and logger-3, so it is at least proving that multiple loggers will work. If you ran 4 workers but only one was doing anything I'd be worried. I'd be interested in knowing what happens if you ran 6 or 8 loggers.

Can you post what the resulting cluster-layout looked like for 2 and 4 workers? Maybe it's a simple problem and it's just not evenly distributing things.

Azoff_Justin_S · January 6, 2017, 10:22pm

To be clear.. The code that was already there is crufty, the easier ways didn't exist back when it was written.

Hovsep_Levi · January 7, 2017, 12:41am

Actually file rotation does work but it’s prone to fail because of a timestamp collision. Each rotated file is named based on the timestamp when the rotation started… so they are about 10-20 seconds different in name. (ex: x509.22:51:59… x509.22:52:20… x509.22:52:30). I guess the fix would be to change the filenames relative to each logger, ex: “logger-1_x509…” or something more clever like merging all logger files into a single zip file.

A cluster-layout for 2 loggers and 8 loggers is attached. I don’t think there’s anything to fix here based on the comments below.

When I configure 8 loggers only 3 loggers are working. (logger-3, logger-4, and logger-8). I restarted the cluster and this time 5 of the loggers are working. (2,3,4,6,8). Still looking into why this happens. This problem would affect the Kafka export since each logger would be exporting. Restarting the failed loggers didn’t fix the log flow. It looks like they are associating with the assigned logger correctly after startup and there’s nothing indicative in the worker logs stderr or stdout.

From logger-1/communication.log after restarting logger-1 post-cluster startup:

1483746743.134338 logger-1 parent - - - info [#10005/10.1.1.2:51512] peer sent class “worker-1-8”
1483746743.134338 logger-1 parent - - - info [#10005/10.1.1.2:51512] phase: handshake
1483746743.135891 logger-1 child - - - info [#10006/10.1.1.3:17887] accepted clear connection
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] added peer
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] peer connected
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] phase: version
1483746743.137351 logger-1 script - - - info connection established
1483746743.139263 logger-1 parent - - - info [#10006/10.1.1.3:17887] peer sent class “worker-3-12”
1483746743.139263 logger-1 parent - - - info [#10006/10.1.1.3:17887] phase: handshake

cluster-layout__2-loggers.bro (47.9 KB)

Hovsep_Levi · January 7, 2017, 12:42am

Here’s the 8-logger cluster-layout.

cluster-layout__8-loggers.bro (48.5 KB)

Hovsep_Levi · January 7, 2017, 1:07am

Actually file rotation does work but it’s prone to fail because of a timestamp collision. Each rotated file is named based on the timestamp when the rotation started… so they are about 10-20 seconds different in name. (ex: x509.22:51:59… x509.22:52:20… x509.22:52:30). I guess the fix would be to change the filenames relative to each logger, ex: “logger-1_x509…” or something more clever like merging all logger files into a single zip file.

A cluster-layout for 2 loggers is attached. I don’t think there’s anything to fix here based on the comments below.

When I configure 8 loggers only 3 loggers are working. (logger-3, logger-4, and logger-8). I restarted the cluster and this time 5 of the loggers are working. (2,3,4,6,8). Still looking into why this happens. This problem would affect the Kafka export since each logger would be exporting. Restarting the failed loggers didn’t fix the log flow. It looks like they are associating with the assigned logger correctly after startup and there’s nothing indicative in the worker logs stderr or stdout.

From logger-1/communication.log after restarting logger-1 post-cluster startup:

1483746743.134338 logger-1 parent - - - info [#10005/10.1.1.2:51512] peer sent class “worker-1-8”
1483746743.134338 logger-1 parent - - - info [#10005/10.1.1.2:51512] phase: handshake
1483746743.135891 logger-1 child - - - info [#10006/10.1.1.3:17887] accepted clear connection
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] added peer
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] peer connected
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] phase: version
1483746743.137351 logger-1 script - - - info connection established
1483746743.139263 logger-1 parent - - - info [#10006/10.1.1.3:17887] peer sent class “worker-3-12”
1483746743.139263 logger-1 parent - - - info [#10006/10.1.1.3:17887] phase: handshake

Hovsep_Levi · January 7, 2017, 1:08am

2-logger cluster layout attached. (mailing list limits the message sizes to 100k).

cluster-layout__2-loggers.bro (47.9 KB)

Hovsep_Levi · January 7, 2017, 1:31am

It also appears that not all workers are connecting to the loggers.

This time after restarting 6 of 8 loggers are active but only 18 workers are actively sending data.

[bro@mgr /opt/bro]$ grep worker spool/logger-*/communication.log | awk ‘{print $2}’ | sort -u |grep -v logger | wc -l
18

Hovsep_Levi · January 7, 2017, 2:20am

I went back to testing Kafka only with logs to disk disabled and it seems to work fine with a single logger, the memory usage is stable but it will take a few days of testing to be sure.

Azoff_Justin_S · January 7, 2017, 3:33pm

Running

broctl print Communication::nodes

May shed some light on that.

If it times out you can do

    broctl print Communication::nodes logger-1
    broctl print Communication::nodes logger-2
    broctl print Communication::nodes worker-1-1
    broctl print Communication::nodes worker-1-2
    broctl print Communication::nodes worker-1-3

to display it from individual nodes.

You may also just want to try running tcpdump when the workers start up, you should see tcp connections to 10.1.1.1 on ports 47761 and 47762 from the worker nodes.

Topic		Replies	Views
Version: 2.0-907 -- Bro manager memory exhaustion Zeek	11	63	May 6, 2022
Version: 2.0-907 -- Bro manager memory exhaustion Development development	2	106	May 6, 2022
Cluster minimal logs on manager Zeek	3	88	May 6, 2022
Version: 2.0-907 -- Bro manager memory exhaustion Zeek	6	80	May 6, 2022
Manager swapping.. Zeek	11	106	May 6, 2022

Bro cluster requirements and manager logging backlog bug

Related topics