Thanks. I’ve found the BroControl scripts to modify the logger setup and will be testing soon.
Here’s the problem I’m having so far. I’ve included most of the modified code below. I think I’m using the wrong data type for logger_str_array ?
[bro@mgr /opt/bro]$ bin/broctl start
starting logger-1 …
starting logger-2 …
starting logger-3 (was crashed) …
starting logger-4 …
logger-3 terminated immediately after starting; check output with “diag”
logger-2 terminated immediately after starting; check output with “diag”
logger-1 terminated immediately after starting; check output with “diag”
logger-4 terminated immediately after starting; check output with “diag”
==== stderr.log
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 146: not a record (logger-1$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 147: not a record (logger-2$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 148: not a record (logger-3$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 149: not a record (logger-4$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 150: not a record (logger-1$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 151: not a record (logger-2$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 152: not a record (logger-3$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 153: not a record (logger-4$manager)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 22: uninitialized list value ($node_type=Cluster::WORKER, $ip=10.1.1.1, $zone_id=, $p=47778/tcp, $interface=myri0, $logger=logger-2$ = manager, $proxy=proxy-10)
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 22: bad record initializer ([$node_type=Cluster::WORKER, $ip=10.1.1.1, $zone_id=, $p=47778/tcp, $interface=myri0, $logger=logger-2$ = manager, $proxy=proxy-10])
from install.py…
Ok I fixed that… loggers[logger_index].name should be used , it’s a reference to the previously defined loggers_str_array[] entries.
Now it’s…
error in /opt/bro_data/spool/installed-scripts-do-not-touch/auto/cluster-layout.bro, line 22: unknown identifier logger, at or near “logger”
The format string was the problem. The value of logger_str_array[i] would be $logger=“logger-1”. When trying to set it differently the %s was substituted incorrectly and didnt’ have the correct template of $logger=“logger-name”. (My hack needs to be rewritten.)
Now it’s fatal error in /opt/bro_data/spool/installed-scripts-do-not-touch/site/local.bro, line 126: can’t find logs-to-kafka.bro which I think means I’m pass the multiple logger configuration and closer to getting this running.
I’m using four loggers and the memory usage remains stable. When I re-enable writing logs to disk there’s a difference since logs/current is a symlink to the first logger, spool/logger-1; the other loggers write into their own spool directories (ex: “spool/logger-3”). I think you mentioned this before.
For some reason logger-1 and logger-3 are doing all of the work, there are no logs in logger-2 and logger-4 and the communication.log files for each doesn’t show any worker communications. At startup there was “peer sent worker-1-1” but nothing afterwards. I’m not sure yet if this happens when Kafka only logging is enabled. The cluster-layout.bro looks correct and shows the 4 loggers are distributed among the workers correctly, so it’s not that.
When I reduced the number of loggers to 2 it’s the same phenomenon, logger-1 is working OK but logger-2 seems to be stalled. Only one worker has sent data and it’s very low volume.
Overall the multiple logger setup shows promise for fixing the issue but there’s a few more things to discover and tune. It seems the reason the cluster is stable is because only half of the logs are being received when using multiple loggers.
Looks like you worked out the broctl changes the right way. That code is a bit crufty, but what you have will work. There's a much easier way to do the distributing of workers/proxies:
import itertools
loggers = ['logger-1', 'logger-2']
logger_cycler = itertools.cycle(loggers)
next(logger_cycler)
'logger-1'
next(logger_cycler)
'logger-2'
next(logger_cycler)
'logger-1'
next(logger_cycler)
'logger-2'
I'm using four loggers and the memory usage remains stable. When I re-enable writing logs to disk there's a difference since logs/current is a symlink to the first logger, spool/logger-1; the other loggers write into their own spool directories (ex: "spool/logger-3"). I think you mentioned this before.
Yep. It's an issue for purely local logging, and I'm not sure if rotation would work (but maybe it does? you tell me :-))
For people that use splunk/logstash/kafka it's mostly a non-issue since it will get re-aggregated anyway.
For some reason logger-1 and logger-3 are doing all of the work, there are no logs in logger-2 and logger-4 and the communication.log files for each doesn't show any worker communications. At startup there was "peer sent worker-1-1" but nothing afterwards. I'm not sure yet if this happens when Kafka only logging is enabled. The cluster-layout.bro looks correct and shows the 4 loggers are distributed among the workers correctly, so it's not that.
When I reduced the number of loggers to 2 it's the same phenomenon, logger-1 is working OK but logger-2 seems to be stalled. Only one worker has sent data and it's very low volume.
Overall the multiple logger setup shows promise for fixing the issue but there's a few more things to discover and tune. It seems the reason the cluster is stable is because only half of the logs are being received when using multiple loggers.
It's very promising that you were seeing traffic to logger-1 and logger-3, so it is at least proving that multiple loggers will work. If you ran 4 workers but only one was doing anything I'd be worried. I'd be interested in knowing what happens if you ran 6 or 8 loggers.
Can you post what the resulting cluster-layout looked like for 2 and 4 workers? Maybe it's a simple problem and it's just not evenly distributing things.
To be clear.. The code that was already there is crufty, the easier ways didn't exist back when it was written.
Actually file rotation does work but it’s prone to fail because of a timestamp collision. Each rotated file is named based on the timestamp when the rotation started… so they are about 10-20 seconds different in name. (ex: x509.22:51:59… x509.22:52:20… x509.22:52:30). I guess the fix would be to change the filenames relative to each logger, ex: “logger-1_x509…” or something more clever like merging all logger files into a single zip file.
A cluster-layout for 2 loggers and 8 loggers is attached. I don’t think there’s anything to fix here based on the comments below.
When I configure 8 loggers only 3 loggers are working. (logger-3, logger-4, and logger-8). I restarted the cluster and this time 5 of the loggers are working. (2,3,4,6,8). Still looking into why this happens. This problem would affect the Kafka export since each logger would be exporting. Restarting the failed loggers didn’t fix the log flow. It looks like they are associating with the assigned logger correctly after startup and there’s nothing indicative in the worker logs stderr or stdout.
From logger-1/communication.log after restarting logger-1 post-cluster startup:
1483746743.134338 logger-1 parent - - - info [#10005/10.1.1.2:51512] peer sent class “worker-1-8”
1483746743.134338 logger-1 parent - - - info [#10005/10.1.1.2:51512] phase: handshake
1483746743.135891 logger-1 child - - - info [#10006/10.1.1.3:17887] accepted clear connection
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] added peer
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] peer connected
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] phase: version
1483746743.137351 logger-1 script - - - info connection established
1483746743.139263 logger-1 parent - - - info [#10006/10.1.1.3:17887] peer sent class “worker-3-12”
1483746743.139263 logger-1 parent - - - info [#10006/10.1.1.3:17887] phase: handshake
cluster-layout__2-loggers.bro (47.9 KB)
Here’s the 8-logger cluster-layout.
cluster-layout__8-loggers.bro (48.5 KB)
Actually file rotation does work but it’s prone to fail because of a timestamp collision. Each rotated file is named based on the timestamp when the rotation started… so they are about 10-20 seconds different in name. (ex: x509.22:51:59… x509.22:52:20… x509.22:52:30). I guess the fix would be to change the filenames relative to each logger, ex: “logger-1_x509…” or something more clever like merging all logger files into a single zip file.
A cluster-layout for 2 loggers is attached. I don’t think there’s anything to fix here based on the comments below.
When I configure 8 loggers only 3 loggers are working. (logger-3, logger-4, and logger-8). I restarted the cluster and this time 5 of the loggers are working. (2,3,4,6,8). Still looking into why this happens. This problem would affect the Kafka export since each logger would be exporting. Restarting the failed loggers didn’t fix the log flow. It looks like they are associating with the assigned logger correctly after startup and there’s nothing indicative in the worker logs stderr or stdout.
From logger-1/communication.log after restarting logger-1 post-cluster startup:
1483746743.134338 logger-1 parent - - - info [#10005/10.1.1.2:51512] peer sent class “worker-1-8”
1483746743.134338 logger-1 parent - - - info [#10005/10.1.1.2:51512] phase: handshake
1483746743.135891 logger-1 child - - - info [#10006/10.1.1.3:17887] accepted clear connection
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] added peer
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] peer connected
1483746743.137351 logger-1 parent - - - info [#10006/10.1.1.3:17887] phase: version
1483746743.137351 logger-1 script - - - info connection established
1483746743.139263 logger-1 parent - - - info [#10006/10.1.1.3:17887] peer sent class “worker-3-12”
1483746743.139263 logger-1 parent - - - info [#10006/10.1.1.3:17887] phase: handshake
2-logger cluster layout attached. (mailing list limits the message sizes to 100k).
cluster-layout__2-loggers.bro (47.9 KB)
It also appears that not all workers are connecting to the loggers.
This time after restarting 6 of 8 loggers are active but only 18 workers are actively sending data.
[bro@mgr /opt/bro]$ grep worker spool/logger-*/communication.log | awk ‘{print $2}’ | sort -u |grep -v logger | wc -l
18
I went back to testing Kafka only with logs to disk disabled and it seems to work fine with a single logger, the memory usage is stable but it will take a few days of testing to be sure.
Running
broctl print Communication::nodes
May shed some light on that.
If it times out you can do
broctl print Communication::nodes logger-1
broctl print Communication::nodes logger-2
broctl print Communication::nodes worker-1-1
broctl print Communication::nodes worker-1-2
broctl print Communication::nodes worker-1-3
to display it from individual nodes.
You may also just want to try running tcpdump when the workers start up, you should see tcp connections to 10.1.1.1 on ports 47761 and 47762 from the worker nodes.