And I can’t tell why.
One master. 26 worker systems. Total of 200 worker processes. All centos6. Bro 2.5.
Crashes just started happening last night. System has been running since the release of 2.5 with 0 issues.
Any way to tell why it’s crashing? So far, all i have is the email from broctl and it’s not very helpful.
And I can't tell why.
One master. 26 worker systems. Total of 200 worker processes. All centos6. Bro 2.5.
Crashes just started happening last night. System has been running since the release of 2.5 with 0 issues.
I'm actually surprised that works at all. Because bro currently (but not for much longer) uses select for handling connections from all the workers, the manager will fail as soon as it gets enough connections for a file descriptor to hit above 1024. You used to hit that limit around 175 workers. Though now that I think of it, we fixed a .bro script leak in 2.5, so I think the new limit may be around 220 for bro 2.5. The next version of bro should hopefully not have a limit
Any way to tell why it's crashing? So far, all i have is the email from broctl and it's not very helpful.
This message:
received termination signal
Means something killed it, probably the kernel OOM killer. Does syslog show anything?
I’ve tried commenting out all workers in node.cfg except for the one master, one proxy, and one worker system using 6 worker processes. Still crashes after around 15 seconds.
If it says "received termination signal " it's not crashing, something is killing it.
Ran several manual bro start commands, and i always get “received termination signal.” Writes to a nohup.out files that contains that string.
broctl diag says…
==== .status
TERMINATING [done_with_network]
Not sure what to do here. The manager doesn’t have a sniffer nic. It’s purpose is accepting data in from the worker nodes.
Do you have an agent or cron job or something running on the machine that could be killing bro for some reason?
Lots of these.
0.000000 Reporter::ERROR no such index (Cluster::nodes[Intel::p$descr]) /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
0.000000 Reporter::ERROR no such index (Cluster::nodes[Intel::p$descr]) /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
0.000000 Reporter::ERROR no such index (Cluster::nodes[Intel::p$descr]) /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
0.000000 Reporter::ERROR no such index (Cluster::nodes[Intel::p$descr]) /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
So I commented out that section just for grins, and it still crashes.
[mclemons@bromaster-kcc:~/logs/current ] $ tail -f reporter.log
1489101446.599386 Reporter::INFO processing continued (empty)
1489101446.582511 Reporter::INFO processing continued (empty)
1489101446.565019 Reporter::INFO processing suspended (empty)
1489101446.565019 Reporter::INFO processing continued (empty)
1489101446.637924 Reporter::INFO processing suspended (empty)
1489101446.637924 Reporter::INFO processing continued (empty)
1489101446.728349 Reporter::INFO processing continued (empty)
1489101446.681030 Reporter::INFO processing continued (empty)
1489101446.751914 Reporter::INFO processing continued (empty)
1489101446.755815 Reporter::INFO processing continued (empty)
0.000000 Reporter::INFO received termination signal (empty)
#close 2017-03-09-23-19-16
Child died in the communication.log.
And a segfault:
2017-03-09T18:34:06.409225+00:00 HOSTNAME kernel: bro[60506]: segfault at 0 ip 00000000005fcf8d sp 00007fffaf9d2f40 error 6 in bro[400000+624000]
Just wanted to give an update to show how crazy this has been.
The segfaults made me think “memory issue”, so i ran memtest on the system. It has a lot of mems so this took many hours to complete, and with 0 errors. Pulled power on the system and upon boot, everything came up fine with a limited set of workers. I added all 200+ worker processes back in, and now it’s running like a champ again.
The only other thing that it could have been was a power outage on one of the 10 gig worker boxes. It kept blipping and coming back up. Bro cron was starting processes, and then that worker system was crashing due to lack of power. This could have caused the manager to fail. But i can’t really tell what the root cause was.
Thanks for the responses.
Ah.. I dropped the ball on this, sorry.
That's really interesting that a full restart fixed things. One thing I was thinking could have caused it was a stray/hung bro process somehow still listening on the port, but that usually shows up as a much more explicit issue in the logs.
It may be possible to use gdb to see where this is in the bro binary:
2017-03-09T18:34:06.409225+00:00 HOSTNAME kernel: bro[60506]: segfault at 0 ip 00000000005fcf8d sp 00007fffaf9d2f40 error 6 in bro[400000+624000]
I'm not sure if the usual method would work, but you can try
gdb `which bro`
and then at the (gdb) prompt, see if
info symbol 0x00000000005fcf8d
info symbol 0x00007fffaf9d2f40
show anything useful. There may be a more correct command to get gdb to tell you where in the bro binary the segfault occurred.