2.5 Beta cluster issue

Previously, I have successfully run a 2.4.1 Bro cluster with 160 workers processes. After updating to 2.5_beta, I’m suddenly seeing an issue crop up where I’m unable to start this same number of worker processes without the manager and logger crashing either immediately, or shortly after restarting the cluster. I’m able to successfully get to 140 worker processes, but when I try to add the last two nodes (10 worker procs each) back into the mix, things go wonky quickly. There is no crash report being generated as I would have normally expected. I have checked for orphan processes within the cluster, and none exist.

I’m wondering if this is re-manifestation of an issue Justin Azoff assisted me with in the past (Bro 2.4.1 cluster) where he noted that around 180 worker procs, this sort of issue can happen. In this previous case after finding orphaned worker processes and killing them, I was able to successfully start my cluster at full strength.

Any input regarding this issue would be greatly appreciated.

Respectfully,

-Erin Shelton

Program Manager: Incident Response and Network Security
Office of Information Technology
University of Colorado Boulder

Previously, I have successfully run a 2.4.1 Bro cluster with 160 workers processes. After updating to 2.5_beta, I'm suddenly seeing an issue crop up where I'm unable to start this same number of worker processes without the manager and logger crashing either immediately, or shortly after restarting the cluster. I'm able to successfully get to 140 worker processes, but when I try to add the last two nodes (10 worker procs each) back into the mix, things go wonky quickly. There is no crash report being generated as I would have normally expected. I have checked for orphan processes within the cluster, and none exist.

I'm wondering if this is re-manifestation of an issue Justin Azoff assisted me with in the past (Bro 2.4.1 cluster) where he noted that around 180 worker procs, this sort of issue can happen. In this previous case after finding orphaned worker processes and killing them, I was able to successfully start my cluster at full strength.

Any input regarding this issue would be greatly appreciated.

Yes.. This is likely the same issue. There was a change committed just after the 2.5 beta that fixed some file descriptor leaking. With the fix in place 2.5 should support more workers than 2.4.1 did. With broker in 2.6 the issue should hopefully be gone completely.

To fix it you can switch to using bro from git, or apply this small change to the beta:

commit b3a7d07e66b027a56e57ba010998639ff0d6da86
Author: Daniel Thayer <dnthayer@illinois.edu>

    Added a missing fclose in scan.l

    On OS X, Bro was failing to startup without first using the "ulimit -n"
    command to increase the allowed number of open files (OS X has a much
    lower default limit than Linux or FreeBSD).

diff --git a/src/scan.l b/src/scan.l
index a6e37a6..a026b31 100644
--- a/src/scan.l
+++ b/src/scan.l
@@ -599,7 +599,12 @@ static int load_files(const char* orig_file)
        ino_t i = get_inode_num(f, file_path);

        if ( already_scanned(i) )
+ {
+ if ( f != stdin )
+ fclose(f);