intel.log file stops getting generated.

Hi All,

Running Bro 2.5, everything is working except intel.log file stop getting generated.
Last event in that file was around 12:45pm today, and after it got rotated,
I didn’t see intel.log for 1pm hour and still no log for intel.log in the current log dir.

Don’t know why all of a sudden intel.log stopped geting generated.

I checked:

  1. The conn.log, and seeing the connections from IPs listed as bad in intel feed.

$ less bad-IP.intel | grep “61.240.xx.yy”
61.240.xx.yy Intel::ADDR scanner 85 csirtg.io

$ less conn.log | grep “61.240.144.65”
1485280794.930507 CzUCmv3TFKLcYxFps1 61.240.xx.yy 40805 128.4.107.206 8081 tcp - - - - S0 F T 0 S 1 40 0 0 ( empty)

  1. Permissions on the intel input files are fine,i.e bro readable.
  2. No major activity related to Bro happened during 12:45ish, that can impact any Bro processing.

Any leads/suggestions?

Thanks,
Fatema.

Hi Fatema,

Running Bro 2.5, everything is working except intel.log file stop getting
generated.

just to be sure: You haven't configured intel expiration, right?

Last event in that file was around 12:45pm today, and after it got rotated,
I didn't see intel.log for 1pm hour and still no log for intel.log in the
current log dir.

Don't know why all of a sudden intel.log stopped geting generated.

How long was that instance running and is that behavior reproducible?
Have you noticed anything in reporter.log? To debug whether this is a
logging issue or an intel framework issue you might add a debug print to
the matching event.

Jan

It turns out to be the performance issue.
I restarted the bro cluster and it started getting generated, but have another issue:
The bro sensors are utilizing almost 100% memory as well as some part of swap.

We recently have upgraded the kernel and centos to 7.3 on bro cluster, as well as using latest pfring v6.4.1
We have 4 bro sensors each with 132G of memory and 24 core cpu @ 2.50GHz with 48 On-line CPU(s) (0-17)), and each running 22 bro processes.

The memory usage on the sensors is around ~129G

total used free shared buff/cache available
Mem: 131921372 129912824 801672 11224 1206876 1270972
Swap: 8388600 3378312 5010288

The peak traffic we see usually toggles around 6-7Gbps.
Don’t know if this started happening after the upgrade to Bro 2.5, but the sensors become un-responsive because of this.

I have checked that the ethtool settings on sensors are set to: rx off tx off tso off sg off gso off gro off
Also, have commented out some scripts, that I used to run with 2.4.1, but no luck with memory usage.

Any leads/suggestions?

Thanks,
Fatema.

with pf_ring the first thing to check would be to verify that bro is using pf_ring correctly. If it's not, you end up analyzing 100% of the traffic 22 times.

If you do a

    ls -l /proc/net/pf_ring/

and

    cat /proc/net/pf_ring/info

it should show rings in use and one file per bro process, like:

-r--r--r--. 1 root root 0 Jan 25 11:23 36549-p1p1.376
-r--r--r--. 1 root root 0 Jan 25 11:23 36552-p1p1.369
-r--r--r--. 1 root root 0 Jan 25 11:23 36561-p1p1.377
-r--r--r--. 1 root root 0 Jan 25 11:23 36581-p1p1.372
-r--r--r--. 1 root root 0 Jan 25 11:23 36594-p1p1.375
-r--r--r--. 1 root root 0 Jan 25 11:23 36600-p1p1.378
-r--r--r--. 1 root root 0 Jan 25 11:23 36608-p1p1.371
-r--r--r--. 1 root root 0 Jan 25 11:23 36611-p1p2.373

Hi Justin,

Thanks for suggestions.
Here are the stats (Looks like bro using pf_ring correctly though):

$ ls -l /proc/net/pf_ring/
total 0
-r–r–r-- 1 root root 0 Jan 25 12:40 74966-em1.1612
-r–r–r-- 1 root root 0 Jan 25 12:40 74968-em1.1616
-r–r–r-- 1 root root 0 Jan 25 12:40 74969-em1.1618
-r–r–r-- 1 root root 0 Jan 25 12:40 74970-em1.1620
-r–r–r-- 1 root root 0 Jan 25 12:40 74977-em1.1615
-r–r–r-- 1 root root 0 Jan 25 12:40 74998-em1.1621
-r–r–r-- 1 root root 0 Jan 25 12:40 75001-em1.1614
-r–r–r-- 1 root root 0 Jan 25 12:40 75026-em1.1629
-r–r–r-- 1 root root 0 Jan 25 12:40 75027-em1.1631
-r–r–r-- 1 root root 0 Jan 25 12:40 75040-em1.1622
-r–r–r-- 1 root root 0 Jan 25 12:40 75042-em1.1619
-r–r–r-- 1 root root 0 Jan 25 12:40 75051-em1.1627
-r–r–r-- 1 root root 0 Jan 25 12:40 75072-em1.1633
-r–r–r-- 1 root root 0 Jan 25 12:40 75076-em1.1613
-r–r–r-- 1 root root 0 Jan 25 12:40 75077-em1.1623
-r–r–r-- 1 root root 0 Jan 25 12:40 75097-em1.1625
-r–r–r-- 1 root root 0 Jan 25 12:40 75102-em1.1632
-r–r–r-- 1 root root 0 Jan 25 12:40 75105-em1.1624
-r–r–r-- 1 root root 0 Jan 25 12:40 75106-em1.1630
-r–r–r-- 1 root root 0 Jan 25 12:40 75107-em1.1626
-r–r–r-- 1 root root 0 Jan 25 12:40 75109-em1.1628
-r–r–r-- 1 root root 0 Jan 25 12:40 75110-em1.1617

$ cat /proc/net/pf_ring/info
PF_RING Version : 6.4.1 (unknown)
Total rings : 22

Standard (non ZC) Options
Ring slots : 32768
Slot version : 16
Capture TX : No [RX only]
IP Defragment : No
Socket Mode : Standard
Total plugins : 0
Cluster Fragment Queue : 14140
Cluster Fragment Discard : 0

$ free
total used free shared buff/cache available
Mem: 131921372 130028924 684760 11916 1207688 1161016
Swap: 8388600 3253200 5135400

Yes.. that is how it should look.. very important to verify that before checking anything else :slight_smile:

What does your 'broctl top' output look like?

That will break things down by each process

Forgot to mention about the arch. of cluster:
1 manager node (which is defined as logger as well)
4 worker nodes (which are defined as proxies as well)

Before (in 2.4.1) we used to have manager act as proxy, but because of performance issue (i.e bro unable to rotate logs on manager), moved the proxy functionality to the workers.

Attaching the output of ‘broctl top’, as it will swamp this email with text if pasted in the body :slight_smile:

Thanks,
Fatema

broctl_top.txt (13.5 KB)

Interesting, so all of your workers are pretty much the same at

worker-1-12 worker wrk1.xx.xx.xx 78972 parent 5G 5G 6% bro
worker-1-12 worker wrk1.xx.xx.xx 78994 child 428M 269M 0% bro

Do you have any system monitoring graphs that would show memory usage over time? I wonder if they are quickly growing to 5G at startup, or if they are slowly growing over time. In a pinch, you can do things like throw something like (date;broctl top) in cron and send the output to a file.

Are you loading misc/detect-traceroute or misc/scan.bro ?

Yeah, all procs pretty much the same, not sure why there is a parent/child pair for each process, thought it would just be 22 processes per node, hmm interesting.

I think we don’t have any system monitoring graphs on the workers (Looking into installing some tool to do that, was googling about the same :)).
I can setup a cron to do broctl top and send the output to a file.

The misc/detect-traceroute script isn’t loaded, but misc/scan is loaded in local.bro, was just about to configure Aashish’s scan-NG script to detect other kind of scans as well, but
seeing the boxes already swaping, chucked the plan :frowning:

Thanks,
Fatema.

Yeah, all procs pretty much the same, not sure why there is a parent/child pair for each process, thought it would just be 22 processes per node, hmm interesting.

The child process handles the communication to the manager/proxies. These will go away once the conversion to broker is done.

I think we don't have any system monitoring graphs on the workers (Looking into installing some tool to do that, was googling about the same :)).
I can setup a cron to do broctl top and send the output to a file.

Munin is crazy easy to get up and running and does the job, but it's not the best monitoring system out there. You can also use things like sar to collect data and use something else to graph it.

The misc/detect-traceroute script isn't loaded, but misc/scan is loaded in local.bro, was just about to configure Aashish's scan-NG script to detect other kind of scans as well, but
seeing the boxes already swaping, chucked the plan :frowning:

Ah.. if your network sees a lot of scan traffic, scan.bro could be what is killing your cluster.

If you run these commands, what values do you get?

    wc -l conn.log
    cat conn.log|bro-cut id.resp_p |fgrep -cw 23
    cat conn.log|bro-cut history|sort|uniq -c |sort -rn|head

Thanks Justin for suggesting some tools :slight_smile: will try those (Maybe Munin first)

Here’s the output of the cmds:

$ wc -l conn.log
12913751 conn.log

$ cat conn.log|bro-cut id.resp_p |fgrep -cw 23
3

$ cat conn.log|bro-cut history|sort|uniq -c |sort -rn|head
4230547 S
2938925 Dd
1059285 ShADadFf
968902 ShADadfF
915401 D
212507 ShAFf
177731 SAF
177359 ShADadFfR
159024 ShADadfFr
140911 ShADdaFf

Interesting, you're not seeing port 23 scans, but you are seeing a lot of scans.. 1/3 of your connections are unanswered Syn packets.

This would show what port is being scanned:

    cat conn.log |bro-cut id.resp_p history|fgrep -w S|sort|uniq -c|sort -nr|head

Disabling scan.bro would likely help a lot.

Thanks Justin!

Happening again, no intel.log file getting generated (I don’t know why poor intel file getting impacted, n not any other log file :-/ )

Here’s the stats (before I go ahead and disable scan.bro, and restart the cluster)

$ cat conn.log |bro-cut id.resp_p history|fgrep -w S|sort|uniq -c|sort -nr|head
398587 2323 S
256953 5358 S
205109 7547 S
115442 6789 S
101712 22 S
97051 81 S
90099 5800 S
44297 40884 S
43943 40876 S
35522 80 S

$ free
total used free shared buff/cache available
Mem: 131921372 131069700 562628 18476 289044 223264
Swap: 8388600 4443208 3945392

As it can be seen above, worker1 using almost 100% memory :frowning:

Going to disable scan.bro, and restart the cluster.
Also, will get Munin to have system monitoring graph on the sensors.

Thanks,
Fatema.

Ah.. that looks about right for the constant flood of IoT Scan crap. Are you filtering port 23 before bro can see it? 23 would be about 10x the volume of 2323.

Ah, makes sense, yes port 23 is getting blocked at the border, hence Bro wouldn’t be seeing any traffic to port 23… :slight_smile:
Disabled the scan.bro file. Is there any other script(s) that can be used in place of scan.bro , i.e scan-NG would also have same effect as well?
Thanks Justin for the help to troubleshoot the issue, will keep an eye on the sensors for any performance hit for next 24 hours.

scan-NG will work a lot better than scan.bro. I have a version that is kind of like 'scan-ng-lite' but from a users point of view it doesn't add much over scan-NG, so you should just use that.

Alrighty, yeah was looking into how to configure the script according to the environment.
It appears that we have to define the list of allocated subnets in the network,
as landmine works on watching connections which are not in allocated subnets.

Defining the allocated subnets is a pain, have a whole lot list of subnets that are allocated and
have just couple of subnets that constitute the darknet, hence was tweaking around the scripts to change that setting
from defining allocated subnets to rather defining un-allocated subnets, which is much easier.

Thanks,
Fatema.

That part is optional(but extremely useful). I'm glad you brought this up, the darknet configuration problem is something I've been thinking about how to fix:

* Some people define darknet as NOT allocated.
* Some people know exactly which subnets are dark.

I did write a version of the darknet code that auto-tunes itself based on allocated subnets, it's part of my scan code:

https://gist.github.com/JustinAzoff/80f97af4f4fbb91ae26492b919a50434

One can let it run for a while, and then do a

    broctl print Site::used_address_space manager

to dump out what it figures out as active, and then put it in a file that does

    @load ./dark-nets
    redef Site::used_address_space = {
    ...
    }

It's not perfect but it's a start. broker with its persistent data store support may be what is needed to make it more useful.

The only issue is it doesn't support something like a honey net that does technically exist: the auto tuning code will flag it as an allocated subnet. I need to work out how it should be overridden in cases like that.

Aside from the auto detection the function just comes down to

    return (a in local_nets && a !in used_address_space);

In your case you want this instead

    return (a in dark_address_space);

so I think the simplest thing that may possibly work for everyone is something like

    global dark_address_space: set[subnet] &redef;

and change the is_darknet logic to be

    if(|dark_address_space|)
        return (a in dark_address_space);
    else
        return (a in local_nets && a !in used_address_space);

Or maybe just

    return (a in local_nets && (a in dark_address_space || a !in used_address_space);

but I could see a different user wanting this instead:

    return (a in local_nets && a in dark_address_space && a !in used_address_space);

for the use case of "dark_address_space is my darknet subnets, but something may be allocated without us knowing, so double check!"

I haven't quite figured this out yet.. Maybe the answer is that there isn't a one size fits all implementation and I just need to have 4 is_darknet functions depending on how people want it to work:

    return (a in dark_address_space); #mode=darknet

    return (a in local_nets && a !in used_address_space); #mode=not_allocated

    return (a in local_nets && (a in dark_address_space || a !in used_address_space); #mode=darknet_or_not_allocated

    return (a in local_nets && a in dark_address_space && a !in used_address_space); #mode=darknet_and_not_allocated

actually, now that I finally wrote all this out, I see that it's just the 4 combinations of 2 boolean flags.

Great! glad to know that I am not the only one dealing with this glitch in scan-NG :slight_smile:
Now it totally makes sense, as I was thinking that in our case darknet and un-allocated subnets are samebut will have to be careful, as you mentioned, when un-allocated subnets can get assigned without us knowing!

I knew that there were two upgraded scan scripts available other than the one that gets ship with Bro by default,
one that you wrote scan.bro, and another scan-NG script.

Hence was thinking to migrate to use one of those, and stumbled across this darknet and allocated net defining issue.
(There was no way I could be able to define the complete list of allocated subnets in scan-NG config, without missing anything :slight_smile: ,
at that time I though it’s required to define the allocated subnets, as there’s a comment in scan-config.bro that reads like:

  • |

"####Important to configure for Landmine detection

if subnet_feed is empty then LandMine detection wont work "

hence thought of tweaking that setting to rather define darknet, but never got around to it)

Great to know that there’s already some code written by you that works around this issue!
Thanks a ton for all the explanation and link to your scan script, great help!
will go through it and see if I can get it up and running in our cluster :slight_smile:

Thanks Justin!

I think I know what was causing the intel log not to get generated.

As suggested by Justin, I disabled the scan.bro, and the sensors looks in pretty stable state

with 35-40% memory usage overall.

I realized (and tested) that when the following lines are enabled in local.bro, the intel.log stops getting generated after a day or so:
@load frameworks/intel/do_expire
redef Intel::item_expiration = 1day;

And when I comment out the above lines, the intel.log doesn’t have any problem, and starts getting generated.

I have been testing it for past two days and verified, with scan.bro disabled and sensors in pretty good state in terms of resource usage.

Hence, my hypothesis is, something is not working correctly, that is causing the intel feeds to expire in 1day and Bro no longer

has valid intel feeds in memory to compare it with traffic and hence causing no intel logs getting generated.

We are pulling down the feeds every day around 6:45am in morning in the bro feed dir.
I was thinking that if the feeds are not getting updated

(i.e if the feeds are same as they were before pulling), then it might cause all the old feeds (longer than 1 day) to expire and hence

Bro not generating intel.log.

But I compared the old feeds with new feeds, and verified that every day some IPs get added and some get removed.

Hence every day the feeds get modified and don’t remain the same.

I will still try to troubleshoot the issue, but for time being I have disabled the do_expire script so that intel.log file is generated.

Anyone observed this kind of issue/behavior?

Thanks,

Fatema.