Making scan.bro great again.

I took a closer look at scan-NG and at the scan.bro that shipped with 1.5 to understand how the detection could be better than what we have now. 1.5 wasn’t fundamentally better, but compared to what we are doing now it has an unfair advantage :slight_smile:

I found that it used tables like this:

global distinct_ports: table[addr] of set[port]
&read_expire = 15 mins &expire_func=port_summary &redef;

Not only is it using a default timeout of 15 minutes vs 5 minutes, it is using read_expire. This means that an attacker can send one packet every 14 minutes 25 times and still be tracked.

Meaning scan.bro as shipped with 1.5 can pick up slow scans over as much as a 6 hour period.

The sumstats based scan.bro can only detect scans that fit in the fixed time window (it is effectively using create_expire, but as Aashish points out, limited even further since the ‘creation time’ is a fixed interval regardless of when the attacker is first seen)

The tracking that 1.5 scan.bro has isn’t doing anything inherently better than what we have now, it’s just doing it over a much longer period of time. The actual detection it uses has the same limitations the current sumstats based scan.bro has: it does not detect fully randomized port scans. It would benefit from the same “unification” changes.

Since that fixing sumstats and adding new functionality to solve this problem in a generic way is a huge undertaking, I tried instead to just have scan.bro do everything itself. We may not be able to easily fix sumstats, but I think we can easily fix scan.bro by making it not use sumstats.

To see if this was even viable or a waste of time I wrote the script: it works. It sends new scan attempts to the manager and stores them in a similar ‘&read_expire = 15 mins’ table. This should detect everything that the 1.5 based version did, plus all the fully random scans that were previously missed. And with the simpler unified data structure and capped set sizes it will use almost zero resources.

Attached is the code I just threw on our dev cluster. It’s the implementation of “What is the absolute simplest thing that could possibly work”. It uses 1 event and 2 tables, one for the workers and one for the manager.

What does this look like from a CPU standpoint?

This graph shows a number of experiments.

  • The first block around 70% is the unified sumstats based scan.bro plus hacked up sumstats/cluster.bro to do data transfer more efficiently
  • The next block at 40% was the unified scan.bro hacked up to make the manager do all the sumstats (worked, but had issues)
  • The small spike upwards back to 70% was a return to the unified scan.bro that is in git with the threshold changed back to 25
  • The spike up to 170-200% was a return to stock sumstats/cluster.bro. This is what 2.5 would be with sumstats based scan.bro
  • The drop back down to 40% is the switch to the attached scan.bro that does not use sumstats at all.

The ‘duration’ is TODO in the notices, but otherwise everything works. I want to just get the start time directly from the time information in the table… I’m not sure if bro exposes it or even stores it in a usable way. If there’s no way to get it out of the table I just need to track when an attacker is first seen separately, but that is easy enough to do.

scan.bro (6 KB)

I know… I send too many emails :slight_smile:

I let the rewritten script run over the weekend, cpu and memory was stable.

I added one additional table to store known scanners so it can completely purge a scanners state, this cut down on the total amount of data stored by 1/2, as measured by

while true;do echo $(date) $(broctl print Scan::recent_scan_attempts |sort -u| wc -l);sleep 30m;done | tee -a keys.log

currently this is around 155,000 for us. That is 155,000 addr, port records. approx 16 bytes for each ip and 2 bytes for each port, gives ~3 MB of raw data, times whatever the overhead in bro is.

I also fixed the duration and port formatting issues, so now it properly shows things like

… scanned at least 100 unique hosts on port 3306/tcp in 13m18s
… scanned at least 70 unique hosts on ports 23/tcp, 2222/tcp, 22/tcp in 102m27s
… scanned at least 100 unique hosts on port 23/tcp in 0m1s
… scanned at least 99 hosts on 80 ports in 0m52s

I also even further simplified the connection filtering that feeds into scan detecting, I think it now finally has the bare minimum needed to detect scans and does not flag connections with capture loss as scans.

The last graph I included was a bit of a mess, this one is a little more clear

It shows 3 experiments, from left to right:

  • Stock scan.bro
  • Unified scan.bro that still uses sumstats
  • Unified scan.bro rewritten to not use sumstats and to work like the 1.5 version did (attached)

Also interesting is a graph of the network traffic during the same timeframe:

The positive line is manager → worker traffic, and the negative line is worker → manager traffic.

The negative line includes log writes, so the floor there won’t be zero.

scan.bro (6.54 KB)

Indeed :slight_smile: I realized one of the bigger issues in the sumstats based code is not really the detection of scans, but what happens AFTER the detection. After detection it keeps accumulating data, or possibly only slightly better, keeps trying to accumulate data.

Connection events are used to generate sumstats observations
sumstats observations feed into the sumstats framework
which may cross a threshold
which generates notices that are fed into the notice framework
which are often suppressed for at least 1hr by default.

However, sumstats has no idea that the only reason it is collecting observations is to raise a notice that could be currently suppressed for an entire day.

The observations don't stop once attacker has already triggered a notice. The whole machine keeps running even though nothing visible will ever come out of it.

I managed to fix this in the sumstats based unified scan.bro, but it is only a partial fix.

The code that does this is in this version:

Basically this part:

global known_scanners: table[addr] of interval &create_expire=10secs &expire_func=adjust_known_scanner_expiration;

# There's no way to set a key to expire at a specific time, so we
# First set the keys value to the duration we want, and then
# use expire_func to return the desired time.
event Notice::begin_suppression(n: Notice::Info)
    if (n$note == Port_Scan || n$note == Address_Scan || n$note == Random_Scan)
        known_scanners[n$src] = n$suppress_for;

function adjust_known_scanner_expiration(s: table[addr] of interval, idx: addr): interval
    local duration = s[idx];
    s[idx] = 0secs;
    return duration;

and then later, the checks are aborted early with:

    if ( scanner in known_scanners )

This works, but the reason it is incomplete is that if the notice was triggered by an intermediate update, sumstats still contains data for this attacker. This data will be sent over to the manager at the end of the epoch, even though by that time it isn't needed anymore since there was only one threshold and it was already crossed. The fix for that would be some more changes inside of sumstats..

if (intermediate update crossed a threshold && the number of thresholds is 1)
    instruct all worker nodes to purge any data associated with this key

my non-sumstats based scan.bro and AAshish's scan-NG both handle 'known scanner' suppression directly with a cluster event. Because there's no middleman (sumstats) it ends up being a bit simpler.

This Notice::begin_suppression trick itself is nice though, since it lets you configure suppression intervals in one place, instead of potentially having to configure policy specific suppression and keep them in sync.