File extraction after checking hash.

Hi,

I was reading about the Files framework of Bro, and know that there are file analyzers available that can be attached to files that Bro sees on the network connection.
I am currently extracting all the ‘application/x-dosexec’ files from http connections, and realized that
there are lot of files that are just duplicates (i.e with same hashes).

Hence was thinking to write some bro script that would use Files analysis FW and checkes the hash of the file first against a set of hashes already seen (extracted) by Bro and will skip the extraction of that file if it’s present in the set of hashes.

I tried adding the files::add_analyzer(f, Files::ANALYZER_EXTRACT,…) in file_new event, file_sniff event and file_state_removed event(except it didn’t work here), but turns out that file_hash event triggers later than all these events and hashes get calculated after the file extraction analyzer has run.

Hence wanted to ask is it possible to add Files::ANALYZER_EXTRACT AFTER Files::ANALYZER_MD5 analyzer so that I can get the hash first to compare against the set before making a decision to extract the file?

Thanks,
Fatema.

Unfortunately not. Since we don't know the hash of the file when we see the beginning we can't yet determine that we don't want to extract the file. Sort of a chicken and egg problem. :slight_smile:

  .Seth

Thanks Seth for confirming!
I think we can go through the extractions afterwards and write some sort of script to delete the dups. :slight_smile:
And same for hashes, asked Wes Young about querying limit to cif server (REN-ISAC) for hashes.
I know that we can query the cif server for a given hash, and get back the results with cif confidence rate and other respective fields.
Hence will be writing some scripts to get unique hashes and malware execs from traffic :slight_smile:

Thanks!
Fatema.

So here’s a simple script that will add a column ‘uniq_hash’ to the files.log file that will show
whether bro has seen that hash before (in one day duration).

module Uniq_hashes;

redef record Files::Info += {

Adding a field column of host and uniq_hash to show from where

the file got downloaded and whether seen first time or duplicate.

host: string &optional &log;
uniq_hash: bool &optional &log ;
};

#global uniq_hashes: set[string] ;
global uniq_hashes: set[string] &create_expire=1day;

event file_hash(f: fa_file, kind: string, hash: string)
{
print “file_hash”, f$id, kind, hash;

if(f?$http && f$http?$host)
f$info$host = f$http$host;

if(hash in uniq_hashes)
f$info$uniq_hash = F;

else
{
add uniq_hashes[hash];
f$info$uniq_hash = T;
}

}

And, then I can grep the hashes with uniq_hash=T and query the cif server for analysis.
Also, can script to get the name of the extracted file from the ‘extracted’ field in files.log with uniq_hash=F
and delete that file almost realtime, after Bro has extracted that file.

Before I can test it in production, I want to ask if there is a way I can delete the contents of set uniq_hashes right at the midnight
so that we can get unique files and hashes on a daily basis logged in files.log? I don’t want that variable to grow out of bound,
consuming lot of memory, hence thought 1 day should be reasonable period of time to flush the contents of the set and exact time line
will give an idea of uniq hashes queried daily and no. of execs extracted on daily basis.

Any help appreciated!

Thanks,
Fatema.

And, then I can grep the hashes with uniq_hash=T and query the cif server for analysis.
Also, can script to get the name of the extracted file from the 'extracted' field in files.log with uniq_hash=F
and delete that file almost realtime, after Bro has extracted that file.

Do you know that the intel framework supports hashes? If you export a feed of hashes from CIF you can load that into bro and do the alerting on known hashes bad in real time.

Before I can test it in production, I want to ask if there is a way I can delete the contents of set uniq_hashes right at the midnight
so that we can get unique files and hashes on a daily basis logged in files.log? I don't want that variable to grow out of bound,
consuming lot of memory, hence thought 1 day should be reasonable period of time to flush the contents of the set and exact time line
will give an idea of uniq hashes queried daily and no. of execs extracted on daily basis.

You can probably do it using something like this:

global SECONDS_IN_DAY = 60*60*24;

function midnight(): time
{
    local now = network_time();
    local dt = time_to_double(now);
    local mn = double_to_count(dt / SECONDS_IN_DAY) * SECONDS_IN_DAY;
    return double_to_time(mn);
}

function interval_to_midnight(): interval
{
    return midnight() - network_time();
}
event reset_hashes()
{
    uniq_hashes = set(); #I think this is the proper way to clear a set?
}

event bro_init()
{
    print "Time to midnight:", interval_to_midnight();
    schedule interval_to_midnight() { reset_hashes()};
}

I think that might work properly except for the timezone being in UTC, so it might need to be adjusted, or something different altogether

Seth has this plugin: https://github.com/sethhall/bro-approxidate

which would let you do

local md = approxidate("midnight");

If it was packaged for bro-pkg it would be easier to install though :slight_smile:

The known hosts/services/certs scripts need a framework to do things like this, so 2.6 may end up having this as a built in feature.

Hi Justin,

Do you know that the intel framework supports hashes? If you export a feed of hashes from CIF you can load that into bro and do the alerting on known hashes bad in real time.

Yes. And that was the plan, but unfortunately, I couldn’t get the list of the feeds (hashes) pulled down from REN-ISAC , that’s interesting that they provide other feeds but hashes (will ask in REN-ISAC mailing list to confirm).
But I figured out that you can query their database to get information about a particular hash.
Also, tried looking for a good open source of feeds for hashes, but couldn’t find it hence don’t have any hash feeds currently in intel :frowning:

Thank you for the code, works perfect! :slight_smile:
Made a little tweak, replaced network_time() with current_time() function at both the places.
For some reason I was getting 0.0 as network_time() value when ran the code in try.bro.org with sample http pcap.

Also, added "local mn_EST = mn + 14400.0; " in midnight() function to get local EST in quick and dirty way. :slight_smile: (I know the best way to do ii to use Seth’s plugin, will try that next).

Hence, the complete script looks like this now:

module Uniq_hashes;

redef record Files::Info += {

Adding a field column of host and uniq_hash to show from where

the file got downloaded and whether seen first time or duplicate.

host: string &optional &log;
uniq_hash: bool &optional &log ;
};

global SECONDS_IN_DAY = 606024;
global uniq_hashes: set[string] ;

function midnight(): time
{
local now = current_time();
local dt = time_to_double(now);
local mn = double_to_count(dt / SECONDS_IN_DAY) * SECONDS_IN_DAY;
local mn_EST = mn + 14400.0;
return double_to_time(mn_EST);
}

function interval_to_midnight(): interval
{
return midnight() - current_time();
}

event reset_hashes()
{
uniq_hashes = set(); #I think this is the proper way to clear a set?
}

event file_hash(f: fa_file, kind: string, hash: string)
{
#print “file_hash”, f$id, kind, hash;

if(f?$http && f$http?$host)
f$info$host = f$http$host;

if(hash in uniq_hashes)
f$info$uniq_hash = F;

else
{
add uniq_hashes[hash];
f$info$uniq_hash = T;
}

}
event bro_init()
{ #print “current_time”, current_time();
#print “midnight”, midnight();
#print “Time to midnight:”, interval_to_midnight();
schedule interval_to_midnight() { reset_hashes()};
}

Thanks,Fatema.

Do you know that the intel framework supports hashes? If you export a

feed of hashes from CIF you can load that into bro and do the alerting on
known hashes bad in real time.

Yes. And that was the plan, but unfortunately, I couldn't get the list of
the feeds (hashes) pulled down from REN-ISAC

If you come up with a feed, using the intel framework should be straight
forward. We did a POC, extracting files (I think below 100MB) and just
preserve them in case of an intel hit (see
https://github.com/J-Gras/intel-extensions/blob/master/scripts/preserve_files.bro).
The only thing to set up except extraction and this script is a cron job
deleting the extracted files that aren't of interest. To avoid dups one
might want to name the extracted files according to their hash or
something like that.

Jan