File Extraction Related Scripting Questions

Hello:

I would like for a quick way to simply get the directory size of the extract_files directory. If it meets a certain threshold I don’t want to extract the file. I tried looking for a builtin function that did this but could not locate one. I then attempted to do the following system command:

local somevar = system(fmt(“du -b %s | cut -f1”, FileExtract::prefix))

However, I am unable to capture the output (since it goes directly to stdout). Does anyone have any advice on how to tackle this?

Additionally, I was wondering if Bro is able to identify MIME types of modern Office docuements down to the type of application they support (Excel, Powerpoint, etc)… From my testing, it seems that the only thing one gets is ‘application/zip’ for the MIME type for a modern office document, this is technically correct, but I was hoping for a way to zone in on this a little more by being able to specify ‘application/vnd.openxmlformats-officedocument.presentationml.presentation’ (if I wanted pptx files). Does Bro MIME detection support this in any way?

Many thanks,
Jason

FWIW - I managed to cobble together the following poc once I stumbled across ‘exec’ :slight_smile:

when ( local dir_size = Exec::run([$cmd=fmt(“du -b %s | cut -f1”, FileExtract::prefix)]) )
{
if ( to_int(dir_size$stdout[0]) < dir_size_limit )
print “file can be written”;
else
print “file cannot be written”;
}

Interested if this is the ‘best’ way or not. The drawback is this required the use of ‘when’ which requires me to wait a little bit before I can utilize the returned result. It also seems that if I place an ‘extract’ analyzer inside the if statement when a file can be written, I get the error ‘field value missing [dir_size$stdout]’. This probably relates to a timing issue on the part of the issued command I am guessing? Back to the drawing board I suppose but that is as far as I’ve gotten so far :slight_smile:

Also interested as well in the MIME type question with respect to Office documents.

Thanks!
Jason

FWIW - I managed to cobble together the following poc once I stumbled across 'exec' :slight_smile:

Yep, that's probably the correct thing to do for now.

The drawback is this required the use of 'when' which requires me to wait a little bit before I can utilize the returned result.

Since Bro needs to keep running in a non-blocking manner all the time, basically any solution you aim for will be using when since looking at the file system is almost intrinsically a blocking operation.

What I would recommend is that you have a scheduled event that regularly checks the size of the directory and modifies a global value to let you know if you're safe to extract or not. That will combine the benefit of the asynchronous operation with the benefit of being able to check in an if statement if your extraction directory is overly full.

It also seems that if I place an 'extract' analyzer inside the if statement when a file can be written, I get the error 'field value missing [dir_size$stdout]'. This probably relates to a timing issue on the part of the issued command I am guessing?


I'm not sure why you're seeing that problem, that seems weird. However, I wouldn't expect that to work generally because once the when statement returns could be after quite a bit of the file has already transferred.

Also interested as well in the MIME type question with respect to Office documents.

Yeah, what would help a lot there is for someone to pull together files that they don't feel are being detected with accurate mime types and to provide those files or links to files on the internet that don't get detected accurately.

  .Seth

Thanks Seth,

Is it expected for Bro to detect specific Office documents? Older versions of Office documents (2003 and back I believe) had easily identifiable file magic that one could use to at least inform the observation that the file is an MS office document (ex: D0 CF 11 E0 A1 B1 1A E1).

Now however, with ‘office open xml’, the initial file magic for modern Office documents is equivilent to that of a zip file at face value. However, if properly insturmented you can deliniate between a regular ZIP file and an Office document, down to being able to state if one is a pptx, docx, etc…

When extracting files off the wire that are what I believe to be modern MS Office documents, Bro tends to classify them as ZIP files via the MIME type. This is technically correct, however there are higher fidelity attributes that may be absent, such as the fact that the file is an Office document for Word, Powerpoint, etc.

I did a little playing around to see if perhaps Bro was simply claiming the ‘ZIP’ file magic identification was the strongest - and the other more application centric file magic identifiers were burried in the variable f$mime_types mime_matches vector as defined here:

https://www.bro.org/sphinx-git/scripts/base/init-bare.bro.html#type-mime_matches

After running , it doesn’t seem that Bro is identifying this. Below is the snip of code I am using to attempt this:

if ( f?$mime_types && |f$mime_types| == 2 )
{
if ( f$mime_types[0]$mime == “application/zip” )
{
ext = ext_map_zip_subset[f$mime_types[1]$mime];
}
}

The subset mime types I have defined to map the specific versions of MS Office are as follows:

application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.presentationml.presentation

Also curious about JAR files too, those seem to fold into the broader ZIP file magic but may be detected with something a little more specific.

The magic/type list posted here (under ‘zip’) perhaps better illustrates the issue:
http://www.garykessler.net/library/file_sigs.html

Hope that helps explain better,
Jason

Seth and group:

As a tangental (but related) topic, it appears that the MIME type ‘application/msword’ is possibly being used to universally classify all legacy MS Office filetypes (doc, ppt, xls, etc), instead of just Office documents with the .doc extension. From what I understand, the MIME type ‘application/msword’ is specifically used for ‘doc’ files and is not universal to describe any or all Office documents.

The following table supports this notion, as you can see different extentions within office have thier own respective MIME type.

http://filext.com/faq/office_mime_types.php

In testing this, I had a script that pulled files off the wire that matched the ‘msword’ MIME type, waited, then reviewed. I ended up with a diverse collection of office documents (doc/xls).

It may be purposeful, since all OLECF files have the same magic (D0 CF 11 E0 A1 B1 1A E1). Is this the case? Would it be more appropriate/clear to have a MIME type such as ‘application/ole’? Additionally, if you look 512 bytes in you can determine the type of file for older office documents. Is this an opportunity to create clearer, more specific file type signatures?

I am certainly not an authority on this matter, but would appreciate any insight into the topic as it will help drive the direction of a solution I am developing.

Thanks,
Jason

It may be purposeful, since all OLECF files have the same magic (D0 CF 11 E0 A1 B1 1A E1). Is this the case? Would it be more appropriate/clear to have a MIME type such as 'application/ole'? Additionally, if you look 512 bytes in you can determine the type of file for older office documents. Is this an opportunity to create clearer, more specific file type signatures?

I view this as the opportunity. We can make type signatures and indicators that fit our use case. Are you interested in leading an effort to clean up the MS Office document identification? That's a nice, tightly defined problem scope and it sounds like it's in an area that you need to address for yourself anyway.

I am certainly not an authority on this matter, but would appreciate any insight into the topic as it will help drive the direction of a solution I am developing.

The general problem with this stuff is that everyone ends up saying that same thing. I'm sure that even libmagic developers would say the same thing because they are just trying to show mime types that are defined and allocated by IANA. This is an area where we're just going to have to let ourselves be free to extend and expand beyond libmagic or even IANA in some cases (they have a mechanism for unallocated extensions that we should evaluate closely).

  .Seth

I view this as the opportunity. We can make type signatures and indicators that fit our use case. Are you interested in leading an effort to clean up the MS Office document identification? That’s a nice, tightly defined problem scope and it sounds like it’s in an area that you need to address for yourself anyway.

I would be :). Would you mind pointing me in the right direction to how I might make type signatures and indicators as you describe. If it is as simple as adding more detailed content to an existing file or library, could you point me to the file I should be tinkering with? I’ve done this sort of stuff before with Yara but have not explored doing so with Bro.

Thanks,
Jason

I would be :).

Woo!

Would you mind pointing me in the right direction to how I might make type signatures and indicators as you describe.

https://github.com/bro/bro/tree/master/scripts/base/frameworks/files/magic

Any attention to those file detections would be great. I would also like to start getting some tests in place that verify we are detecting these files correctly going into the future. Feel free to ask if you have any questions.

  .Seth

Just FYI to the group, I created the following after having spent some time looking at magic.sig. I placed them in general.sig and so far they seem to do the trick on identifying OLECF (legacy MS Office) and OOXML (modern MS Office) documents.

Seth indicated to me offline this would be reviewed and folded into the next release.

For your immediate use.

Jason Batchelor Edits, 9/19/2014

Signatures informed by the following resource

http://www.garykessler.net/library/file_sigs.html

signature file-olecf {
file-magic /(\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1)/
file-mime “application/olecf”, 150
}
signature file-ooxml {
file-magic /(\x50\x4b\x03\x04\x14\x00\x06\x00)/
file-mime “application/vnd.openxmlformats-officedocument”, 100
}