Thanks Seth,
Is it expected for Bro to detect specific Office documents? Older versions of Office documents (2003 and back I believe) had easily identifiable file magic that one could use to at least inform the observation that the file is an MS office document (ex: D0 CF 11 E0 A1 B1 1A E1).
Now however, with ‘office open xml’, the initial file magic for modern Office documents is equivilent to that of a zip file at face value. However, if properly insturmented you can deliniate between a regular ZIP file and an Office document, down to being able to state if one is a pptx, docx, etc…
When extracting files off the wire that are what I believe to be modern MS Office documents, Bro tends to classify them as ZIP files via the MIME type. This is technically correct, however there are higher fidelity attributes that may be absent, such as the fact that the file is an Office document for Word, Powerpoint, etc.
I did a little playing around to see if perhaps Bro was simply claiming the ‘ZIP’ file magic identification was the strongest - and the other more application centric file magic identifiers were burried in the variable f$mime_types mime_matches vector as defined here:
https://www.bro.org/sphinx-git/scripts/base/init-bare.bro.html#type-mime_matches
After running , it doesn’t seem that Bro is identifying this. Below is the snip of code I am using to attempt this:
if ( f?$mime_types && |f$mime_types| == 2 )
{
if ( f$mime_types[0]$mime == “application/zip” )
{
ext = ext_map_zip_subset[f$mime_types[1]$mime];
}
}
The subset mime types I have defined to map the specific versions of MS Office are as follows:
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.presentationml.presentation
Also curious about JAR files too, those seem to fold into the broader ZIP file magic but may be detected with something a little more specific.
The magic/type list posted here (under ‘zip’) perhaps better illustrates the issue:
http://www.garykessler.net/library/file_sigs.html
Hope that helps explain better,
Jason