File Extraction: doc/xls=ok, docx/xlsx=ko

Hello,
I am trying to find out if I did some mistake in my extract.bro script.
Basically   I am able to extract the doc,pdf,xls but not the docx,xlsx,xlsm etc (all new office files).
Script looks like this:
 
global ext_map: table[string] of string = {
    ["application/msword"] = "doc",
    ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"] = "docx",
    ["application/vnd.openxmlformats-officedocument.wordprocessingml.template"] = "dotx",
    ["application/vnd.ms-word.document.macroEnabled.12"] = "docm",
    ["application/vnd.ms-word.template.macroEnabled.12"] = "dotm",
    ["application/vnd.ms-excel"] = "xls",
    ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"] = "xlsx",
    ["application/vnd.openxmlformats-officedocument.spreadsheetml.template"] = "xltx",
    ["application/vnd.ms-excel.sheet.macroEnabled.12"] = "xlsm",
    ["application/vnd.ms-excel.template.macroEnabled.12"] = "xltm",
    ["application/vnd.ms-excel.addin.macroEnabled.12"] = "xlam",
    ["application/vnd.ms-excel.sheet.binary.macroEnabled.12"] = "xlsb",
    ["application/vnd.ms-powerpoint"] = "ppt",
    ["application/vnd.openxmlformats-officedocument.presentationml.presentation"] = "pptx",
    ["application/vnd.openxmlformats-officedocument.presentationml.template"] = "potx",
    ["application/vnd.openxmlformats-officedocument.presentationml.slideshow"] = "ppsx",
    ["application/vnd.ms-powerpoint.addin.macroEnabled.12"] = "ppam",
    ["application/vnd.ms-powerpoint.presentation.macroEnabled.12"] = "pptm",
    ["application/vnd.ms-powerpoint.presentation.macroEnabled.12"] = "potm",
    ["application/vnd.ms-powerpoint.slideshow.macroEnabled.12"] = "ppsm",
} &default ="";

 
event file_new(f: fa_file)
   {
       if ( ! f?$mime_type  )
        return;
    local ext = "";
 if ( f?$mime_type )
        ext = ext_map[f$mime_type];
    #if ( ext !="pdf" && ext !="exe" && ext !="swf" )
    if ( ext !="doc" && ext !="docx" && ext !="dotx" && ext !="docm" && ext !="dotm" && ext !="xls" && ext !="xlsx" && ext !="xltx" && ext !="xlsm" && ext !="xltm" && ext !="xlam" && ext !="xlsb" && ext !="ppt" && ext !="pptx" && ext !="potx" && ext !="ppsx" && ext !="ppam" && ext !="pptm" && ext !="potm" && ext !="ppsm" )
    return;

      local fname = fmt("/bro/extracted/%s-%s.%s", f$source, f$id, ext);
           Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
           break;
}

Into the files.log I can see when extract matches:

1455104772.317535       FiWH8E2GK4LZmK8kYg      12.23.29.13  194.1.1.22     C9Ujyw1HodyV6hrs4f      SMTP    5       DATA_EVENT,EXTRACT,SHA1,MD5     application/msword      SPCH_100658601_1_Skillsupdatefebruary2016.doc  0.056888        T       F       44973   -       2736    0       F       -       -       -       -       /bro/extracted/SMTP-FiWH8E2GK4LZmK8kYg.doc
1455105508.920691       FiqR9N1j5G1JlUlDe       12.23.29.13  12.3.16.5   COsYzjbE2bCVGewz1       SMTP    7       SHA1,DATA_EVENT,MD5,EXTRACT     application/msword      SCD List - SS101-612a.vsd     0.148642 T       F       91656   -       2696    0       F       -       -       -       -       /bro/extracted/SMTP-FiqR9N1j5G1JlUlDe.doc
1455105575.354126       FmnQbA19ShsuCDh0bk      12.23.29.13  16.2.23.2   CXYSjQx0YmTqhDagf       SMTP    3       DATA_EVENT,MD5,SHA1,EXTRACT     application/msword      00336582.doc    0.378492      TF       177152  -       0       0       F       -       c7c213a316143494115c905fd28938f9        8b7d7c28b0d2c28ad1287db60e7c26925181ab07        -       /bro/extracted/SMTP-FmnQbA19ShsuCDh0bk.doc

But no matches for new office files...

Do you have any idea?

I have another question: in order to keep track of files extracted, how can I set the filename with something trackable like realfilename ?

Thanks in advance.

Connetti gratis il mondo con la nuova indoona: hai la chat, le chiamate, le video chiamate e persino le chiamate di gruppo.
E chiami gratis anche i numeri fissi e mobili nel mondo!
Scarica subito l’app Vai su https://www.indoona.com/

I have never done this myself but it seems like f$info$filename could be a possible solution to your second question.

/Peter

Hello,
I am trying to find out if I did some mistake in my extract.bro script.
Basically I am able to extract the doc,pdf,xls but not the docx,xlsx,xlsm etc (all new office files).

I believe the problem is that as far as bro is concerned, new office files are really .zip archives.

Script looks like this:

global ext_map: table[string] of string = {
    ["application/msword"] = "doc",
    ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"] = "docx",
    ["application/vnd.openxmlformats-officedocument.wordprocessingml.template"] = "dotx",

[..]

} &default ="";

event file_new(f: fa_file)
   {
       if ( ! f?$mime_type )
        return;
    local ext = "";
if ( f?$mime_type )
        ext = ext_map[f$mime_type];
    #if ( ext !="pdf" && ext !="exe" && ext !="swf" )
    if ( ext !="doc" && ext !="docx" && ext !="dotx" && ext !="docm" && ext !="dotm" && ext !="xls" && ext !="xlsx" && ext !="xltx" && ext !="xlsm" && ext !="xltm" && ext !="xlam" && ext !="xlsb" && ext !="ppt" && ext !="pptx" && ext !="potx" && ext !="ppsx" && ext !="ppam" && ext !="pptm" && ext !="potm" && ext !="ppsm" )
    return;

      local fname = fmt("/bro/extracted/%s-%s.%s", f$source, f$id, ext);
           Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
           break;
}

Aside from the issue that docx shows up as a zip file, here is a fixed up version of that file_new event:

event file_new(f: fa_file)
   {
    if (!f?$mime_type)
        return;

    if (f$mime_type !in ext_map)
        return;

    ext = ext_map[f$mime_type];
    local fname = fmt("/bro/extracted/%s-%s.%s", f$source, f$id, ext);
    Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
    break;
}

Justin's correct that the header for new Office files are very similar
to Zip files, but they do differ slightly. If you look at the code
that identifies a file as
application/vnd.openxmlformats-officedocument.wordprocessingml.document,
it uses this regular expression:

/^PK\x03\x04.{26}(\[Content_Types\]\.xml|_rels\x2f\.rels|word\x2f).*PK\x03\x04.{26}word\x2f/

Here is the regular expression that identifies Zip files: /

^PK\x03\x04.{2}/

It's possible that your new office files aren't fitting this format.
In that case, I'd suggest adding this mime_type to your table:
application/vnd.openxmlformats-officedocument

It uses the basic new Office file header as a regular expression:

/^PK\x03\x04\x14\x00\x06\x00/

If you use application/vnd.openxmlformats-officedocument, then you
can't assume that the file has a specific extension-- all you know is
that the file is a new Office file.

Hope that clears up some confusion (and doesn't cause more)!

Josh

Thanks Josh!

One more tiny note, if anyone discovers files that they think should be matching a particular type and they aren't, please reach out. We are maintaining all of the file type identification ourselves at this point and I'd like to make sure that we're doing a really nice job at identifying file types.

  .Seth