http incomplete file extraction (Files::ANALYZER_EXTRACT)

Hi!

I am relatively new to bro so please excuse me, if I missed the obvious solution.

I want to extract files downloaded via http from a pcap-file, but the files I download are never extracted completely.
They seem to be truncated at ~1 mb. My bro-script is quite simple:

event file_new(f: fa_file)
{
                Files::add_analyzer(f, Files::ANALYZER_EXTRACT);
}

Are there any other events I have to catch to get the complete file?

When I download a test file from [1] with size 3521964 bytes, only 960204 bytes are extracted. I checked with
wireshark and tcpflow, that the download was completely captured in the pcap,

I tested with Bro 2.3.2 and the current dev version from git.

have a nice weekend!

Franky

[1] http://ipv4.download.thinkbroadband.com/5MB.zip

event file_new(f: fa_file)
{
               Files::add_analyzer(f, Files::ANALYZER_EXTRACT);
}

Nope, that should work.

Are there any other events I have to catch to get the complete file?

When I download a test file from [1] with size 3521964 bytes, only 960204 bytes are extracted. I checked with
wireshark and tcpflow, that the download was completely captured in the pcap,

Could you show me the files.log entry and the associated conn.log entry?

  .Seth

Hi again!

Thanks for the quick reply!

Your question for the logs is a valid one, I should have sent them in my initial mail.
I was also wondering, why the correct size is in the logs. If data was missing I would
at least have exspected a warning or some missing_bytes.

I hope the logs are readable inline in the mail, attachments seem to be filtered.

Thanks!

Franky

conn.log:
#fields ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents
#types time string addr port addr port enum string interval count count string bool bool count string count count count count set[string]
1427461795.952391 CiJ3X2Tf0O0EVCX6a 192.168.2.103 32880 80.249.99.148 80 tcp - - - - OTH - - 0 C 0 0 0 0 (empty)
1427461798.647371 CIS6ae2iV8YZoi8wa3 192.168.2.103 37219 173.194.116.186 80 tcp - 0.016545 0 0 OTH - - 0 Ca 0 0 1 52 (empty)
1427461795.983496 CXNI8E2HLRrrW8qOh1 192.168.2.103 32880 80.249.99.148 80 tcp - 2.369540 0 5243156 SHR - - 0 hCadcf 0 0 3637 5422092 (empty)
1427461798.167374 C9kHyj4HJdMN1lMwtd 192.168.2.103 45447 74.125.136.94 80 tcp - 0.044061 0 0 OTH - - 0 Ca 0 0 1 52 (empty)
1427461798.999381 ClgBVx3a9pQkee3uHf 192.168.2.103 34635 173.194.116.169 80 tcp - 0.016103 0 0 OTH - - 0 Ca 0 0 1 52 (empty)
#close 2015-03-27-15-02-36

files.log:
#fields ts fuid tx_hosts rx_hosts conn_uids source depth analyzers mime_type filename duration local_orig is_orig seen_bytes total_bytes missing_bytes overflow_bytes timedout parent_fuid extracted md5 sha1 sha256
#types time string set[addr] set[addr] set[string] string count set[string] string string interval bool bool count count count count bool string string string string string
1427461796.014318 FbVw4P1oMybfKCu0Wg 80.249.99.148 192.168.2.103 CXNI8E2HLRrrW8qOh1 HTTP 0 EXTRACT - - 0.540901 - F 960204 5242880 0 0 F - extract-1427461796.555219-HTTP-FbVw4P1oMybfKCu0Wg - - -

http.log:

fields ts uid id.orig_h id.orig_p id.resp_h id.resp_p trans_depth method host uri referrer user_agent request_body_len response_body_len status_code status_msg info_code info_msg filename tags username password proxied orig_fuids orig_mime_types resp_fuids resp_mime_types
#types time string addr port addr port count string string string string string count count count string count string string set[enum] string string set[string] vector[string] vector[string] vector[string] vector[string]
1427461796.014318 CXNI8E2HLRrrW8qOh1 192.168.2.103 32880 80.249.99.148 80 0 - - - - - 0 960204 200 OK - - - (empty) - - - - - FbVw4P1oMybfKCu0Wg -

Hi Kevin,

thanks for your mail. I will have a look at the examples. For your hint about the extraction proces:
I still doubt that the root of the problem lies here, because other tools successfully extract the
files from the same pcap.

Franky

In files.log, the value of total_bytes is just taken from the HTTP Content-Length header. Since the value of seen_bytes is less than total_bytes, you can suspect Bro didn’t see the full file for some reason. Do you have a weird.log containing any obvious clues? Else, I may need the original pcap to understand what went wrong.

- Jon

Hi!

In files.log, the value of total_bytes is just taken from the HTTP Content-Length header. Since the value of seen_bytes is less than total_bytes, you can suspect Bro didn’t see the full file for some reason. Do you have a weird.log containing any obvious clues? Else, I may need the original pcap to understand what went wrong.

The weird.log states some “above_hole_data_without_any_acks”, but why does it work with tcpflow?

Here is what I did:

  1. I downloaded the test file: wget http://ipv4.download.thinkbroadband.com/5MB.zip
  2. Gathered the pcap: tcpdump -s0 -i eth0 -w download.pcap port http
  3. checked if the file was completely captured with tcpflow:
    tcpflow -FT -e http -r download.pcap
    md5sums do match:
    ~/bro-liste$ md5sum 2015-04-01T07:45:00Z080.249.099.148.00080-192.168.002.103.42716-HTTPBODY-001.zip

b3215c06647bc550406a9c8ccc378756 2015-04-01T07:45:00Z080.249.099.148.00080-192.168.002.103.42716-HTTPBODY-001.zip
~/bro-liste$ md5sum 5MB.zip
b3215c06647bc550406a9c8ccc378756 5MB.zip

  1. run bro (revision 32ae94de9ae36060651240a0ee11838e3e572223) with simple bro-file:

~/bro-liste$ cat extract.bro

event file_new(f: fa_file)
{
Files::add_analyzer(f, Files::ANALYZER_EXTRACT);
}

~/bro-liste$ /usr/local/bro/bin/bro -r download.pcap extract.bro
1427874309.892545 warning in /usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54: Your trace file likely has invalid TCP checksums, most likely from NIC checksum offloading.

  1. Logs from bro and the pcap: (14mb)
    http://www.xup.in/dl,19594721/extract.tar.bz2/

Thanks!

Franky

~/bro-liste$ /usr/local/bro/bin/bro -r download.pcap extract.bro
1427874309.892545 warning in /usr/local/bro/share/bro/base/misc/find-checksum-offloading.bro, line 54: Your trace file likely has invalid TCP checksums, most likely from NIC checksum offloading.

You’ll have to address this problem to get the results you expect. See:

https://www.bro.org/documentation/faq.html#why-isn-t-bro-producing-the-logs-i-expect-a-note-about-checksums

The weird.log states some “above_hole_data_without_any_acks"

In this case, this seems like it’s just a side effect of the bad checksums, but in case you’re interested on how that type of situation can effect file extraction in Bro there’s discussion of how/why here:

https://bro-tracker.atlassian.net/browse/BIT-1255

- Jon

Thanks to all who answered!

The -C switch did the trick. Sometimes warnings should be taken seriously…

Have a nice day!

Franky