different file hash between downloaded file by ANALYZER_EXTRACT with original file

Hello, everyone .
i’m new to bro recently, i’m using FAF(File Analysis Framework) to extract certain type file to disk for further analysis from traffic .
but now i have problem which is so difficult to understand:

  • bro extract file size is one byte bigger than my original file
  • or bro extract file the right size with my original file, but it’s different MD5 value among these files

below is my test env, test steps and test result:

my test env

bro version:

  • bro version 2.5-156
    OS (32C 64G):
  • CentOS Linux release 7.3.1611 (Core)
    CPU model:
  • Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  • CPU(s): 32
  • CPU MHz: 2334.445
    NIC:
  • 03:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network

my test bro scripts


event file_sniff(f: fa_file, meta: fa_metadata)
{
print "file sniff event by Myth";
if ( meta?$mime_type )#&& hook FileExtraction::extract(f, meta) )
{
if ( meta$mime_type in mime_to_ext )
{
local fext = mime_to_ext[meta$mime_type];
if ( fext == "txt" )
{
#print "txt";
if ( f$source != "SMTP" )
{
#print "NOT SMTP";
return;
}
}
}
else
return;
#fext = split_string(meta$mime_type, /\//)[1];

local fname = fmt("%s%s-%s.%s", path, f$source, f$id, fext);
# file path
#print fname;
Files::add_analyzer(f, Files::ANALYZER_MD5);
Files::add_analyzer(f, Files::ANALYZER_SHA1);
Files::add_analyzer(f, Files::ANALYZER_SHA256);
Files::add_analyzer(f, Files::ANALYZER_EXTRACT,[$extract_filename=fname]);
}
}

my test steps

  1. generate test file

[root@sensor ~]# dd if=/dev/urandom of=test.for.bro.txt bs=1024 count=512

[root@sensor ~]# tar -cvzf test.for.bro.tar.gz test.for.bro.txt

  1. original file size and MD5 valud

[root@sensor ~]# ls -lt test.for.bro.tar.gz
-rw-r–r-- 1 root root 524608 8月 7 13:59 test.for.bro.tar.gz

[root@sensor ~]# md5sum test.for.bro.tar.gz
6e755b5c0a7754c7066ca6db5f0f90ba test.for.bro.tar.gz

  1. start test web server using Python

[root@sensor ~]# python -m SimpleHTTPServer 8998 > ws.log 2>&1

  1. start bro

[root@sensor myth]# /usr/local/bro/bin/bro -i eno1 -C bro-scripts/tophant.entrypoint.bro > myth.log 2>&1

  1. using ab do make lots of http request to test file from another machine

[root@localhost ~]# ab -n 2000 -c 4 ‘http://10.0.81.54:8998/test.for.bro.tar.gz

  1. result ( after all request is done)

5.1 webserver process request count

[root@sensor ~]# cat ws.log | grep test.for.bro | wc -l
2000

5.2 bro file_sniff event count

[root@sensor myth]# cat myth.log | grep “file sniff event by Myth” | wc -l
976

5.3 download file count

[root@sensor sensor_files_by_myth]# ls | wc -l
973

5.4 file count with different file size:

[root@sensor sensor_files_by_myth]# ls -lt | grep -v 524608 | wc -l
193

5.5 file count with same file size:

[root@sensor sensor_files_by_myth]# ls -lt | grep 524608 | wc -l
780

5.6 file count with same MD5 value:

[root@sensor sensor_files_by_myth]# ls -lt | awk ‘{print $NF}’ | xargs md5sum | grep 6e755b5c0a7754c7066ca6db5f0f90ba | wc -l
19

5.7 file count with same file size but different MD5 (!!! NOTICE: all is different MD5)

[root@sensor sensor_files_by_myth]# ls -lt | grep 524608 | awk ‘{print $NF}’ | xargs md5sum | grep -v 6e755b5c0a7754c7066ca6db5f0f90ba | awk ‘{print $1}’ | sort | uniq -c | wc -l
761

5.8 download file size distribution:

[root@sensor sensor_files_by_myth]# ls -lt | awk ‘{print $5}’ | sort -rn | uniq -c
136 524609 <<<<<<<<<<<<<<< this is one byte bigger than my original test file !!!
780 524608
3 523990
3 522542
8 521094
1 520208
1 519646
2 518198
1 515302
1 513854
1 512968
1 512406
1 510958
1 509510
2 503718
1 502176
1 501384
1 497926
1 490296
1 488808
1 487040
1 486342
1 480550
1 473310
1 467518
1 464622
1 458830
1 453038
1 442902
1 441454
1 396566
1 382408
1 377742
1 358918
1 354574
1 318240
1 283312
1 263350
1 256110
1 250318
1 234952
1 189502
1 164886
1 79454
2 2710
1

Thanks for reading so far, wish someone could help me with this :slight_smile:

Myth

    - bro extract file size is one byte bigger than my original file
    - or bro extract file the right size with my original file, but it's
different MD5 value among these files

Ugh, that's not a good behavior.

below is my test env, test steps and test result:

Could you capture traffic and replay that with Bro instead of sniffing
the interface directly? If you did that you could at least verify
that the problem is deterministically replicable and then we could
possibly look into the problem with you. I have several thoughts
about what the problem could be but they're ultimately fairly long
shots and could likely be wrong.

.Seth

i had tests with pcaps, and there is no problem with that.
but when i listen on interface directly, weird problem happened,
and it happened most times .

I opened a issue at https://bro-tracker.atlassian.net/projects/BIT/issues/BIT-1832 .

and i upload three files to that issue:

  1. test4faf.bro - this is the bro script i use for test
  2. test4faf.tar.gz - this is the file i use http to download, this is generated with command dd if=/dev/urandom of=test4faf.dat bs=1024 count=128 && tar -cvzf test4faf.tar.gz test4faf.dat && rm -f test4faf.dat
  3. test4faf.pcap - this is generated with tcpdump. if i test with this pcap, no problem happened, everything is all right.

everyone who has interesting with this problem could do some test with that bro script, but remember to sniffing traffic directly from interface.

Myth Ren wrote:

everyone who has interesting with this problem could do some test with
that bro script, but remember to sniffing traffic directly from interface.

If you are only seeing the problem when sniffing from an interface, it's
likely that the problem is actually that you are dropping packets. When
you sniff from an interface, what is your traffic rate that is being
monitored?

  .Seth