Hashing incomplete files

Hey everyone,

Hopefully anyone who has looked at or worked on the hashing component of the file analysis framework can help out with my request. I have a need for Bro to hash all files, including incomplete ones. I looked at the file hashing source code and making Bro hash incomplete files seemed straight forward (comment out the lines that break file hashing if there is an undelivered chunk), but I’m getting an odd result: the hashes reported by Bro for incomplete files are not the same hashes as what is extracted by Bro.

For example, here’s a files.log entry for an incomplete file with hashing enabled:

1493035575.544634 Fb19KI1OvvCjlT49eg CKcFdN2BuVOe1wiFB HTTP 0 EXTRACT,SHA1,MD5 - - 0.036770 - F 32221 59247 27026 0 T - 62f2c17b427ab54f9a8e30f384ba2a5e 6cba20d301dde6d7cbc4f41c689c1ecd108d7bef - extract-1493035575.544634-HTTP-Fb19KI1OvvCjlT49eg

Here is the MD5 hash as reported by the file system:

f0d987adb1015a05aabfcbade38751b1 extract-1493035575.544634-HTTP-Fb19KI1OvvCjlT49eg

Any thoughts on why these hashes don’t match? I’m guessing that enabling this functionality isn’t as simple as not breaking the hashing function when an undelivered chunk is found.


I’m guessing that Bro doesn’t pass a string of nulls to the hash function when there’s an undelivered chunk. But that’s what ends up in the file (I don’t know if that’s a side effect or intentional – but it is useful as all the other bits end up in the right place and you can find the holes after the fact). So I wouldn’t expect that the hash would be the same.

If you want them to match you probably need to figure out how to pass a block of nulls (of the appropriate length) to the hash function whenever there is undelivered data.


Just to add a bit to this - I think this behavior is intentional and used,
e.g., when a file is downloaded from over multiple streams simultaneously.


Kevin was correct – filling in the incomplete space with nulls produces the same MD5 hash.

Johanna, in the case of an “incomplete” file, could multiple simultaneous streams produce an inconsistent hash? Not sure I understand how multiple streams might affect a file’s completeness, but would happy to hear your thoughts.


Actually, I think I understand what you mean now – with some of the PCAPs I have, the hash for incomplete files changes from run to run.

I take it back, my code was off – seems fine now. It would be nice if this could be enabled as an option via an argument.