File events question

Currently we are using file_sniff events plus some conditional logic to selectively invoke Files::ANALYZER_EXTRACT on some subset of files for further analysis. However, this approach leads to a lot of duplication and we would like to hone it down - specifically, exclude files based on known hashes. So I tried:

file_sniff event -> sometimes invoke Files::ANALYZER_MD5
file_hash event -> after using some logic to make sure this is the hashing event triggered by the previous step, then try to invoke the full Files::ANALYZER_EXTRACT
but this approach results in
"Reporter::WARNING","message":"Analyzer Files::ANALYZER_EXTRACT not added successfully to file
which, based on these threads I found...
https://lists.zeek.org/archives/list/zeek@lists.zeek.org/thread/UUIA4PN4D24PNG5FG6TFRVGCC3VJTDN3/#UUIA4PN4D24PNG5FG6TFRVGCC3VJTDN3
https://lists.zeek.org/archives/list/zeek@lists.zeek.org/thread/AVODJCKRGC34JJOMTYUPZR2C76FOSDHS/#AVODJCKRGC34JJOMTYUPZR2C76FOSDHS

...what I got sounds like an expected result - the ANALYZER_EXTRACT call is "too late" in the event lifecycle for the file - presumably because a maximum of one ANALYZER submission per file is supported - but I'm still not clear on exactly why.

Is there documentation somewhere on the file analysis / event lifecycle that documents when and how file analysis can be triggered, and the limitations that appear to be implicit?

It also seems like this is a common enough use case that someone must have solved this problem at some point in a more elegant way than the threads I found have proposed (extract all, then delete some).

Hoping someone has some insights here they would be willing to share?

In general you can not do this without building a time machine. The hash of the file is not known until the entire file has been seen. At that point it is too late to extract it because most likely the earlier bytes of the file no longer exist. The easiest solution is to extract all files and remove the hashes you don’t care about in file_state_remove.

It’s possible for small files this could be made to work. There’s a 4k buffer that is used for mime matching, and why you can extract a file a few packets into a connection:

Default amount of bytes that file analysis will buffer in order to use

for mime type matching. File analyzers attached at the time of mime type

matching or later, will receive a copy of this buffer.

option default_file_bof_buffer_size: count = 4096;

so potentially if the file is smaller than 4k it could be extracted based on the hash, but there would probably have to be a lot of changes and maybe a hook added to ensure that when the file ends there’s a way to tell zeek to hold onto that buffer.

OK, fair enough - and thanks for the answer. From your description it sounds like zeek is acting as a stream processor, hence the timing limitations on what events can be called and when. Is there a more detailed discussion of this somewhere that I have managed to miss? I would just like to understand the limitations on events a little better.