However, there is an issue with how the entropy is being computed. Also, I needed a special bif to compute entropy for strings of a specified length, so added that too.
Two conclusions here (referencing the following two lines of the patch):
- ent += prob[i] * rt_log2(1 / prob[i]);
+ ent += prob[i] * rt_log2(prob[i]);
(1) That the patched code would need to be:
ent -= prob[i] * rt_log2(prob[i]);
to be technically correct (unless the sign is flipped later and I just missed it :). That said, I don't really know that the negative sign in front of the entropy matters much.
(2) If the sign were correct, the submitted patch would not change the result of the code (e.g. that the code in there at the moment is technically correct). [1]
That said, the patch *might* make the code's function a little more obvious, though; it took me a minute to recognize what was going on there, since the form for entropy that I was taught matches what's described in the patch.
Related: there is an interesting effect described in 2.1 of [2] (called "cancellation"; see the footnote), though: as the individual probabilities decrease, I'd imagine that the accuracy of our calculation would fall (since the entropy is calculated incrementally, and individual contributions can be vanishingly small). It might be good to try to find a similarly alternative definition here, as well [3]