Changes in entropy computation code.

Hi,

These changes went in a while ago: http://tracker.bro-ids.org/bro/ticket/265

However, there is an issue with how the entropy is being computed. Also, I needed a special bif to compute entropy for strings of a specified length, so added that too.

0001-Fixing-entopy-computation-code-and-adding-bif-for-co.patch (691 Bytes)

0002-Adding-a-new-bif-for-computing-entropy-for-specified.patch (1.34 KB)

However, there is an issue with how the entropy is being computed.

This is well outside of my expertise, but your change is in opposition to how ENT[1] does it.

- ent += prob[i] * rt_log2(1 / prob[i]);
+ ent += prob[i] * rt_log2(prob[i]);

I just went back and verified and it looks like the original line is how it's done.

Also, I needed a special bif to compute entropy for strings of a specified length, so added that too.

I would rather not integrate this. My suggestion would be to trim the string with the sub_bytes BiF before passing it to the find_entropy function.

  .Seth

1. http://www.fourmilab.ch/random/

Hi,

This is well outside of my expertise, but your change is in opposition to how ENT[1] does it.

  • ent += prob[i] * rt_log2(1 / prob[i]);
  • ent += prob[i] * rt_log2(prob[i]);

I just went back and verified and it looks like the original line is how it’s done.

I checked it out. I think rt_log(prob[i]) is the correct way to do this. It is the sum over entire alphabat, probability multiplied by log of probablity: http://en.wikipedia.org/wiki/Entropy_%28information_theory%29#Definition

I would rather not integrate this. My suggestion would be to trim the string with the sub_bytes BiF before passing it to the find_entropy function.

I see, thanks for pointing that out, just started scripting. :slight_smile:

Unfortunately, e-mail is not the best medium for arguing stuff like this. Forgive the terrible formatting.

Let SUM(i, f(i)) be the sum of f(i) for all values of i that exist in some data set (in our case, the array prob).

Base Definition: H(X) = -1 * SUM(i, prob[i] * rt_log2(prob[i]))

What we want to prove:

(1) H(X) = -1 * SUM(i, prob[i] * rt_log2(prob[i]))

is equivalent to

(2) H(X) = SUM(i, prob[i] * rt_log2(1 / prob[i]))

Assuming I remember how to do math, I believe we can get from (2) to (1) with a little bit of algebraic manipulation (as follows):

SUM(i, prob[i] * rt_log2(1 / prob[i])) // we’re starting at (2)

SUM(i, prob[i] * (rt_log2(1) - rt_log2(prob[i])) ) // rule of logarithms: log(1 / x) = log(1) - log(x)

SUM(i, prob[i] * -rt_log2(prob[i]) ) // log(1) = 0

SUM(i, -1 * prob[i] * rt_log2(prob[i]) ) // pull the -1 to the front to make the next step more obvious

-1 * SUM(i, prob[i] * rt_log2(prob[i]) ) // Pull the constant -1 out of the SUM

… and we should be good.

–Gilbert

Gilbert, I'm not sure what your conclusion is?

Robin

Two conclusions here (referencing the following two lines of the patch):

- ent += prob[i] * rt_log2(1 / prob[i]);
+ ent += prob[i] * rt_log2(prob[i]);

(1) That the patched code would need to be:

ent -= prob[i] * rt_log2(prob[i]);

to be technically correct (unless the sign is flipped later and I just missed it :). That said, I don't really know that the negative sign in front of the entropy matters much.

(2) If the sign were correct, the submitted patch would not change the result of the code (e.g. that the code in there at the moment is technically correct). [1]

That said, the patch *might* make the code's function a little more obvious, though; it took me a minute to recognize what was going on there, since the form for entropy that I was taught matches what's described in the patch.

Related: there is an interesting effect described in 2.1 of [2] (called "cancellation"; see the footnote), though: as the individual probabilities decrease, I'd imagine that the accuracy of our calculation would fall (since the entropy is calculated incrementally, and individual contributions can be vanishingly small). It might be good to try to find a similarly alternative definition here, as well [3]

--Gilbert

[1] I've since found a reference here: http://astarte.csustan.edu/~tom/SFI-CSSS/info-theory/info-lec.pdf that explicitly defines entropy with the prob[i] * rt_log2(1 / prob[i]) formula.
[2] http://www.eecs.berkeley.edu/~mhoemmen/cs194/Tutorials/variance.pdf
[3] Assuming I'm understanding this effect correctly...

So, this is what we should do with this line?

  .Seth