Bug on Anon.cc

Hi,

I think I found a bug in the IP anonymizer code, more concretely in
the PREFIX_PRESERVING_MD5 mode (well, considering that the anonymized
addresses do not preserve prefixes, I'd say it is a bug). I include
a patch.

-Chema

bro.current.20050910.fix_for_ip_anonymizer_in_mode_preserving_md5.diff (1000 Bytes)

FYI, the IP anonymizer code in mode PREFIX_PRESERVING_MD5 didn't do
what it was supposed to do. Such mode is based on "On the Design and
Performance of Prefix-Preserving IP Traffic Trace Anonymization", by
Xu et al (IMW 2001), where it is suggested to anonymize
X=x_0...x_{n-1} as X'=x'_0...x'_{n-1}, where:
  x_i' = x_i ^ f_{i-1},
  f_{i-1} = LSB(HK(PAD(x_0 ... x_{i-1}), hashkey))
  LSB: less significative bit function
  HK: cryptographic hash function (using hashkey)
  PAD(x_0 ... x_{i-1}) = x_0 ... x_{i-1} 0 ... 0

Two bugs in the old code: (a) it used addresses in network order, so
prefixes didn't make too much sense. (b) it did the hash of an 8-byte
struct composed of the prefix length and the input, instead of the
prefix of the input (4 bytes).

I also added some comments explaining the function.

-Chema

PS: The patch didn't include context. I enclose a context patch.

I wrote:

bro.current.20050910.fix_for_ip_anonymizer_in_mode_preserving_md5.diff (1.32 KB)

FYI, the IP anonymizer code in mode PREFIX_PRESERVING_MD5 didn't do
what it was supposed to do. Such mode is based on "On the Design and
Performance of Prefix-Preserving IP Traffic Trace Anonymization", by
Xu et al (IMW 2001), where it is suggested to anonymize
X=x_0...x_{n-1} as X'=x'_0...x'_{n-1}, where:
  x_i' = x_i ^ f_{i-1},
  f_{i-1} = LSB(HK(PAD(x_0 ... x_{i-1}), hashkey))
  LSB: less significative bit function
  HK: cryptographic hash function (using hashkey)
  PAD(x_0 ... x_{i-1}) = x_0 ... x_{i-1} 0 ... 0

Two bugs in the old code: (a) it used addresses in network order, so
prefixes didn't make too much sense.

That's right. I forgot to add an ntohl. Thanks for finding it!

(b) it did the hash of an 8-byte
struct composed of the prefix length and the input, instead of the
prefix of the input (4 bytes).

I think this is not correct. For example, prefix 10.0.0.0/8 is different from 10.0.0.0/9, and should be hashed differently; otherwise the 9th and 10th most significant bits will always be flipped the same way. (Or, try to anonymize 128.0.0.0 with 1000 different keys, and see how many distinct results one can get.)

Ruoming

Ruoming Pang wrote:

>(b) it did the hash of an 8-byte
>struct composed of the prefix length and the input, instead of the
>prefix of the input (4 bytes).
I think this is not correct. For example, prefix 10.0.0.0/8 is
different from 10.0.0.0/9, and should be hashed differently; otherwise

If you add the prefix length, then you have a "fixed-prefix length
anonymizer." By this, I mean an anonymizer for each /X class. It's
equivalent to have a hash key for /4 addresses, and another for /5
addresses. This scheme only preserves prefixes when the /X suffix is
the same. For example, 10.0.0.0/8 and 10.0.0.0/9 will not have the
same prefix. Is this what the function is supposed to do?

With this scheme, you may also have collisions. It may be the case
that 0.0.0.0/24 hashes to 4.4.4.0/24, and that 255.255.0.0/16 hashes
to 4.4.0.0/16. The hashed addresses share a 16-bit prefix, while
the input addresses share no prefix.

BTW, if this is the case, for the code to work, you cannot change
prefix.len during the "for" loop in AnonymizeIPAddr_PrefixMD5::anonymize().

the 9th and 10th most significant bits will always be flipped the same
way. (Or, try to anonymize 128.0.0.0 with 1000 different keys, and see
how many distinct results one can get.)

That's an artifact of the padding being 0...0. The f_0 function is
a constant dependent on the hash key, so x'_0 is random. x'_1 to
x'_{31} are all the same, as PAD(x_0) = PAD(x_0 x_1) = ... =
PAD(x_0 ... x_{31}).

Therefore, you can get 0.0.0.0, 128.0.0.0, 127.255.255.255, and
255.255.255.255.

-Chema

Ruoming Pang wrote:

(b) it did the hash of an 8-byte
struct composed of the prefix length and the input, instead of the
prefix of the input (4 bytes).

I think this is not correct. For example, prefix 10.0.0.0/8 is
different from 10.0.0.0/9, and should be hashed differently; otherwise

If you add the prefix length, then you have a "fixed-prefix length
anonymizer." By this, I mean an anonymizer for each /X class. It's
equivalent to have a hash key for /4 addresses, and another for /5
addresses. This scheme only preserves prefixes when the /X suffix is
the same. For example, 10.0.0.0/8 and 10.0.0.0/9 will not have the
same prefix. Is this what the function is supposed to do?

I see. There is some misunderstanding here. We do not take the hash value of 10.0.0.0/8 as the result prefix, but use it to determine whether we flip the 9th bit. And hash value of 10.0.0.0/9 determines whether to flip the 10th bit, and so on. For your reference, the paper by Xu et. al. can be found at: One-Stop Blog for International News, Press Releases, Media, and Conferences - IMConf.

(Or, try to anonymize 128.0.0.0 with 1000 different keys, and see
how many distinct results one can get.)

That's an artifact of the padding being 0...0. The f_0 function is
a constant dependent on the hash key, so x'_0 is random. x'_1 to
x'_{31} are all the same, as PAD(x_0) = PAD(x_0 x_1) = ... =
PAD(x_0 ... x_{31}).

By the way, the PAD function in the paper does include the prefix length, instead of padding with only 0.

Ruoming

Ruoming Pang wrote:

I see. There is some misunderstanding here. We do not take the hash
value of 10.0.0.0/8 as the result prefix, but use it to determine
whether we flip the 9th bit. And hash value of 10.0.0.0/9 determines
whether to flip the 10th bit, and so on.

Oh, I thought you were meaning that you wanted to anonymize the
prefix 10.0.0.0/8, as compared to the address 10.0.0.0, from which
hashing 10.0.0.0/8 is one of 32 steps.

I'd say the discussion concerns only the padding function, which was
wrong in both the old code and my patch.

By the way, the PAD function in the paper does include the prefix
length, instead of padding with only 0.

You're right: the padding requires adding the i value (the prefix
length). That's actually what makes 128.0.0.0 hash to something
different than 4 possible values.

Still, you have to get rid of all the bits after the prefix you're
currently hashing. Moreover, the padding also includes a 1 before the
zeros (the padding is the same than in RFC 1321. I wonder which is
the reason for adding a 1. Anyway, I'd respect it). For the old code,
the prefix should be calculated as:

prefix.prefix = (input & ~(prefix_mask>>(31-i))) | (1<<i);

The code would be now OK. A minor quirk: the way the old code uses i
is different from the way the paper uses i (they are related as
i\_\{paper\} = 31 \- i\_\{old\_code\}). I changed the old code so that i
is now the same than in the paper. Patch is enclosed.

-Chema

bro.current.20050910.fix_for_ip_anonymizer_in_mode_preserving_md5.diff (1.31 KB)

You're right: the padding requires adding the i value (the prefix
length). That's actually what makes 128.0.0.0 hash to something
different than 4 possible values.

I'm glad that now we agree upon this.

Still, you have to get rid of all the bits after the prefix you're
currently hashing.

Yes, that's another bug in my old code, which, apparently, wasn't well tested. :frowning:

Moreover, the padding also includes a 1 before the
zeros (the padding is the same than in RFC 1321. I wonder which is
the reason for adding a 1. Anyway, I'd respect it).

That is fine.

The code would be now OK. A minor quirk: the way the old code uses i
is different from the way the paper uses i (they are related as
i\_\{paper\} = 31 \- i\_\{old\_code\}). I changed the old code so that i
is now the same than in the paper. Patch is enclosed.

That's fine, too.

However, there's another minor tweak we need to make (I learned that in writing and testing the same anonymization function in the TCPMKPUB code). The byte-by-byte value of struct prefix currently depends on byte order of the machine. In order for Bro to produce the same results for both byte orders, we have to put the 32 bit values back to network order when assigning the struct prefix.

Thanks for looking into the code and finding the bugs!

Ruoming

Ruoming Pang wrote:

However, there's another minor tweak we need to make (I learned that in
writing and testing the same anonymization function in the TCPMKPUB
code). The byte-by-byte value of struct prefix currently depends on
byte order of the machine. In order for Bro to produce the same results
for both byte orders, we have to put the 32 bit values back to network
order when assigning the struct prefix.

That's right. I enclose the patch.

-Chema

bro.current.20050910.fix_for_ip_anonymizer_in_mode_preserving_md5.diff (1.32 KB)