minor documentation error

Came up on the SO list.

http://www.bro-ids.org/bro-workshop-2011/solutions/logs/index.html

Solution for:

Exercise

What are the top 10 hosts (originators) that send the most traffic?

The final sort should be “sort-rnk 2”

Credits Shane Castle

Happy Holidays All,

Liam

Came up on the SO list.

http://www.bro-ids.org/bro-workshop-2011/solutions/logs/index.html

Solution for:

Exercise

What are the top 10 hosts (originators) that send the most traffic?

The final sort should be “sort-rnk 2”

Credits Shane Castle

Happy Holidays All,

Liam

I found another issue with this script. The Unix/POSIX sort command will not sort IP addresses correctly unless it is told to explicitly:
"sort -t '.' -k 1,1n -k 2,2n -k 3,3n -k 4,4n". This defect causes the script to lie about who is using how many bytes.

If you want a nice example, just access a reasonably busy Bro system, go to one of the compressed log directories, and try:

"zcat conn.*.gz | bro-cut id.orig_h orig_bytes | sort | less"

You will see it sorting addresses like 192.168.6.48 and 192.168.64.8 the same. This causes the subsequent awk script to fail rather badly.

And that brings up another point: many times the orig_bytes field will be nonnumeric, containing a "-" or a blank instead of a number. I don't know how the awk script deals with these, offhand. I am trying to find out, and create a true toptalkers script that really works.

I think I may have this script working correctly now. There were several errors in the original script: the first sort, the last sort, and in the awk script.

Here is the final, I believe correct version:

bro-cut id.orig_h orig_bytes < conn.log \
    > sort -t '.' -k 1,1n -k 2,2n -k 3,3n -k 4,4n \
    > awk 'BEGIN { size=0;host="" } \
           { if (host != $1) { \
                 if (size != 0) \
                     print host, size; \
                  host=$1; \
                  if ($2 != "-") \
                     size=$2 \
              } else \
                  if ($2 != "-") \
                     size += $2 \
            } \
            END { \
                if (size != 0) \
                     print host, size \
                }' \
    > sort -rnk 2 \
    > head -n 10

Note the "print" command in the awk script. Originally, it was "print $1, size". This is incorrect since it will print the *current* field and not the *last* field, causing the sum for that host to be associated with the next address rather than the last one. The first sort has been changed so that it will do what we really want, and the last sort has been changed to sort reverse numerically. I added in the test for the bytes to be "-", but that might be superfluous.

My old PA senses were tweaked by the lack of variable initialization, and the first assignment to size glared at me as well. As it was originally written, the first time the IP address changed, the size would be set to zero and the first value of orig_bytes would be thrown away. Testing has shown that the above script works correctly.