Dropped Packets

Hi all,

I recently upgraded 3 standalone Bro nodes. 2 of them are Ubuntu and one of them is CentOS 6.2.

On the 2 Ubuntu 11.10 boxes I have a lot of dropped packets in the notice.log

Update:

I have found a way to lessen the amount of packets being dropped.

Here is what I have:
Dell r310 - 3.2Ghz - 4GB RAM - Dell hardware RAID controller - two 1TB 7.2k drives in a RAID 1

Test scenario:
Two bro2.0 servers running virtually identical configs with Ubuntu 11.10.
One server for testing and one as a control.
Both monitoring 2 Network Taps of live traffic.

Test 1 : increased RAM to 8GB
Result : same amount of packets dropped

Test 2 : replaced hard drives with 2 10k drives in a RAID 1
Result : 10% less packet drops in bro logs as compared to the control server

Test 3 : replaced hard drives with 2 SSD drives in a RAID 1
Result : 80% less packet drops then the control server

Test 4 : switched SSD hard drives to a RAID 0
Result | 90% less packet drops then the control server

I have heard that SSD drives have a shorter life span if it is written to a lot. So this is probably not the best solution.

But, from now on I will order servers with the fastest possible hard drives which for the Dell r310 are 15K SAS drives.

When I get the 15K SAS drives in I will run the same tests and put the results out.

Will

That's really interesting! What about using a ramdisk (e.g. /dev/shm)
file system for logs being currently written to, then at the hour mark
(when the logs rollover), putting them on disk? That should
theoretically take disk performance out of the equation, and I'd be
really interested in your numbers then.

What was the network speeds you had been seeing during these tests?

Per broctl>capstats, I was averaging between 15-45mbps.

Hm, that seems like an oddly low amount of traffic to see drops on a box like you have.

Could you also try the current master branch in our repository? The logging framework has been threaded and it's likely that disk latency issues have been resolved to some degree. Also, I assume these were running Bro in standalone mode? When running in a cluster this shouldn't have nearly so much effect because the manager process does all of the log writing and it doesn't do any packet processing.

  .Seth

How much disk IO are these boxes actually doing while the test is
running?

A good tool for showing this is dstat (apt-get install dstat)

    dstat --disk-tps -a --mem 5

That's really interesting! What about using a ramdisk (e.g. /dev/shm)
file system for logs being currently written to, then at the hour mark
(when the logs rollover), putting them on disk? That should
theoretically take disk performance out of the equation, and I'd be
really interested in your numbers then.

Along those lines, you could experiment with various filesystem
robustness-performance tradeoffs. For instance, assuming you're running
ext4 filesystems, you could try any/all of these mount-time/fstab options:

  noatime,barrier=0,data=writeback

...with the related caveats about what you're giving up (man mount).

Hank

I don't mean to hijack the thread, but I just tried cluster versus
standalone and got some interesting results. I tuned the cFlow to send
about 45 mbps to an interface on a particular worker. Based on the
Dropped_Packets in notice.log, I see about 35% drop on the cluster for
this particular worker. When I run standalone against the same
interface with the same filter I see between 0 and 1% dropped. Very odd.

Here is some output from dstat -a --mem 5

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-->
usr sys idl wai hiq siq| read writ| recv send| in out | int csw >
14 9 76 0 0 1| 26k 430k| 0 0 | 0 0 | 24k 132k>
11 7 82 0 0 1| 0 8192B| 17M 141k| 0 0 | 65k 91k>
12 6 81 0 0 1| 0 9011B| 10M 32k| 0 0 | 67k 91k>
12 6 81 0 0 1| 0 282k| 19M 142k| 0 0 | 69k 90k>
12 7 81 0 0 0| 0 33k| 13M 35k| 0 0 | 69k 89k>
11 6 82 0 0 1| 0 1558k| 17M 124k| 0 0 | 63k 92k>
12 7 81 0 0 0| 0 3403k| 11M 26k| 0 0 | 67k 92k>
12 7 80 0 0 1| 0 9011B| 23M 158k| 0 0 | 69k 91k>
12 6 82 0 0 1| 0 9011B| 11M 33k| 0 0 | 67k 90k>
12 6 80 0 0 1| 0 258k| 23M 125k| 0 0 | 69k 89k>
13 7 80 0 0 0| 0 9011B| 13M 26k| 0 0 | 66k 91k>
11 7 81 0 0 0| 0 490k| 17M 155k| 0 0 | 66k 91k>
12 6 81 0 0 1| 0 4980k| 10M 29k| 0 0 | 67k 91k>^C

My dstat doesn't have the --disk-tps parameter (plug-in) that Justin
mentioned. Do any of the values look too large?

Tyler

If you running bro on linux, without pf_ring, try increasing net.core-rmem_default:

For 10GigE nic

net.core.rmem_max = 500000000
net.core.rmem_default = 500000000

For 1 GigE nics

net.core.rmem_max = 50000000
net.core.rmem_default = 50000000

Bill Jones

Thank you all for the suggestions.

I will begin testing the different options and email out the results.

Will