Need some thoughts from the LINUX/BRO gifted....
Hardware:
CPU: two - Intel(R) Xeon(TM) CPU 2.40GHz
MEM: 2gig
NIC's: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
We peak around 130mbps and at this time we are running around 10mbps.
No matter what speed we run at we continue to drop packets. We have
loaded pf_ring and load balanced across two NIC's based on Martin's
BLOG: http://ossectools.blogspot.com/2011/09/bro-quickstart-cluster-edition.html
Only change I made was use an additional NIC in the node.cfg and not
the same one.
I have also made the follwoing NIC changes based on some threads I
found on SO and BRO lists.
ethtool -K eth0 rx off
ethtool -K eth0 tx off
ethtool -K eth0 sg off
ethtool -K eth0 tso off
ethtool -K eth0 gso off
ethtool -K eth0 lro off
and
echo 33554432 > /proc/sys/net/core/rmem_default
echo 33554432 > /proc/sys/net/core/rmem_max
echo 10000 > /proc/sys/net/core/netdev_max_backlog
as well as
Changed the MTU size on NIC's to match BRO
Still no love. I then went back to a standalone setup and the packet
drops are not as bad, but again we are running very low bandwidth at
this time. Any ideas? Update NIC maybe? Drop Kick G200 in dumpster!
Thanks
Tom
Can you post the contents of the files in /proc/net/pf_ring/ for the bro
processes? You should have one per bro worker.
On moderate hardware, I've found that it takes about one CPU per 100
Mb/sec, so you shouldn't be dropping at anything under that. You
probably also don't need PF_RING or any special kernel tunings at
anything less than 200-300 Mb/sec, so that shouldn't be the problem
either. When you say dropped packets, is that per the Bro drop log,
or the nic stats?
That is from the netstats via Bro. Zero dropped packets via Nic stats.
Here is some stats from this AM with the following setup.
root@ptlsecsensor1:/home/secarch# /usr/local/bro/bin/broctl capstats
Interface kpps mbps (10s average)
Via tcpdump
1995 packets captured
1995 packets received by filter
14731 packets dropped by kernel
meant to replay all.
This is via tcpdump this morning.
1995 packets captured
1995 packets received by filter
14731 packets dropped by kernel
I stop bro and then run tcpdump on the same interface and I get no drops.
Time to retire this old gear..
worker-0: 1336126625.749682 recvd=263871 dropped=30023 link=293912
worker-1: 1336126625.997021 recvd=262510 dropped=30656 link=293227
Are you running "misc/capture-loss"? That should provide a much more holistic view of packet loss because it's not relying on anything other than characteristics of the actual traffic to tell you if packets are being lost. It doesn't tell you where the packet loss is happening and could mean a very large number of things, but it's a good place to start.
We were unsure as the documentation mentioned 80mbps per CPU, so we
thought we would give pf_ring a run. But at these rates I would not
think we would see drops.
I was really conflicted when I wrote 80Mbps in that documentation. There is really no good way to figure out what that will be. With reasonably fast, modern Xeon CPUs people seem to be getting ~150Mbps per core now but you need to take value with a grain of salt too since it depends so heavily on your traffic mix
Is netstats not telling the truth?
That question is really hard to answer, especially if you are running pf_ring where the normal Linux packet processing pipeline is being bypassed.
We are just trying to get an idea of what these old IBM hardware can
do for us and are running into this.
You didn't mention that it's old hardware. What's the architecture? How many cores does the box have total?
.Seth
Thanks for reply Seth. Was out of pocket all day yesterday but I will load up capture-loss and see what other details we get. We got something weird going on. We have another g200 in another location seeing about the amount of traffic, same thing. It is not running pf_ring. The boxes I believe are dual processor zeon's old technology. But no reason they should not Andre the traffic. I might switch OS as Ubuntu server sometimes has flakey Nic drivers. Then load up Bro by itself and see how that goes.
Thanks again
Tom
Well I finally got some time to work on this dude.
I started with a fresh build of Ubuntu 10.10 server all up to snuff.
Loaded only Bro with a single tap. Drops started right off the bat.
So I updated my intel driver to the latest and restarted bro. Drops
still happening. I loaded capture-loss and I assume you wanted some
date out of the notice.log about the packet drops?
Here is a small snippet of a couple. They are pretty frequent.
1336583347.593837 - - - - - - PacketFilter::Dropped_Packets 93989
packets dropped after filtering, 249946 received, 249946 on
link - - - - bro Notice::ACTION_LOG 6 3600.000000 F - - - - - - - -
1336583357.594487 - - - - - - PacketFilter::Dropped_Packets 73508
packets dropped after filtering, 227808 received, 227808 on
link - - - - bro Notice::ACTION_LOG 6 3600.000000 F - - - - - - - -
1336583367.594936 - - - - - - PacketFilter::Dropped_Packets 82349
packets dropped after filtering, 234476 received, 234476 on
link - - - - bro Notice::ACTION_LOG 6 3600.000000 F - - - - - - - -
Current traffic on the monitor port:
Interface kpps mbps (10s average)
Just for giggles and cause I can. I am upgrading to latest Ubuntu on
the same box. WTH
You will have a new log named capture_loss.log. Could we see some lines from that?
.Seth
hmmm...I did not see that one. I am in the middle of a upgrade. Soon
as that is done I will send it along. Maybe the script did not load.
Thanks Seth
Tom
Well new version of UBuntu:
Distributor ID: Ubuntu
Description: Ubuntu 11.04
Release: 11.04
Codename: natty
I do not see that log file being created. I do see:
/usr/local/bro/share/bro/policy/misc/capture-loss.bro
in loaded_scripts.log
Do I feed it to the birds now?
Tom
It takes 5 minutes before the log shows up.
.Seth
Well I just checked again an have the following in the file.
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path capture_loss
#fields ts ts_delta peer gaps acks percent_lost
#types time interval string count count string
1336608708.135106 900.000206 bro 996 721708 0.138%
1336609608.135122 900.000016 bro 805 705801 0.114%
Now that actually looks really nice. Did you say that you are running PF_Ring? I trust the data from the NIC even less when using any of the various things that bypass the normal OS data flow (but I'm not saying that's a bad thing!).
.Seth
hehe
Well that does seem exciting, but at the time we were running around
13mbps and no we are not running pf_ring. Here is a snipet of the log
when we were running close to 100mbps.
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path capture_loss
#fields ts ts_delta peer gaps acks percent_lost
#types time interval string count count string
1336586727.588158 900.000168 bro 289518 644040 44.953%
1336587627.588220 900.000062 bro 306102 746812 40.988%
Ah, ok. So you said that it's an old dual Xeon box? Does that mean you have 2 cores total?
It's very likely that your box is just underpowered if you only have two cores and it's an older cpu architecture.
.Seth
yeah I suspect you are correct. we are at 91mbps atm and bro is
consuming pretty much the 1 CPU.
Thanks for all your help peops. Time to throw a bigger horse at it!
Tom