Hello everyone:
I have Bro installed on two Dell r720s each with the following specs…
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz x 32
48GB RAM
Running: CentOs 6.5
Both have the following PF_RING configuration:
PF_RING Version : 6.0.2 ($Revision: 7746$)
Total rings : 16
Standard (non DNA) Options
Ring slots : 32768
Slot version : 15
Capture TX : No [RX only]
IP Defragment : No
Socket Mode : Standard
Transparent mode : Yes [mode 0]
Total plugins : 0
Cluster Fragment Queue : 1917
Cluster Fragment Discard : 26648
The only difference in PF Ring is the other server (Server A) is going off revision 7601, where B is rev 7746.
I’ve tuned the NIC to the following settings…
ethtool -K p4p2 tso off
ethtool -K p4p2 gro off
ethtool -K p4p2 lro off
ethtool -K p4p2 gso off
ethtool -K p4p2 rx off
ethtool -K p4p2 tx off
ethtool -K p4p2 sg off
ethtool -K p4p2 rxvlan off
ethtool -K p4p2 txvlan off
ethtool -N p4p2 rx-flow-hash udp4 sdfn
ethtool -N p4p2 rx-flow-hash udp6 sdfn
ethtool -n p4p2 rx-flow-hash udp6
ethtool -n p4p2 rx-flow-hash udp4
ethtool -C p4p2 rx-usecs 1000
ethtool -C p4p2 adaptive-rx off
ethtool -G p4p2 rx 4096
I’ve got the following sysctl settings on each.
turn off selective ACK and timestamps
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
memory allocation min/pressure/max.
read buffer, write buffer, and buffer space
net.ipv4.tcp_rmem = 10000000 10000000 10000000
net.ipv4.tcp_wmem = 10000000 10000000 10000000
net.ipv4.tcp_mem = 10000000 10000000 10000000
net.core.rmem_max = 524287
net.core.wmem_max = 524287
net.core.rmem_default = 524287
net.core.wmem_default = 524287
net.core.optmem_max = 524287
net.core.netdev_max_backlog = 300000
Each bro configuration is using the following…
[manager]
type=manager
host=localhost
[proxy-1]
type=proxy
host=localhost
[worker-1]
type=worker
host=localhost
interface=p4p2
lb_method=pf_ring
lb_procs=16
Both have the same NIC driver version (ixgbe):
3.15.1-k
Same services installed (min install).
Slightly different Kernel versions…
Server A (2.6.32-431.11.2.el6.x86_64)
Server B (2.6.32-431.17.1.el6.x86_64)
At the moment Server A is getting about 700MB/s and Server B is getting about 600Mb/s.
What I don’t understand, is Server A is having several orders of magnatude better performance compared to Server B?
TOP from A (included a few bro workers):
top - 12:48:45 up 1 day, 17:03, 2 users, load average: 5.30, 3.99, 3.13
Tasks: 706 total, 19 running, 687 sleeping, 0 stopped, 0 zombie
Cpu(s): 33.9%us, 6.6%sy, 1.1%ni, 57.2%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st
Mem: 49376004k total, 33605828k used, 15770176k free, 93100k buffers
Swap: 2621432k total, 9760k used, 2611672k free, 9206880k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5768 root 20 0 1808m 1.7g 519m R 100.0 3.6 32:24.92 bro
5760 root 20 0 1688m 1.6g 519m R 99.7 3.4 34:08.36 bro
3314 root 20 0 2160m 269m 4764 R 96.1 0.6 30:14.12 bro
5754 root 20 0 1451m 1.4g 519m R 82.8 2.9 36:40.02 bro
TOP from B (included a few bro workers)
top - 12:49:33 up 14:24, 2 users, load average: 10.28, 9.31, 8.06
Tasks: 708 total, 25 running, 683 sleeping, 0 stopped, 0 zombie
Cpu(s): 41.6%us, 6.0%sy, 1.0%ni, 50.4%id, 0.0%wa, 0.0%hi, 1.1%si, 0.0%st
Mem: 49376004k total, 31837340k used, 17538664k free, 147212k buffers
Swap: 2621432k total, 0k used, 2621432k free, 13494332k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3178 root 20 0 1073m 1.0g 264m R 100.0 2.1 401:47.31 bro
3188 root 20 0 881m 832m 264m R 100.0 1.7 377:48.90 bro
3189 root 20 0 1247m 1.2g 264m R 100.0 2.5 403:22.95 bro
3193 root 20 0 920m 871m 264m R 100.0 1.8 429:45.98 bro
Both have the same amount of Bro workers. I just do not understand why Server A is literally half the utilization on top of seeing more traffic? The only real and consistent difference between the two I see is that server A seems to have twice the amount of SHR (shared memory) compared to server B.
Could this be part of the issue, if not the root cause? How might I go about rectifying the issue?
FWIW, both are not dropping packets and doing well. However, I want to run other apps on top of this, and the poor performance on Server B is likely to have effects on it.
Thanks advance for the advice!
-Jason