Unexplained Performance Differences Between Like Servers

Hello everyone:

I have Bro installed on two Dell r720s each with the following specs…

Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz x 32
48GB RAM

Running: CentOs 6.5

Both have the following PF_RING configuration:

PF_RING Version : 6.0.2 ($Revision: 7746$)
Total rings : 16
Standard (non DNA) Options
Ring slots : 32768
Slot version : 15
Capture TX : No [RX only]
IP Defragment : No
Socket Mode : Standard
Transparent mode : Yes [mode 0]
Total plugins : 0
Cluster Fragment Queue : 1917
Cluster Fragment Discard : 26648

The only difference in PF Ring is the other server (Server A) is going off revision 7601, where B is rev 7746.

I’ve tuned the NIC to the following settings…

ethtool -K p4p2 tso off
ethtool -K p4p2 gro off
ethtool -K p4p2 lro off
ethtool -K p4p2 gso off
ethtool -K p4p2 rx off
ethtool -K p4p2 tx off
ethtool -K p4p2 sg off
ethtool -K p4p2 rxvlan off
ethtool -K p4p2 txvlan off
ethtool -N p4p2 rx-flow-hash udp4 sdfn
ethtool -N p4p2 rx-flow-hash udp6 sdfn
ethtool -n p4p2 rx-flow-hash udp6
ethtool -n p4p2 rx-flow-hash udp4
ethtool -C p4p2 rx-usecs 1000
ethtool -C p4p2 adaptive-rx off
ethtool -G p4p2 rx 4096

I’ve got the following sysctl settings on each.

turn off selective ACK and timestamps

net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0

memory allocation min/pressure/max.

read buffer, write buffer, and buffer space

net.ipv4.tcp_rmem = 10000000 10000000 10000000
net.ipv4.tcp_wmem = 10000000 10000000 10000000
net.ipv4.tcp_mem = 10000000 10000000 10000000
net.core.rmem_max = 524287
net.core.wmem_max = 524287
net.core.rmem_default = 524287
net.core.wmem_default = 524287
net.core.optmem_max = 524287
net.core.netdev_max_backlog = 300000

Each bro configuration is using the following…

[manager]
type=manager
host=localhost
[proxy-1]
type=proxy
host=localhost
[worker-1]
type=worker
host=localhost
interface=p4p2
lb_method=pf_ring
lb_procs=16

Both have the same NIC driver version (ixgbe):
3.15.1-k

Same services installed (min install).

Slightly different Kernel versions…
Server A (2.6.32-431.11.2.el6.x86_64)
Server B (2.6.32-431.17.1.el6.x86_64)

At the moment Server A is getting about 700MB/s and Server B is getting about 600Mb/s.

What I don’t understand, is Server A is having several orders of magnatude better performance compared to Server B?

TOP from A (included a few bro workers):

top - 12:48:45 up 1 day, 17:03, 2 users, load average: 5.30, 3.99, 3.13
Tasks: 706 total, 19 running, 687 sleeping, 0 stopped, 0 zombie
Cpu(s): 33.9%us, 6.6%sy, 1.1%ni, 57.2%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st
Mem: 49376004k total, 33605828k used, 15770176k free, 93100k buffers
Swap: 2621432k total, 9760k used, 2611672k free, 9206880k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5768 root 20 0 1808m 1.7g 519m R 100.0 3.6 32:24.92 bro
5760 root 20 0 1688m 1.6g 519m R 99.7 3.4 34:08.36 bro
3314 root 20 0 2160m 269m 4764 R 96.1 0.6 30:14.12 bro
5754 root 20 0 1451m 1.4g 519m R 82.8 2.9 36:40.02 bro

TOP from B (included a few bro workers)

top - 12:49:33 up 14:24, 2 users, load average: 10.28, 9.31, 8.06
Tasks: 708 total, 25 running, 683 sleeping, 0 stopped, 0 zombie
Cpu(s): 41.6%us, 6.0%sy, 1.0%ni, 50.4%id, 0.0%wa, 0.0%hi, 1.1%si, 0.0%st
Mem: 49376004k total, 31837340k used, 17538664k free, 147212k buffers
Swap: 2621432k total, 0k used, 2621432k free, 13494332k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3178 root 20 0 1073m 1.0g 264m R 100.0 2.1 401:47.31 bro
3188 root 20 0 881m 832m 264m R 100.0 1.7 377:48.90 bro
3189 root 20 0 1247m 1.2g 264m R 100.0 2.5 403:22.95 bro
3193 root 20 0 920m 871m 264m R 100.0 1.8 429:45.98 bro

Both have the same amount of Bro workers. I just do not understand why Server A is literally half the utilization on top of seeing more traffic? The only real and consistent difference between the two I see is that server A seems to have twice the amount of SHR (shared memory) compared to server B.

Could this be part of the issue, if not the root cause? How might I go about rectifying the issue?

FWIW, both are not dropping packets and doing well. However, I want to run other apps on top of this, and the poor performance on Server B is likely to have effects on it.

Thanks advance for the advice!

-Jason

Hey Jason,

What versions of Bro, and it is the same for both? I had some serious resource issues from one of the Beta versions recently, and switched back to the stable version.

-John

Hi Jason:

Is the type of traffic in the 600 Mbps stream similar to the type of traffic in the 700 Mbps stream?

Cheers,
Gilbert Clark

The different traffic profiles can can cause different performance. My guess is you are seeing more traffic of a certain type on one of the boxes vs the other. To really know you would need to profile the traffic but if you think about it if more http traffic for instance would be more files processing etc.

Mike

At the moment Server A is getting about 700MB/s and Server B is getting about
600Mb/s.

What I don't understand, is Server A is having several orders of magnatude
better performance compared to Server B?

TOP from A (included a few bro workers):

top - 12:48:45 up 1 day, 17:03, 2 users, load average: 5.30, 3.99, 3.13
Tasks: 706 total, 19 running, 687 sleeping, 0 stopped, 0 zombie
Cpu(s): 33.9%us, 6.6%sy, 1.1%ni, 57.2%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st
Mem: 49376004k total, 33605828k used, 15770176k free, 93100k buffers
Swap: 2621432k total, 9760k used, 2611672k free, 9206880k cached
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5768 root 20 0 1808m 1.7g 519m R 100.0 3.6 32:24.92 bro
5760 root 20 0 1688m 1.6g 519m R 99.7 3.4 34:08.36 bro
3314 root 20 0 2160m 269m 4764 R 96.1 0.6 30:14.12 bro
5754 root 20 0 1451m 1.4g 519m R 82.8 2.9 36:40.02 bro

Server A Bro cpu utilization = 378.6

TOP from B (included a few bro workers)

top - 12:49:33 up 14:24, 2 users, load average: 10.28, 9.31, 8.06
Tasks: 708 total, 25 running, 683 sleeping, 0 stopped, 0 zombie
Cpu(s): 41.6%us, 6.0%sy, 1.0%ni, 50.4%id, 0.0%wa, 0.0%hi, 1.1%si, 0.0%st
Mem: 49376004k total, 31837340k used, 17538664k free, 147212k buffers
Swap: 2621432k total, 0k used, 2621432k free, 13494332k cached
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3178 root 20 0 1073m 1.0g 264m R 100.0 2.1 401:47.31 bro
3188 root 20 0 881m 832m 264m R 100.0 1.7 377:48.90 bro
3189 root 20 0 1247m 1.2g 264m R 100.0 2.5 403:22.95 bro
3193 root 20 0 920m 871m 264m R 100.0 1.8 429:45.98 bro

Both have the same amount of Bro workers. I just do not understand why Server
A is literally half the utilization on top of seeing more traffic? The only
real and consistent difference between the two I see is that server A seems to
have twice the amount of SHR (shared memory) compared to server B.

Server B Bro cpu utilization = 400%

Are you only running 4 workers or did you truncate the output? Is that
running at 100% 24/7 or does it vary with the traffic?

Are you doing 4 tuple load balancing or 2 tuple load balancing between
the two servers? Most likely Server B is seeing more flows.

Wow, thanks for all the quick replies :slight_smile:

What versions of Bro, and it is the same for both?

I am using the same version of Bro for each server (1.2).

Is the type of traffic in the 600 Mbps stream similar to the type of traffic in the 700 Mbps stream?

I’m not 100% sure but I think that is a really good question to ask. Do you know of any good tools that might help inform an answer? I know of iptraf for example, is there one that folks generally prefer the most?

Are you only running 4 workers or did you truncate the output?
Yes, I truncated the output to show four workers each (I have 16 total).

Are you doing 4 tuple load balancing or 2 tuple load balancing between the two servers?

Sorry I am not sure what you mean by this or the implications of one over the other. Is there an easy way I can find out (I am kinda new to this)? I agree with the likelihood that B may be recieving more flows.

Thanks!
Jason

FWIW:

I just ran iptraf for a bit on both and one thing really stuck out to me:

Server A:
Other IP: 5273 633087 5273 633087 0 0

Server B:
Other IP: 952797 445867K 952797 445867K 0 0

So server A is seeing 633087 bytes of ‘other’ traffic, while B is seeing 445867 kilobytes of ‘other’ traffic. Do you think this other traffic could be the root cause of the issues here? If so, would a bpf filter looking for only tcp/udp/ipv4 traffic be sufficient? How might I apply that within Bro?

Here is the full view taken some time after the metrics above:

Server A:

x Total Total Incoming Incoming Outgoing Outgoing x
x Packets Bytes Packets Bytes Packets Bytes x
x Total: 80187229 51270M 80187229 51270M 0 0 x
x IPv4: 80187193 50026M 80187193 50026M 0 0 x
x IPv6: 36 1296 36 1296 0 0 x
x TCP: 70040618 47342M 70040618 47342M 0 0 x
x UDP: 10052947 2676M 10052947 2676M 0 0 x
x ICMP: 85189 6652550 85189 6652550 0 0 x
x Other IP: 8475 1060993 8475 1060993 0 0

Server B:

x Total Total Incoming Incoming Outgoing Outgoing x
x Packets Bytes Packets Bytes Packets Bytes x
x Total: 89718860 53317M 89718860 53317M 0 0 x
x IPv4: 89712988 51882M 89712988 51882M 0 0 x
x IPv6: 5872 51778 5872 51778 0 0 x
x TCP: 79615124 49170M 79615124 49170M 0 0 x
x UDP: 7627607 1682M 7627607 1682M 0 0 x
x ICMP: 86620 5619078 86620 5619078 0 0 x
x Other IP: 2389509 1023M 2389509 1023M 0 0 x

Many thanks in advance for the quick and helpful replies!

Hi Justin:

> Is the type of traffic in the 600 Mbps stream similar to the type of traffic in the 700 Mbps stream?
I'm not 100% sure but I think that is a really good question to ask. Do you know of any good tools that might help inform an answer? I know of iptraf for example, is there one that folks generally prefer the most?

bro ships with a utility called 'trace-summary' that will print some useful information about the trace. It is written in Python (I think 2.6+ should work fine, though someone can feel free to correct me if I'm wrong :). Example run / output available below at [1], though the formatting is terrible without a monospace font.

Note that it's possible to run trace-summary against either bro log files (-C option) *or* a captured trace, so capturing a trace is not necessarily required in order to use the tool. Additionally, if it's desired to run trace-summary against a trace directly, ipsumdump is required (http://www.read.seas.harvard.edu/~kohler/ipsumdump/).

Cheers,
Gilbert

[1] clarkg1-osx:trace-summary clarkg1$ python trace-summary ~/net.2009.12.06.1159.dmp

>== Total === 2009-12-06-15-00-10 - 2009-12-07-14-59-40
    - Bytes 150.6m - Payload 144.1m - Pkts 169.0k - Frags 88.5% - MBit/s 0.0 -
      Ports | Sources | Destinations | Services | Protocols |
      80 88.0% | 198.189.255.76 22.5% | 192.168.1.103 45.0%

            100.0% | 6 90.5% | |

      1119 23.0% | 192.168.1.105 13.2% | 192.168.1.105 19.6%

                   > 17 8.2% | |

      1115 9.6% | 192.168.1.103 12.9% | 198.189.255.76 4.1%

                   > 1 0.0% | |

      1817 5.9% | 198.189.255.74 10.8% | 192.168.1.255 2.3%

                   > > >

      49638 3.5% | 151.207.243.129 3.8% | 198.189.255.74 2.3%

                   > > >

      1117 3.4% | 192.168.1.1 3.4% | 151.207.243.129 2.2%

                   > > >

      626 3.4% | 74.125.164.32 2.8% | 192.168.1.1 1.8%

                   > > >

      137 3.2% | 74.125.164.91 2.1% | 224.0.0.1 1.7%

                   > > >

      53 2.9% | 192.168.1.104 1.9% | 192.168.1.104 1.5%

                   > > >

      49378 2.5% | 192.168.1.106 1.6% | 0.0.0.0 1.3%

                   > > >

First: 2009-12-06-15-00-10 (1260129610.426233) Last: 2009-12-07-14-59-40 1260215980.426237

Hi Jason:

I believe one way to set a BPF filter is to modify site/local.bro to include:

redef cmd_line_bpf_filter = “ip or not ip”;

I think there’s also a packet filter framework (http://www.bro.org/sphinx/scripts/base/frameworks/packet-filter/main.html) which supports more elaborate filtering schemes, but I don’t really know much about it offhand :slight_smile:

Regarding the “other” traffic being the root cause of the issues: I think it’s pretty difficult to say. A few ideas:

  • check the size of log files for significant differences. if http.log / reporter.log / weird.log / etc. is much longer on one system than on another, maybe that might be a place to start looking
  • try setting a filter to only accept a certain type of traffic (e.g. HTTP, SSH) to see relative load for that specific traffic type
  • try playing with which scripts bro loads (e.g. tweak local.bro and / or try running bro in bare mode with a very small set of loaded scripts) to see if that has any effect
  • bro can be told to dump performance statistics into a human-readable ASCII log by including the “misc/profiling.bro” script: some of the information included there might be useful to have
  • try capturing a trace and playing that trace back to a standalone bro process … using tools like ‘time’ and ‘perf’ could help identify how performance changes based on the trace and scripts currently being loaded.
    } this has the benefit of not dropping packets while scripts are being tweaked…

As some food for thought: in general, bro does a few things every time there’s a new packet:

  • Retrieve the packet from the NIC
  • Dissect the packet and generate events
  • Spend time in script-land processing events that have been generated
  • Spend time handling administrative overhead (e.g. check timers, check triggers)

Thus, in general, making bro go faster is probably going to mean making one of those things take less time.

Anyway, hope something in there is useful :slight_smile:

Cheers,
Gilbert

Thanks for the reply Gilbert! I will take a closer look into these items tomorrow but off the top of my head, I do not recall there being any great difference in file size in the log files. On the surface (using iptraf), it seems like there is a significant amount of non-ip traffic so I modified local.bro to include the following:

redef cmd_line_bpf_filter = “ip”

In hopes that it has the desired effect.

One other question I had was the effect of implementing TCP sequence randomization on performance (if it was enabled on an ASA for example)? What impact would this have on flows (presumably a large increase)? How might I best quantify the amount of flows being processed compared to the other server?

Sorry for all the questions, I am very much a novice at this, but very willing to learn so I appreciate the help so far!