Update on using PF_RING/TNAPI with Bro

Hi,

I have managed to get TNAPI/PF_RING configured and working with PF_RING-aware libpcap. http://www.ntop.org/TNAPI.html
Looks like this will be very well suited to the Multiprocessing version of Bro.

  1. At the device driver level, RSS functionality (also, Flow Director in Intel) allows packets multiplexed to different Receiver Queues (and also allows packets belonging to a particular connection be sent to the same RX_Queue) on an I/OAT-supporting network card.
  2. By virtue of TNAPI, these multiple RX_Queues get polled concurrently (by one kernel thread per queue), and sent to PF_RING (along with information about which queue the packet came from).
  3. PF_RING provides a user API which can be used by user-applications like Bro to directly read from the multiple RX_Queues of a network interface by using notation like eth0@1, eth0@2, etc. for RX_Queues 1 and 2 belonging to interface eth0. By assigning one thread to one RX_Queue, we ensure that all packets from one connection are being processed by the same core.

PF_RING and TNAPI can be used to drastically improve the performance of any multiprocessing application, but need to be properly tuned and used by the application. Performance stems from the fact that for Bro, the packets can bypass the kernel’s network stack altogether; one thread polling per RX_Queue thanks to TNAPI; and PF_RING avoiding the mmap from Kernel space to User space by directly copying payloads from RX_Queue rings.

Configuration wise, it took a bit of work to change Bro’s configure files to use a PF_RING-aware libpcap instead of the libpcap that Bro ships with. When running TNAPI and PF_RING, there is a clear performance improvement in the kernel’s ability to receive packets at a higher packet rate (results on the TNAPI website, I also verified). But using PF_RING with the existing Bro leads to a performance degradation of Bro because Bro runs on one user-thread, and when all these packets reach user-space on different user-threads, they need to be processed by the core that is running Bro. But from my knowledge on TNAPI/PF_RING and intuition, multi-threaded Bro can be adapted to PF_RING and will lead to huge gains in performance.

Here’s the summary of results of a brief experiment that I performed on a 8-core Intel Xeon with32 GB RAM running on Linux and with an Intel 82598EB 10Gbps ethernet card:

Goal: Compare conventional Bro installation against Bro with TNAPI and PF_RING (I called it Bro-Ring)
Conclusion: Bro-Ring shows a performance drop.
Observations: The values in the table show for varying packet-rates, how many packets were accepted by the machine running Bro (rest were lost).

Packets/sec | Bro-Ring | Bro |

  • | - | - |
    34000 | 1368791 | 1368003 |
    50000 | 1368546 | 1367707 |
    65,000 | 1368614 |
    |
    120000 |
    | 1224761 |
    130000 |
    | 1168734 |
    166000 | 596667 |
    |
    170000 | 561702 |
    |
    171000 | 681104 |
    |
    173000 | 618100 |
    |
    175000 | 740137 |
    |
    178000 | 864706 |
    |
    210000 |
    | 753700 |
    215000 |
    | 728450 |
    230000 | 494637 |
    |
    240000 |
    | 636287 |

(Note: there was a difference in tcpreplay’s input parameter packet-rate and the actual packet rate achieved, so I could not supply exact values for packet rate)

Sunjeet Singh

The table got a bit messed up, here are the values again-

Packets/sec Bro-Ring Bro
34000 13687911 36800
350000 13685461 367707
65000 1368614 -
120000 - 1224761
130000 - 1168734
166000 596667 -
170000 561702 -
171000 681104 -
173000 618100 -
175000 740137 -
178000 864706 -
210000 - 753700
215000 - 728450
230000 494637 -
240000 - 636287

Sunjeet

3. PF_RING provides a user API which can be used by user-applications
like Bro to directly read from the multiple RX_Queues of a network
interface by using notation like eth0@1, eth0@2, etc. for RX_Queues 1
and 2 belonging to interface eth0.

...

But using PF_RING with the existing Bro leads to a performance
degradation of Bro because Bro runs on one user-thread, and when all
these packets reach user-space on different user-threads, they need to
be processed by the core that is running Bro.

Why are you only running one bro process? You can setup a single node
bro cluster and run multiple bro processes, each listening on one of
eth0@1, eth0@2..

Why are you only running one bro process? You can setup a single node
bro cluster and run multiple bro processes, each listening on one of
eth0@1, eth0@2..

Yes, that's a great idea. But I'm not sure how Bro would handle manager-proxy-worker communication between different RX_Queues instead of different interfaces. Can't be as simple as writing eth0@1, etc. in the cluster's node.cfg file. Maybe some changes to Bro code?

Sunjeet

Putting eth0@1,2,3,4 ec. in node.cfg should work just fine.
no changes to bro are needed, but you may have to rebuild bro with
./configure --enable-cluster...

the config I use with click just has:

[manager]
type=manager
host=10.10.1.12

[proxy-1]
type=proxy
host=10.10.1.12

[worker-1]
type=worker
host=10.10.1.12
interface=tap0

[worker-2]
type=worker
host=10.10.1.12
interface=tap1

[worker-3]
type=worker
host=10.10.1.12
interface=tap2

[worker-4]
type=worker
host=10.10.1.12
interface=tap3

Okay, if this works, I don't think you'll see a gain in performance.

To leverage performance in this case, lets say you have 4 cores,
Core 0 running Bro Manager,
Core 1 running Bro Proxy,
Core 2 running Bro Worker1,
Core 3 running Bro Worker2.

For max performance (and really any performance gain) through cache localization, you'd want all traffic input to Bro to go to either Core2 or Core3; and both of these cores coupled to one RX_Queue each. You will somehow need to at the driver layer, split the traffic coming from the wire to go to one of these queues (intelligently to send packets sharing state to the same RX_Queue). This has to be done at the RSS, and I have no idea of how to do this on my network card- Intel 82598EB. (You couldn't use Click to do this because you want to do it at the driver level).

What do you think?

Sunjeet