Workers fail to process traffic due to a PF_RING problem

Hi all,

I’ve noticed a problem with the cluster where some workers do not start. I can tell this by looking at the stats from broctl using “netstats”, “status”, and “ps.bro”.

It seems there is a problem with how PF_RING is used or PF_RING itself. (Or maybe my setup.) Has anyone else encountered this problem and started trying to isolate it ?

Restarting the worker seems to make it “work” again. This phenomenon seems to happen almost every time I start the cluster. Some times, on some nodes, more than one worker is affected.

If it helps to know I am using PF_RING from SVN 2013-03-19 and have experienced the issue with all previous versions. Bro is 2.1-380.

I searched the bug/problem tracker at http://tracker.bro.org/bro without result. If this is something not resolved by mailing list and worth tracking in a ticket I will set it up.

Thanks,

–TC

Examples. Via netstats, notice worker-1-4 is lame.

[BroControl] > netstats
worker-1-1: 1363895818.282736 recvd=29542788 dropped=4 link=29542788
worker-1-10: 1363895818.482747 recvd=20389244 dropped=1 link=20389244
worker-1-11: 1363895818.682289 recvd=24803977 dropped=1 link=24803977
worker-1-12: 1363895818.882953 recvd=28730644 dropped=1 link=28730644
worker-1-13: 1363895819.082850 recvd=19810612 dropped=0 link=19810612
worker-1-14: 1363895819.290962 recvd=22651710 dropped=0 link=22651710
worker-1-15: 1363895819.490876 recvd=27415776 dropped=0 link=27415776
worker-1-16: 1363895819.694541 recvd=21634742 dropped=0 link=21634742
worker-1-17: 1363895819.895422 recvd=20572973 dropped=0 link=20572973
worker-1-18: 1363895820.095018 recvd=25490613 dropped=2 link=25490613
worker-1-19: 1363895820.298648 recvd=19699362 dropped=0 link=19699362
worker-1-2: 1363895820.499099 recvd=23931030 dropped=1 link=23931030
worker-1-20: 1363895820.699632 recvd=21769411 dropped=0 link=21769411
worker-1-3: 1363895820.899525 recvd=21604270 dropped=1 link=21604270
worker-1-4: 1363895821.102857 recvd=0 dropped=0 link=0
worker-1-5: 1363895821.307124 recvd=22320056 dropped=0 link=22320056
(…cut…)

Find what PID worker-1-4 is using by checking broctl “status”.

[BroControl] > status
Name Type Host Status Pid Peers Started
(…cut…)

worker-1-4 worker 10.1.1.1 running 17618 2 21 Mar 12:20:21
(…cut…)

Go check the PF_RING stats for PID 17618

root@bro:/home/bro# cat /proc/net/pf_ring/17618-eth5.9
Bound Device(s) : eth5
Active : 1
Breed : Non-DNA
Sampling Rate : 1
Capture Direction : RX+TX
Socket Mode : RX+TX
Appl. Name :
IP Defragment : No
BPF Filtering : Enabled

Sw Filt. Rules : 0

Hw Filt. Rules : 0

Poll Pkt Watermark : 1
Num Poll Calls : 16161864
Channel Id Mask : 0xFFFFFFFF
Cluster Id : 20
Slot Version : 15 [5.5.3]
Min Num Slots : 6966
Bucket Len : 9600
Slot Len : 9632 [bucket+header]
Tot Memory : 67108864
Tot Packets : 0
Tot Pkt Lost : 0
Tot Insert : 0
Tot Read : 0
Insert Offset : 0
Remove Offset : 0
TX: Send Ok : 0
TX: Send Errors : 0
Reflect: Fwd Ok : 0
Reflect: Fwd Errors: 0
Num Free Slots : 6966

No packets huh. Must be something with how PF_RING is used or PF_RING itself. What does restarting the worker do ?

[BroControl] > restart worker-1-4
stopping …
stopping worker-1-4 …
starting …
starting worker-1-4 …

[BroControl] > status
Name Type Host Status Pid Peers Started
(…cut…)
worker-1-4 worker 10.1.1.1 running 18854 2 21 Mar 12:58:00

(…cut…)

[BroControl] > netstats
(…cut…)
worker-1-4: 1363896589.166826 recvd=6413632 dropped=112989 link=6413632
(…cut…)

On checking the PF_RING stats again it looks like things are working now. There was a brief moment of “dropped packets” during the restart but that counter has not incremented since.

root@bro:/home/bro# cat /proc/net/pf_ring/18854-eth5.21
Bound Device(s) : eth5
Active : 1
Breed : Non-DNA
Sampling Rate : 1
Capture Direction : RX+TX
Socket Mode : RX+TX
Appl. Name :
IP Defragment : No
BPF Filtering : Enabled

Sw Filt. Rules : 0

Hw Filt. Rules : 0

Poll Pkt Watermark : 1
Num Poll Calls : 6637605
Channel Id Mask : 0xFFFFFFFF
Cluster Id : 20
Slot Version : 15 [5.5.3]
Min Num Slots : 6966
Bucket Len : 9600
Slot Len : 9632 [bucket+header]
Tot Memory : 67108864
Tot Packets : 7711193
Tot Pkt Lost : 112989
Tot Insert : 7598204
Tot Read : 7598197
Insert Offset : 4454256
Remove Offset : 4446288
TX: Send Ok : 0
TX: Send Errors : 0
Reflect: Fwd Ok : 0
Reflect: Fwd Errors: 0
Num Free Slots : 6959