Memory Issue with Bro

Hello,

I have been using the Bro 2.4 to test the performance of SFC driver. I have observed the following issue because of which I am unable to proceed with any analysis -

There seems to be a memory leak somewhere as there are times when Bro runs out of memory too soon. These are the instances when drops are also seen too soon even at very low packet rates.

When Bro is started, the available free memory keeps going down till a point where the server is extremely sluggish and there are drops being seen –

An instance of Bro running out of memory (with 16 workers, no cpu pinning and having sent 155K pps for 7-8 minutes)–

[root@dellr620c skathare]# free -m

total used free shared buffers cached

Mem: 32129 31917 211 3 1 376 à 211MB : that’s very low, considering the system started with some 26GB free memory (and this drop happens just within the first 2 minutes of running the traffic). System becomes very slow at this point and, of course, it has started dropping packets already.

-/+ buffers/cache: 31539 589

Swap: 1907 1687 219

[root@dellr620c skathare]#

Swap: 1907 1764 142

[root@dellr620c skathare]# cat /proc/meminfo

MemTotal: 32900200 kB

MemFree: 193384 kB

MemAvailable: 480956 kB

Buffers: 2464 kB

Cached: 471260 kB

SwapCached: 74860 kB

Active: 23439908 kB

Inactive: 3120012 kB

Active(anon): 23179296 kB

Inactive(anon): 2914628 kB

Active(file): 260612 kB

Inactive(file): 205384 kB

Unevictable: 0 kB

Mlocked: 0 kB

SwapTotal: 1953076 kB

SwapFree: 152264 kB

Dirty: 22548 kB

Writeback: 8 kB

AnonPages: 26017216 kB

Mapped: 15200 kB

Shmem: 4692 kB

Slab: 190556 kB

SReclaimable: 93648 kB

SUnreclaim: 96908 kB

KernelStack: 4288 kB

PageTables: 71380 kB

NFS_Unstable: 0 kB

Bounce: 0 kB

WritebackTmp: 0 kB

CommitLimit: 15638376 kB

Committed_AS: 29119672 kB

VmallocTotal: 34359738367 kB

VmallocUsed: 374920 kB

VmallocChunk: 34342144308 kB

HardwareCorrupted: 0 kB

AnonHugePages: 243712 kB

HugePages_Total: 2700

HugePages_Free: 46

HugePages_Rsvd: 0

HugePages_Surp: 0

Hugepagesize: 2048 kB

DirectMap4k: 345024 kB

DirectMap2M: 18483200 kB

DirectMap1G: 16777216 kB

[root@dellr620c skathare]#top

top - 22:48:06 up 70 days, 19:43, 5 users, load average: 18.25, 13.49, 10.04

Tasks: 17 total, 1 running, 16 sleeping, 0 stopped, 0 zombie

%Cpu(s): 31.5 us, 2.9 sy, 0.7 ni, 43.7 id, 21.1 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 32900200 total, 32731868 used, 168332 free, 336 buffers

KiB Swap: 1953076 total, 1262956 used, 690120 free. 14248 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P

20504 root 20 0 404644 32724 1464 D 0.0 0.1 6:32.62 bro 31

20552 root 20 0 404692 50108 1496 D 52.1 0.2 6:37.18 bro 29

20564 root 20 0 404824 50960 1508 D 49.8 0.2 6:37.16 bro 27

20574 root 20 0 404652 48748 1476 D 52.0 0.1 6:36.52 bro 24

20567 root 20 0 404684 49948 1456 D 0.0 0.2 6:33.32 bro 23

20561 root 20 0 421440 66672 1412 D 51.9 0.2 6:37.14 bro 18

20569 root 20 0 404708 31904 1508 D 41.1 0.1 6:34.77 bro 16

20495 root 20 0 404620 49936 1408 D 27.1 0.2 6:34.91 bro 13

20515 root 20 0 404684 46324 1500 D 21.9 0.1 6:33.25 bro 13

20548 root 20 0 404704 50188 1504 D 43.6 0.2 6:35.16 bro 13

20474 root 20 0 404736 32704 1508 D 0.0 0.1 6:32.79 bro 12

20502 root 20 0 404636 29300 1464 D 52.0 0.1 6:36.13 bro 12

20539 root 20 0 404748 32784 1484 D 44.7 0.1 6:34.08 bro 11

20537 root 20 0 404668 29284 1464 D 0.0 0.1 6:32.03 bro 4

20559 root 20 0 404684 32644 1444 R 54.6 0.1 6:38.12 bro 3

20542 root 20 0 404728 32704 1504 D 25.1 0.1 6:33.84 bro 1

20289 root 20 0 196768 412 412 S 0.0 0.0 0:00.03 solar_clusterd 0

After stopping the BRO workers (especially after the manager is killed/stopped), memory recovers –

top - 22:53:01 up 70 days, 19:48, 5 users, load average: 3.06, 8.83, 9.20

Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie

%Cpu(s): 0.1 us, 0.3 sy, 2.2 ni, 96.6 id, 0.8 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 32900200 total, 6702252 used, 26197948 free, 8308 buffers à This is almost what the system originally started with – 26GB

KiB Swap: 1953076 total, 237216 used, 1715860 free. 528032 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P

20289 root 20 0 196768 1028 484 S 0.0 0.0 0:00.03 solar_clusterd

At very high packet rates, the available free memory keeps going down very fast and starts dropping packets. At lower packet rates, the drop in available free memory is comparatively slower, but it is still there and packets are dropped eventually. When the BRO workers are stopped, the available free memory recovers. During the few successful times when I have been able to go till 150Kpps without seeing any packet drops, the available free memory remained a constant at ~23G. It remained at this for the entire duration of the test (more than an hour ) and no drops were seen.

The above data is a few days old. When I tried running BRO again today, I saw the memory drop from 18G to 4G in just a matter of few seconds after starting BRO (16 workers, each pinned to a CPU). Is it possible that Bro is accumulating some per-flow state and not freeing it? If so, is there any tuning that should be done to avoid this?

Appreciate any help on this!

Sampada,

I’ve observed Bro utilizing more memory when PF_RING modules haven’t been recompiled for the newest kernel. Once I recompiled for the current kernel, my memory usage got more inline with the numbers I would expect to see. Could that be a problem for you or is PF_RING not in the equation?

Mark

Mark,

I have tried Bro both, with and without PF-RING, and saw the memory issue in both cases. When I am not using PF-RING, I use Solar Capture libpcap and flow balancer.

In any case, I think I can give it a shot at least for the PF-RING testcase. How do I re-compile PF-RING for the newest kernel?

[root@dellr620c skathare]#top

top - 22:48:06 up 70 days, 19:43, 5 users, load average: 18.25, 13.49, 10.04
Tasks: 17 total, 1 running, 16 sleeping, 0 stopped, 0 zombie
%Cpu(s): 31.5 us, 2.9 sy, 0.7 ni, 43.7 id, 21.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32900200 total, 32731868 used, 168332 free, 336 buffers
KiB Swap: 1953076 total, 1262956 used, 690120 free. 14248 cached Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
20504 root 20 0 404644 32724 1464 D 0.0 0.1 6:32.62 bro 31
20552 root 20 0 404692 50108 1496 D 52.1 0.2 6:37.18 bro 29
20564 root 20 0 404824 50960 1508 D 49.8 0.2 6:37.16 bro 27
20574 root 20 0 404652 48748 1476 D 52.0 0.1 6:36.52 bro 24
20567 root 20 0 404684 49948 1456 D 0.0 0.2 6:33.32 bro 23
20561 root 20 0 421440 66672 1412 D 51.9 0.2 6:37.14 bro 18
20569 root 20 0 404708 31904 1508 D 41.1 0.1 6:34.77 bro 16
20495 root 20 0 404620 49936 1408 D 27.1 0.2 6:34.91 bro 13
20515 root 20 0 404684 46324 1500 D 21.9 0.1 6:33.25 bro 13
20548 root 20 0 404704 50188 1504 D 43.6 0.2 6:35.16 bro 13
20474 root 20 0 404736 32704 1508 D 0.0 0.1 6:32.79 bro 12
20502 root 20 0 404636 29300 1464 D 52.0 0.1 6:36.13 bro 12
20539 root 20 0 404748 32784 1484 D 44.7 0.1 6:34.08 bro 11
20537 root 20 0 404668 29284 1464 D 0.0 0.1 6:32.03 bro 4
20559 root 20 0 404684 32644 1444 R 54.6 0.1 6:38.12 bro 3
20542 root 20 0 404728 32704 1504 D 25.1 0.1 6:33.84 bro 1
20289 root 20 0 196768 412 412 S 0.0 0.0 0:00.03 solar_clusterd 0

Are those the only bro processes you are running?

the %MEM is only .1 or .2% for all of those process, and even if I round most of them up to .2% that only adds up to 3.2%

Can you press the M key in top to sort by memory usage?

Hi,

These are the bro workers. I haven't shown the memory usage of the Bro manager and proxy processes. I believe the manager is the one that takes up most of the memory as when I stop the manager, the available free memory goes up by almost 10G! I will send out that log shortly.

Could the manager in anyway be accumulating per worker or per flow state and not freeing it?

Thanks!

Hi,

These are the bro workers. I haven't shown the memory usage of the Bro manager and proxy processes. I believe the manager is the one that takes up most of the memory as when I stop the manager, the available free memory goes up by almost 10G! I will send out that log shortly.

There's actually a manager parent and child process, knowing which one is using the memory can help figure this out.

Could the manager in anyway be accumulating per worker or per flow state and not freeing it?

Thanks!

There is sort of a known issue if the manager can't keep up logging the amount of data it is being sent.
What sort of data are you sending bro? Is it something like random data that will cause a LOT of logging?

Hi,

I re-ran the same test case just now to get the data for you - 16 bro workers, each pinned to a core, no cpu pinning for the manager and proxy processes, data sent at 155000 pps -

Before running traffic -

top - 21:48:24 up 2:11, 5 users, load average: 0.84, 1.14, 0.88

Tasks: 20 total, 0 running, 20 sleeping, 0 stopped, 0 zombie

%Cpu(s): 1.6 us, 2.1 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 32900200 total, 8381556 used, 24518644 free, 55736 buffers

KiB Swap: 1953076 total, 0 used, 1953076 free. 249556 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P

18301 root 25 5 129980 46096 884 S 0.0 0.1 0:00.00 bro 1

18299 root 25 5 130020 46116 892 S 0.0 0.1 0:00.00 bro 11

18294 root 25 5 129988 46116 892 S 0.0 0.1 0:00.00 bro 12

18289 root 25 5 130012 46076 840 S 0.0 0.1 0:00.00 bro 15

18285 root 25 5 130024 46128 896 S 0.0 0.1 0:00.00 bro 10

18284 root 25 5 129960 46132 892 S 0.0 0.1 0:00.00 bro 14

18283 root 25 5 130048 46136 888 S 0.0 0.1 0:00.00 bro 13

17863 root 20 0 387408 51964 5548 S 9.7 0.2 0:31.44 bro 16

17860 root 20 0 387344 51968 5548 S 9.3 0.2 0:29.73 bro 15

17836 root 20 0 387416 51956 5548 S 8.7 0.2 0:29.62 bro 14

17835 root 20 0 387348 51960 5548 S 9.7 0.2 0:30.02 bro 13

17834 root 20 0 387348 51948 5552 S 9.0 0.2 0:29.31 bro 11

17833 root 20 0 387292 51952 5548 S 9.3 0.2 0:29.26 bro 12

17832 root 20 0 387340 51960 5548 S 9.3 0.2 0:29.85 bro 10

17738 root 20 0 387296 51920 5548 S 10.0 0.2 0:30.59 bro 1

17651 root 25 5 145956 75268 960 S 0.0 0.2 0:00.00 bro 2

17650 root 20 0 109988 43080 5096 S 1.3 0.1 0:04.50 bro 0

17613 root 25 5 146096 75396 944 S 0.0 0.2 0:00.01 bro 9

17604 root 20 0 405392 45428 5116 S 0.7 0.1 0:03.71 bro 26

17457 root 20 0 196772 9016 3448 S 0.0 0.0 0:00.02 solar_clusterd 0

The highlighted rows are Bro manager processes (17613 being the child and 17604 being the parent)

After running traffic at 150000 pps for 3 minutes –>

top - 21:53:58 up 2:16, 5 users, load average: 20.90, 11.89, 5.36

Tasks: 20 total, 10 running, 10 sleeping, 0 stopped, 0 zombie

%Cpu(s): 55.3 us, 3.4 sy, 0.8 ni, 37.5 id, 3.0 wa, 0.0 hi, 0.1 si, 0.0 st

KiB Mem: 32900200 total, 32703156 used, 197044 free, 892 buffers

KiB Swap: 1953076 total, 626996 used, 1326080 free. 331096 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P

18301 root 25 5 134092 46032 676 S 0.3 0.1 0:00.78 bro 1

18299 root 25 5 134132 46052 676 S 0.3 0.1 0:00.79 bro 11

18294 root 25 5 134100 2024 676 S 0.7 0.0 0:00.88 bro 12

18289 root 25 5 134124 46000 676 S 0.3 0.1 0:00.82 bro 15

18285 root 25 5 134136 2016 676 S 0.3 0.0 0:00.87 bro 10

18284 root 25 5 134072 1992 676 S 0.7 0.0 0:00.87 bro 14

18283 root 25 5 134160 46064 676 S 0.7 0.1 0:00.80 bro 13

17863 root 20 0 404636 24628 3540 R 0.0 0.1 4:36.63 bro 16

17860 root 20 0 404668 67472 3544 R 100.0 0.2 4:29.44 bro 15

17836 root 20 0 404612 24580 3540 R 100.0 0.1 4:35.53 bro 14

17835 root 20 0 404608 67412 3544 R 0.0 0.2 4:33.82 bro 13

17834 root 20 0 404608 67388 3544 R 100.0 0.2 4:35.69 bro 11

17833 root 20 0 404636 24648 3540 R 0.0 0.1 4:32.05 bro 12

17832 root 20 0 404632 24588 3540 R 100.0 0.1 4:36.37 bro 10

17738 root 20 0 404624 67368 3544 R 100.0 0.2 4:38.29 bro 1

17651 root 25 5 145956 7488 648 S 0.0 0.0 0:00.00 bro 17

17650 root 20 0 109988 3892 2916 S 1.7 0.0 0:09.39 bro 29

17613 root 25 5 871884 763968 668 R 99.9 2.3 3:39.04 bro 17

17604 root 20 0 22.869g 0.022t 2932 R 195.4 71.2 7:41.59 bro 18

17457 root 20 0 196772 2272 2272 S 0.0 0.0 0:00.02 solar_clusterd 0

As you can see, the manager parent process seems to be using 71% of the memory and it’s CPU utilization is also 195%.

Continuing the previous test case – After another 7-8 minutes, I don’t see the BRO manager process running and memory seems to have been restored –

top - 22:04:54 up 2:27, 5 users, load average: 16.04, 16.06, 11.24

Tasks: 18 total, 8 running, 10 sleeping, 0 stopped, 0 zombie

%Cpu(s): 47.7 us, 0.1 sy, 0.0 ni, 52.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 32900200 total, 8363464 used, 24536736 free, 4152 buffers

KiB Swap: 1953076 total, 743412 used, 1209664 free. 608176 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P

18301 root 25 5 132036 19816 540 S 0.0 0.1 0:00.88 bro 1

18299 root 25 5 132076 25040 468 S 0.0 0.1 0:00.83 bro 11

18294 root 25 5 132044 4208 480 S 0.0 0.0 0:01.04 bro 12

18289 root 25 5 132068 22424 548 S 0.0 0.1 0:00.84 bro 15

18285 root 25 5 132080 2276 468 S 0.0 0.0 0:00.89 bro 10

18284 root 25 5 132016 2276 444 S 0.0 0.0 0:00.92 bro 14

18283 root 25 5 132104 20792 520 S 0.0 0.1 0:00.85 bro 13

17863 root 20 0 404724 24268 1568 R 100.0 0.1 15:23.31 bro 16

17860 root 20 0 404592 44080 1568 R 91.2 0.1 14:33.31 bro 15

17836 root 20 0 405076 24604 1592 R 94.5 0.1 15:09.50 bro 14

17835 root 20 0 404632 42456 1568 R 0.0 0.1 15:06.73 bro 13

17834 root 20 0 404616 42740 1592 R 95.2 0.1 15:02.40 bro 11

17833 root 20 0 404612 24252 1592 R 94.9 0.1 15:00.49 bro 12

17832 root 20 0 404740 24052 1528 R 100.0 0.1 15:20.78 bro 10

17738 root 20 0 404676 41420 1528 R 0.0 0.1 15:27.26 bro 1

17651 root 25 5 143900 4820 360 S 0.0 0.0 0:00.01 bro 19

17650 root 20 0 109988 1916 736 S 1.3 0.0 0:19.00 bro 19

17457 root 20 0 196772 272 272 S 0.0 0.0 0:00.02 solar_clusterd 0

PID 17604 and 17613 missing above?

Is this an expected behavior? During the initial few minutes, does the manager do some sort of stabilization to get everything in order, possibly due to the high traffic rate? It is during these few minutes that I saw packets being dropped. I don’t see any drops right now.

What is the meaning of this error? –

SolarCapture session=4382/1 log=/var/tmp/solar_capture_root_4382/1

ERROR: errno=-114 from core/sc_ef_vi.c:1323 in sc_ef_vi_alloc_ts():

ERROR: sc_ef_vi_alloc_ts: ef_vi_alloc_from_pd failed (-114)

sfsc_activate: sc_vi_alloc failed (ERROR: sc_ef_vi_alloc_ts: ef_vi_alloc_from_pd failed (-114)

) errno=Unknown error -114

fatal error: problem with interface enp65s0f0 (enp65s0f0: sfsc_activate: sc_vi_alloc failed (ERROR: sc_ef_vi_alloc_ts: ef_vi_alloc_from_pd failed (-114)

) errno=Unknown error -114)

I see this in diag while creating 16 workers (sometimes even for lesser number). As a result, some of the workers are not started.

That appears to be a problem with your own NIC (I’m assuming... solarflare error messages, solarflare.com email address).

  .Seth

I’m super interested in this thread, as I believe i’m experiencing the same memory leak, using the solarflare cards.

i’m running a similar setup, with 20 workers and lots of traffic, but i’m having to bounce the entire NIC once Bro goes haywire. Bro doesn’t take too long before it’s wiped the whole box out of memory (all 192GB).

Please let me know how to troubleshooting goes. I’m happy to provide logs.

Cheers,

JB

Any update on this? If not I think i’m going to submit a support case to Solarflare.

Cheers,

JB

Memory leaks are tricky. It is important to make a distinction about what component is using a lot of memory:

1) the workers - analyzer issues and leaks in general would show up here.
2) the proxies - communication related
3) the manager - child - if the manager is overloaded the child will buffer log data
4) the manager - parent - if a logging destination is overloaded the parent will buffer log writes

If your manager processes are using a lot of ram, that doesn't have anything to do with the capture library in use.

It’s my manager processes using tons of memory…

How would you suggest debugging the manager processes?

Cheers,

JB

Is it the manager parent or the child process?

Ok so it looks like it’s the parent manager process. I’m pointing bro at an haproxy pool for ES, and ES isn’t terribly bogged down. Is there any way to specify multiple ES nodes? Or should we be using a pool like i’m using?

Cheers,

JB

I know we are still seeing issues with the manager child proccess. The process will consume over 200GB of RAM in 8 hours.

Give the attached patch a try.

Monitor by using

cat logs/current/communication.log |egrep ‘manager.child’

And check to see if pending=0 or at least not growing.

flare_fix.patch (839 Bytes)

Patch applied, after 15 minutes I am seeing

Oct 23 09:00:43 manager child - - - info selects=300000 canwrites=216206 pending=0
Oct 23 09:01:29 manager child - - - info selects=400000 canwrites=216206 pending=0
Oct 23 09:02:08 manager child - - - info selects=500000 canwrites=216552 pending=0
Oct 23 09:02:49 manager child - - - info selects=600000 canwrites=216557 pending=0
Oct 23 09:03:34 manager child - - - info selects=700000 canwrites=216557 pending=0
Oct 23 09:04:29 manager child - - - info selects=800000 canwrites=255305 pending=4007
Oct 23 09:05:21 manager child - - - info selects=900000 canwrites=355305 pending=6593
Oct 23 09:06:13 manager child - - - info selects=1000000 canwrites=455305 pending=6003
Oct 23 09:07:04 manager child - - - info selects=1100000 canwrites=555305 pending=3077
Oct 23 09:07:55 manager child - - - info selects=1200000 canwrites=640438 pending=3399
Oct 23 09:08:45 manager child - - - info selects=1300000 canwrites=740438 pending=3163
Oct 23 09:09:36 manager child - - - info selects=1400000 canwrites=840438 pending=5245
Oct 23 09:10:25 manager child - - - info selects=1500000 canwrites=940438 pending=6027
Oct 23 09:11:15 manager child - - - info selects=1600000 canwrites=1040438 pending=6713
Oct 23 09:12:01 manager child - - - info selects=1700000 canwrites=1140438 pending=5713
Oct 23 09:12:50 manager child - - - info selects=1800000 canwrites=1240438 pending=6747
Oct 23 09:13:39 manager child - - - info selects=1900000 canwrites=1340438 pending=7417
Oct 23 09:14:32 manager child - - - info selects=2000000 canwrites=1440438 pending=13117
Oct 23 09:15:10 manager child - - - info selects=2100000 canwrites=1540438 pending=20825
Oct 23 09:15:59 manager child - - - info selects=2200000 canwrites=1640438 pending=18539
Oct 23 09:16:47 manager child - - - info selects=2300000 canwrites=1740438 pending=15881
Oct 23 09:17:35 manager child - - - info selects=2400000 canwrites=1840438 pending=15389
Oct 23 09:18:28 manager child - - - info selects=2500000 canwrites=1940438 pending=16685
Oct 23 09:19:18 manager child - - - info selects=2600000 canwrites=2040438 pending=17031

I will let you know about the mem usage after a bit