Workers occasionally using 102% CPU

I’m seeing an interesting problem on zeek 3.0.1 (running stock SecurityOnion sensor setup) where the main thread suddenly spikes to 100% CPU and stays there.

The base setup: Ubuntu 16.04 kernel 4.15.0-88-generic, zeek is using AF_PACKET to distribute packets across 8 worker processes. Baseline load for each worker is around 18-22% CPU. Running strace for 1 second and filtering with ‘sort strace.out | uniq -c’ on a normal worker looks like this:

31 futex(0x7f9586384a38, FUTEX_WAKE_PRIVATE, 1) = 1
33 futex(0x7f9586384a64, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f9586384a60, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1 futex(0x7f9586d03a0c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f9586d03a08, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
6671 nanosleep({0, 1000}, NULL) = 0
1 read(12, “@”, 1) = 1
1 read(14, “@”, 1) = 1
26 read(16, “@”, 1) = 1
6699 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 1 (out [0], left {0, 0})
198 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 2 (in [11], out [0], left {0, 0})
1 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 2 (in [12], out [0], left {0, 0})
1 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 2 (in [14], out [0], left {0, 0})
21 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 2 (in [16], out [0], left {0, 0})
1 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 3 (in [11 16], out [0], left {0, 0})
1 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0} <detached …>
101 select(17, [6 8 10 12 14 16], [], [], {0, 0}) = 0 (Timeout)
92 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, …}) = 0

Notice there are close to the same number of nanosleep calls as there are select calls.

After some time - i’ve seen it happen within 4 minutes of start to several hours afterwards - the usage suddenly spikes to 100%. Packets continue to be processed, and the load on the remaining workers stays about the same, as does the load on logger, manager, and proxy. Changing the number of worker processes doesn’t seem to prevent it. There is no degredation in output logging, since I have enough cores to compensate for that single hot process. Running strace for 1 second and filtering with ‘sort strace.out | uniq -c’ looks like this:

1 futex(0x7f270ef7ea38, FUTEX_WAKE_PRIVATE, 1) = 0
30 futex(0x7f270ef7ea38, FUTEX_WAKE_PRIVATE, 1) = 1
35 futex(0x7f270ef7ea64, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f270ef7ea60, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1 futex(0x7f270f900a6c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f270f900a68, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1 poll([{fd=16, events=POLLIN}], 1, -1) = 1 ([{fd=16, revents=POLLIN}])
1 read(12, “@”, 1) = 1
1 read(14, “@”, 1) = 1
28 read(16, “@”, 1) = 1
1 select(12, [10 11], [0], [0], {0, 0}) = 1 (out [0], left {0, 0})
21703 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 2 (in [16], out [0], left {0, 0})
141 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 3 (in [11 16], out [0], left {0, 0})
1 select(17, [6 8 10 11 12 14 16], [0], [0], {0, 0}) = 3 (in [12 16], out [0], left {0, 0})
109 select(17, [6 8 10 12 14 16], [], [], {0, 0}) = 1 (in [16], left {0, 0})
106 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, …}) = 0

Note the complete lack of nanosleeps and the jump in the number of calls to select. Also notice that fds 12, 14, and 16 continue to be read, but select always returns with data available on fd16. procfs shows this (also showing fd17 since that appears to be the other end of the pipe):

root:/proc/32327# ls -l fd/16 fd/17
lr-x------ 1 root root 64 Feb 20 20:43 fd/16 → pipe:[48070104]
l-wx------ 1 root root 64 Feb 20 20:43 fd/17 → pipe:[48070104]
root:/proc/32327# cat fdinfo/16
pos: 0
flags: 02004000
mnt_id: 12
root:/proc/32327# cat fdinfo/17
pos: 0
flags: 02000001
mnt_id: 12

lsof output shows these file descriptors in use on the main thread (and thus on all):

zeek 32327 sguil 3u a_inode 0,13 0 12847 [eventpoll]
zeek 32327 sguil 4r FIFO 0,12 0t0 48059893 pipe
zeek 32327 sguil 5w FIFO 0,12 0t0 48059893 pipe
zeek 32327 sguil 6r FIFO 0,12 0t0 48059895 pipe
zeek 32327 sguil 7w FIFO 0,12 0t0 48059895 pipe
zeek 32327 sguil 8r FIFO 0,12 0t0 48059896 pipe
zeek 32327 sguil 9w FIFO 0,12 0t0 48059896 pipe
zeek 32327 sguil 10u IPv4 48059899 0t0 UDP localhost:55072->localhost:domain
zeek 32327 sguil 11u pack 48059900 0t0 ALL type=SOCK_RAW
zeek 32327 sguil 12r FIFO 0,12 0t0 48048835 pipe
zeek 32327 sguil 13w FIFO 0,12 0t0 48048835 pipe
zeek 32327 sguil 14r FIFO 0,12 0t0 48048836 pipe
zeek 32327 sguil 15w FIFO 0,12 0t0 48048836 pipe
zeek 32327 sguil 16r FIFO 0,12 0t0 48070104 pipe
zeek 32327 sguil 17w FIFO 0,12 0t0 48070104 pipe
zeek 32327 sguil 18u IPv6 48038789 0t0 TCP *:47770 (LISTEN)
zeek 32327 sguil 19u IPv4 48038790 0t0 TCP localhost:47996->localhost:47761 (ESTABLISHED)
zeek 32327 sguil 20u IPv4 48038791 0t0 TCP localhost:42076->localhost:47763 (ESTABLISHED)
zeek 32327 sguil 21u IPv4 48038792 0t0 TCP localhost:34920->localhost:47762 (ESTABLISHED)

Of the 7 threads created, the 4th (judging by the thread’s pid) is calling ‘write(17, “@”, 1)’ 10x per second and getting a return value of 1.

Any ideas what might be wrong? Any suggestions for further diagnosis? These are in production, so I can’t do too much other than a restart and an occasional strace. I cannot reproduce in lab conditions.

I’ve just upgraded 2 of my sensors from 2.6.4 to 3.0.1, and this is happening on one only, but its the more heavily loaded one. I’m hoping to resolve this before upgrading the remaining 10 sensors, as I don’t want to see it on others…

I’m seeing an interesting problem on zeek 3.0.1 (running stock SecurityOnion sensor setup) where the main thread suddenly spikes to 100% CPU and stays there.

Could be an elephant flow…

Any ideas what might be wrong? Any suggestions for further diagnosis? These are in production, so I can’t do too much other than a restart and an occasional strace. I cannot reproduce in lab conditions.

Use perf! Start with this:

sudo perf record -C 3 -g -F199 --call-graph dwarf sleep 30

where 3 is the cpu core that worker is running on (zero indexed)

Then run

sudo perf script |c++filt > perf.out

Then process that with https://github.com/brendangregg/FlameGraph or even better https://github.com/Netflix/flamescope

You may be able to skip all that and just run

sudo perf top -C 3 -F199

to see what is going on, but recording helps if it’s hard to catch or if you want to automate it.

Thanks, Justin, but I don’t think it’s an elephant flow. I’m running snort on the same host with the same global BPF rules (it still uses pf-ring for clustering), which is even more sensitive to elephant flows, and it’s not spiking CPU. Besides, it’s fd16 that’s triggering select, not fd11, which is the raw socket. The count of read calls on fd11 are pretty similar between the good and bad strace summaries.

I don’t have the workers pinned to a CPU, and there are other heavy-duty processes on that host, so I’m not sure if perf will work for me. I’ll read up on it and see if I can get some sort of results from it. I’ll also strace some more and try to catch the right moment to at least figure out what’s happening around trigger time… The load on that worker isn’t so high that strace interferes, and streaming its output through gzip sips diskspace.

Also, it’s now happening on the other node where I upgraded to v3.0.1. I only have 3 workers there, but load is light enough I could get by with 1 or 2. When I kill the malfunctioning worker, it appears to respawn within a couple minutes, which is good. I didn’t realize Zeek did that!

Any other ideas? I can’t be the only one seeing this, can I?

I don’t have the workers pinned to a CPU, and there are other heavy-duty processes on that host, so I’m not sure if perf will work for me. I’ll read up on it and see if I can get some sort of results from it. I’ll also strace some more and try to catch the right moment to at least figure out what’s happening around trigger time… The load on that worker isn’t so high that strace interferes, and streaming its output through gzip sips diskspace.

Oh… you can use -p with perf and give it a pid. strace is really not helpful at all for this sort of thing.

Also, it’s now happening on the other node where I upgraded to v3.0.1. I only have 3 workers there, but load is light enough I could get by with 1 or 2. When I kill the malfunctioning worker, it appears to respawn within a couple minutes, which is good. I didn’t realize Zeek did that!

zeekctl cron does that when it runs. The new supervisor stuff https://docs.zeek.org/en/master/frameworks/supervisor.html aims to have things restarted immediately.

Any other ideas? I can’t be the only one seeing this, can I?

Hard to say until we know what the ‘this’ is :slight_smile:

Ah, thanks. I'm not familiar with perf, so had to research some.
Here's a sample of running perf-top against it for a bit:

Samples: 2K of event 'cycles:ppp', Event count (approx.): 23397706264
  Children Self Shared Object Symbol
+ 36.85% 0.72% libc-2.23.so [.] __select
+ 34.78% 1.18% [kernel] [k] do_syscall_64
+ 34.53% 0.83% [kernel] [k]
entry_SYSCALL_64_after_hwframe
+ 32.88% 0.56% [kernel] [k] sys_select
+ 29.65% 1.15% [kernel] [k] core_sys_select
+ 20.18% 9.99% [kernel] [k] do_select
+ 10.12% 0.00% [unknown] [k] 0x00007fe5752e0b20
+ 9.14% 8.91% libjemalloc.so.1 [.] free
+ 7.60% 7.43% libjemalloc.so.1 [.] malloc
+ 7.05% 6.86% zeek [.]
iosource::Manager::FindSoonest
+ 5.67% 1.20% [kernel] [k] __fdget
+ 5.42% 5.42% zeek [.]
std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>,
std::allocator<int> >::_M_insert_unique<in
+ 4.44% 0.93% [kernel] [k] __fget_light
+ 4.08% 3.94% [kernel] [k] __fget

I also captured call-graph data as you requested earlier, but I don't
know what I'm looking at... how should I get that to you? Direct
email, or attach to reply and let the list manager strip and provide a
link? It's only 350k once gzipped.

Perf output mailed to you, Justin. I'm checking to see if other
SecurityOnion users are seeing this, but haven't heard anything yet.
The good news is that it doesn't break things, as it just consumes
more CPU than usual.

I appreciate you taking a look at this. Please let me know if you
need anything else.

Ah, sorry. The profiling confirms it’s spending most of its time in iosource::Manager::FindSoonest. As you noticed this code is completely gone in the next version of zeek, from the NEWS file:

  • Replace old select-based IO loop with a new architecture that doesn’t
    spin checking for active IO sources. The new architecture now waits for the
    the sources to actively notify it when activity occurs and only processes
    data once it’s ready. This helps heavily reduce the CPU usage on idle
    network connections.

I’m not sure if what you are seeing is a bug, or just an instance of the older select loop not performing well. Since you said it seems to be working fine otherwise and not dropping packets, I’m leaning towards just a performance issue.

The new IO loop should perform much better, especially on lower utilized links.

Justtin,

We agree that a worker process of Zeek using 10-20% of the CPU when
essentially idle is a performance issue. Thank you for your work on
that. I'm looking forward to the day that it's rolled out in
SecurityOnion.

That said...

That a worker process of Zeek is taking 100% of the CPU when
essentially idle, but only sometimes and only after running for an
indeterminate amount of time, is a bug. It's not just not performing
well, as is demonstrated by the strace output; select() is always
returning with a file descriptor ready for IO, which is not cleared by
reading it in the main thread, and thus the main thread never sleeps.
Its behavior is very different from the baseline with just the
performance issue.

To bolster that conclusion, I was doing my best to understand the
strace output I'd captured earlier, and sifting through the series of
select(), epoll_ctl(), futex(), rt_sigprocmask() and rt_sigtimedwait()
syscalls reminded me just how HARD multi-threaded programming can be.
And race conditions can be subtle with no discernible pattern in when
they manifest. So I tried something: I tried pinning each worker
process to a CPU.

None of the threads have jumped up to 100% since.

FWIW, I've found another SecurityOnion user that is seeing the same
issue. He reported that load averages skyrocketed after the latest
upgrade, and saw that 9 out of his 11 Zeek worker processes were
pegged at 100% CPU until he restarted it. I'm going to suggest he pin
workers to specific CPUs to see if it goes away for him too.

From strace output, it appears that the main thread and one of the

child threads get out of sync, in that there's an extra character in
the pipe used for intra-thread signalling that the main thread never
clears with a read. Even when a new packet comes in, I think I see
that there's something else (eg. a signal being unmasked?) that
triggers the main thread to read from the pipe, but it still only
reads what it thinks are the right number of characters, rather than
all that are available in the pipe, leaving extra in the pipe select()
always returns that fd as ready for IO.

This may be specific to SecurityOnion's configuration or set of
ZeekScripts, or it may be happening on all instances where Zeek
workers have no cpu affinity. Perf wouldn't help there, as we need
the specific sequence that triggers the race condition rather than a
heatmap of function calls. I'd still like to get a trace that shows
the function call sequence at the moment this bug is triggered, but
haven't had any luck so far.

Does this make sense to you? Trigger any memories of changes that
went into 3.0 that might have altered behavior in that area? Do you
run your tests without pinning cpus?

I saw in a ZeekWeek2019 slide that 3.0.x is considered a long-term
stable release. Please let me know if there's anything I can do to
help diagnose this further and make sure it's stable for everyone.

These are some differences between 2.6 and 3.0 that come to mind as
possibly related:

* https://github.com/zeek/zeek/issues/346 and
https://github.com/zeek/zeek/pull/358
* https://github.com/zeek/broker/commit/0abe5277

Your descriptions makes me generally suspect the Broker flare_actor
code, either it being introduced in that later commit or just the
patch that was attempted to fix similar behavior to what you notice is
incomplete (and possibly wasn't an issue in 2.6 because underlying CAF
library was just different there, too). And if there's still a bug
there, then I'd expect it would also break the 3.1.0 IO loop despite
it generally being re-written.

If you have more ideas to narrow it down or can pinpoint which pipe
isn't getting fully drained, that will be helpful.

- Jon

Thanks, Jon.

I'll try to digest those links and dig into the code. Unfortunately,
it seems running strace on the process keeps it from occurring... I
may try to get dtrace working in place, but I need to improve my lab
setup first before I go too crazy.

I’ve been able to duplicate this issue so I’m passing along some notes in the hope that others are able to duplicate the issue as well and perhaps pinpoint what’s going on.

  • I’ve seen this issue on both physical boxes and virtual machines and on both single CPU socket and multiple CPU socket systems

  • I’ve been able to trigger this issue fairly consistently using VMware Workstation with the VM set to 4 processors (seems easier to duplicate when using processors rather than cores)

  • 8GB RAM and 2 NICs (one set to NAT for management and the other set to a custom network and configured for sniffing)

  • running our latest Security Onion ISO image which contains Zeek 3.0.1 (I’ve also duplicated this behavior using Zeek 3.0.2 compiled manually):
    https://blog.securityonion.net/2020/02/security-onion-160464-iso-image-now.html

  • run sosetup-minimal and choose Evaluation Mode

  • once Setup is complete, create some traffic on the sniffing interface:
    while :; do sudo so-replay; done

  • on my box, Zeek normally runs at about 10% to 20% CPU usage when running so-replay but after a certain period of time (seems inconsistent, could be minutes or over an hour), Zeek will go to 100% CPU usage and remain there even if you kill the so-replay while loop from above

  • you can restart Zeek with “sudo so-zeek-restart” and it will go back to normal operation and normal CPU usage, but after a while of processing traffic it will go back to 100% CPU usage

  • as mentioned above, you can also download Zeek 3.0.2 and compile it manually according to https://docs.zeek.org/en/v3.0.2/install/install.html and duplicate the issue there, so this would seem to rule out any possible issues with our Zeek package or scripts

Please let me know if I can provide any further information to assist in duplicating and pinpointing this issue.

Thanks!

Hi all,
not sure yet, but looks very similar to issue I started here
https://github.com/zeek/zeek/issues/838. I will take a look what Zeek
scripts are enabled by default in SO to correlate these issues.

Petr

Hi Petr,

It does sound similar to issue 838. However, please note that I was able to duplicate this issue on a manual compile of Zeek 3.0.2 with no additional scripts loaded.

- I've been able to trigger this issue fairly consistently using VMware Workstation with the VM set to 4 processors (seems easier to duplicate when using processors rather than cores)

Thanks, I've reproduced with a similar configuration. First thing I
notice is `perf top` showing much time spent in
`threading::Manager::NextTimestamp`, so now off to try and understand
how it can get into that state.

(I've also duplicated this behavior using Zeek 3.0.2 compiled manually):

I'm expecting to have to go this way either to add more
instrumentation or test potential patches, so for consistency, can you
provide the exact commands used from the point of ./configure until
point of running/restarting zeek ?

- Jon

Hi Jon,

Replies inline.

  • I’ve been able to trigger this issue fairly consistently using VMware Workstation with the VM set to 4 processors (seems easier to duplicate when using processors rather than cores)

Thanks, I’ve reproduced with a similar configuration. First thing I
notice is perf top showing much time spent in
threading::Manager::NextTimestamp, so now off to try and understand
how it can get into that state.

I’m glad you were able to reproduce this!

(I’ve also duplicated this behavior using Zeek 3.0.2 compiled manually):

I’m expecting to have to go this way either to add more
instrumentation or test potential patches, so for consistency, can you
provide the exact commands used from the point of ./configure until
point of running/restarting zeek ?

  • Jon

Going off of memory, but it should have been rather standard:

sudo apt-get install cmake make gcc g++ flex bison libpcap-dev libssl-dev python-dev swig zlib1g-dev

./configure
make
sudo make install
modify /usr/local/zeek/etc/node.cfg and configure the standalone stanza to sniff from the sniffing interface
sudo /usr/local/zeek/bin/zeekctl deploy

Please let me know if you need anything else.

Thanks!

Just a quick note to confirm that I’m seeing much time spent in threading::Manager::NextTimestamp here as well.

Can we disable scripts and run Zeek with a minimalist configuration and see if result changes?

The timestamp thing looks like a packet timestamp?

Do use you libpcap by chance?

That didn't turn out to be directly related, but I did ultimately hunt
down a race condition in Broker as explanation for it getting stuck at
full CPU load. This is the incoming patch:

    https://github.com/zeek/broker/pull/97

Thanks for the assists in tracking this down. Any further help in
verifying the patch would also be great.

- Jon

Good morning Jon,

I applied the relevant patch to the Zeek 3.0.2 tarball, compiled, and have been running for the last 3 hours or so with the same pcap replay loop described previously. Zeek still seems to be averaging 10% to 20% CPU usage, so the patch is looking good so far.

Would you recommend that we apply the patch to our Zeek 3.0.2 package OR hold off on 3.0.2 and wait for 3.0.3 assuming it will be included there?

Thanks!

Hi Doug, I do now have this on the schedule for fixing in a Zeek 3.0.3
(and 3.1.1), but not sure what time frame to expect those releases.
I'll suggest to the team to aim at doing one in ~2-3 weeks. If you'd
like to patch ASAP, you may still want to wait at least until that
Pull Request gets a code review and merged into the master branch.

- Jon