manager-child cpu utilization?

Seth_Hall3 · September 17, 2011, 12:52am

I'm seeing what appears to be extremely high CPU utilization by the manager's child process in a rather large (23 worker?) cluster. I believe the deserialization is being done by the child process and I'm guessing that the child process is being overwhelmed by all of the deserialization it's having to do for the logging framework's log records.

Any one have thoughts about if I'm correct or ways this could be optimized? I'm just worried this could become a problem because logs are written much faster at other sites and I can see other sites getting communication framework overload really easily. I think it's something that we need to keep a close eye on at least.

.Seth

Martin_Holste · September 18, 2011, 2:17am

Yep, this is definitely happening. Our cluster's manager and its
child, as well as the proxy and its child chew up about 20-25% CPU
each, so that's almost a full CPU for the manager/child and
proxy/child.

Scott_Campbell · September 18, 2011, 4:37am

I ran strace on the manager parent and child process during a cycle of
relatively high use and came up with the following sets of system calls:

98337 kill
143871 read
128700 select
8058 write

70 sec

Child:

56193 read
1 rt_sigreturn
684399 select
333803 write

75 sec

For the parent the kill() call looks like:

1316290215.456133 kill(10062, SIG_0) = 0

with 10062 being the child manager process.

There is a select loop in RemoteSerializer.cc which may be a candidate
to tune

// FIXME: Fine-tune this (timeouts, flush, etc.) struct timeval
small_timeout; small_timeout.tv_sec = 0; small_timeout.tv_usec =
io->CanWrite() || io->CanRead() ? 1 : 10;

int a = select(max_fd + 1, &fd_read, &fd_write, &fd_except,
&small_timeout);

if ( a == 0 ) ++timeouts;

if ( selects % 100000 == 0 ) Log(fmt("selects=%ld canwrites=%ld
timeouts=%ld", selects, canwrites, timeouts));

if ( a < 0 ) // Ignore errors for now. continue;

or ChunkedIO.cc : small_timeout.tv_usec = 50;

This is just a (pair of) guesses - current version is 1.6-dev-1261
running on Linux golumn.nersc.gov 2.6.38.6-26.rc1.fc15.x86_64 #1 SMP
Mon May 9 20:45:15 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux via PF_RING
5.01 and ixgbe+DNA driver.

scott

robin · September 19, 2011, 4:20am

I'm seeing what appears to be extremely high CPU utilization by the
manager's child process in a rather large (23 worker?) cluster. I
believe the deserialization is being done by the child process

It's not actually, the child process just relays the raw data between
the remote side and the main process. The latter does the
(de-)serialization.

Also, the logging does actually not use (most of) the normal
serialization framework. It has its own code for (un-)marshalling its
data, which is less complex. And I'm just realizing that once we have
merged in the threaded logging, we could actually even shift the
demarshalling to the threads I believe ..

Yep, this is definitely happening. Our cluster's manager and its
child, as well as the proxy and its child chew up about 20-25% CPU
each, so that's almost a full CPU for the manager/child and
proxy/child.

Note that the proxy doesn't see the logging data, just &synchronized
data passes through it. For the manager it's the opposite: it sees the
logs, but not &synchronized. Seth, didn't you say we actually don't
have much &synchronized at all anymore now?

The communication performance with the new scripts is something I'm
still planing to play with a bit more myself. Soon hopefully.

robin · September 25, 2011, 10:09pm

Turns out that at least some of this was caused by the inefficiency
described in #498. That's fixed now, and I see much lower utilization
for the child process now.

Robin

Topic		Replies	Views
Manager swapping.. Zeek	11	106	May 6, 2022
Manager memory requirements for the intel framework Zeek	1	157	May 6, 2022
cluster manager log buffering Zeek	1	111	May 6, 2022
cluster communication best practice? Development development	1	96	May 6, 2022
Bro manager dies in a large cluster Zeek	1	120	May 6, 2022

manager-child cpu utilization?

Related topics