manager-child cpu utilization?

I'm seeing what appears to be extremely high CPU utilization by the manager's child process in a rather large (23 worker?) cluster. I believe the deserialization is being done by the child process and I'm guessing that the child process is being overwhelmed by all of the deserialization it's having to do for the logging framework's log records.

Any one have thoughts about if I'm correct or ways this could be optimized? I'm just worried this could become a problem because logs are written much faster at other sites and I can see other sites getting communication framework overload really easily. I think it's something that we need to keep a close eye on at least.

  .Seth

Yep, this is definitely happening. Our cluster's manager and its
child, as well as the proxy and its child chew up about 20-25% CPU
each, so that's almost a full CPU for the manager/child and
proxy/child.

I ran strace on the manager parent and child process during a cycle of
relatively high use and came up with the following sets of system calls:

  98337 kill
143871 read
128700 select
   8058 write

70 sec

Child:

  56193 read
      1 rt_sigreturn
684399 select
333803 write

75 sec

For the parent the kill() call looks like:

1316290215.456133 kill(10062, SIG_0) = 0

with 10062 being the child manager process.

There is a select loop in RemoteSerializer.cc which may be a candidate
to tune

// FIXME: Fine-tune this (timeouts, flush, etc.) struct timeval
small_timeout; small_timeout.tv_sec = 0; small_timeout.tv_usec =
io->CanWrite() || io->CanRead() ? 1 : 10;

int a = select(max_fd + 1, &fd_read, &fd_write, &fd_except,
&small_timeout);

if ( a == 0 ) ++timeouts;

if ( selects % 100000 == 0 ) Log(fmt("selects=%ld canwrites=%ld
timeouts=%ld", selects, canwrites, timeouts));

if ( a < 0 ) // Ignore errors for now. continue;

or ChunkedIO.cc : small_timeout.tv_usec = 50;

This is just a (pair of) guesses - current version is 1.6-dev-1261
running on Linux golumn.nersc.gov 2.6.38.6-26.rc1.fc15.x86_64 #1 SMP
Mon May 9 20:45:15 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux via PF_RING
5.01 and ixgbe+DNA driver.

scott

I'm seeing what appears to be extremely high CPU utilization by the
manager's child process in a rather large (23 worker?) cluster. I
believe the deserialization is being done by the child process

It's not actually, the child process just relays the raw data between
the remote side and the main process. The latter does the
(de-)serialization.

Also, the logging does actually not use (most of) the normal
serialization framework. It has its own code for (un-)marshalling its
data, which is less complex. And I'm just realizing that once we have
merged in the threaded logging, we could actually even shift the
demarshalling to the threads I believe ..

Yep, this is definitely happening. Our cluster's manager and its
child, as well as the proxy and its child chew up about 20-25% CPU
each, so that's almost a full CPU for the manager/child and
proxy/child.

Note that the proxy doesn't see the logging data, just &synchronized
data passes through it. For the manager it's the opposite: it sees the
logs, but not &synchronized. Seth, didn't you say we actually don't
have much &synchronized at all anymore now?

The communication performance with the new scripts is something I'm
still planing to play with a bit more myself. Soon hopefully.

Turns out that at least some of this was caused by the inefficiency
described in #498. That's fixed now, and I see much lower utilization
for the child process now.

Robin