high cpu usage and strange select(2) behavior

Hiya,

here with a standalone bro-2.0, I'm seeing 2 bro processes
using about 30% of the time of a cpu each regarless of whether
there's traffic or not to watch.

Doing a strace on them I see, for one process only select system
calls in a loop with null or ridiculously low timeouts:

select(12, [0 11], [], [], {0, 10}) = 0 (Timeout)
select(12, [11], NULL, NULL, {0, 0}) = 0 (Timeout)
select(12, [11], NULL, NULL, {0, 0}) = 0 (Timeout)
select(12, [0 11], [], [], {0, 10}) = 0 (Timeout)
select(12, [11], NULL, NULL, {0, 0}) = 0 (Timeout)
select(12, [11], NULL, NULL, {0, 0}) = 0 (Timeout)
select(12, [0 11], [], [], {0, 10}) = 0 (Timeout)
select(12, [11], NULL, NULL, {0, 0}) = 0 (Timeout)

0 being tcp *:47760 (LISTEN), and 11 a unix domain socket, probably
a socket pair to communicate with the other bro process. Why
those 3 selects in an infinite loop, why a timeout? Can't the
select just sit on [0 11] if it's the only inputs it gets? Here
it seems it is just wasting 30% of the time of the CPU.

The other process behavior is even weirder:

select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
select(11, [4 7 10], [0], [0], {0, 0}) = 1 (out [0], left {0, 0})
kill(15287, SIG_0) = 0
read(10, 0xb6ebc008, 1048576) = -1 EAGAIN (Resource temporarily unavailable)
gettimeofday({1328798101, 327305}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
select(11, [4 7 10], [0], [0], {0, 0}) = 1 (out [0], left {0, 0})
kill(15287, SIG_0) = 0
read(10, 0xb6ebc008, 1048576) = -1 EAGAIN (Resource temporarily unavailable)
gettimeofday({1328798101, 328513}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
select(11, [4 7 10], [0], [0], {0, 0}) = 1 (out [0], left {0, 0})
kill(15287, SIG_0) = 0
read(10, 0xb6ebc008, 1048576) = -1 EAGAIN (Resource temporarily unavailable)

A select with no fd to watch (why not use nanosleep?) a read()
on a non-blocking fd following a select that returns on timeout
(null timeout with nothing for that fd). And again, why not have
the select() sit waiting for input?

That doesn't seem quite right.

Is that the expected behavior, or is there any way to configure
it so that it behaves itself?

IMO looping through select isn't an issue. On an idle/lightly loaded
system, select is going to be the dominate usage of cpu but when you
start seeing increased load, the rest of the system begins to pick up
in usage with select taking a smaller and smaller portion of the CPU.
Is it really an issue using 30% CPU on an otherwise idle system?

About using nanosleep vs select, i don't really see any difference in
terms of what you want to achieve. In either case the process doing
the select or nanosleep is going to be put to sleep till the timer
expires so they both end up achieving the same thing. It isn't
uncommon for one to use select/poll in that way to sleep.

Sridhar

2012-02-09 12:31:29 -0500, sridhar basam:
[...]

IMO looping through select isn't an issue. On an idle/lightly loaded
system, select is going to be the dominate usage of cpu but when you
start seeing increased load, the rest of the system begins to pick up
in usage with select taking a smaller and smaller portion of the CPU.
Is it really an issue using 30% CPU on an otherwise idle system?

[...]

Thanks Sridhar for the feedback.

That's assuming a system where only bro is running.

Here, bro is running in a VM, and I've got other VMs running
there (with CPU overcommit) and other software running on that
VM (other IDS which also do offline deep inspection in the
background), so those CPU cycles wasted by bro are not available
to those other processes and to the other VMs. (and it's 60%
CPU, not 30).

But the reason I asked was because I thought it was a
configuration problem of mine, because I found it abnormal for
bro to use that much CPU when idle, and thought that could
explain the alerts about dropped packets where the other IDSes
are fine.

About using nanosleep vs select, i don't really see any difference in
terms of what you want to achieve. In either case the process doing
the select or nanosleep is going to be put to sleep till the timer
expires so they both end up achieving the same thing. It isn't
uncommon for one to use select/poll in that way to sleep.

[...]

OK, I agree that's a minor point, but why not use the right tool
for the job? it makes the code easier to read and saves a few
CPU cycles.

Cheers,
Stephane

The high overhead is due to the Bro communication loop. If you run in standalone mode (or just manually run a Bro process without loading frameworks/communication/listen.bro) you won't see the high cpu load.

  .Seth

2012-02-10 09:16:52 -0500, Seth Hall:

> But the reason I asked was because I thought it was a
> configuration problem of mine, because I found it abnormal for
> bro to use that much CPU when idle, and thought that could
> explain the alerts about dropped packets where the other IDSes
> are fine.

The high overhead is due to the Bro communication loop. If
you run in standalone mode (or just manually run a Bro process
without loading frameworks/communication/listen.bro) you won't
see the high cpu load.

[...]

Thanks Seth,

I think we're onto something here. node.cfg has "standalone" and
yet the cpu load is high. Where should I be looking? (I have to
admit I'm quite new to bro).

# broctl status
waiting for lock ...... ok
Name Type Host Status Pid Peers Started
bro standalone localhost running 4508 0 10 Feb 14:16:37

policy/auto/standalone-layou.bro has

# Automatically generated. Do not edit.
redef Communication::listen_port = 47760/tcp;
redef Communication::nodes += {
        ["control"] = [$host=127.0.0.1, $class="control", $events=Control::controller_events],
};

# lsof -i tcp -ac bro
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
bro 4526 root 0u IPv4 3729419 0t0 TCP *:47760 (LISTEN)

# broctl config | grep -e polic -e standalone
defsitepolicypath = /usr/local/share/bro/site
policydir = /usr/local/share/bro
policydirbroctl = /nsm/bro/spool/policy/broctl
policydirsiteinstall = /nsm/bro/spool/policy/site
policydirsiteinstallauto = /nsm/bro/spool/policy/auto
sitepolicymanager = local-manager.bro
sitepolicypath = /usr/local/share/bro/site
sitepolicystandalone = local.bro
sitepolicyworker = local-worker.bro
standalone = 1

Thanks,
Stephane

Oops! I completely forgot that the communications framework is always enabled when you use BroControl so that the control framework is available. It makes it possible to communicate with and control Bro at runtime through broctl.

There aren't any knobs to disable that at the moment since BroControl's functionality would be sporadically broken without it. I've been bugging Robin about the high cpu load with the communications framework enabled and low traffic for quite a few years but you can actually compensate for it quite a bit now by increasing the timeout values that you were writing about before.

  .Seth

Coming late to this thread, two remarks:

    - Bro's I/O loop definitly needs some work, there's some stuff in
      there that's clearly suboptimal. High CPU load with low traffic
      is one of them.

    - That said, while some things may look odd, there are reasons for
      why it's done this way. It's mostly platform artifacts: at the
      time the code was written, non-blocking pcap file descriptors
      weren't (correctly) supported on some OSs; and sometimes capture
      performance turned out to be better with a less intuitive
      version of the code.

Reworking this is on the todo list but it needs care to not make
things worse.

Robin