manager crash

Got this:
internal error: unknown msg type 101 in Poll()
/usr/local/bro-git-20110925/share/broctl/scripts/run-bro: line 60:
4445 Aborted (core dumped) nohup $mybro $@

from revision v1.6-dev-1302-g827dcea

I'll try with the latest.

Hmm... I would have suggested trying the latest logging fix, but
that's already in there. :frowning:

In the past, we have seen this error in one of two cases: (1)
communication overload, i.e., a node receives more messages than it
can handle, usually noticable by extremely high CPU load; or (2)
*another* node crashes for whatever reason and that then causes some
peers to crash as well with this error.

Does it look like one of these two?

Robin

Well, there is high CPU and high volume almost all of the time, and
all of the workers are still up and running, so this seems to be a
volume issue.

What's the typical CPU load on the manager with the fixes from the
weekend applied?

Robin

The manager processes are among the lowest for CPU utilization. The
workers are top.

top - 11:06:43 up 12 days, 22:51, 1 user, load average: 6.30, 5.90, 5.46
Tasks: 195 total, 16 running, 179 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.7%us, 1.9%sy, 0.3%ni, 95.3%id, 0.0%wa, 0.1%hi, 0.7%si, 0.0%st
Mem: 66176992k total, 6142712k used, 60034280k free, 155124k buffers
Swap: 17358840k total, 0k used, 17358840k free, 1129988k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15361 root 20 0 315m 275m 134m R 57 0.4 18:44.79 bro
15357 root 20 0 317m 282m 137m S 55 0.4 16:32.71 bro
15363 root 20 0 312m 276m 137m R 47 0.4 16:55.25 bro
15355 root 20 0 314m 277m 137m R 43 0.4 18:22.26 bro
15360 root 20 0 308m 268m 134m R 43 0.4 16:30.32 bro
15359 root 20 0 326m 290m 137m R 39 0.5 20:11.75 bro
15364 root 20 0 321m 284m 137m R 39 0.4 17:54.01 bro
15362 root 20 0 309m 274m 134m R 37 0.4 18:31.64 bro
15193 root 20 0 82336 44m 3836 S 25 0.1 9:17.12 bro
15227 root 25 5 76420 25m 508 R 18 0.0 7:17.88 bro
15365 root 25 5 188m 145m 128m R 18 0.2 5:15.67 bro
15367 root 25 5 188m 145m 128m R 16 0.2 5:11.59 bro
15226 root 20 0 81832 43m 3812 S 14 0.1 9:01.03 bro
15368 root 25 5 188m 145m 128m S 14 0.2 5:11.62 bro
15372 root 25 5 188m 146m 128m S 14 0.2 5:26.85 bro
15369 root 25 5 188m 145m 128m S 12 0.2 5:14.59 bro
15370 root 25 5 188m 145m 128m R 12 0.2 5:12.91 bro
15371 root 25 5 188m 145m 128m S 12 0.2 5:19.80 bro
15194 root 25 5 76452 22m 508 R 10 0.0 7:03.27 bro
15366 root 25 5 188m 145m 128m R 4 0.2 5:18.45 bro

That's what I thought, and then normally I wouldn't attribute the
crash to the manager's load. However, I forgot that this is all
running on a single box where the manager may just not get sufficient
cycles to process what it gets.

If you see these crashes regularly, it may be worth playing with some
scheduling parameters, like pinning the manager process to its own
core.

Robin

PS: To be very clear, these 101 crashes should never happen at all;
it's a bug somewhere in the communication code that has evaded
detection for a while already. If anything, Bro should tear down the
connection regularly if it can't deal with what it gets.

Is your weird.log file oddly large? I've been seeing communications overload occasionally that is in part due to a lot of weird log messages. I'm becoming really tempted to turn off weird messages but measure some of them through the metrics framework to find where there might be issues (due to checksum offloading, async routing, load balancing problems etc, abundance of out of order traffic, etc).

.Seth

FWIW,
I often get tons of weirds from the DNS scripts. (The scripts appear to get confused with the number of answers to expect)

cu
gregor

I disable weird logs through disable_stream().

I'm planning on addressing that soon, those weirds aren't very good. :slight_smile:

  .Seth

Ah, that's right.

Robin, if a stream is disabled that causes a remote logging host to stop trying to send the log to it's log-accepting peer, correct? I just want to verify that the stream is as disabled as I'm assuming it is.

  .Seth