internal error: unknown msg type 101 in Poll()

I have been seeing several crashes per day due to 'internal error:
unknown msg type 101 in Poll()' in the manager process of a bro cluster
handling ~2.5 Gb/s of traffic. Here is a typical stack trace:

Program terminated with signal 6, Aborted.
#0 0x000000080158ef6c in kill () from /lib/libc.so.6
#0 0x000000080158ef6c in kill () from /lib/libc.so.6
#1 0x000000080158ddfd in abort () from /lib/libc.so.6
#2 0x000000000040b329 in internal_error () at SSLInterpreter.cc:31
#3 0x000000000050efde in RemoteSerializer::InternalCommError (this=0x8fd3,
msg=0x8fd3 <Address 0x8fd3 out of bounds>) at RemoteSerializer.cc:2714
#4 0x000000000051668b in RemoteSerializer::Poll (this=0x7cb7e0,
may_block=false) at RemoteSerializer.cc:1477
#5 0x0000000000516c83 in RemoteSerializer::NextTimestamp (this=0x7cb7e0,
local_network_time=0x7fffffffe330) at RemoteSerializer.cc:1294
#6 0x00000000004d6575 in IOSourceRegistry::FindSoonest (this=0x79a310,
ts=0x7fffffffe518) at stl_list.h:131
#7 0x00000000004f2df3 in net_run () at Net.cc:509
#8 0x0000000000408938 in main (argc=36152552, argv=0x0) at main.cc:999

This seems to be the same problem as ticket #203. Robin's comment (see
<http://tracker.icir.org/bro/ticket/203#comment:1> suggests this may be
caused by high system load, but that doesn't seem to be the case.

To check this, I have set up two clusters fed by the same input traffic.
The first is a cluster of seven machines with a single bro instance
running on each. The cluster has four workers, two proxies, and the
manager node. In broctl, 'top' rarely reports CPU utilization over 10%
for any node, and memory consumption is typically < 250 MB per process.
The manager process in this cluster crashes several times per day.

The second cluster is just one machine: a dual quad-core Xeon system
with 16 GB of RAM. It is running six instances of bro: four workers
each listening to a different network interface, one proxy, and one
manager. CPU utilization is often ~50% on the workers, and as high as
20% on the manager. Although 'netstats' reports more packet loss for
this cluster, the manager does not crash.

Is there some other line of investigation I should pursue? A
single-machine Bro cluster won't handle much more traffic, so this isn't
a useful workaround for the long term.

Just as a data point, we are seeing the same thing here at NERSC in a
two machine cluster. Manager diag output looks like:

%broctl diag manager
[manager]
Could not find the frame base for "RemoteSerializer::InternalCommError(char const*)".
Cannot access memory at address 0x5
==== stderr.log
pcap bufsize = 2097152
listening on em1
1266607176.728276 internal error: unknown msg type 101 in Poll()
/bro/share/broctl/scripts/run-bro: line 73: 5695 Abort trap: 6 (core dumped) nohup $tmpbro $@
==== stdout.log

==== .status
TERMINATED [internal_error]

==== No prof.log.

bro.core
Core was generated by `bro'.
Program terminated with signal 6, Aborted.
#0 0x2870c017 in kill () from /lib/libc.so.7
#0 0x2870c017 in kill () from /lib/libc.so.7
#1 0x2870bf76 in raise () from /lib/libc.so.7
#2 0x2870ab8a in abort () from /lib/libc.so.7
#3 0x08051894 in internal_error () at SSLInterpreter.cc:30
#4 0x08164d61 in RemoteSerializer::InternalCommError (this=) at RemoteSerializer.cc:2714
#5 0x0816c68b in RemoteSerializer::Poll (this=0xbfbfe57c, may_block=116) at RemoteSerializer.cc:1478
#6 0x0816c87b in RemoteSerializer::NextTimestamp (this=0x82df3c8, local_network_time=0xbfbfe7f8) at RemoteSerializer.cc:1294
#7 0x08129a7b in IOSourceRegistry::FindSoonest (this=0x82b8f58, ts=0xbfbfe838) at IOSource.cc:61
#8 0x081465ce in net_run () at Net.cc:509
#9 0x0804fcef in main (argc=) at main.cc:999

The memory address is consistent across crashes.

This is the stock 1.5.1 with the only "unusual" thing running on the
system being Seth's policy scripts for DNS, SMTP and HTTP logging.

thanks,
scott

Try two things.

1. Apply this patch...
http://tracker.icir.org/bro/ticket/220#comment:13

2. Add the following to your local.bro script:
redef notice_action_filters += {
         [Weird::ContentGap] = ignore_notice,
         [Weird::AckAboveHole] = ignore_notice,
};
redef suppress_notice_actions += {
         Weird::ContentGap,
         Weird::AckAboveHole,
};

Hopefully those will help. There are bugs buried deeper in some of that code, but those two changes should help to mitigate them on cluster deployments.

Let me know how it goes.

   .Seth

That's is interesting. Thanks for comparing the two setups. I've
seen this problem myself so far only when one of the systems
appeared to be under too much load or crashed for some other reason.
However, in that case it was then *not* the node with the overload
reporting the 101 message, but (I think) always manager or proxy.
Therefore, I've been thinking that this message is the result of
manager/proxy not being able to deal nicely with other nodes having
trouble.

But it seems in your tests, none of the other systems showed any
kind of unusual behaviour, is that right?

To fix this, I'll need some way of reliably reproducing it. I
haven't actually seen it myself in a while now. I should build some
similar test setup over here I guess.

Robin

P.S.: Can you send me remote.logs from your setups, ideally
one with a crash from the first cluster and the other from the
second without a crash, both for (roughly) the same time interval.

Robin Sommer wrote:

To check this, I have set up two clusters fed by the same input traffic.

It seems that the single machine Bro cluster also suffers from manager
crashes, but much less often. Over the past 48 hours the manager in the
7-machine cluster has crashed 13 times, while the manager in the
single-machine cluster has crashed twice.

[...]

But it seems in your tests, none of the other systems showed any
kind of unusual behaviour, is that right?

Yes, I have only seen the manager crash. In both clusters, the error
has been the same.

Thanks for the suggestions. If I'm understanding correctly, the policy
changes should help prevent load spikes from missing packets in the
captured traffic. Since I am capturing traffic that includes flows that
exceed 1 Gb/s, the workers will see periods of heavy load that are
missing a lot of packets.

Tweaking small_timeout down should also help prevent buffer overruns
during a period of heavy load, at the cost of increasing the overall
system load. Will these changes affect Bro in other ways as well?

Thanks for the suggestions. If I'm understanding correctly, the policy
changes should help prevent load spikes from missing packets in the
captured traffic. Since I am capturing traffic that includes flows that
exceed 1 Gb/s, the workers will see periods of heavy load that are
missing a lot of packets.

It should prevent spikes in the number of events that your workers are sending to your manager which should help. There are some problems with the pipes between the parent (the real worker) and child (communication) processes being filled and causing other strange issues. I used to see this exact problem, but it has been a fairly long time.

If you can try and reduce the number of events your manager is receiving that should help to mitigate the problem until the root problem is found.

Tweaking small_timeout down should also help prevent buffer overruns
during a period of heavy load, at the cost of increasing the overall
system load. Will these changes affect Bro in other ways as well?

I think that increased system load should be the only change, but you won't actually see a change in cpu usage if you're running this on an active cluster. The only time you should see any difference is if you are running Bro that isn't seeing any activity. I'll leave it up to Robin or Vern to say for sure though. :slight_smile:

   .Seth

FYI, I tested it, with a vanilla config, i.e. nothing but the stock
scripts, and was getting the 101 Poll message. I just migrated my
cluster from FreeBSD to Linux, and the error seems to be happening
slightly more often on Linux.

I just tried Seth's suggestions about filtering ContentGap and
AckAboveHole, and it has been quiet for the last couple hours. I had
been noticing the error more often during our peak times, so I should
see crashes right now. My second proxy seems to be crashing a bit more
often after implementing the filter, but it is on a machine with only 4
GB of RAM, so it is using up all of the physical memory.

Tyler

I ran into general load issues when I switched to running a single node
cluster.. I traced it back to the same problem with ContentGap and
AckAboveHole.. I also ignored Weird::WeirdActivity, which helped too.

If I ran capstats on the 'lo' interface, I would see Bro doing about 10mbps and
a few thousand packets/sec for what seemed like no reason. After ignoring
those two event types lo now has under .1 mbps and about 20 packets/sec.

FWIW, I've also observed a significant decrease in CPU load on the
manager parent process as displayed by 'broctl top'. The workers still
report high utilization, but the manager seems to be largely unaffected.
Without Seth's policy changes in local.bro I would see a spike in CPU
load on the manager whenever any worker was under heavy load.

I have been seeing several crashes per day due to 'internal error:
unknown msg type 101 in Poll()' in the manager process of a bro
cluster
handling ~2.5 Gb/s of traffic. Here is a typical stack trace:

Try two things.

[...]

2. Add the following to your local.bro script:
redef notice_action_filters += {
         [Weird::ContentGap] = ignore_notice,
         [Weird::AckAboveHole] = ignore_notice,
};
redef suppress_notice_actions += {
         Weird::ContentGap,
         Weird::AckAboveHole,
};

FYI, this policy change was sufficient to fix the stability problems in
my cluster. I now believe the underlying problem is upstream packet
loss, and I have been working to eliminate that problem in my traffic
capture and distribution network. In hindsight, I've had similar
problems in the past with overloaded standalone Bro processes. Missing
packets generate a significant amount of additional load, and this tends
to turn a small problem into a more serious one. Perhaps we can start a
'debugging Bro problems' page on the wiki, or add this policy to an
existing one?

Can somebody please file a ticket with the recommendations collected
in this thread? I'll see what makes sense to integrate into the
default cluster config.

(The Wiki page is still a good idea, and it's fine for the ticket to
just link there.)

Thanks,

Robin

Done. http://tracker.icir.org/bro/ticket/245

   .Seth