Bro 2 has been crashing for me regularly and frequently for several
months. [ http://bit.ly/JJQVVf ]
Although I configured Bro in a way that works for me, it would still
be nice to use it as it is intended to be used.
I studied a number of crash dumps, and have looked through the code.
I was seeing crashes with the Bro 2.0 release, but I am now using a
version of Bro 2.0 from the git repositories that I checked out on
April 30. I saw very similar crashes in both versions.
Line 78 in PktSrc.cc is consistently related to issues in the
backtraces I'm getting from bro core dumps. I really haven't written
much code with libpcap, so there's probably a good reason to use
pcap_next() there. I'm just wondering, why not use pcap_next_ex()
there and do a bit of error checking before passing packet data along?
The way it is right now, it looks like the code just trusts that
pcap_next() read a packet successfully and then hands it off.
I think that in my case, something is going wrong with the call to
pcap_next() -- it's returning a pointer that doesn't make any sense.
If there was a little error checking around pcap_next() by using
pcap_next_ex() instead, maybe that would prevent the crash I'm seeing.
On the other hand, maybe there is some code that does some error
checking on if the value returned by pcap_next() makes sense and I'm
just not finding it.
Can anyone help me understand the choice to use pcap_next() vs pcap_next_ext()?
Line 78 in PktSrc.cc is consistently related to issues in the
backtraces I'm getting from bro core dumps. I really haven't written
much code with libpcap, so there's probably a good reason to use
pcap_next() there. I'm just wondering, why not use pcap_next_ex()
there and do a bit of error checking before passing packet data along?
The way it is right now, it looks like the code just trusts that
pcap_next() read a packet successfully and then hands it off.
pcap_next() returns NULL if an error occurs or no packets are read from a live capture. The call to it in PktSrc::ExtractNextPacket() that you mention does look like it checks the validity of the return value in several places and its own return value is based on it (which is also checked whenever it's called).
But I don't know why pcap_next_ex() isn't used to get information about errors so some text can be relayed to the user, maybe that function didn't exist at the time the code was written.
I think that in my case, something is going wrong with the call to
pcap_next() -- it's returning a pointer that doesn't make any sense.
If there was a little error checking around pcap_next() by using
pcap_next_ex() instead, maybe that would prevent the crash I'm seeing.
Do you have a stack trace you can send? If pcap_next() were returning a bogus pointer, I don't think you'd see the call to it in the stack, you'd be at a later point in the code where it attempted to access it and crashes. That is, if pcap_next() is in your stack trace, something bad is probably happening within the pcap library and the caller would never have the opportunity to check the return value.
Yes, I think that's the reason, we were trying to stay compatible to
older pcaps for a while. It should be safe to switch to pcap_next_ex()
now, but I'd like to understand the underlying problem here. I don't
think we have seen this problem reported from elsewhere yet. Could
this be an artifact of the VM setting?
I've attached a fairly recent backtrace from a core dump.
As you can see, the value being passed as pkt to net_packet_dispatch
triggers an "Address out of bounds" error. PktSrc::Process calls
PktSrc::ExtractNextPacket which calls pcap_next. The return value
from pcap_next sets the value for data, which PktSrc::Process
ultimately passes into net_packet_arrival as pkt. So, either
pcap_next is returning an out of bounds pointer, or something happens
to data between the point in time when pcap_next returns a values and
PktSrc::Process calls net_packet_dispatch.
I've tried to identify what type of traffic might cause this to
happen, but unfortunately, nothing is jumping out at me.
Whatever it is, I don't think that data should be an out of bounds
address. That's why I'm thinking that some preemptive error checking
may help.
So, either
pcap_next is returning an out of bounds pointer, or something happens
to data between the point in time when pcap_next returns a values and
PktSrc::Process calls net_packet_dispatch.
Could be, but then I'd expect a segfault much earlier, at least before NetSessions::DoNextPacket() when the packet data is accessed. In the end, your crash doesn't look related to accessing that data at all, it's an unhandled exception thrown from operator new (failure to allocate memory).
Another possibility could be that since GDB is working with optimized code (the reason why some arguments are "<value optimized out>"), the "out of bounds" indicators don't necessarily indicate a problem yet. If you `./configure --enable-debug` then rebuild/reinstall to disable optimizations, you can see if stack traces for future crashes start reporting valid addresses for the pointer.
block = new u_char[size];
if ( ! block )
reporter->InternalError("out of memory");
Since the new operator is throwing a std::bad_alloc error, that line
should be surrounded with a bit of error handling like in the attached
patch file. (To apply, copy to bro/src and use "patch
--ignore-whitespace < Reassem.cc.patch").
The result would look something like this:
try
{
block = new u_char[size];
}
catch(std::bad_alloc& exc)
{
reporter->InternalError("out of memory");
}
That way, if there is no memory available, at least this error can be
handled gracefully.
As it is right now, if new fails to allocate memory, the following if
statement will never execute and the end user will never see the "out
of memory" message in their logs.
block = new u_char[size];
if ( ! block )
reporter->InternalError("out of memory");
Since the new operator is throwing a std::bad_alloc error, that line
should be surrounded with a bit of error handling like in the attached
patch file.
I thought 'new' is defined to return a nil pointer upon memory exhaustion.
Is that wrong / dates me horribly?
I did a bit a of Googling to find some documentation that would say
that the new operator returns a NULL pointer by defaut, versus
returning std::bad_alloc. The only thing I could really find was a an
bug in Visual C++ that was documented in this KB article: http://support.microsoft.com/?kbid=167733
The funny thing is that I thought we did that already; and in fact
there's a function bro_new_handler() that reports an out of memory
error and aborts. However, we must have lost the call to the
corresponding set_new_handler() at some point: I don't find
bro_new_handler() used anywhere. So what we should do is add that back
in; I'll do that.
The if-statement that follows still doesn't make sense to me, though:
if ( ! block )
reporter->InternalError("out of memory");
If the handler in bro_new_handler() aborts, then this never executes.
If the handler succeeds in making more storage available and returns,
it seems to me that this if block still wouldn't execute because block
would not be null. If this if-statement did execute, then it seems
like the logic would be off -- you're not out of memory yet, because
the handler just returned with some after a second try.
It seems like a logical place to print an "out of memory" message to
the logs would be just before an abort in the function registered in
bro_new_handler() to handle catching std::bad_alloc errors.
The if-statement that follows still doesn't make sense to me, though:
Yes, agreed.
It seems like a logical place to print an "out of memory" message to
the logs would be just before an abort in the function registered in
bro_new_handler() to handle catching std::bad_alloc errors.
Indeed, that's what should happen once the handler gets installed
(with the caveat that the logging might not work if it needs to
allocate memory itself).
The if-statement that follows still doesn't make sense to me, though:
Right, it's vestigial from back when my understanding of new's behavior
was actually correct. (Now do me a favor and don't explicitly date my
coding skillz from that comment.