[BRO-ISSUE]: bro crash when so many Repoter::Error calls

Hi all,

I’m using bro 2.5.1 for network security monitoring , the message queue is kafka componment (the bro-to-kafka plugin version is v0.5.0, librdkafka version is v0.9.5).

Now i have encountered an error when network traffic up to 1.6Gbps, the error message is segment fault from src/Event.cc#90, bro crashed.

The following listed is our test environment informations:

CPU: 32 core Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Memory: 64G

NIC: Speed 10000Mb/s

Storage: 2TB SATA, 100GB SSD

Below listed information is backtrace from core dump. (more on gist)

#0 SetNext (this=0x0, n=0x7fe292ebd490) at /opt/download/bro/src/Event.h:21 #1 EventMgr::QueueEvent (this=0xc302c0 , event=event@entry=0x7fe292ebd490) at /opt/download/bro/src/Event.cc:90 #2 0x00000000005fe6a7 in QueueEvent (obj=0x0, mgr=0x0, aid=0, src=0, vl=0x7fe2e2bedb80, h=…, this=) at /opt/download/bro/src/Event.h:88 #3 Reporter::DoLog (this=0x29aabb0, prefix=prefix@entry=0x908cd7 “error”, event=…, out=0x0, conn=conn@entry=0x0, addl=addl@entry=0x0, location=location@entry=true, time=time@entry=true, postfix=postfix@entry=0x0, fmt=fmt@entry=0x7fe36c719d70 “Kafka send failed: %s”, ap=ap@entry=0x7fe36aa3eaf8) at /opt/download/bro/src/Reporter.cc:350 #4 0x00000000005fee8f in Reporter::Error (this=, fmt=fmt@entry=0x7fe36c719d70 “Kafka send failed: %s”) at /opt/download/bro/src/Reporter.cc:76 #5 0x00007fe36c717fa9 in logging::writer::KafkaWriter::DoWrite (this=0x6369270, num_fields=, fields=, vals=0x69d2080) at /opt/download/bro/aux/plugins/kafka/src/KafkaWriter.cc:156 #6 0x000000000089e495 in logging::WriterBackend::Write (this=0x6369270, arg_num_fields=, num_writes=1000, vals=0x6dc7bf0) at /opt/download/bro/src/logging/WriterBackend.cc:301 #7 0x0000000000662180 in threading::MsgThread::Run (this=0x6369270) at /opt/download/bro/src/threading/MsgThread.cc:371 #8 0x000000000065eaa8 in threading::BasicThread::launcher (arg=0x6369270) at /opt/download/bro/src/threading/BasicThread.cc:205 #9 0x00007fe36e8ce2b0 in ?? () from /lib64/libstdc++.so.6 #10 0x00007fe36ed2ce25 in start_thread () from /lib64/libpthread.so.0 #11 0x00007fe36e03634d in clone () from /lib64/libc.so.6

Varibles on frame 1

(gdb) f 1 #1 EventMgr::QueueEvent (this=0xc302c0 , event=event@entry=0x7fe292ebd490) at /opt/download/bro/src/Event.cc:90 90 tail->SetNext(event); (gdb) info args this = 0xc302c0 event = 0x7fe292ebd490 (gdb) info locals done = (gdb) p head $1 = (Event *) 0x7fe3540c81c0 (gdb) p tail $2 = (Event *) 0x0 (gdb) p event

$3 = (Event *) 0x7fe292ebd490

During test, the variable tail is NULL pointer always when bro crashed, however the variable head is NULL or not.

on my research, in the huge network traffic scenario, KakfaWriter write log to kafka exceed the limit of
configuration queue.buffering.max.messages(default is 100000) or queue.buffering.max.kbytes(default is 4000000, 4G in other words) in librdkafka,
and QUEUE_FULL error raised by librdkafka, then KafkaWriter call Reporter::Error to report the runtime error.
so KafkaWriter::DoWrite lead too many action to call Reporter::Error function.
I guess the issue cause with concurrency access to the varible tail without lock, then it set to be a NULL pointer, but i don’t know why.

Then call the function SetNext with the NULL pointer, segment fault was raised.

The above is my guesswork, maybe there is another reason.

Wish someone could help.

Best regards,
Myth

This would be a problem if bro is configured to send the reporter.log to Kafka.

Reporter::Error generates a reporter_error event which then calls

    Log::write(Reporter::LOG, [$ts=t, $level=ERROR, $message=msg, $location=location]);

So you're probably also ending up in the situation where bro is trying to log to Kafka the fact that Kafka is broken.

If you tell bro to not send the reporter.log to Kafka, does your problem go away?

Hi, Justin
currently i hava a bro code snippet to stable reproduce the crash.
(no Reporter log to kafka with explicit remove default filter at bro_init)

I record the backtrace and threads info in this gist:
https://gist.github.com/MythRen/3d77111fb810cac941c48311dc273289, and the snippet.

steps to reproduce:
0. prepare an environment with bro 2.5.1 installed, with kafka plugin

  1. upload the test.crash.bro to your server
  2. run $BRO_BIN/bro -i INTERFACE -C test.crash.bro
  3. open another terminal and login to your server
  4. use ab to get more bandwidth: ab -n 100000000 -c 10 ‘https://www.google.com’ (better with local http server, make sure traffic go out through INTERFACE in step 3)
  5. waitting for crash. usually in seconds, in my test case.

i think the problem is the access of head, ‘tail’
without lock between Event.cc#86-91 and Event.cc#118-119 in multi-thread.

but i don’t known the effective if we add lock here.

Wish you and others could help.

Best regards,

Myth

 I'm using bro 2\.5\.1 for network security monitoring , the message queue is kafka componment \(the bro\-to\-kafka plugin version is v0\.5\.0, librdkafka version is v0\.9\.5\)\.

The plugins from Bro v2.5.1 may be a bit old to continue using. I'd generally suggest trying to update to newest version of everything.

Below listed information is backtrace from core dump. (more on gist <https://gist.github.com/MythRen/b55220647ca28654c6f7e1db12ee6036>)

    #4 0x00000000005fee8f in Reporter::Error (this=<optimized out>,
    fmt=fmt@entry=0x7fe36c719d70 "Kafka send failed: %s") at
    /opt/download/bro/src/Reporter.cc:76 #5 0x00007fe36c717fa9 in
    logging::writer::KafkaWriter::DoWrite (this=0x6369270,
    num_fields=<optimized out>, fields=<optimized out>, vals=0x69d2080)
    at /opt/download/bro/aux/plugins/kafka/src/KafkaWriter.cc:156 #6

This is basically the problem: this version of KafkaWriter is directly using Reporter calls and that's not thread-safe. Here would be the way to fix it for your reference (in case you simply can't update things):

https://github.com/apache/metron-bro-plugin-kafka/commit/4968b6537f663c1de061d0cf0aedb42f43ab12ee

- Jon

currently i’m using Apache/metron-kafka-plugin v0.1 and the problem is gone.
thanks Jon.