help Reading the backtrace

Aashish_Sharma1 · January 18, 2017, 5:29pm

So I am running a new detection package and everything seemed right but somehow since yesterday each worker is running at 5.7% to 6.3% CPU and not generating logs.

The backtrace shows the following and how much (%) CPU is spending on what functions.

Can someone help me read why might BRO spend 17.5% of its time in

bro-2.5/src/Dict.cc:void* Dictionary::NextEntry(HashKey*& h, IterCookie*& cookie, int return_hash) const

Here is functions and time spent in each of them:

bro`_ZN8iosource4pcap10PcapSource17ExtractNextPacketEP6Packet 1 0.1%
bro`_ZNK13PriorityQueue3TopEv 1 0.1%
bro`_ZNK7BroFunc4CallEP8ValPListP5Frame 1 0.1%
bro`_Z15net_update_timed 1 0.1%
bro`_ZN16RemoteSerializer6GetFdsEPN8iosource6FD_SetES2_S2_ 1 0.1%
bro`_ZN8EventMgr5DrainEv 1 0.1%
bro`_ZNK15EventHandlerPtrcvbEv 1 0.1%
bro`_ZN8iosource6FD_Set6InsertEi 1 0.1%
bro`_ZNK11ChunkedIOFd12ExtraReadFDsEv 1 0.1%
bro`_ZN13PriorityQueue10BubbleDownEi 1 0.1%
bro`0x699d60 2 0.1%
bro`_ZNK8iosource8IOSource6IsOpenEv 2 0.1%
bro`_ZN8iosource6FD_Set6InsertERKS0_ 2 0.1%
bro`_ZNK8iosource6FD_Set5ReadyEP6fd_set 3 0.2%
bro`_ZNK14DictEntryPListixEi 3 0.2%
bro`_ZN8iosource6PktSrc25ExtractNextPacketInternalEv 4 0.3%
bro`_ZNSt3__16__treeIiNS_4lessIiEENS_9allocatorIiEEE15__insert_uniqueERKi 4 0.3%
bro`_ZNK8iosource6FD_Set3SetEP6fd_set 5 0.3%
bro`0x69a610 5 0.3%
bro`_ZNSt3__16__treeIiNS_4lessIiEENS_9allocatorIiEEE7destroyEPNS_11__tree_nodeIiPvEE 5 0.3%
bro`0x699c00 6 0.4%
bro`_ZNSt3__16__treeIiNS_4lessIiEENS_9allocatorIiEEE16__construct_nodeIJRKiEEENS_10unique_ptrINS_11__tree_nodeIiPvEENS_22__tree_node_destructorINS3_ISC_EEEEEEDpOT_ 6 0.4%
bro`0x69ad50 7 0.5%
bro`_ZN7HashKeyD2Ev 7 0.5%
bro`_ZN8iosource7Manager11FindSoonestEPd 7 0.5%
bro`_ZN7HashKeyC2EPKvim 11 0.7%
bro`_ZNK18TableEntryValPDict9NextEntryERP7HashKeyRP10IterCookie 12 0.8%
bro`_ZN8TableVal8DoExpireEd 16 1.1%
bro`_ZNK7HashKey7CopyKeyEPKvi 16 1.1%
bro`_ZNK13TableEntryVal16ExpireAccessTimeEv 164 11.1%
bro`_ZNK8BaseList6lengthEv 170 11.5%
bro`_ZNK8BaseListixEi 173 11.7%
bro`_ZNK10Dictionary9NextEntryERP7HashKeyRP10IterCookiei 259 17.5%

Aashish

Azoff_Justin_S · January 18, 2017, 5:34pm

It was stable before you added the new scripts? Are the new scripts publicly available?

Jan · January 18, 2017, 5:37pm

Hi Aashish,

So I am running a new detection package and everything seemed right but somehow since yesterday each worker is running at 5.7% to 6.3% CPU and not generating logs.

my guess would be that the script makes (heavy) use of tables and table
expiration, right? Can you share the script?

Jan

Aashish_Sharma1 · January 18, 2017, 6:40pm

Yes, I have been making heavy use of tables ( think a million entries a day and million expires a day)

Let me figure out a way to upload the scripts on github or send them yours and justin's way otherwise.

Strangely this code kept running fine for last month and reasonably stable. I am not sure what little thing I added/changed that has caused bro to run but all workers in uwait state with 6% CPU. (I'll be doing svn diffs to figure out)

Seems like bro is stuck in:

0x00000004039ccadc in _umtx_op_err () from /lib/libthr.so.3
(gdb) bt
#0 0x00000004039ccadc in _umtx_op_err () from /lib/libthr.so.3
#1 0x00000004039c750b in _thr_umtx_timedwait_uint () from /lib/libthr.so.3
#2 0x00000004039cea06 in ?? () from /lib/libthr.so.3
#3 0x00000000009042ee in threading::Queue<threading::BasicInputMessage*>::Get (this=0x404543038) at /home/bro/install/bro-2.5/src/threading/Queue.h:173
#4 0x0000000000902a31 in threading::MsgThread::RetrieveIn (this=0x404543000) at /home/bro/install/bro-2.5/src/threading/MsgThread.cc:349
#5 0x0000000000902ce4 in threading::MsgThread::Run (this=0x404543000) at /home/bro/install/bro-2.5/src/threading/MsgThread.cc:366
#6 0x00000000008fb952 in threading::BasicThread::launcher (arg=0x404543000) at /home/bro/install/bro-2.5/src/threading/BasicThread.cc:201
#7 0x00000004039c6260 in ?? () from /lib/libthr.so.3
#8 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffbfe000
(gdb)

Aashish

Azoff_Justin_S · January 18, 2017, 7:27pm

Yeah.. lots of expires may have something to do with it, your traceback shows

TableEntryVal16ExpireAccessTimeEv

But I also wonder what you are doing that is triggering Dictionary9NextEntryERP7HashKeyRP10IterCookiei

which would be

void* Dictionary::NextEntry(HashKey*& h, IterCookie*& cookie, int return_hash) const

Tables should be fine, but I wonder what you're doing that is triggering so much iteration.

Jan · January 19, 2017, 9:55am

But I also wonder what you are doing that is triggering Dictionary9NextEntryERP7HashKeyRP10IterCookiei

which would be

void* Dictionary::NextEntry(HashKey*& h, IterCookie*& cookie, int return_hash) const

Tables should be fine, but I wonder what you're doing that is triggering so much iteration.

If I am not mistaken, tables are basically dictionaries (with the
exception of subnet-indexed tables). The iterations should be related to
how expiration works: As soon as the timer fires, the table is looped
looking for expired entries. To prevent blocking in case of too many
expired entries that have to be handled, only
<table_incremental_step>-number of entries are processed, following a
delay of <table_expire_delay>. Adjusting this parameters might help (see
https://www.bro.org/sphinx/scripts/base/init-bare.bro.html?highlight=expire#id-table_expire_delay).

Jan

Aashish_Sharma1 · January 19, 2017, 5:44pm

https://www.bro.org/sphinx/scripts/base/init-bare.bro.html?highlight=expire#id-table_expire_delay).

This is very helpful.

I have been testing out table_expire_delay but somehow wasn't looking at table_incremental_step I'll test out a million expires while teaking table_incremental_step to see what the limits are here.

Still, to clearify, there might be a possibility that because at present table_incremental_step=5000, somehow expiring >> 5000 entries continiously every moment might cause cause Queue to deadlock resulting in BRO to stop packets processing ?

Aashish

robin · January 19, 2017, 5:55pm

It shouldn't deadlock. What I can see happening, depending on load and
these parameters, is Bro spending most of its time going through the
table to expire entries and only getting to few packets in between (so
not complete stop of processing, but not getting much done either)

Robin

Aashish_Sharma1 · January 20, 2017, 1:54am

SO this doesn't (at the moment) seem to be related to table expiration. My table is maintained on manager and expire_func only runs on manager.

But, I see 'a' worker stall with 99-100% CPU for a good while while all other workers go down to 5-6% CPU. conn.log continues to grow though

GDB points to : Get() in install/bro-2.5/src/threading/Queue.h :

template<typename T>
inline T Queue<T>::Get()
{
safe_lock(&mutex[read_ptr]);

int old_read_ptr = read_ptr;

        if ( messages[read_ptr].empty() && ! ((reader && reader->Killed()) || (writer && writer->Killed())) )
                {
                struct timespec ts;
                ts.tv_sec = time(0) + 5;
                ts.tv_nsec = 0;

-----> pthread_cond_timedwait(&has_data[read_ptr], &mutex[read_ptr], &ts);

On a side note, Well, why is this on bro-dev ? Not entirely sure. I think eventually this might go into what my script is messing up and whats a better way to script the code, I suppose.

Aashish

robin · January 20, 2017, 4:01pm

Just be sure: are you sure this is the troublesome spot? I'm asking
because this is likely running inside a logging thread, and expected
to block frequently if there's nothing to log (remember we have one
logging thread per output file, so for any low-volume log it'll block
regularly). Have you tried switching to other threads in GDB to see
where they are at? Also, at that location in gdb above, try to figure
out what the queue is for: you should be able to get the name of the
thread through 'this->reader->name' (haven't tried that, just took a
quick look at the code).

Robin

Topic		Replies	Views
Bro cluster requirements and manager logging backlog bug Zeek	34	160	May 6, 2022
Bro 2.6-beta plans Development development	20	170	May 6, 2022
Urgent, plz Help me Zeek	2	74	May 6, 2022
Bro cluster's CPU usage Zeek	2	142	May 6, 2022
WIP: Instrumentation plugin Development development	2	91	May 6, 2022

help Reading the backtrace

Related topics