help Reading the backtrace

So I am running a new detection package and everything seemed right but somehow since yesterday each worker is running at 5.7% to 6.3% CPU and not generating logs.

The backtrace shows the following and how much (%) CPU is spending on what functions.

Can someone help me read why might BRO spend 17.5% of its time in

bro-2.5/src/Dict.cc:void* Dictionary::NextEntry(HashKey*& h, IterCookie*& cookie, int return_hash) const

Here is functions and time spent in each of them:

bro`_ZN8iosource4pcap10PcapSource17ExtractNextPacketEP6Packet 1 0.1%
bro`_ZNK13PriorityQueue3TopEv 1 0.1%
bro`_ZNK7BroFunc4CallEP8ValPListP5Frame 1 0.1%
bro`_Z15net_update_timed 1 0.1%
bro`_ZN16RemoteSerializer6GetFdsEPN8iosource6FD_SetES2_S2_ 1 0.1%
bro`_ZN8EventMgr5DrainEv 1 0.1%
bro`_ZNK15EventHandlerPtrcvbEv 1 0.1%
bro`_ZN8iosource6FD_Set6InsertEi 1 0.1%
bro`_ZNK11ChunkedIOFd12ExtraReadFDsEv 1 0.1%
bro`_ZN13PriorityQueue10BubbleDownEi 1 0.1%
bro`0x699d60 2 0.1%
bro`_ZNK8iosource8IOSource6IsOpenEv 2 0.1%
bro`_ZN8iosource6FD_Set6InsertERKS0_ 2 0.1%
bro`_ZNK8iosource6FD_Set5ReadyEP6fd_set 3 0.2%
bro`_ZNK14DictEntryPListixEi 3 0.2%
bro`_ZN8iosource6PktSrc25ExtractNextPacketInternalEv 4 0.3%
bro`_ZNSt3__16__treeIiNS_4lessIiEENS_9allocatorIiEEE15__insert_uniqueERKi 4 0.3%
bro`_ZNK8iosource6FD_Set3SetEP6fd_set 5 0.3%
bro`0x69a610 5 0.3%
bro`_ZNSt3__16__treeIiNS_4lessIiEENS_9allocatorIiEEE7destroyEPNS_11__tree_nodeIiPvEE 5 0.3%
bro`0x699c00 6 0.4%
bro`_ZNSt3__16__treeIiNS_4lessIiEENS_9allocatorIiEEE16__construct_nodeIJRKiEEENS_10unique_ptrINS_11__tree_nodeIiPvEENS_22__tree_node_destructorINS3_ISC_EEEEEEDpOT_ 6 0.4%
bro`0x69ad50 7 0.5%
bro`_ZN7HashKeyD2Ev 7 0.5%
bro`_ZN8iosource7Manager11FindSoonestEPd 7 0.5%
bro`_ZN7HashKeyC2EPKvim 11 0.7%
bro`_ZNK18TableEntryValPDict9NextEntryERP7HashKeyRP10IterCookie 12 0.8%
bro`_ZN8TableVal8DoExpireEd 16 1.1%
bro`_ZNK7HashKey7CopyKeyEPKvi 16 1.1%
bro`_ZNK13TableEntryVal16ExpireAccessTimeEv 164 11.1%
bro`_ZNK8BaseList6lengthEv 170 11.5%
bro`_ZNK8BaseListixEi 173 11.7%
bro`_ZNK10Dictionary9NextEntryERP7HashKeyRP10IterCookiei 259 17.5%

Aashish

It was stable before you added the new scripts? Are the new scripts publicly available?

Hi Aashish,

So I am running a new detection package and everything seemed right but somehow since yesterday each worker is running at 5.7% to 6.3% CPU and not generating logs.

my guess would be that the script makes (heavy) use of tables and table
expiration, right? Can you share the script?

Jan

Yes, I have been making heavy use of tables ( think a million entries a day and million expires a day)

Let me figure out a way to upload the scripts on github or send them yours and justin's way otherwise.

Strangely this code kept running fine for last month and reasonably stable. I am not sure what little thing I added/changed that has caused bro to run but all workers in uwait state with 6% CPU. (I'll be doing svn diffs to figure out)

Seems like bro is stuck in:

0x00000004039ccadc in _umtx_op_err () from /lib/libthr.so.3
(gdb) bt
#0 0x00000004039ccadc in _umtx_op_err () from /lib/libthr.so.3
#1 0x00000004039c750b in _thr_umtx_timedwait_uint () from /lib/libthr.so.3
#2 0x00000004039cea06 in ?? () from /lib/libthr.so.3
#3 0x00000000009042ee in threading::Queue<threading::BasicInputMessage*>::Get (this=0x404543038) at /home/bro/install/bro-2.5/src/threading/Queue.h:173
#4 0x0000000000902a31 in threading::MsgThread::RetrieveIn (this=0x404543000) at /home/bro/install/bro-2.5/src/threading/MsgThread.cc:349
#5 0x0000000000902ce4 in threading::MsgThread::Run (this=0x404543000) at /home/bro/install/bro-2.5/src/threading/MsgThread.cc:366
#6 0x00000000008fb952 in threading::BasicThread::launcher (arg=0x404543000) at /home/bro/install/bro-2.5/src/threading/BasicThread.cc:201
#7 0x00000004039c6260 in ?? () from /lib/libthr.so.3
#8 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffbfe000
(gdb)

Aashish

Yeah.. lots of expires may have something to do with it, your traceback shows

TableEntryVal16ExpireAccessTimeEv

But I also wonder what you are doing that is triggering Dictionary9NextEntryERP7HashKeyRP10IterCookiei

which would be

void* Dictionary::NextEntry(HashKey*& h, IterCookie*& cookie, int return_hash) const

Tables should be fine, but I wonder what you're doing that is triggering so much iteration.

But I also wonder what you are doing that is triggering Dictionary9NextEntryERP7HashKeyRP10IterCookiei

which would be

void* Dictionary::NextEntry(HashKey*& h, IterCookie*& cookie, int return_hash) const

Tables should be fine, but I wonder what you're doing that is triggering so much iteration.

If I am not mistaken, tables are basically dictionaries (with the
exception of subnet-indexed tables). The iterations should be related to
how expiration works: As soon as the timer fires, the table is looped
looking for expired entries. To prevent blocking in case of too many
expired entries that have to be handled, only
<table_incremental_step>-number of entries are processed, following a
delay of <table_expire_delay>. Adjusting this parameters might help (see
https://www.bro.org/sphinx/scripts/base/init-bare.bro.html?highlight=expire#id-table_expire_delay).

Jan

https://www.bro.org/sphinx/scripts/base/init-bare.bro.html?highlight=expire#id-table_expire_delay).

This is very helpful.

I have been testing out table_expire_delay but somehow wasn't looking at table_incremental_step I'll test out a million expires while teaking table_incremental_step to see what the limits are here.

Still, to clearify, there might be a possibility that because at present table_incremental_step=5000, somehow expiring >> 5000 entries continiously every moment might cause cause Queue to deadlock resulting in BRO to stop packets processing ?

Aashish

It shouldn't deadlock. What I can see happening, depending on load and
these parameters, is Bro spending most of its time going through the
table to expire entries and only getting to few packets in between (so
not complete stop of processing, but not getting much done either)

Robin

SO this doesn't (at the moment) seem to be related to table expiration. My table is maintained on manager and expire_func only runs on manager.

But, I see 'a' worker stall with 99-100% CPU for a good while while all other workers go down to 5-6% CPU. conn.log continues to grow though

GDB points to : Get() in install/bro-2.5/src/threading/Queue.h :

template<typename T>
inline T Queue<T>::Get()
        {
        safe_lock(&mutex[read_ptr]);

        int old_read_ptr = read_ptr;

        if ( messages[read_ptr].empty() && ! ((reader && reader->Killed()) || (writer && writer->Killed())) )
                {
                struct timespec ts;
                ts.tv_sec = time(0) + 5;
                ts.tv_nsec = 0;

-----> pthread_cond_timedwait(&has_data[read_ptr], &mutex[read_ptr], &ts);

On a side note, Well, why is this on bro-dev ? Not entirely sure. :slight_smile: I think eventually this might go into what my script is messing up and whats a better way to script the code, I suppose.

Aashish

Just be sure: are you sure this is the troublesome spot? I'm asking
because this is likely running inside a logging thread, and expected
to block frequently if there's nothing to log (remember we have one
logging thread per output file, so for any low-volume log it'll block
regularly). Have you tried switching to other threads in GDB to see
where they are at? Also, at that location in gdb above, try to figure
out what the queue is for: you should be able to get the name of the
thread through 'this->reader->name' (haven't tried that, just took a
quick look at the code).

Robin