Worker being "killed nohup"

I’m running Zeek 2.6.4 and I have been seeing occasional error messages of the form:

run-bro: line 110: 42574 Killed nohup ${pin_command} $pin_cpu “$mybro” “$@”

The workers seem to be restarted fine and other than the error message, I haven’t noticed any ill behavior. What should I do about the error messages?

Mark

I’m running Zeek 2.6.4 and I have been seeing occasional error messages of the form:

run-bro: line 110: 42574 Killed nohup ${pin_command} $pin_cpu “$mybro” “$@”

The workers seem to be restarted fine and other than the error message, I haven’t noticed any ill behavior. What should I do about the error messages?

I would check your syslog. Assuming you are running linux, if your system runs out of memory, the kernel will go find the biggest process and kill it. This can often be a zeek process as they tend to grow large as more connections are tracked (depending on many factors: what scripts you are running, what you’re logging, what kind of traffic is being seen, etc.). If that’s happening, you should see something in syslog containing the string “invoked oom-killer:” If you look at the surrounding lines, there should be some info on process sizes showing why it was selected.

Good catch Jeff. A crash email arrived at 10:10 and the journal entry starting around 10:07 contained the following:

Nov 06 10:07:08 sensor1 kernel: oom_reaper: reaped process 43437 (bro), now anon-rss:0kB, file-rss:1048576kB, shmem-rss:0kB
Nov 06 10:07:05 sensor1 kernel: Killed process 43437 (bro) total-vm:67116888kB, anon-rss:57339476kB, file-rss:1042296kB, shmem-rss:0kB
Nov 06 10:07:05 sensor1 kernel: Out of memory: Kill process 43437 (bro) score 444 or sacrifice child
Nov 06 10:07:05 sensor1 kernel: [ 54824] 47 54824 1516502 1221530 11284480 80 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 54817] 47 54817 1661 56 49152 16 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51543] 47 51543 1516502 1281056 11886592 3616 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51542] 47 51542 1516502 1285841 11964416 2296 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51539] 47 51539 1516502 1277957 11890688 2817 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51537] 47 51537 1516502 1282107 11829248 2543 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51533] 47 51533 1516502 1276417 11939840 7591 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51532] 47 51532 1516502 1290264 12042240 8006 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51523] 47 51523 1516502 1281160 11894784 7479 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51521] 47 51521 1516502 1285423 11984896 7957 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51520] 47 51520 1745878 1370574 12668928 6908 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51501] 47 51501 1516502 1286290 11788288 6967 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51499] 47 51499 1661 22 49152 49 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51498] 47 51498 1516502 1276891 11882496 7233 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51496] 47 51496 1516502 1287064 11948032 7685 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51491] 47 51491 1661 18 53248 55 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51487] 47 51487 1661 18 57344 54 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51474] 47 51474 1661 21 49152 52 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51471] 47 51471 1516502 1277820 11890688 7964 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51465] 47 51465 1661 20 53248 53 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51455] 47 51455 1661 23 57344 49 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51440] 47 51440 1661 18 49152 54 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51438] 47 51438 1661 23 49152 49 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51428] 47 51428 1661 18 49152 54 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51414] 47 51414 1661 16 53248 56 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51405] 47 51405 1661 20 53248 52 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51397] 47 51397 1661 14 45056 59 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51375] 47 51375 1661 15 49152 58 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 43437] 47 43437 16779222 14595446 122212352 129972 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 43343] 47 43343 1661 2 49152 70 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 43183] 47 43183 42625 4 106496 608 0 (sd-pam)
Nov 06 10:07:05 sensor1 kernel: [ 43181] 47 43181 5290 0 77824 342 0 systemd
Nov 06 10:07:05 sensor1 kernel: [ 996] 0 996 1346 0 49152 33 0 agetty
Nov 06 10:07:05 sensor1 kernel: [ 995] 0 995 1403 0 45056 32 0 agetty
Nov 06 10:07:05 sensor1 kernel: [ 994] 0 994 3963 38 69632 178 -1000 sshd
Nov 06 10:07:05 sensor1 kernel: [ 991] 0 991 2119 20 53248 35 0 cron
Nov 06 10:07:05 sensor1 kernel: [ 990] 105 990 2210 97 57344 55 -900 dbus-daemon
Nov 06 10:07:05 sensor1 kernel: [ 977] 0 977 56456 90 86016 378 0 rsyslogd
Nov 06 10:07:05 sensor1 kernel: [ 968] 0 968 4875 78 77824 189 0 systemd-logind
Nov 06 10:07:05 sensor1 kernel: [ 944] 0 944 2021 1 57344 799 0 haveged
Nov 06 10:07:05 sensor1 kernel: [ 931] 100 931 23270 26 86016 207 0 systemd-timesyn
Nov 06 10:07:05 sensor1 kernel: [ 640] 0 640 5579 14 69632 309 -1000 systemd-udevd
Nov 06 10:07:05 sensor1 kernel: [ 615] 0 615 74225 2624 638976 568 0 systemd-journal
Nov 06 10:07:05 sensor1 kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Nov 06 10:07:05 sensor1 kernel: Tasks state (memory values in pages):
Nov 06 10:07:05 sensor1 kernel: 0 pages hwpoisoned
Nov 06 10:07:05 sensor1 kernel: 551979 pages reserved
Nov 06 10:07:05 sensor1 kernel: 0 pages HighMem/MovableOnly
Nov 06 10:07:05 sensor1 kernel: 33528566 pages RAM
Nov 06 10:07:05 sensor1 kernel: Total swap = 976892kB
Nov 06 10:07:05 sensor1 kernel: Free swap = 0kB
Nov 06 10:07:05 sensor1 kernel: Swap cache stats: add 759161, delete 711025, find 25951/48832
Nov 06 10:07:05 sensor1 kernel: 48128 pages in swap cache
Nov 06 10:07:05 sensor1 kernel: 52774 total pagecache pages

Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Nov 06 10:07:05 sensor1 kernel: Node 3 Normal: 734kB (U) 1508kB (UH) 5916kB (UH) 132532kB (UH) 564kB (H) 2128kB (H) 0256kB 0512kB 01024kB 02048kB 04096kB = 45412kB
Nov 06 10:07:05 sensor1 kernel: Node 1 Normal: 78
4kB (UMEH) 2138kB (UEH) 7016kB (UH) 123232kB (UMH) 764kB (H) 0128kB 0256kB 0512kB 01024kB 02048kB 04096kB = 43008kB
Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32: 6974kB (UME) 6478kB (UME) 39216kB (UME) 19032kB (UME) 9464kB (UE) 51128kB (UE) 63256kB (UME) 240512kB (UME) 721024kB (UME) 02048kB 04096kB = 245596kB
Nov 06 10:07:05 sensor1 kernel: Node 1 DMA: 1
4kB (U) 18kB (U) 216kB (U) 132kB (U) 364kB (U) 0128kB 1256kB (U) 0512kB 11024kB (U) 12048kB (M) 34096kB (M) = 15884kB
Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
Nov 06 10:07:05 sensor1 kernel: Node 3 Normal free:44844kB min:45116kB low:111144kB high:177172kB active_anon:48629496kB inactive_anon:3250728kB active_file:64kB inactive_file:668kB unevictable:0kB writepending:0kB present:67094528kB managed:66031300kB mlocked:0kB kernel_stack:5352kB pagetables:146976kB bounce
Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
Nov 06 10:07:05 sensor1 kernel: Node 1 Normal free:41932kB min:42604kB low:104956kB high:167308kB active_anon:55633904kB inactive_anon:4082748kB active_file:560kB inactive_file:0kB unevictable:0kB writepending:0kB present:63438848kB managed:62359848kB mlocked:0kB kernel_stack:4776kB pagetables:133008kB bounce:
Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 60894 60894 60894
Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32 free:245356kB min:2372kB low:5844kB high:9316kB active_anon:3199852kB inactive_anon:44556kB active_file:100kB inactive_file:0kB unevictable:0kB writepending:0kB present:3564892kB managed:3499316kB mlocked:0kB kernel_stack:48kB pagetables:3092kB bounce:0kB free_pcp:4
Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 3392 64286 64286 64286
Nov 06 10:07:05 sensor1 kernel: Node 1 DMA free:15884kB min:8kB low:20kB high:32kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Nov 06 10:07:05 sensor1 kernel: Node 3 active_anon:48629496kB inactive_anon:3250728kB active_file:108kB inactive_file:700kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:13636088kB dirty:0kB writeback:0kB shmem:9996kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1337344kB writeback_tmp:0kB unst
Nov 06 10:07:05 sensor1 kernel: Node 1 active_anon:58833756kB inactive_anon:4127304kB active_file:548kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2101152kB dirty:0kB writeback:0kB shmem:7828kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 13608960kB writeback_tmp:0kB unstab
Nov 06 10:07:05 sensor1 kernel: active_anon:26865813 inactive_anon:1844508 isolated_anon:0
active_file:163 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:25057 slab_unreclaimable:65745
mapped:3934289 shmem:4456 pagetables:70769 bounce:0
free:87073 free_pcp:3080 free_cma:0
Nov 06 10:07:05 sensor1 kernel: Mem-Info:
Nov 06 10:07:05 sensor1 kernel: R13: 0000000000001800 R14: 00007f375bd4f780 R15: 0000000000000022
Nov 06 10:07:05 sensor1 kernel: R10: 00007f3618971000 R11: 00007f36189727e0 R12: 00007f375bd4f940
Nov 06 10:07:05 sensor1 kernel: RBP: 00007f361424b000 R08: 0000000000000000 R09: 00007f36189727e0
Nov 06 10:07:05 sensor1 kernel: RDX: 0000000000001780 RSI: 00007f361424b000 RDI: 00007f3618971000
Nov 06 10:07:05 sensor1 kernel: RAX: 00007f3618971000 RBX: 0000000000001800 RCX: 00007f361424c760
Nov 06 10:07:05 sensor1 kernel: RSP: 002b:00007ffd609fbc58 EFLAGS: 00010202
Nov 06 10:07:05 sensor1 kernel: Code: Bad RIP value.
Nov 06 10:07:05 sensor1 kernel: RIP: 0033:0x7f375bf1f925
Nov 06 10:07:05 sensor1 kernel: page_fault+0x1e/0x30
Nov 06 10:07:05 sensor1 kernel: ? page_fault+0x8/0x30
Nov 06 10:07:05 sensor1 kernel: __do_page_fault+0x249/0x4f0
Nov 06 10:07:05 sensor1 kernel: handle_mm_fault+0xd6/0x200
Nov 06 10:07:05 sensor1 kernel: ? __switch_to_asm+0x41/0x70
Nov 06 10:07:05 sensor1 kernel: __handle_mm_fault+0x958/0x1270
Nov 06 10:07:05 sensor1 kernel: alloc_pages_vma+0x74/0x1c0
Nov 06 10:07:05 sensor1 kernel: __alloc_pages_nodemask+0x28b/0x2b0
Nov 06 10:07:05 sensor1 kernel: __alloc_pages_slowpath+0xbd8/0xcb0
Nov 06 10:07:05 sensor1 kernel: out_of_memory+0x1a5/0x430
Nov 06 10:07:05 sensor1 kernel: oom_kill_process.cold.30+0xb/0x1cf
Nov 06 10:07:05 sensor1 kernel: dump_header+0x6b/0x283
Nov 06 10:07:05 sensor1 kernel: dump_stack+0x5c/0x80
Nov 06 10:07:05 sensor1 kernel: Call Trace:
Nov 06 10:07:05 sensor1 kernel: Hardware name: Supermicro Super Server/H11SSL-i, BIOS 1.0b 04/27/2018
Nov 06 10:07:05 sensor1 kernel: CPU: 3 PID: 51496 Comm: bro Not tainted 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u1
Nov 06 10:07:05 sensor1 kernel: bro cpuset=/ mems_allowed=1,3
Nov 06 10:07:05 sensor1 kernel: bro invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0

It looks like bro was using 64 GB of RAM (summation of even though the system has 128 GB installed. (It also looks like there was no free swap.) Searching the Internet for ways to solve OOM, people suggested turning off over-commit, and configuring the overcommit ratio a bit higher. Currently, the values are:

sysctl vm.overcommit_memory

vm.overcommit_memory = 0

sysctl vm.overcommit_ratio

vm.overcommit_ratio = 50

In one case the suggestion was to change those to vm.overcommit_memory = 1 and vm.overcommit_ratio = 80 which seems reasonable based on the description at https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.

Or is there a better way to handle OOM?

Mark

Let’s dig into the root cause.
Your OOM most likely happens because either

  • you have tons of state and hit something that Justin neatly described as “unbound state growth”
  • or your logger threads cannot write messages quickly enough

Given the OOM output it looks to me like workers are to blame. Do you have any custom scripts? Do you use the stock scan.bro or any other SumStats scripts?

Can you observe the system for a while, in something like htop, or top with threads, to see what kind of processes eat most of memory and report back?

Can you expand on “tons of state”? It sounds like something I’m familiar with but want to make sure. I work with Mark and we’ve been talking a lot about this.

Sure!

Zeek internally keeps a state for each connection - and quite possibly information in other data structures that your scripts created. If you have lots of events and each of them adds somehow data into those data structures, it might eventually eat all your RAM.

To give you an example. An intern of mine, a while ago, designed a brilliant script

event http_reply(c: connection, version: string, code: count,
reason: string) &priority=3

(…)

if (code >= 400) {
add c$http$tags[HTTP_ERROR];

SumStats::observe(“http.excessive_errors.attacker_badhits”, [$host=c$id$orig_h],
[$str=fmt("%d %s%s", code, c$http$host, c$http$uri)]);
SumStats::observe(“http.excessive_errors.victim_badhits”, [$host=c$id$resp_h],
[$str=fmt("%d %s%s", code, c$http$host, c$http$uri)]);
}
else if (code < 400) {
SumStats::observe(“http.excessive_errors.attacker_goodhits”, [$host=c$id$orig_h], []);
SumStats::observe(“http.excessive_errors.victim_goodhits”, [$host=c$id$resp_h], []);

This script will store, per inbound connection, a ratio of good vs bad HTTP transactions. That can be used to do a behaviour profiling of a client. A scanner would easily have mostly “bad hits” with return code >400.

This all worked well for 15 minutes and then 11 nodes x 64GB RAM each cluster ran out of memory. See, this cluster monitored (among other things) the Firefox update system, with 500 000 000 clients talking to it.

Why did it crash? Because the SumStats framework was adding data, into internal data structures, per connection.

Notice how there is technically no memory leak. Given 2TB of RAM this maybe would have worked.

Basically, if you have tons of connections (not simply bandwidth or packets) the amount of memory necessary to keep all of them might be simply too much.

Now, this data expires (unless you have a script that never does that), but it might be the amount of state grows too quickly and the expiration is not quick enough, to free up some memory.

My quick suspect would be the scan.bro / scan.zeek old script that comes bundled with Zeek. If you have it enabled, disable and see if you’re still crashing.

You can then take a look at your scripts and see if there is some data structure that will grow per connection, over time - and how quickly you purge data from it.

BTW is your Zeek built with tcmalloc? I heard horror stories about that now I just use stock glibc malloc or jemalloc.

I’ve heard those stories too. I build with jemalloc.

Mark

Your comments have been most helpful, Michał. I am very appreciative.

Now, this data expires (unless you have a script that never does that), but it might be the amount of state grows too quickly and the expiration is not quick enough, to free up some memory.

One hypothesis is that we have lots of very long running connections. Looking through the first 1M connections in the largest 1 hour conn.log for 2019-10-21 (the biggest log in the past month), the min, max, and average duration of connections are 0.0, 16750, and 37.4s (with stdev of 285.5) respectively. 66.9% of the connections last less than a second and the percentage of connections lasting to the mean duration is 86.4%. So the number of long running connections seems small.

So it doesn’t seem like there are tons of long running connections. We just have many connections. (73,403,294 during the single hour mentioned above.)

My quick suspect would be the scan.bro / scan.zeek old script that comes bundled with Zeek. If you have it enabled, disable and see if you’re still crashing.

We commented out @load misc/scan in bro-2.6.4/share/bro/site/local.bro. Is that what you meant?

You can then take a look at your scripts and see if there is some data structure that will grow per connection, over time - and how quickly you purge data from it.

The only extra configuration we added to local.bro was @load tuning/json-logs. But I wouldn’t think that would cause a large increase in memory use (even though it does cause an increase in written file size).

One of the things I did yesterday was add 128 GB of swap file on NVMe to the workers to augment the 1 GB swap partition already in place. That seems to be helping. The one sensor I checked this morning was using 6.5 GB of swap. Yesterday it would have crash exceeding 1 GB swap. So maybe I don’t need to worry about long running connections if I have enough swap.

I need to set up monitoring on the cluster to make it easier to diagnose these kinds of problems.

Mark