Weird log rotation issues in 2.5

My site was an early tester of 2.5_beta, and during most of our testing, we did not experience much in the way of issues related to log rotation. As soon as we started using 2.5_beta we were utilizing logger mode, as we run an extremely busy cluster, and had been experiencing memory utilization exhaustion issues with the manager being single threaded At one point near the end of the 2.5 beta, we started seeing some irregular rotation intervals, which we noted were potentially related to us starting to use the Intel framework.

I have since moved to 2.5 stable, and I’m still seeing irregular rotation issues which appear to be related to how busy the manager is at the time of log rotation. I do notice that the main local-manager process seems to stay loaded at or near 100% CPU utilization when we are seeing a lot of traffic. As an example, I rotated logs perfectly from midnight up to 07:00-08:00. After that, my current logs in spool/logger are running well over into 9 o’clocks data set. I frequently see .rotated files well after a log rotation interval has run (the following was observed 09:41 today):

-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.capture_loss
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.communication
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.conn
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.conn-summary
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.dce_rpc
-rw-r–r-- 1 root root 18 Nov 30 03:00 .rotated.dhcp
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.dns
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.dpd
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.files
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.ftp
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.http
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.intel
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.irc
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.kerberos
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.known_certs
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.known_hosts
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.known_services
-rw-r–r-- 1 root root 18 Nov 29 22:00 .rotated.loaded_scripts
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.mysql
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.notice
-rw-r–r-- 1 root root 18 Nov 29 22:00 .rotated.packet_filter
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.pe
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.radius
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.rdp
-rw-r–r-- 1 root root 18 Nov 29 23:00 .rotated.reporter
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.rfb
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.sip
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.smtp
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.snmp
-rw-r–r-- 1 root root 18 Nov 30 01:00 .rotated.socks
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.software
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.ssh
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.ssl
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.stats
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.syslog
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.tunnel
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.weird
-rw-r–r-- 1 root root 18 Nov 30 08:00 .rotated.x509

Any tips or pointers on how to track down the culprit here and get things back to normally scheduled log rotation intervals?

FWIW, I’m running on a brand new R730xd with SSD’s for the OS, an NVMe drive for the Bro installation, and RAID 10 & RAID 5 respectively for the 60 day archive and long term storage archive volumes. I’m linting my intel files to make sure they are properly formatted, and there doesn’t seem to be an issue here…

Respectfully,

-Erin Shelton

Program Manager: Incident Response and Network Security
Office of Information Technology
University of Colorado Boulder

Not sure if your case is going to be the same, but I used to experience a similar issue on pre-2.5 versions of Bro. Upon checking my process list I usually found that gzip was choking on a relatively large multi-gigabyte log file and would not finish compressing fast enough if at all. This seemed to cause a sort of chain reaction that broke log rotation. Usual culprits were http.log, dns.log, or conn.log. The solution I found at the time was to decrease the log rotation interval to keep the file sizes down. I still seem to recall at least some of the logs would rotate before it got stuck, but I’m not sure that is happening based on your example.

~Gary