Proposal: Make Zeek's debug logging thread-safe

Right now, if you try to use Zeek’s debug logging facilities in DebugLogger.h concurrently from multiple threads, the contents of debug.log can get mixed up and look like like “word salad”.

I’ve been working on log writers for Zeek. Those operate in different threads, and using Zeek’s current open-source debug logging implementation, trying to make sense of debug logs from those was a real headache.

So in my own code, I’ve made debug logging thread-safe, so log text from different threads winds up on different lines in the debug.log file. I’ve also added more convenience macros to make logging some kinds of debug information easier.

This proposal is to integrate those debug logging changes into open-source Zeek. I’d welcome any questions, suggestions or feedback.

Bob Murphy | Corelight, Inc. | bob.murphy at corelight.com | www.corelight.com

Is there a reason why you didn’t just use the Debug call of the threading framework (which goes through the message queues then ends up in debug.log)?

Johanna

Hi Johanna,

I wasn’t aware of that call, but it also wouldn’t have done what I needed.

If I understand the code correctly, each MsgThread has a FIFO queue that it pushes messages onto. Later on, the main thread occasionally runs a loop where it handles all the queued messages from the first MsgThread, then all the queued messages from the second MsgThread, etc.

The development I was doing sometimes required me to examine the debug messages from different threads in the chronological order they were generated. But if I understand it correctly, the threading framework’s logging doesn’t maintain that ordering.

Also, that work sometimes generated a LOT of debug messages - thousands or millions of lines of them - when only a tiny fraction of them were interesting. To cut down on the garbage, I used the DebugLogger class’s member functions to selectively enable and disable individual streams when particular conditions occurred. However, those member functions have immediate effect, and because the threading framework’s Debug member function emits log lines after a delay, it seems likely I would have not seen debug output I wanted to see, and seeing debug output I didn’t want to see.

Best regards,
Bob

Yeah, or at least the time associated with a Debug message is its
time-of-processing, not time-of-generation. Can see how the latter is
more useful, but want to discuss the proposed solution with a bit more
detail? Does it involve a locked mutex around only the underlying
fprintf() or something more? I imagine it should be "something more"
if the requirement is to make debug.log a convenient way of
understanding operation ordering among many threads.

- Jon

My current implementation does just use a mutex to control access to the output file, and reports the time of generation.

Outside of this email thread, one person has suggested adding something to each debugging log line to identify its source thread. That could potentially be the thread ID, or the thread name, or both.

Another person who runs multiple Zeek instances concurrently also suggested adding the process ID to each log line.

So I was planning to add those to each debug log line before doing a pull request to merge my changes to Zeek master.

- Bob

P.S. If Zeek were to emit a lot of debugging log lines from enough threads very quickly, it’s possible the mutex would add excessive overhead. Boost has a lockfree interthread queue that could be the nucleus of a solution for that, but that would be a lot more complicated than just using a mutex. So I don’t want to look further into that unless and until we know it’s really needed.

I was also trying to break down a couple distinct requirements and
wondered if that actually covers the 2nd:

(1) Fix the "word salad"
(2) Ability to examine debug output from multiple threads in chronological order

Is it fine to just be able to understand the ordering of "when the
fprintf() happened" or is what's really needed is to understand
ordering of "when operations associated with debug messages happened"
?

Thread 1:
  Foo();
  LockedDebugMsg("I did Foo.");

Thread 2:
  Bar();
  LockedDebugMsg("I did Bar.");

debug.log
  [Timestamp 1] I did Foo.
  [Timestamp 2] I did Bar.

That debug.log doesn't really tell us whether Foo() happened before
Bar(), right?

- Jon

The version I have definitely fixes #1, the word salad. It also fixes #2 in the sense that the output is in the same chronological order the calls to LockedDebugMsg occur.

The code you show should give correct ordering on when Foo() and Bar() finish.

If you also want to know when they start, you could do:
Thread 1:
LockedDebugMsg(“About to do Foo.");
Foo();
LockedDebugMsg("I did Foo.");

Thread 2:
LockedDebugMsg(“About to do Bar.");
Bar();
LockedDebugMsg("I did Bar.”);

Wondering what's meant by "correct ordering" here. Bar() can finish
before Foo() and yet debug.log can report "I did Foo" before "I did
Bar" for whatever thread-scheduling reasons happened to make that the
case. Or Foo() and Bar() can execute together in complete concurrency
and it's just the LockedDebugMsg() picking an arbitrary "winner".

- Jon

I see your point.

For example:
a. Foo() in thread 1 finishes before Bar() in thread 2 finishes
b. The scheduler deactivates thread 1 for a while between the return from Foo() and the execution of LockedDebugMsg("I did Foo.”)
c. Thread 2 proceeds from the return from Bar() without interruption

Then debug.log would contain the message “I did Bar” before “I did Foo”.

So the ordering in the log file really reflects how the kernel sees the temporal order of mutex locking inside LockedDebugMsg. That’s an inexact approximation of the temporal order of calls to LockedDebugMsg, and that’s an even more inexact approximation of the temporal order of code executed before LockedDebugMsg.

For what I was doing, though, that proved to be good enough. :slight_smile:

I’d be very interested in ideas about how to improve that, especially if they’re simple. I can think of a way to improve it, but it would be substantially more complicated than just a mutex.

Reading through this thread, I'm wondering if we should focus on
improving identification of log lines in terms of where they come from
and when they were generated, while keeping to go through the existing
mechanism of sending messages back to main process for output (so that
we don't need the mutex). If we sent timestamps & thread IDs along
with the Debug() messages, one could later post-process debug.log to,
get things sorted/split as desired.

This wouldn't support the use case of "millions of lines" very well,
but I'm not convinced that's what we should be designing this for. A
mutex becomes potentially problematic at that volume as well, and it
also seems like a rare use case to begin with. In cases where it's
really needed, a local patch to get logs into files directly (as you
have done already) might just do the trick, no?

Robin

Reading through this thread, I'm wondering if we should focus on
improving identification of log lines in terms of where they come from
and when they were generated, while keeping to go through the existing
mechanism of sending messages back to main process for output (so that
we don't need the mutex). If we sent timestamps & thread IDs along
with the Debug() messages, one could later post-process debug.log to,
get things sorted/split as desired.

This wouldn't support the use case of "millions of lines" very well,
but I'm not convinced that's what we should be designing this for. A
mutex becomes potentially problematic at that volume as well, and it
also seems like a rare use case to begin with. In cases where it's
really needed, a local patch to get logs into files directly (as you
have done already) might just do the trick, no?

Robin

We could definitely change DebugLogger to improve the log line identification, and route it through the threading framework’s Debug() call. That will avoid turning debug.log into "word salad”.

However, that would also cause a delay in writing the log lines, and I've run into situations working on Zeek where that kind of delay would make debugging harder.

For example, sometimes I run tail on the log file in a terminal window. Then, when the code hits a breakpoint in a debugger, I can analyze the program state by looking at log lines emitted right before the breakpoint triggers, and compare them to variable contents, the stack trace, etc. That won't work if logging is delayed.

There are multiple, conflicting use cases for logging in Zeek. Sometimes a developer might think:
- Maximized throughput is important, but a delay is okay
- No delay can be tolerated, but slower throughput is okay
- Correct temporal ordering in the log is (or isn’t) important
- fflush() after every write is (or isn’t) important
- Debug logging output should go to the debug.log file, or stdout, or somewhere else

This is a pretty common situation around logging, in my experience.

One way to solve it, as Robin says, is for a developer with a use case Zeek doesn't support to apply a temporary local patch. Unfortunately, that doesn't help other developers who might have the same use case. Also, I personally hate to spend time writing code and getting it to work well, and then throw it away.

On other projects, I've used a different approach that's worked really well: use a single, common logging API, but let it send its output to different output mechanisms that support different use cases. Then a developer could pick the output mechanism that works best for their use case at runtime, using a command line option or environment variable. I think it wouldn’t be very complicated to add that to Zeek.

- Bob

I get that in general. It's just that afaik this is the first time
this need comes up. Adding a full-featured, thread-safe logging
framework is a trade-off against complexity and maintainance costs.
Not saying it's impossible, but I'd like to hear more people thinking
this is a good idea before committing to such a route.

Robin

I completely agree about that trade-off, which is why the work I’ve done so far is pretty simple. It doesn’t change the existing DebugLogger system other than adding thread safety. Then on the side, there are a few optional features like a scoping utility class and some preprocessor macros.

That said, different developers have different debugging styles, and I'm a big fan of using feature-rich debug logging frameworks with multiple operating modes and destinations, because they let me fix bugs and write new code much faster than I could otherwise.

Writing a powerful debug logging system does take time and effort, but my experience has been that once it’s finished, it usually doesn't require much ongoing maintenance. Working on open-source and commercial projects with lifetimes of more than a few years, I’ve always seen that time and effort pay for itself many, many times over by making it quicker and easier to diagnose bugs, write new features, and do performance enhancements.

That’s especially been true when I’ve worked on code that handled large volumes of data, like Zeek does. If I need to track down a bug in a stream of data that doesn’t manifest until megabytes have gone by, I usually find it the quickest approach is to run the software and search for a diagnostic pattern in a gigantic log file, compared to other approaches like spending hours hitting the same debugger breakpoint over and over again.

- Bob