0MQ security considerations

But anyway, mulling over this a bit more, let's just go with pthreads.

Will do.

--Gilbert

JFYI, pthread locking/unlocking can be expensive (my impression from using them for the TimeMachine). So, doing processing stuff in (small) batches while holding the lock might be a good idea to help performance. (Then, for the Log framework the number of messages being passed is probably also not that high compared to the TM)

cu
Gregor

Thanks for the note; will keep that in mind.

--Gilbert

Although it seems that folks have settled on pthreads, I still would
like to add my two cents, biased by my own experience with C++11 (aka
C++0x).

I forgot the obvious one yesterday: Intel's TBB. That's what the
multi-core Bro prototype is already using, and it's main thread
abstraction is (almost?) compatible to C++0x.

I've been using C++11 for quite a while now and can only say that it
feels like it was overdue. One writes much less boilerplate for what one
wants to achieve (e.g., functors) and many mature components from the
Boost libraries found their way into the C++11 standard library:
threads, tasks/futures, smart pointers, SFINAE helpers like
boost::enable_if, RNGs, etc. Overall, I find myself needing less time to
write more code that actually does something.

Except for the thread-safe data structures (which we can wrap ourselves,
e.g., thread-safe queues) and the TBB scheduler (which we don't plan to
use IIRC), C++11 meets our needs from a feature standpoint. The benefit
would be avoiding licensing hassles and reaping the, erm, somewhat
underappreciated improvements that come for free :-). Now that the
standard is sealed, everybody will use C++11 in a few years and widely
used compilers like gcc will have implemented the full standard. I just
want to point out that we might oversee a free lunch down the road.

But somebody has maintain the code that's *using* Boost ... My main
concern is actually that once we have Boost, folks will immediately
start using pretty much any feature it provides. :slight_smile:

The main problem I see here is that Boost is a mixed bag: some libraries
are really high quality (and thus made their way into the new standard)
while others still need time to mature and are of experimental nature.
If we really want to use Boost (say to have a platform-independent
networking library), we could whitelist the Boost components that we
allow in Bro.

While we're at it, Boost Asio is a nice library not only for
platform-independent networking, but also to structure computation at
the granularity of tasks. Unlike 0mq, it is not a messaging subsystem,
we would need to obtain this functionality elsewhere. I have not looked
at it in detail, but 0mq is just one of many options in the space of
messaging middle layers [1].

    Matthias

[1] jms - ActiveMQ or RabbitMQ or ZeroMQ or - Stack Overflow

Although it seems that folks have settled on pthreads, I still would
like to add my two cents, biased by my own experience with C++11 (aka
C++0x).

I forgot the obvious one yesterday: Intel's TBB. That's what the
multi-core Bro prototype is already using, and it's main thread
abstraction is (almost?) compatible to C++0x.

I've been using C++11 for quite a while now and can only say that it
feels like it was overdue. One writes much less boilerplate for what one
wants to achieve (e.g., functors) and many mature components from the
Boost libraries found their way into the C++11 standard library:
threads, tasks/futures, smart pointers, SFINAE helpers like
boost::enable_if, RNGs, etc. Overall, I find myself needing less time to
write more code that actually does something.

Except for the thread-safe data structures (which we can wrap ourselves,
e.g., thread-safe queues) and the TBB scheduler (which we don't plan to
use IIRC), C++11 meets our needs from a feature standpoint. The benefit
would be avoiding licensing hassles and reaping the, erm, somewhat
underappreciated improvements that come for free :-). Now that the
standard is sealed, everybody will use C++11 in a few years and widely
used compilers like gcc will have implemented the full standard. I just
want to point out that we might oversee a free lunch down the road.

But somebody has maintain the code that's *using* Boost ... My main
concern is actually that once we have Boost, folks will immediately
start using pretty much any feature it provides. :slight_smile:

The main problem I see here is that Boost is a mixed bag: some libraries
are really high quality (and thus made their way into the new standard)
while others still need time to mature and are of experimental nature.
If we really want to use Boost (say to have a platform-independent
networking library), we could whitelist the Boost components that we
allow in Bro.

While we're at it, Boost Asio is a nice library not only for
platform-independent networking, but also to structure computation at
the granularity of tasks. It also plays nicely with C++11, i.e.,
facilities the implementation of user-space thread scheduling. Unlike
0mq, it does not feature a messaging subsystem; we would need to obtain
this functionality elsewhere.

    Matthias

threads, tasks/futures, smart pointers, SFINAE helpers like
boost::enable_if, RNGs, etc. Overall, I find myself needing less time to
write more code that actually does something.

I believe all of that!

However, my main concern is that I don't want to introduce such
functionality through the backdoor by starting to use it here and
there. That leads to code that's really hard to maintain because one
suddenly has a mix of different styles, idioms, level of abstractions,
etc.

For Bro, we had this in the past already with STL containers: they
slowly started to being used more and more, and now we have quite a
mix all over the code. I like the STL containers and I certainly
prefer to use them over the macro-based alternatives (and I do use
them where I can these days), but I still don't think that having both
versions in the same code base is a good thing.

Containers aren't really that critical in this context as it's rather
clear to see what's going on (though there are cases where we now need
to convert from container type A to B). But I'm not sure that also
applies to all the shiny new features in C++11 ...

So, generally, I'd be fine switching to using more modern stuff but I
think that needs to be a conscious decision and we need to adapt the
existing code base accordingly. We also need to make sure we have the
platform support everywhere we need it; don't think we are there yet
(or are we?).

Coming back to the original question, in my view triggering all this
is not really justified just for doing threaded loggging. Longer term,
it may be fine.

If we really want to use Boost (say to have a platform-independent
networking library), we could whitelist the Boost components that we
allow in Bro.

That's a slippery slope. With that, everytime somebody wants to use
something else (which is always tempting of course :), we'd need to
argue about it and potentially find arguments for *not* using it;
which will be hard I'm sure. Or somebody might just start using it and
"will change that later" ...

(Btw, DataSeries uses Boost. So folks using that will need that
dependency already. And it's quite a bit one as I noticed when
installing the Boost FreeBSD port the other day ...)

Robin

However, my main concern is that I don't want to introduce such
functionality through the backdoor by starting to use it here and
there. That leads to code that's really hard to maintain because one
suddenly has a mix of different styles, idioms, level of abstractions,
etc.

I fully agree.

For Bro, we had this in the past already with STL containers: they
slowly started to being used more and more, and now we have quite a
mix all over the code. I like the STL containers and I certainly
prefer to use them over the macro-based alternatives (and I do use
them where I can these days), but I still don't think that having both
versions in the same code base is a good thing.

Assuming there is not performance/functionality issue, is it too much of
an effort to replace the ancient containers?

So, generally, I'd be fine switching to using more modern stuff but I
think that needs to be a conscious decision and we need to adapt the
existing code base accordingly.

Eventually, it could be worth going through the new list of language and
library features, compare them to the Status Quo, and decide if a change
makes sense. For example, there could be smart pointer branch that
replaces (Un)Ref() calls with std::shared_ptr.

We also need to make sure we have the platform support everywhere we
need it; don't think we are there yet (or are we?).

At least all platforms that support gcc 4.7+ should have no trouble, but
I'm sure it will still take a while until we have solid C++11 compilers
that are stable on all platforms.

Coming back to the original question, in my view triggering all this
is not really justified just for doing threaded loggging. Longer term,
it may be fine.

For just threaded logging, I agree. But this seems to be a rather
general questions. I believe that just enhancing one component of Bro
with thread support can lead to a problematic architecture in the long
run, when other components also jump on the concurrency train. If this
is not a coordinated effort, the components may crossfire and impede
themselves. This could be as simple as thread thrashing but als manifest
more subtle with poor caching and memory performance. Maybe it's not
warranted at this point, but I think we should at some point think about
a task-based approach to concurrency, rather than thinking about
concurrency in terms of threads. (This is completely independent of the
multi-core Bro effort but rather about general software engineering
practices in many-core land.)

we'd need to argue about it and potentially find arguments for *not*
using it; which will be hard I'm sure. Or somebody might just start
using it and "will change that later" ...

Fair enough. But at the end of the day we have to make a decision about
what platform-independent network and I/O framework to use. The Boost
networking library is of high quality and used widely. If we find a
slimmer alternative, that would be great.

    Matthias

Coming back to the original question, in my view triggering all this
is not really justified just for doing threaded loggging. Longer term,
it may be fine.
    
For just threaded logging, I agree. But this seems to be a rather
general questions. I believe that just enhancing one component of Bro
with thread support can lead to a problematic architecture in the long
run, when other components also jump on the concurrency train. If this
is not a coordinated effort, the components may crossfire and impede
themselves.

Okay, since this came up and I'm working on it, I guess I'll try to address the architecture issue :slight_smile:

Originally, we were discussing using 0mq, which uses a message-based architecture. This struck me as a very clean way to segment a program into threads, and would logically extend rather well to cover other things (e.g. IPC). As such, I borrowed that model.

Bro threads, as I built them (read: my fault if this is bad / stupid :), are simply discrete event processors; they accept messages that descend from a parent EventMessage (or is it MessageEvent? can't remember offhand) and are passed to the individual threads through a thread-safe queue. Because there's a large degree of complexity involved with ensuring any individual event can be processed on any thread, especially given that log / flush / rotate messages have possibly complex ordering dependencies to deal with, and further given that a log writer (or, in bro's case, most of the logwriter-related events) should spend the majority of its time blocking for IO, I don't necessarily agree that logging stuff would be a good candidate for task-oriented execution.

Re: task-oriented execution for bro in general: this seems like it is already accomplished to a large degree by e.g. hardware that splits packets by the connection they belong to and routes them to the appropriate processing node in the bro cluster. In this case, the hardware connection splitter thingy acts as the task scheduler, the packet acts as a task, and each individual bro instance acts as the thread of execution. If we wanted to see aggregate performance gains, I guess we could write ubro: micro scripting language, rules are entirely resident in a piece of specialized hardware (CUDA?), processes only certain types of packet streams (thus freeing other general-purpose bro instances to handle other stuff). We could even call it BroCluMP (Bro Cluster Multi-Processing) or something.

Naturally, in the case such hardware is not available, it might make sense to build a piece of software that could accomplish the same; I don't know enough about bro's cluster architecture to know whether or not something of this nature already exists, however.

This could be as simple as thread thrashing but als manifest
more subtle with poor caching and memory performance.

Yes. That said, this seems relevant: http://www.fullduplex.org/humor/2006/10/how-to-shoot-yourself-in-the-foot-in-any-programming-language/

Maybe it's not
warranted at this point, but I think we should at some point think about
a task-based approach to concurrency, rather than thinking about
concurrency in terms of threads. (This is completely independent of the
multi-core Bro effort but rather about general software engineering
practices in many-core land.)

See above :slight_smile:

we'd need to argue about it and potentially find arguments for *not*
using it; which will be hard I'm sure. Or somebody might just start
using it and "will change that later" ...
    
Fair enough. But at the end of the day we have to make a decision about
what platform-independent network and I/O framework to use. The Boost
networking library is of high quality and used widely. If we find a
slimmer alternative, that would be great.
  

Just for the record, boost tends to be less of a library and more of a label; once a library has existed for long enough at a high enough level of quality, it'll often roll itself into boost. As such, one of the libraries bro uses could eventually end up as a piece of boost, at which point the team would have to revisit this particular issue.

Just my $0.02. I guarantee neither relevance nor quality.

--Gilbert

Okay, since this came up and I'm working on it, I guess I'll try to
address the architecture issue :slight_smile:

Thanks for chiming in.

Originally, we were discussing using 0mq, which uses a message-based
architecture. This struck me as a very clean way to segment a program
into threads, and would logically extend rather well to cover other
things (e.g. IPC). As such, I borrowed that model.

I like the message-passing model as well. How do you use the term
"thread?" Do you mean a hardware thread (managed by the OS) or a
virtual/logic thread (a user-space abstraction)? I am asking because
because a general there should be a (close to) 1:1 ratio between
available cores and number of user-level threads, mainly to avoid
thrashing and increase cache performance. With I/O-bound applications
this is of course less of an issue, but nonetheless a prudent software
engineering practice in the manycore era.

Because there's a large degree of complexity involved with ensuring
any individual event can be processed on any thread, especially given
that log / flush / rotate messages have possibly complex ordering
dependencies to deal with, and further given that a log writer (or, in
bro's case, most of the logwriter-related events) should spend the
majority of its time blocking for IO, I don't necessarily agree that
logging stuff would be a good candidate for task-oriented execution.

You bring up a good point, namely blocking I/O, which I haven't thought
of. Just out of curiosity, could all blocking operations replaced with
their asynchronous counterparts? I am just asking because I am using
Boost Asio in a completely asynchronous fashion. This lends itself well
to a task-based architecture with asynchronous "components," each of
which have a task queue that accumulates non-interfering function calls.
Imagine that each component has some dynamic number of threads,
depending on the available number of cores.

Let's assume there exists an asynchronous component for each log
backend. (Sorry if I am misusing terminology, I'm not completely
up-to-date regarding the logging architecture.) If the log/flush/rotate
messages are encapsulated as a single task, then the ordering issues
would go away, but you still get a "natural" scale-up at event
granularity (assuming events can arrive out-of-order) by assigning more
threads to a component. Does that make sense? Maybe you use this sort of
architecture already?! My point is essentially that there is a
difference between tasks and events that have different notions of
concurrency.

Re: task-oriented execution for bro in general: this seems like it is
already accomplished to a large degree by e.g. hardware that splits
packets by the connection they belong to and routes them to the
appropriate processing node in the bro cluster.

Yeah, one can think about it this way. The only thing that gives me
pause is that the term "task" has a very specific, local meaning in
parallel computation lingo.

If we wanted to see aggregate performance gains, I guess we could
write ubro: micro scripting language, rules are entirely resident in a
piece of specialized hardware (CUDA?), processes only certain types of
packet streams (thus freeing other general-purpose bro instances to
handle other stuff).

Nice thought, that's something for HILTI where available hardware is
transparently used by the execution environment. For example, if a
hardware regex matcher is available, the execution environment offloads
the relevant instructions to the special card but would otherwise use
its own implementation.

We could even call it BroCluMP (Bro Cluster Multi-Processing) or
something.

While we're creating new names, what about Bruda for the GPU-based
version of Bro that offloads regex matching to CUDA? :wink:

    Matthias

We don't know how many cores the user will have and how many are virtual (hyperthreading). Due to the direction hardware is heading, I think if we estimate and hard code it in the design, it will soon be an underestimate. Therefore, I would probably err on the side of more threads than is typical in a server today when designing a multi-threaded architecture.

We don't know how many cores the user will have and how many are
virtual (hyperthreading). Due to the direction hardware is heading, I
think if we estimate and hard code it in the design, it will soon be
an underestimate. Therefore, I would probably err on the side of more
threads than is typical in a server today when designing a
multi-threaded architecture.

One way to get around this is to just have a command line switch that
specifies the number of threads to use in total. This would override the
default concurrency level as detected by the application (i.e., the
number of cores exposed by the OS). The applications than somehow
internally distributes the number of available threads over the
asynchronous components.

In fact, one often uses a slightly tweaked ratio with a little more
threads than what the hardware exposes, as it achieves better
saturation. (I guess that's what you mean by "err on the side of more
threads.")

    Matthias

It wasn't clear to me whether or not the number of threads was really defined by the architecture. If there is always more work that you can give to more threads, then this works.

Hi:

Inline:

Originally, we were discussing using 0mq, which uses a message-based
architecture. This struck me as a very clean way to segment a program
into threads, and would logically extend rather well to cover other
things (e.g. IPC). As such, I borrowed that model.

I like the message-passing model as well. How do you use the term
"thread?" Do you mean a hardware thread (managed by the OS) or a
virtual/logic thread (a user-space abstraction)? I am asking because
because a general there should be a (close to) 1:1 ratio between
available cores and number of user-level threads, mainly to avoid
thrashing and increase cache performance. With I/O-bound applications
this is of course less of an issue, but nonetheless a prudent software
engineering practice in the manycore era.

Whichever model pthread happens to use :slight_smile: I think the implementation might be slightly platform-dependent.

That said, keep in mind that some libraries (e.g. DataSeries) actually spawn additional worker threads. This makes it very difficult to place a hard limit on the number of threads that exist within a bro process. If this does turn out to be an issue, it may be time to look at fork()'ing the loggers and / or using some kind of remote logging scheme.

Because there's a large degree of complexity involved with ensuring
any individual event can be processed on any thread, especially given
that log / flush / rotate messages have possibly complex ordering
dependencies to deal with, and further given that a log writer (or, in
bro's case, most of the logwriter-related events) should spend the
majority of its time blocking for IO, I don't necessarily agree that
logging stuff would be a good candidate for task-oriented execution.

You bring up a good point, namely blocking I/O, which I haven't thought
of. Just out of curiosity, could all blocking operations replaced with
their asynchronous counterparts?

For the ASCII and the DataSeries loggers, yes; the latter seems to already do this (one of the worker threads is an output thread).

I am just asking because I am using
Boost Asio in a completely asynchronous fashion. This lends itself well
to a task-based architecture with asynchronous "components," each of
which have a task queue that accumulates non-interfering function calls.
Imagine that each component has some dynamic number of threads,
depending on the available number of cores.

Okay.

Let's assume there exists an asynchronous component for each log
backend. (Sorry if I am misusing terminology, I'm not completely
up-to-date regarding the logging architecture.)

I think I follow you.

  If the log/flush/rotate
messages are encapsulated as a single task, then the ordering issues
would go away, but you still get a "natural" scale-up at event
granularity (assuming events can arrive out-of-order) by assigning more
threads to a component.

I don't understand this bit. Say we have 10 things to log: {A, B, C, ... J}, and we queue 8 messages before we flush + rotate (for the sake of argument)

Log+Flush+Rotate task = {A, B, C, D} ---> Thread 1
Log+Flush task = {E, F, G, H} ---> Thread 2
Log+Flush+Finish task = {I, J} ---> Thread 3

In this case, it seems possible that thread 3 could complete before threads 1 and 2, unless we forced tasks to lock the log file upon execution. . . but if we did that, I think the locking order would become less predictable as the system's load increased, leading to oddly placed chunks of log data and / or dropped messages if the file were to be closed before a previous rotate message made it to the logger.

Does that make sense? Maybe you use this sort of
architecture already?!

Kind of. Let me throw together a diagram and send it out; it's something that should probably end up in the documentation anyway :slight_smile:

My point is essentially that there is a
difference between tasks and events that have different notions of
concurrency.

I need to do some more reading before I'd be prepared to offer a coherent argument here.

Re: task-oriented execution for bro in general: this seems like it is
already accomplished to a large degree by e.g. hardware that splits
packets by the connection they belong to and routes them to the
appropriate processing node in the bro cluster.

Yeah, one can think about it this way. The only thing that gives me
pause is that the term "task" has a very specific, local meaning in
parallel computation lingo.

I don't claim to be a parallel computation expert, so this does not surprise me :slight_smile:

I was more attempting to illustrate the parallel between bro's existing cluster architecture and a more traditional task-oriented model (as I understand it) than I was trying to argue that packets should necessarily be classifiable as "tasks" in the strictest sense of the word.

If we wanted to see aggregate performance gains, I guess we could
write ubro: micro scripting language, rules are entirely resident in a
piece of specialized hardware (CUDA?), processes only certain types of
packet streams (thus freeing other general-purpose bro instances to
handle other stuff).

Nice thought, that's something for HILTI where available hardware is
transparently used by the execution environment. For example, if a
hardware regex matcher is available, the execution environment offloads
the relevant instructions to the special card but would otherwise use
its own implementation.

Ah, right then.

Thanks,
Gilbert

That said, keep in mind that some libraries (e.g. DataSeries) actually
spawn additional worker threads.

Is it possible to configure the number of threads DataSeries uses?

I don't understand this bit. Say we have 10 things to log: {A, B, C,
... J}, and we queue 8 messages before we flush + rotate (for the sake
of argument)

Log+Flush+Rotate task = {A, B, C, D} ---> Thread 1
Log+Flush task = {E, F, G, H} ---> Thread 2
Log+Flush+Finish task = {I, J} ---> Thread 3

In this case, it seems possible that thread 3 could complete before
threads 1 and 2, unless we forced tasks to lock the log file upon
execution. . . but if we did that, I think the locking order would
become less predictable as the system's load increased, leading to oddly
placed chunks of log data and / or dropped messages if the file were to
be closed before a previous rotate message made it to the logger.

This is a good example of hot to not use task-based parallelism. As
there exist inter-task dependencies, they have to be processed
sequentially. For example, say these 10 things to log are generated for
18 http_reply events. What I was saying is that you have one task for
all dependent actions/messages/steps:

    Event 1: Log+Flush+Rotate+Log+Flush+Finish task = {A, ..., J}
    ...
    Event 18: Log+Flush+Rotate+Log+Flush+Finish task = {A, ..., J}

Say the Log Manager component has 3 threads associated, then each thread
would process 6 events, assuming a simple uniform scheduling strategy.

Kind of. Let me throw together a diagram and send it out; it's
something that should probably end up in the documentation anyway :slight_smile:

Sounds good. This stuff is much better to comprehend with a visual aid!

I was more attempting to illustrate the parallel between bro's existing
cluster architecture and a more traditional task-oriented model (as I
understand it) than I was trying to argue that packets should
necessarily be classifiable as "tasks" in the strictest sense of the word.

Yup, I understood it that way and think it's a nice analogy.

    Matthias

Matthias Vallentin wrote:

That said, keep in mind that some libraries (e.g. DataSeries) actually spawn additional worker threads.
    
Is it possible to configure the number of threads DataSeries uses?
  

Yes, to some extent; DataSeries will always spawn at least two threads (one output thread and one compression / worker thread).

The number of worker threads DataSeries spawns can be controlled by tweaking a variable in dataseries.bro, but this brings up a more general problem: what if we use a library that doesn't offer us the ability to do this?

Then again, if the library doesn't let us do this, maybe we just shouldn't be using that particular library...?

I don't understand this bit. Say we have 10 things to log: {A, B, C, ... J}, and we queue 8 messages before we flush + rotate (for the sake of argument)

Log+Flush+Rotate task = {A, B, C, D} ---> Thread 1
Log+Flush task = {E, F, G, H} ---> Thread 2
Log+Flush+Finish task = {I, J} ---> Thread 3

In this case, it seems possible that thread 3 could complete before threads 1 and 2, unless we forced tasks to lock the log file upon execution. . . but if we did that, I think the locking order would become less predictable as the system's load increased, leading to oddly placed chunks of log data and / or dropped messages if the file were to be closed before a previous rotate message made it to the logger.
    
This is a good example of hot to not use task-based parallelism. As
there exist inter-task dependencies, they have to be processed
sequentially. For example, say these 10 things to log are generated for
18 http_reply events. What I was saying is that you have one task for
all dependent actions/messages/steps:

    Event 1: Log+Flush+Rotate+Log+Flush+Finish task = {A, ..., J}
    ...
    Event 18: Log+Flush+Rotate+Log+Flush+Finish task = {A, ..., J}

Say the Log Manager component has 3 threads associated, then each thread
would process 6 events, assuming a simple uniform scheduling strategy.

Sure. In this case, though, A-J essentially compose the entire log file, which is what's giving me some trouble; even if the LogManager has 3 threads associated, it still has to pass entire LogWriters around between those threads and make sure no two threads are using the same LogWriter (assuming nothing was using e.g. thread-local storage, which might stipulate that thread assignments would need to be made at initialization and, once made, could never change).

So, I suppose it might be good to spawn a finite number of Bro worker threads and assign LogWriters to those (rather than spawning one bro worker thread per logwriter), but I don't see a way around binding an entire logwriter to a thread. What I don't like about this is that it means any scheduling the task manager does would be coarse-grained at best.

Kind of. Let me throw together a diagram and send it out; it's something that should probably end up in the documentation anyway :slight_smile:
    
Sounds good. This stuff is much better to comprehend with a visual aid!
  

Rough cut of a general overview is attached. Probably going to change quite a bit, but I think it might help illustrate the way the system works overall.

The architecture, at the moment, is geared toward supporting transparent replacement of threaded log writers with writers that work across sockets and / or IPC; anything the implements QueueInterface can be used to pass messages back and forth. Before this would actually work in a network / IPC context, though, the messages would need to support serialization; haven't done that yet because it hasn't really been a priority.

Feedback / corrections / "THAT'S A STUPID MODEL!" are always appreciated.

--Gilbert

LogSystemOverview.pdf.bz2 (876 KB)

Gilbert and I talked about this a bit more, but for the record: while
generally there's nothing wrong with that, I'm not sure we actually
need it here. The writer threads are long-lived and many of them won't
even have much load. My guess is that the OS will actually be doing
just fine scheduling them for us. So my suggestion is to go with
individual threads for now and see if we run into problems. If we
really do, it should still be quite straight-forward to later switch
to a worker thread model.

Robin

(I read Robin's follow up and have no objections with keeping the
hardware thread model instead of a task-based worker thread model.
My comments below are just for the record.)

[..] this brings up a more general problem: what if we use a library
that doesn't offer us the ability to do this?

That's a big issue. Some research at the Berkeley ParLab tried to
address this by providing a broker library; sort of a meta scheduler
that seeks to make sure cross-library calls do not hurt concurrency.
Alas, I cannot remember the name of the project but if you are
interested I can find it out.

Then again, if the library doesn't let us do this, maybe we just
shouldn't be using that particular library...?

If the library is uncontrollable with respect to the number of threads
it spawns, that's tough. Though most well-written libraries (not
stand-alone applications) I came across have some way to configure or
limit the number of used threads. Even if we used 4 different libraries,
each of which can be trimmed down to two threads, I'd say we have a
tractable situation because the number of cores will probably grow
faster than the minimum number of threads in libraries we use.

What I don't like about this is that it means any scheduling the task
manager does would be coarse-grained at best.

Agreed, the logging framework is not a component of Bro that would reap
major benefits from the worker thread model. The reason why I proposed
the task-based architecture is that it is a strong spine for modern
concurrent applications. However, Bro has enough muscle at this point,
which is why focusing on other body parts is probably a better idea. (I
hope this metaphor is not too awkward :slight_smile:

Rough cut of a general overview is attached. Probably going to change
quite a bit, but I think it might help illustrate the way the system
works overall.

Cool, that's a nice diagram. Having similar ones for the big parts of
Bro is a valuable resource for developers. Some quick
questions/comments:

    - In the LogMgr description, what's a log stream object?

    - I'm not sure if I understand the crossed-out CONTEXT BOUNDARY
      text. Does this essentially mean that the queue interface can (in
      the future) adapt based on how the distributed system is
      configured? I.e., if reader and writer are part of the same
      program, it is some sort of IN_PROC queue, and for IPC and
      networking it will transparently serialize the messages?

    - What is the Pull FIFO for? Does the LogEmissary receive
      feedback from the LogWriter?

    - Just out of curiousity, what type of thread-safe queue is it?
      Single-writer/single-reader? Single-writer/multiple-reader? There
      are zillions of thread-safe queue implementations out there, each
      optimized for some different use case! While we're at it, is the
      LogEmissary to LogWriter multiplicity 1:1 or 1:n?

    - About BasicThread: it seems there is one channel for both control
      and data messages, i.e., the queue can include a special terminate
      event to shutdown the BasicThread. Are these control messages
      filtered out in the base class (BasicThread) code before the
      message is passed to the child (E.g. LogWriter) code?

Feedback / corrections / "THAT'S A STUPID MODEL!" are always
appreciated.

I like the architecture (+1 with your words :wink: and am looking forward
to see it in action. What I am really interested in are profiling
benchmarks (very easy with Google perftools) to see where the non-I/O
bottlenecks occur.

Sorry if I have missed it, but do we use now 0mq as messaging
middle-layer or is all message passing based on custom code?

    Matthias

(I read Robin's follow up and have no objections with keeping the
hardware thread model instead of a task-based worker thread model.
My comments below are just for the record.)

[..] this brings up a more general problem: what if we use a library
that doesn't offer us the ability to do this?

That's a big issue. Some research at the Berkeley ParLab tried to
address this by providing a broker library; sort of a meta scheduler
that seeks to make sure cross-library calls do not hurt concurrency.
Alas, I cannot remember the name of the project but if you are
interested I can find it out.

Sure, if it wouldn't be too much trouble :slight_smile: I'd love to see how they manage to do that.

Cool, that's a nice diagram.

Thanks :slight_smile:

Having similar ones for the big parts of
Bro is a valuable resource for developers. Some quick
questions/comments:

     - In the LogMgr description, what's a log stream object?

Oh, yeah. That probably needs to go in there.

To answer the question: a bro log stream is defined by three things -- a path (where do I write), a record type (what do I write), and a writer type (how do I write). The stream object holds a little bit of state ("am I pushing my logs through a remote serializer?" "is this log stream enabled or disabled?" etc). Also, since the types used by the log writers differ somewhat from the types used by bro internally, the stream holds some information used to help with that mapping.

That said, stream objects could (and probably eventually should) be rolled into the LogEmissary type.

     - I'm not sure if I understand the crossed-out CONTEXT BOUNDARY
       text. Does this essentially mean that the queue interface can (in
       the future) adapt based on how the distributed system is
       configured? I.e., if reader and writer are part of the same
       program, it is some sort of IN_PROC queue, and for IPC and
       networking it will transparently serialize the messages?

Yeah, that was the intent. Wasn't really sure quite how to illustrate that; any thoughts on a better way?

     - What is the Pull FIFO for? Does the LogEmissary receive
       feedback from the LogWriter?

Yes. Currently, that feedback is limited to error messages, but there are other messages planned as well.

     - Just out of curiousity, what type of thread-safe queue is it?
       Single-writer/single-reader? Single-writer/multiple-reader? There
       are zillions of thread-safe queue implementations out there, each
       optimized for some different use case!

I don't know if I'd really call this one "optimized" :slight_smile: The queue was thrown together pretty quickly; it's largely targeted for single producer / single consumer, but I believe multiple producer / single consumer should work as well.

Trying to use multiple consumers with this queue would likely result in some kind of universe-ending quantum event involving the LHC, the deflector dish on the USS Enterprise, and a relatively cute kitten with gray fur and black tiger stripes.

Alternatively, the program could just crash and / or deadlock, but that's not nearly as much fun to contemplate.

Regardless, verifying the above and documenting the properties of said queue would probably be an excellent idea. Thanks :slight_smile:

While we're at it, is the
       LogEmissary to LogWriter multiplicity 1:1 or 1:n?

1:1; it wouldn't be difficult to make it otherwise, but I haven't been able to think of a good reason to do so.

     - About BasicThread: it seems there is one channel for both control
       and data messages, i.e., the queue can include a special terminate
       event to shutdown the BasicThread. Are these control messages
       filtered out in the base class (BasicThread) code before the
       message is passed to the child (E.g. LogWriter) code?

More or less. An additional piece of the message (not shown on the diagram. . . not quite sure how to illustrate this yet) involves a reference to the LogWriter which is used to process the message. For example:

class FlushMessage : public EventMessage {
public:
     FlushMessage(LogWriter &ref) : _ref(ref) { }
     void process() { ref.flush(); }
private:
     LogWriter &_ref;
};

For the purposes of serialization, the reference can be expressed as a <LogWriter type, LogWriter path> pair.

Then, when BasicThread pulls the event off of the queue and calls process(), the appropriate LogWriter function is called.

For a terminate message, then:

class TerminateThread : public MessageEvent
{
public:
     TerminateThread(ThreadInterface &ref)
     : ref(ref) { }

     bool process()
         {
         ref.stop();
         return true;
         }

private:
     ThreadInterface &ref;
};

where BasicThread inherits from ThreadInterface, and thus the BasicThread is stopped when the event is processed.

I'm always open to better ideas :slight_smile:

Feedback / corrections / "THAT'S A STUPID MODEL!" are always
appreciated.

I like the architecture (+1 with your words :wink: and am looking forward
to see it in action. What I am really interested in are profiling
benchmarks (very easy with Google perftools) to see where the non-I/O
bottlenecks occur.

Probably going to start working on that after a bit; there's a crossover cable hooked up between two development machines (thanks Robin!) and a few GB of traces ready to go, but took a break to address testing and tool-related stuff.

If you'd happen to know of a good way to measure contention on a Linux system, I'd love to hear about it; I'm planning on writing a few stap scripts to help out here, but it'd save me a lot of time if there were something that existed already and seemed to work pretty well.

Sorry if I have missed it, but do we use now 0mq as messaging
middle-layer or is all message passing based on custom code?

All custom code. It didn't seem to make sense to require 0mq as a dependency when the logging infrastructure was the only thing that would use it.

--Gilbert

Sure, if it wouldn't be too much trouble :slight_smile: I'd love to see how they
manage to do that.

The project is called Lithe and has it's own web presence here:

    http://parlab.eecs.berkeley.edu/lithe

I haven't looked into yet, but the linked paper is probably the best way
to absorb the details.

To answer the question: a bro log stream is defined by three things -- a
path (where do I write), a record type (what do I write), and a writer
type (how do I write).

Got it.

Yeah, that was the intent. Wasn't really sure quite how to illustrate
that; any thoughts on a better way?

Not really. Maybe several boxes inside the QueueInterface to illustrate
that "there is something going on" inside this intelligent queue?!

The queue was thrown together pretty quickly; it's largely targeted
for single producer / single consumer, but I believe multiple producer
/ single consumer should work as well.

If it works, it works. I was just curious about the implementation.

Trying to use multiple consumers with this queue would likely result in
some kind of universe-ending quantum event involving the LHC, the
deflector dish on the USS Enterprise, and a relatively cute kitten with
gray fur and black tiger stripes.

I see... the kitten could cause real trouble [1].

If you'd happen to know of a good way to measure contention on a Linux
system, I'd love to hear about it;

I've only used Google perftools thus far to profile CPU and heap, but
haven't come across something similar for I/O.

    Matthias

[1] http://www.psiopradio.com/wp-content/uploads/sniper_kitten1.jpg

Hm, are logging streams defined slightly differently in the "core" than they are in the scripting language? Streams are only defined by their record type. Where to write and how to write are both handled by filters.

  .Seth