Broker::publish API

Yeah, I realize that. A direct port of the old logic was of course the
goal so far, with the drawbacks of that approach accepted &
understood. That's what's in place now; that's great and exactly as
planned. We can get 2.6 out this way, and it'll be fine.

My point is that now also seems like a good time to take stock of what
we got that way. That direct porting is finally getting us some sense
of where things aren't an ideal match between API and use cases yet.
And if there's something easy we can do about that before people start
relying on the new API, it seems that would be beneficial to do. But
we can see.

Robin

There's also a bunch of places that I think were written standalone first and then updated to work on a cluster in
place resulting in some awkwardness..

Yeah, indeed, that's another other source of complexity with these
scripts.

But if this was written in a more 'cluster by default' way, it would just look like:

Nice example. That's the kind of thing I hope we can do during the
next cycle: streamline the scripts to unify these kinds of logic.

Broker::publish could possibly be optimized for standalone to raise the event directly if not being ran in a cluster.

Or we generally raise published events locally as well if the node is
subscribed to the destination topic. There are pros and cons for that
I think.

Robin

Yeah, I realize that. A direct port of the old logic was of course the
goal so far, with the drawbacks of that approach accepted &
understood. That's what's in place now; that's great and exactly as
planned. We can get 2.6 out this way, and it'll be fine.

I'm earnestly probing to try to get a better decomposition of the
issues that make it hard to understand cluster communication patterns.

There's the exercise of trying to answer "what *is* this script
doing?" and then there's also trying to answer "what *should* it be
doing?".

I seldom felt like I had definitive answers for the later, but I can
see how it would be beneficial to do that and also broader
script/framework makeovers, possibly before 2.6, because it would help
inform whether new APIs are catering to "good" use-cases. Though my
thinking is it's not critical to get a 100% API/use-case match off the
bat and that there's some actionable stuff to take away from this
thread that is at least going to have us heading in a better direction
sooner rather than later...

My point is that now also seems like a good time to take stock of what
we got that way. That direct porting is finally getting us some sense
of where things aren't an ideal match between API and use cases yet.
And if there's something easy we can do about that before people start
relying on the new API, it seems that would be beneficial to do. But
we can see.

Yeah, agreed. What I've taken away from your earlier points is that
these smaller changes are seeming like they'd be beneficial to do
before 2.6:

* publish() API simplifications/compressions (pending decision on
exactly what those should be)
* enable message forwarding by default (meaning re-implement the one
or two subscription patterns that might create a cycle)
* see if any script-specific topics can instead use a pre-existing
"cluster" topic

What do you think?

A separate question/idea I just had was whether how much of the
process of auditing the subscriptions and communication patterns was
difficult due to having to hunt down things in various scripts and
whether a more centralized config could be something to do? e.g. I
don't know how the details would work out, but I'm imagining a
workflow where one edits a centralized config file with
subscription/node info in it and that auto-generates the code to set
them up. Sort of like working backward from the info in the PDF you
shared.

- Jon

Right, here's a crude graphic of the cluster layout from the docs:

https://github.com/bro/bro/blob/master/doc/frameworks/broker/cluster-layout.png

- Jon

* publish() API simplifications/compressions (pending decision on
exactly what those should be)

Yeah, with an eye on the semantics for forwarding (now and later),
and whether to raise published events locally as well if the host is
subscribed itself.

And maybe the 2nd eye on: can define these semantics so that we can
get rid of some of the "what node type am I?" checks? I'm not sure how
that would look like, but generally it would be nice if one could just
publish stuff liberally without worrying too much and the
subscriptions and forwarding semantics do the right thing (not always,
but often)).

* enable message forwarding by default (meaning re-implement the one
or two subscription patterns that might create a cycle)

Haven't quite made up my mind on this one. In principlel yes, but
right now a host needs to be subscribed to a topic to forward it if I
remember than right. That may limit how we use topics, not sure (e.g.,
if a worker wanted to talk to other workers, with "real"
forwarding/routing they'd just publish to the worker topic and that
message would get routed there, but not be processed at the
intermediary hops as well. With our current forwarding, the hops would
need to subscribe to the worker topic as well and hence the event got
raised there, too.)

* see if any script-specific topics can instead use a pre-existing
"cluster" topic

Yep.

difficult due to having to hunt down things in various scripts and
whether a more centralized config could be something to do?

Yeah, that sounds useful for the cluster case: it could be part of the
cluster framework to define all the relevant node types with their
characeristics. That would also make later changes easier &
centralized to how topics and connections are set up.

For other use cases, it should still be possible to configure things
independently, too, though (say, for talking to external Broker
applications).

Robin

Yeah, that's how I also understand the current mechanisms would work.

Maybe can split it into two separate questions:

(1) enable the "explicit/manual" forwarding by default?
(2) re-implement any existing subscription cycles?

Answer to (2) may pragmatically be "yes" because they'd be known to
cause problems if ever (1) did become enabled (and also could be
problematic for a more sophisticated/automatic/implicit routing system
should that become available in the future... at least I think it's a
problem, but then again maybe connection-cycles would also still be a
problem at that point, not quite sure).

Answer to (1) may be "no" because we don't have a use for it at the
moment -- having the forwarding-nodes also raise events is not ideal,
but if we solved that would it be useful? Maybe an idea would be
extend the subscribe() API in Bro:

    function Broker::subscribe(topic_prefix: string, forward_only:
bool &default=F);

I recall that we have access to both the message/event as well as the
topic string on the receiver side, so could be possible to detect
whether or not to raise the event depending on whether the topic only
has a matching subscription prefix that is marked as forward_only.

With that you could do something like:

    # On Manager
    Broker::subscribe(worker_to_worker_topic, T);

    # On Worker
    Broker::subscribe(worker_to_worker_topic);
    Broker::publish(worker_to_worker_topic, my_event);

There, my_event would be distributed from one worker to all workers
via the manager, but not sure that's as usable/dynamic as the current
"relay" mechanism because you also get a load-balancing scheme to go
along with it. Here, you'd only ever want to pick a single manager or
proxy to do the forwarding (subscribing like this on all proxies
causes all proxies to forward to all workers resulting in undesired
event duplication.)

So I guess that's still to say I'm not sure what the use of the
current forwarding mechanism would be if it were enabled. Also maybe
begs the question for later regarding the "real" routing mechanism: I
suppose that would need to be smart enough to do automatic
load-balancing in the case of there being more than one route to a
subscriber.

- Jon

Yeah, and let me add one thing: What if as a starting point for
modeling things, we assumed that we have global topic-based routing
available. Meaning if node A publishes to topic X, the message will
show up at all nodes that are subscribed to topic X anywhere, no
matter what the topology --- Broker will somehow take care of that. I
believe that's where we want to get eventually, through whatever
mechanism; it's not trivial, but also not rocket science.

Then we (A) design the API from that perspective and adapt our
standard scripts accoordingly, and (B) see how we can get an
approximation of that assumption for today's Broker and our simple
clusters, by having the cluster framework hardcode what need.

(1) enable the "explicit/manual" forwarding by default?

Coming from that assumption above, I'd say yes here, doing it like you
suggest: differentiate between forwarding and locally raising an event
by topic. Maybe instead of adding it to Broker::subscribe() as a
boolean, we add a separate "Broker::forward(topic_prefix)" function,
and use that to essentially hardcode forwarding on each node just like
we want/need for the cluster. Behind the scenes Broker could still
just store the information as a boolean, but API-wise it means we can
later (once we have real routing) just rip out the forward() calls and
let Magic take its role. :slight_smile:

As you say, we don't get load-balancing that way (today), but we still
have pools for distributing analyses (like the known-* scripts do).
And if distributing message load (like the Intel scripts do) is
necessary, I think pools can solve that as well: we could use a RR
proxy pool and funnel it through script-land there: send to one proxy
and have an event handler there that triggers a new event to publish
it back out to the workers. For proxies, that kind of additional load
should be fine (if load-balancing is even necessary at all; just going
through a single forwarding node might just as well be fine.

(2) re-implement any existing subscription cycles?

Now, here I'm starting to change my mind a bit. Maybe in the end, in
large topologies, it would be futile to insist on not having cycles
after all. The assumption above doesn't care about it, putting Broker
in charge of figuring it out. So with that, if we can set up
forwarding through (1) in a way that cycles in subscriptions don't
matter, it may be fine to just leave them in. But I guess in the end
it doesn't matter, removing them can only make things better/easier.

Also maybe begs the question for later regarding the "real" routing
mechanism: I suppose that would need to be smart enough to do
automatic load-balancing in the case of there being more than one
route to a subscriber.

Yeah, I'm becoming more and more convinced that in the end we won't
get around adding a "real" routing layer that takes of such things
under the hood.

Robin

On 08/08/18 17:48, Robin Sommer wrote:> I think it's safe to assume we have the cluster structure under our

own control; it's whatever we configure it to be. That's something
that's easier to change later than the API itself. Said differently:
we can always adjust the connections and topics that we set up by
default; it's much harder to change how the publish() function works.

I think in an earlier discussion (could be [Bro-Dev] Scaling out bro cluster communication) there was the idea of different types of data nodes that would serve different purposes. If that is still a design goal, it feels like the structure of a cluster could be more volatile than it used to be. Not sure how that fits to the current assumptions. Just wanted to bring that back into the discussion.

Let me try to phrase it differently: If there's already a topic for a
use case, it's better to use it. That's easier and less error-prone.
So if, e.g., I want to send my script's data to all workers,
publishing to bro/cluster/worker will do the job. And that will even
automatically adapt if things get more complex later.

Maybe a silly question: Would that work using further "specialized" topics like bro/cluster/worker/intel? From my understanding one feature of topics is that one would be able to subscribe only the the things that one is interested in. Having a bunch of events just published to bro/cluster/worker seems counterproductive.

Maybe it's a *necessary* design, but that doesn't make it nice. :wink: It
makes it very hard to follow the logic; when reading through the
scripts I got lost multiple times because some "@if I-am-a-manager"
was somewhere half a page earlier, disabling the code I was currently
looking at for most nodes. We probably can't totally avoid that, but
the less the better.

I agree! One thing that could also help here is clear separation. In the intel framework that kind of code is capsuled in a cluster.bro file, which is basically divided into a worker and a manager part. In the end it's a tradeoff between abstraction and flexibility.

Jan

different purposes. If that is still a design goal, it feels like the
structure of a cluster could be more volatile than it used to be.

It is, and we have some of that, and I think it fits in with the
discussion here too. In my mind, I see two separate things in this
discussion: one is a general Broker API that facilitates some very
different applications; and the 2nd is our cluster framework that uses
that API for a specific use-case. The latter is much easier to tune
for us in terms of how it uses Broker, as we can hide much of it
internally and adjust later, i.e., by adding a new node type. The
question for the cluster framework, then, is what API *it* provides
for scripts to share state in a cluster. And a part of the answer to
that could be "standardized topics" that are guaranteed to get the
information to where it needs to go.

Maybe a silly question: Would that work using further "specialized" topics
like bro/cluster/worker/intel? From my understanding one feature of topics
is that one would be able to subscribe only the the things that one is
interested in. Having a bunch of events just published to bro/cluster/worker
seems counterproductive.

I hear you, but I think I haven't quite understood the concern yet.
Can you give me an example where the difference matters? What's
different between publishing intel events to bro/cluster/worker/intel
vs bro/cluster/worker if both go to all workers? Or is it so that some
workers can decide not to receive the intel events?

(And technically, subscriptions are prefixed based, so anybody
subscribing to bro/cluster/worker automatically gets
bro/cluster/worker/intel as well; not sure if that helps or hurts
here?)

Robin

> (1) enable the "explicit/manual" forwarding by default?

Coming from that assumption above, I'd say yes here, doing it like you
suggest: differentiate between forwarding and locally raising an event
by topic. Maybe instead of adding it to Broker::subscribe() as a
boolean, we add a separate "Broker::forward(topic_prefix)" function,
and use that to essentially hardcode forwarding on each node just like
we want/need for the cluster. Behind the scenes Broker could still
just store the information as a boolean, but API-wise it means we can
later (once we have real routing) just rip out the forward() calls and
let Magic take its role. :slight_smile:

Not sure there'd be anywhere we'd currently use Broker::forward() ?
Or is it a matter of "if a user needed it for something, then it's
available" ?

The only intra-cluster communication that's more than 1 hop at the
moment is worker-worker, but setting up a Broker::forward() route
wouldn't be my first thought as it's not currently a scalable
approach. I'd instead take the cautious approach of relaying via a
RR-proxy so one can add proxies to handle more load as needed.

However, I can see Broker::forward() could make it a bit easier for a
user wanting to manually set up a forwarding route between clusters or
other external applications. Is that a clear use-case we need to
cater to now? If so, then it would indeed be just saying "hey,
Broker::forward() is now a no-op since Broker has real routing
mechanisms and you can remove them".

As you say, we don't get load-balancing that way (today), but we still
have pools for distributing analyses (like the known-* scripts do).
And if distributing message load (like the Intel scripts do) is
necessary, I think pools can solve that as well: we could use a RR
proxy pool and funnel it through script-land there: send to one proxy
and have an event handler there that triggers a new event to publish
it back out to the workers. For proxies, that kind of additional load
should be fine (if load-balancing is even necessary at all; just going
through a single forwarding node might just as well be fine.

Seems more prudent not to guess whether a single, hardcoded forwarding
node is good enough when writing the default cluster-enabled scripts.
RR via proxy is not just load-balancing either, but fault-tolerance as
well.

But here you're talking more about removing the relay() functions and
doing the RR-via-proxy "manually", right? That seems ok to me -- once
"real" routing is available, you then have the option to simplify your
script and get a minor optimization by not having to manually
handle+forward the event on proxies.

> (2) re-implement any existing subscription cycles?

Now, here I'm starting to change my mind a bit. Maybe in the end, in
large topologies, it would be futile to insist on not having cycles
after all. The assumption above doesn't care about it, putting Broker
in charge of figuring it out. So with that, if we can set up
forwarding through (1) in a way that cycles in subscriptions don't
matter, it may be fine to just leave them in. But I guess in the end
it doesn't matter, removing them can only make things better/easier.

Again, I think we wouldn't have any Broker::forward() usages in the
default cluster setup, but simply enabling the forwarding of messages
at the Broker-layer would currently cause some messages to route in a
cycle. Enabling the current message forwarding means we need to
re-implement existing subscription cycles. If we instead waited for
the "real" routing, then it doesn't matter if we leave them in.

- Jon

Yeah, topic use-cases may need clarification. There's one desire to
use topics as a way to specify known destination(s) within a cluster.
Another desire could be using the topic name to hierarchically
summarize/describe a quality of the message content in order to share
with the external world. Maybe the thing that's currently unclear is
what the intended borders are for information sharing? I break it
down like:

(1) if the event you're publishing just facilitates scalable cluster
analysis: you'd tend to use the topic names which target node classes
within a cluster (eventually this might be "bro/<cluster-id>/worker")

(2) if the event you're publishing is intended for external
consumption, then you should use a topic which describes some specific
qualities of the message (e.g. "jan/intel")

Events that fall under (1) don't need to be descriptive since we don't
want to encourage people to arbitrarily start subscribing to events
that act as the details for how cluster analysis is implemented. Or I
guess if they do subscribe, then they are the kind of person that's
more interested in inspecting the cluster's performance/communication
characteristics anyway.

I'd also say that (2) is a user decision -- they need to be the one to
decide if their cluster has produced some bit of information worthy of
sharing to the external world and then publish it under a suitable
topic name.

That make sense?

- Jon

Or is it a matter of "if a user needed it for something, then it's
available" ?

Yeah, including matching expectations: if there's a
"bro/cluster/worker" topic, I'd expect I can publish there to reach
all the workers (from anywhere). However, I think I'm with you now
that maybe we just shouldn't do do/support any forwarding in the
cluster right now. Pools and manual relaying are a (currently better)
alternative, and we can change things later. And at least it's a clear
message: no forwarding across cluster nodes.

However, I can see Broker::forward() could make it a bit easier for a
user wanting to manually set up a forwarding route between clusters or
other external applications. Is that a clear use-case we need to
cater to now?

Well, if it were easy to add the forward() function, that could indeed
be quite useful for external integrations still. With that, one could
selectively forward custom topics (at one's own risk), without causing
a mess for the cluster. I'm thinking osquery integration for example,
where messages might go through an intermediary Bro. One advantage
that Broker-internal forwarding has compared to manual relaying is
that messages won't be propagated back to the sender.

But it's a matter of effort at this point I'd say.

RR via proxy is not just load-balancing either, but fault-tolerance as
well.

Yeah, that's right.

But here you're talking more about removing the relay() functions and
doing the RR-via-proxy "manually", right? That seems ok to me -- once
"real" routing is available, you then have the option to simplify your
script and get a minor optimization by not having to manually
handle+forward the event on proxies.

Ok, let's make that change then, I think removing relay() will help
for sure making the API easier.

Robin

If relay is removed how does a script writer efficiently get an event from one worker (or manager)
to all of the other workers?

I hear you, but I think I haven't quite understood the concern yet.
Can you give me an example where the difference matters? What's
different between publishing intel events to bro/cluster/worker/intel
vs bro/cluster/worker if both go to all workers? Or is it so that some
workers can decide not to receive the intel events?

The use case I had in my mind is an external application that is interested in interfacing with the intelligence framework. Either for querying it similar to workers of for managing purposes. If possible, it could be beneficial for such an application to receive only the relevant parts of cluster communication.

Old Worker:

  Cluster::relay_rr(Cluster::proxy_pool, my_event);

New Worker:

  Broker::publish(Cluster::rr_topic(Cluster::proxy_pool), my_event);

New Proxy:

  event my_event() { Broker::publish(Cluster::worker_topic, my_event); }

So the proxy has additional overhead of the proxy's event handler. I
doubt that's much a problem from the "efficiency" standpoint, but if
it were, then just having more proxies helps. Once real routing were
available the code would still work or you could opt to change to
just:

Even Newer Worker:

  Broker::publish(Cluster::worker_topic, my_event);

See any problems there?

- Jon

I'm generally thinking there's nothing stopping one from picking a new
topic name to re-publish some set of events under. Would that be
possible in the case you're imagining?

I don't think we're going to come up with a general (or enforce-able)
way of picking topic names such that they'll be useful for any
arbitrary, external use-case. So we pick the topic name that is best
for the use-case we have at time of writing a script (e.g. we just
want to get it working on a cluster so use the pre-existing topics
that are available for that), and then let others re-publish a subset
of events under different topics dependent on their specific use-case.

- Jon

That's nice and simple :slight_smile:

Assuming that can send the events around in the most efficient way possible, that's perfect.

The one tricky case is doing that on the manager. While the manager is fully connected to all workers,
you really want to offload the fanning out of messages to one of the proxies.

Yeah, I don't know exactly how it would be implemented yet, but seems
to warrant a policy/flag that one sets on the manager that means
"prefer sending along a 2-hop route rather than a 1-hop route if it
minimizes our own workload" or else a way to mark proxy nodes such
that any connected peers always prefer to send 1-routable-message to
it rather than N-direct-messages. Maybe falls under "load-balancing"
of the prospective routing implementation, which I've tracked as
requiring these features:

* cycle detection/prevention
* network-wide subscription knowledge per-node
* load-balancing + proxying policies

Let me know if I missed any. I have implementation ideas/notes
already which basically requires associating node IDs with
subscription state and also message state (push node IDs into messages
upon receipt before forwarding), but we can maybe discuss and flesh it
out in a later design thread once we decide what exactly to do. As
for deciding what to do in the near term, seems like we will arrive at
agreeing upon:

(1) Remove relay(...) functions
(2) Reduce unique topic names (use pre-existing cluster topics where possible)
(3) Add Broker::forward(topic_prefix) function + enable Broker forwarding

An alternative to (3) would be implementing "real" routing in Broker
right from the start. No strong opinion there, but seems like it
could fall under nice-to-have at this point and, while it would
obsolete Broker::forward(), I don't expect that's much effort wasted.
Any other ideas?

- Jon

associating node IDs with subscription state and also message state
(push node IDs into messages upon receipt before forwarding),

Yeah, that sounds like the right direction. Some reading might be
worthwile doing here, there are quite a few papers out there on
routing in overlay networks.

(1) Remove relay(...) functions
(2) Reduce unique topic names (use pre-existing cluster topics where possible)
(3) Add Broker::forward(topic_prefix) function + enable Broker forwarding

Yes, that sounds good to me, plus whatever that means for "publish()"
itself. I like what we have arrived at here.

One more question: what about raising published events locally as well
if the sending node is subscribed to the topic? I'm kind of torn on
that. I don't think we want that as a default, but perhaps as an
option, either with the publish() call or, likely better, with the
subscribe() call? I can see that being helpful in cases like unifying
standalone vs cluster operation; and more generally, for running
multiple node types inside the same Bro instance.

An alternative to (3) would be implementing "real" routing in Broker
right from the start.

In an ideal world, yes, that would certainly be nice to have. But it's
a larger task that I don't think we would be able to finish for 2.6
anymore. So, I'd put that on the list for later.

Robin

Not sure, is Broker::auto_publish() currently able to do the same thing?

e.g. if I want an event to be raised locally, I raise it via "event"
and it automatically gets published.

I can also see the opposite being intuitive: If I told
Broker::subscribe() to raise locally, then I get just always use
Broker::publish() and not think about the difference between using
"event" versus "publish". Would Broker::auto_publish() be removable
then?

- Jon