Broker port status

Hi all, this is just a status update on porting Bro to use the new
Broker communication system. I'd say the topic/actor-system branch is
now functionally complete with still a few weeks left of internal
cleanup/improvements.

Open questions:

* Rename "proxy" nodes?

The previous idea was to call them "data nodes", though I don't see
the benefit. It's awkward to talk about because there's no shorthand
for that node type: you can say "logger", "manager", or "worker",
though you'd have to say the whole "data node" phrase to make any
sense. That also shows that maybe it's helpful to name nodes based
upon a personifiable role that they play: "data" isn't a role/action
performed by a node/person. IMO, "proxy" is still accurate to the
role that these nodes perform: they are intermediaries for offloading
analysis and/or storage. Are there other ideas or is everyone wanting
to go ahead with "data node" ?

* How much of the old comm. system to deprecate vs. remove?

(1) &synchronized, &mergeable, &persistent. Seems fine to deprecate these now.

(2) Communication framework scripts. It's awkward to deprecate stuff
here since they internally will be using what is deprecated. I think
it makes sense to just remove it and let users manually make a copy if
they still need it. Thoughts?

(3) Old C++ comm. code, e.g. RemoteSerializer. I think we'll leave
this untouched for the 2.6 release?

(4) BIFs associated with old comm. system. Depends on (3) (and also
actually (2)), though if the internal core code remains, I think it
makes sense to add &deprecated to these.

Anything else?

And just a quick summary of planned internal improvements:

There's still several usability issues with Broker data stores.

* Online vs. offline operation discrepancies. Expiration times in
Broker are driven by wall time, though for a more familiar/expected
offline usage experience, Bro would need to be able to drive
expiration via network time. There's also some wall-time-based data
store initialization delays that make offline Bro usage difficult as
the Bro-based query timeouts are network-time-based.

* The way clone snapshots are distributed needs to be optimized.
Currently, for any clone that connects, the master floods *all* clones
with a snapshot.

* Certain store operations like "increment" fail if the key they are
to operate on does not exist. It makes more sense to me to assume the
user wants to insert the key with a sane default value (e.g.
zero/empty) in those cases, otherwise, it's awkward/race-prone to
require they do it themselves.

Let me know if there's thoughts on anything else still left to add/improve.

- Jon

Hi all, this is just a status update on porting Bro to use the new
Broker communication system. I'd say the topic/actor-system branch is
now functionally complete with still a few weeks left of internal
cleanup/improvements.

Awesome, I'll get it deployed here on a test cluster.

Open questions:

* Rename "proxy" nodes?

The previous idea was to call them "data nodes", though I don't see
the benefit. It's awkward to talk about because there's no shorthand
for that node type: you can say "logger", "manager", or "worker",
though you'd have to say the whole "data node" phrase to make any
sense. That also shows that maybe it's helpful to name nodes based
upon a personifiable role that they play: "data" isn't a role/action
performed by a node/person. IMO, "proxy" is still accurate to the
role that these nodes perform: they are intermediaries for offloading
analysis and/or storage. Are there other ideas or is everyone wanting
to go ahead with "data node" ?

I think I was the one calling them data nodes, but I only did that because that's what
they were called in the original broker integration branch that Mathias Fischer started.

I don't care about the name, as long as it's documented as a proxy node is for "offloading
analysis and storage" that works for me.

* How much of the old comm. system to deprecate vs. remove?

(1) &synchronized, &mergeable, &persistent. Seems fine to deprecate these now.

I'm fine with it going away, but there needs to be some sort of replacement for &synchronized,
minimally a how-to for porting existing scripts to something else.

I don't think anyone currently uses mergeable or persistent.

The answer is generally "use the new store API" to replace
&synchronized or &persistent. It's difficult to give a canonical
porting strategy as it depends on how the data is being
accessed/modified, though you can find a preview of the how-to port
document at [1]. Links and stuff will get rendered more nicely when
it's live on bro.org.

- Jon

[1] https://github.com/bro/bro/blob/topic/actor-system/doc/frameworks/broker.rst

The most common use of &synchronized here is for loading files into tables using input+reread. a cron job updates the table on the managerwhich updates the in memory table and syncs it over to all the workers.

These are read only tables, so I think the best replacement for those may be to switch to the new configuration framework, which would probably
mean the deletion of a ton of verbose Input::add_table statements.

I got the actor-system branch running on a small test cluster… manager+logger+proxy on one box and 32 workers on 2 physical boxes.

It seems to be working for the most part. known hosts and friends are working great now with no duplicate entries at startup.

One thing I notice is that the traffic to/from the manager box and cpu has increased quite a bit.

Ignoring the large CPU spikes from building the new branch just before the switch at 15:30, the overall cpu load on the manager box is about 3x higher.
The bandwidth isn’t terribly excessive, but it increased from about 16mbit to 40mbit (stupid graph is in mebibytes).

Based on some capstats runs, it’s all going to the logger port 47761. I have another graph that shows the total log volume being written to disk and that hasn’t changed.
So it looks like this branch uses 2x the cpu and bandwidth to write the same volume of logs.


Interesting, I would have thought maybe the manager could be utilized
more, though not the logger. Will have to think about that.

An interesting experiment you could try is switching to an alternate
implementation of the Known scripts. E.g. stick this in local.bro:

redef Known::use_host_store = F;
redef Known::use_cert_store = F;
redef Known::use_device_store = F;
redef Known::use_service_store = F;

I'd expect that could reduce bandwidth a bit.

Using data stores the Known scripts do:

worker -> manager: put this key into your store if it doesn't exist
manager -> worker: the key was accepted (or rejected)
worker -> logger: log the key (sent only if manager accepted key as unique)

Else the alternate version of Known scripts do:

worker -> proxy: here's a key
proxy -> logger: log the key (if proxy hasn't seen it recently)

Maybe cuts out 1/3 the message volume contributed by Known scripts that way.

- Jon

One thing I notice is that the traffic to/from the manager box and cpu has increased quite a bit.

Ignoring the large CPU spikes from building the new branch just before the switch at 15:30, the overall cpu load on the manager box is about 3x higher.
The bandwidth isn't terribly excessive, but it increased from about 16mbit to 40mbit (stupid graph is in mebibytes).

Based on some capstats runs, it's all going to the logger port 47761. I have another graph that shows the total log volume being written to disk and that hasn't changed.
So it looks like this branch uses 2x the cpu and bandwidth to write the same volume of logs.

Interesting, I would have thought maybe the manager could be utilized
more, though not the logger. Will have to think about that.

Yeah.. it's definitely worker -> logger:

logger=47761, manager=47762, proxy=47763

[root@bro-test ~]# capstats -i enp3s0f0 -I 1 -f 'port 47761' -n 5
1520553009.488934 pkts=1173 kpps=1.1 kbytes=2837 mbps=22.6 nic_pkts=1175 nic_drops=0 u=0 t=1173 i=0 o=0 nonip=0
1520553010.489053 pkts=1280 kpps=1.3 kbytes=3218 mbps=26.4 nic_pkts=2455 nic_drops=0 u=0 t=1280 i=0 o=0 nonip=0
1520553011.489177 pkts=1086 kpps=1.1 kbytes=2912 mbps=23.9 nic_pkts=3541 nic_drops=0 u=0 t=1086 i=0 o=0 nonip=0
1520553012.489310 pkts=1382 kpps=1.4 kbytes=3757 mbps=30.8 nic_pkts=4923 nic_drops=0 u=0 t=1382 i=0 o=0 nonip=0
1520553013.489471 pkts=1563 kpps=1.6 kbytes=3816 mbps=31.3 nic_pkts=6486 nic_drops=0 u=0 t=1563 i=0 o=0 nonip=0
[root@bro-test ~]# capstats -i enp3s0f0 -I 1 -f 'port 47762' -n 5
1520553018.874154 pkts=346 kpps=0.3 kbytes=92 mbps=0.7 nic_pkts=348 nic_drops=0 u=0 t=346 i=0 o=0 nonip=0
1520553019.875111 pkts=1044 kpps=1.0 kbytes=283 mbps=2.3 nic_pkts=1391 nic_drops=0 u=0 t=1044 i=0 o=0 nonip=0
1520553020.875448 pkts=759 kpps=0.8 kbytes=199 mbps=1.6 nic_pkts=2150 nic_drops=0 u=0 t=759 i=0 o=0 nonip=0
1520553021.875575 pkts=654 kpps=0.7 kbytes=196 mbps=1.6 nic_pkts=2804 nic_drops=0 u=0 t=654 i=0 o=0 nonip=0
1520553022.875920 pkts=470 kpps=0.5 kbytes=123 mbps=1.0 nic_pkts=3274 nic_drops=0 u=0 t=470 i=0 o=0 nonip=0
[root@bro-test ~]# capstats -i enp3s0f0 -I 1 -f 'port 47763' -n 5
1520553025.700919 pkts=26 kpps=0.0 kbytes=8 mbps=0.1 nic_pkts=31 nic_drops=0 u=0 t=26 i=0 o=0 nonip=0
1520553026.701047 pkts=27 kpps=0.0 kbytes=9 mbps=0.1 nic_pkts=58 nic_drops=0 u=0 t=27 i=0 o=0 nonip=0
1520553027.701185 pkts=24 kpps=0.0 kbytes=8 mbps=0.1 nic_pkts=82 nic_drops=0 u=0 t=24 i=0 o=0 nonip=0
1520553028.701306 pkts=55 kpps=0.1 kbytes=18 mbps=0.2 nic_pkts=137 nic_drops=0 u=0 t=55 i=0 o=0 nonip=0
1520553029.701434 pkts=9 kpps=0.0 kbytes=3 mbps=0.0 nic_pkts=146 nic_drops=0 u=0 t=9 i=0 o=0 nonip=0
[root@bro-test ~]#

On another cluster seeing the same traffic with 56 workers running 2.5.3ish shows this for the logger port at the same time:

1520553725.875230 pkts=681 kpps=0.7 kbytes=1946 mbps=15.5 nic_pkts=768 nic_drops=0 u=0 t=681 i=0 o=0 nonip=0
1520553726.875374 pkts=508 kpps=0.5 kbytes=1605 mbps=13.1 nic_pkts=1392 nic_drops=0 u=0 t=508 i=0 o=0 nonip=0
1520553727.875520 pkts=554 kpps=0.6 kbytes=1681 mbps=13.8 nic_pkts=2247 nic_drops=0 u=0 t=554 i=0 o=0 nonip=0
1520553728.875713 pkts=536 kpps=0.5 kbytes=1708 mbps=14.0 nic_pkts=2902 nic_drops=0 u=0 t=536 i=0 o=0 nonip=0
1520553729.875845 pkts=518 kpps=0.5 kbytes=1669 mbps=13.7 nic_pkts=3641 nic_drops=0 u=0 t=518 i=0 o=0 nonip=0

[root@bro-test ~]# broctl top manager logger
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger logger bro-test 2016 parent 938M 283M 81% bro
manager manager bro-test 2107 parent 619M 244M 25% bro
[root@bro-test ~]#

[root@bro-dev ~]# broctl top manager logger
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger logger bro-dev 10704 child 492M 150M 6% bro
logger logger bro-dev 10571 parent 1G 221M 0% bro
manager manager bro-dev 10734 parent 909M 497M 0% bro
manager manager bro-dev 10782 child 482M 110M 0% bro

cpus are about the same, test has X5650 @ 2.67GHz and dev has E5-2420 @ 1.90GHz
according to passmark, the older X5650 has a slightly faster single core performance.

An interesting experiment you could try is switching to an alternate
implementation of the Known scripts. E.g. stick this in local.bro:

redef Known::use_host_store = F;
redef Known::use_cert_store = F;
redef Known::use_device_store = F;
redef Known::use_service_store = F;

I tried doing that for an unrelated reason.. but I'm not sure if it works right.

You can't redef Known::use_service_store before the script is loaded, because otherwise you get

Can't document redef of Known::use_service_store, identifier lookup failed

but if you load the script first, the

@if ( Known::use_service_store )

block is already evaluated before you change the value.

[jazoff@bro-test ~]$ cat foo.bro
@load protocols/conn/known-services
redef Known::use_service_store=F;

event bro_init()
{
        print Known::services;
}
[jazoff@bro-test ~]$ bro foo.bro
error in ./foo.bro, line 6: unknown identifier Known::services, at or near "Known::services"

[jazoff@bro-test ~]$ cat foo.bro
redef Known::use_service_store=F;
@load protocols/conn/known-services

event bro_init()
{
        print Known::services;
}
[jazoff@bro-test ~]$ bro foo.bro
error in ./foo.bro, line 1: "redef" used but not previously defined (Known::use_service_store)
internal warning in ./foo.bro, line 1: Can't document redef of Known::use_service_store, identifier lookup failed
error in ./foo.bro, line 1 and /srv/bro/share/bro/policy/protocols/conn/known-services.bro, line 34: already defined (Known::use_service_store)
error in /srv/bro/share/bro/policy/protocols/conn/known-services.bro, line 40: value used but not set (Known::use_service_store)
error in /srv/bro/share/bro/policy/protocols/conn/known-services.bro, line 40: invalid expression in @if (Known::use_service_store)
error in /srv/bro/share/bro/policy/protocols/conn/known-services.bro, line 93: value used but not set (Known::use_service_store)
error in /srv/bro/share/bro/policy/protocols/conn/known-services.bro, line 93: invalid expression in @if (Known::use_service_store)

Yeah, I realized that and fixed it yesterday. Have you pulled the
latest commits from the topic/actor-system branch?

- Jon

* Rename "proxy" nodes?

"compute" maybe?

(1) &synchronized, &mergeable, &persistent. Seems fine to deprecate these now.

Agree.

(2) Communication framework scripts. It's awkward to deprecate stuff
here since they internally will be using what is deprecated. I think
it makes sense to just remove it and let users manually make a copy if
they still need it.

Likewise agree.

(3) Old C++ comm. code, e.g. RemoteSerializer. I think we'll leave
this untouched for the 2.6 release?

Yep, I'm looking forward to ripping that out for 2.7. :slight_smile:

(4) BIFs associated with old comm. system. Depends on (3) (and also
actually (2)), though if the internal core code remains, I think it
makes sense to add &deprecated to these.

Deprecating them sounds good.

It makes more sense to me to assume the user wants to insert the key
with a sane default value (e.g. zero/empty) in those cases, otherwise,
it's awkward/race-prone to require they do it themselves.

Agree with this as well, has always felt a bit awkward to me too.

Robin

Ah.. I grabbed the latest commit from the wrong checkout and was using a commit from a few weeks ago.

Ok.. running the latest commit now (105fc386ef0f65a91839706641abae664c7f3e49)

Have noticed a problem where at the top of an hour, the logger runs into some issue while rotating the logs
and all logging stops but the logger buffers everything in memory until the box OOMs:

[root@bro-test ~]# broctl top manager logger proxy-{1,2,3}
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger logger bro-test 14267 parent 5G 5G 93% bro
manager manager bro-test 15478 parent 649M 277M 25% bro
proxy-1 proxy bro-test 32061 parent 643M 309M 0% bro
proxy-2 proxy bro-test 32063 parent 642M 351M 12% bro
proxy-3 proxy bro-test 15732 parent 619M 322M 18% bro
[root@bro-test ~]# cat /bro/logs/current/*.log|wc -l
2627
[root@bro-test ~]# broctl top manager logger proxy-{1,2,3}
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger logger bro-test 14267 parent 6G 5G 66% bro
manager manager bro-test 15478 parent 649M 277M 33% bro
proxy-1 proxy bro-test 32061 parent 643M 309M 26% bro
proxy-2 proxy bro-test 32063 parent 642M 351M 20% bro
proxy-3 proxy bro-test 15732 parent 619M 322M 26% bro
[root@bro-test ~]# cat /bro/logs/current/*.log|wc -l
2627
[root@bro-test ~]# broctl top manager logger proxy-{1,2,3}
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger logger bro-test 14267 parent 7G 6G 75% bro
manager manager bro-test 15478 parent 649M 277M 25% bro
proxy-1 proxy bro-test 32061 parent 643M 309M 12% bro
proxy-2 proxy bro-test 32063 parent 642M 353M 25% bro
proxy-3 proxy bro-test 15732 parent 619M 322M 12% bro
[root@bro-test ~]# cat /bro/logs/current/*.log|wc -l
2627
[root@bro-test ~]# broctl top manager logger proxy-{1,2,3}
Name Type Host Pid Proc VSize Rss Cpu Cmd
logger logger bro-test 14267 parent 12G 11G 68% bro
manager manager bro-test 15478 parent 649M 277M 31% bro
proxy-1 proxy bro-test 32061 parent 643M 313M 18% bro
proxy-2 proxy bro-test 32063 parent 642M 359M 12% bro
proxy-3 proxy bro-test 15732 parent 619M 322M 6% bro
[root@bro-test ~]# ls -tl /bro/logs/current/|head
total 412
-rw-r--r--. 1 root root 682 Mar 9 11:00 stats.log
-rw-r--r--. 1 root root 14470 Mar 9 11:00 conn.log
[root@bro-test ~]# date
Fri Mar 9 11:06:59 CST 2018

up until log rotation is supposed to occur the log rate is normal.