package manager progress

The package manager client is at a point now where I think it would be usable. Documentation is here:

https://bro.github.io/package-manager/

There is a branch in the ‘bro’ repo called ‘package-manager’ that simply changes CMake scripts to install ‘bro-pkg’ along with bro. Here’s an example usage/session:

$ git clone --recursive --branch=package-manager git://bro.org/bro
...
$ cd bro && ./configure && make install
...
$ /usr/local/bro/bin/bro-pkg list all
default/jsiwek/bro-test-package
$ /usr/local/bro/bin/bro-pkg install bro-test-package
installed "bro-test-package"
loaded "bro-test-package"
$ /usr/local/bro/bin/bro packages
loaded bro-test-package plugin
loaded bro-test-package scripts
$ /usr/local/bro/bin/broctl
Test package: initialized

That test package shows that bro-pkg was able to install a package containing Bro scripts, a Bro plugin, and a BroControl plugin and everything should “just work” without needing any configuration.

Roadmap/TODO/Questions:

* Add a way for package’s to define “discoverability metadata”.

E.g. following the original plan for this would involve putting something like a “tags” field in each package’s pkg.meta file, but the problem with this is the client would need to either download every package to be able to search this data or have a third-party periodically aggregate it.

My current idea is that instead of putting this type of data inside the package’s metadata, the user puts it in the package source’s metadata. They do this on first registration and may update it whenever. That way, bro-pkg always has access to latest discoverability metadata, no need for a separate aggregation process. It’s also something that will rarely change, so not a problem for that data to live in a repo not owned by the package author and not much increased burden for Bro Team to accept pull requests to update this data. Thoughts?

* Automatic inter-package dependency analysis

Simply a TODO. I put it at lower priority since I don’t think it will be common right off the bat to have complex package dependencies and users can always manually resolve dependencies at the moment.

* Is it acceptable to depend on GitPython and semantic_version python packages?

Both are replaceable implementation details, just didn’t want to write something myself if not necessary and in interest of time.

* Documentation is hosted on GitHub at the moment, move to bro.org?

Mostly just on GitHub now to be able to show something without having to touch any of the master bro/www doc generation processes, but maybe it’s a nice thing to start keeping docs more compartmentalized? The current doc/www setup feels like it’s getting rather large/monolithic and maybe that contributes to the difficulty of approaching/understanding it when there’s breakages. Just an idea.

* Thoughts on when to merge ‘package-manager’ branch in ‘bro’ ?

IMO, it can be done now or soon after I address responses/feedback to this email.

- Jon

Amazing work! I really like the package manager and I am looking forward
to contributing a script.

* Add a way for package’s to define “discoverability metadata”.

E.g. following the original plan for this would involve putting something like a “tags” field in each package’s pkg.meta file, but the problem with this is the client would need to either download every package to be able to search this data or have a third-party periodically aggregate it.

I think this is a question about who should deal with the extra effort:
On the one hand requiring to spread and sync information between two
places introduces a burden for the contributors, on the other hand
(automatic) aggregation of information makes it harder to maintain a
source including metadata. I am in favor of putting that information
into pkg.meta to make contributing as easy as possible.

One note: I think the documentation should contain a tremendous warning
pointing out that the users are responsible for what they are
installing. One scenario that came instantly to my mind: Someone is
contributing a small and useful script, waits for its distribution and
than updates his repository, adding e.g. a malicious build command. In
that context it would be nice if the package manager would ask the user
before executing the build command. For the official repository also
some automatic checks would be nice (e.g. indicating in case a script
executes shell commands). I think that was discussed before.

All in all I think the package manager design is intuitive and really
easy to use. Having central repositories will be great!

Thanks,
Jan

The package manager client is at a point now where I think it would be usable.

Cool!

* Add a way for package’s to define “discoverability metadata”.

E.g. following the original plan for this would involve putting
something like a “tags” field in each package’s pkg.meta file, but the
problem with this is the client would need to either download every
package to be able to search this data or have a third-party
periodically aggregate it.

What does "downloading" a package mean? If the package is in the
.gitmodules of the repo bro/packages, won't it be automatically
downloaded once the user updates their submodules?

I put it at lower priority since I don’t think it will be common right
off the bat to have complex package dependencies and users can always
manually resolve dependencies at the moment.

Agreed on inter-package dependencies. How about specifying a specific
Bro version as "dependency"?

* Documentation is hosted on GitHub at the moment, move to bro.org?

A key benefit of hosting it at github is reliability and that clients
get good viewing performance thanks to their CDN.

The current doc/www setup feels like it’s getting rather
large/monolithic and maybe that contributes to the difficulty of
approaching/understanding it when there’s breakages. Just an idea.

Keeping it separate could be an advantage for users, because the current
documentation is a bit unwieldy and confusing. Since you've written it
in RST, have you thought about publishing it via read-the-docs? Their
documentation really reads very smoothly out of the box. CAF, for
example, recently switched to it [1].

Some minor feedback:

- Is the "refresh" command essentially what "update" is to Homebrew? The
  documentation says:

    Update local package source clones to retrieve information about new
    packages that are available. Also fetches updated package
    information about any installed packages to determine if new
    versions are available.

  It sounds like this means it's doing submodule update.

- The documentation of the "list" command says:
  
    Filters available/installed packages by a chosen category and then
    outputs that filtered package list.

  I don't understand what "available" means here. It could also mean
  "packages that exist remotely but not installed locally" as opposed to
  "available for use right now." To avoid ambiguity and clearly
  distinguish it from "search", I would make that clear in the
  documention.

- Regaring pkg.meta: this is more of a nit/style thing, but I like
  minimalistic naming of configuration options, e.g.:

    [package]
    version = 1.0.0
    scripts = /path/to/scripts
    plugins = /path/to/plugins

  I find them easier to remember. But Robin would probably disagree with
  me here :-).

Looking forward to see it shaping up!

    Matthias

[1] CAF User Manual — CAF 0.19.4 documentation

* Add a way for package’s to define “discoverability metadata”.

E.g. following the original plan for this would involve putting something like a “tags” field in each package’s pkg.meta file, but the problem with this is the client would need to either download every package to be able to search this data or have a third-party periodically aggregate it.

I think this is a question about who should deal with the extra effort:
On the one hand requiring to spread and sync information between two
places introduces a burden for the contributors

The idea was not for contributors to have to keep syncing the information between two places, the “discoverability” metadata would just be located within the “package source” instead of the package itself.

My thinking is that discoverability metadata should be more of a property of a package source than the package itself — e.g. if a user is looking at discoverability data in a package’s pkg.meta file, it’s not that helpful because they’ve already found the package.

Also, some people may initially have no intention of sharing their package, so there’s no reason to put discoverability metadata in its pkg.meta. If they later change their mind, and care enough to take the time to register it to a package source, then likely they don’t mind adding a few keywords to a new meta file as an optional part of the one-time registration process.

One note: I think the documentation should contain a tremendous warning
pointing out that the users are responsible for what they are
installing

Thanks for the suggestion, I’ll do that.

- Jon

* Add a way for package’s to define “discoverability metadata”.

E.g. following the original plan for this would involve putting
something like a “tags” field in each package’s pkg.meta file, but the
problem with this is the client would need to either download every
package to be able to search this data or have a third-party
periodically aggregate it.

What does "downloading" a package mean? If the package is in the
.gitmodules of the repo bro/packages, won't it be automatically
downloaded once the user updates their submodules?

Right now, packages don’t get downloaded via the submodule, they are cloned directly from the package’s full git URL (which git just happens to encoded within the submodule).

So this means only packages a user is interested in end up getting downloaded. I think it also helps in cases where a user installs a package and later it gets removed from the package source — so the submodule is gone, but user’s installed version is not effected because it’s cloned directly from the package’s git URL. i.e. the package manager doesn’t distinguish between packages installed from a package source and packages installed directly from git URL.

If we wanted, we could actually use something completely different from git submodules to register a package to a package source. The package source just has to have some sort of database that links nodes in a package hierarchy (e.g. alice/foo, bob/bar, eve/baz) to their actual URLs. Git submodules just happens to perform this role. Maybe another added benefit of submodules is that if someone (e.g. a web frontend) does want to download the “universe of packages” (maybe to do some global stats/analysis on it) its easy to do that with a single builtin git command.

Agreed on inter-package dependencies. How about specifying a specific
Bro version as "dependency”?

Yep, that’s on also on the TODO list.

have you thought about publishing it via read-the-docs?

Yeah, just haven’t looked into it. I’ll do that unless consensus is to host the docs on bro.org.

Some minor feedback:

- Is the "refresh" command essentially what "update" is to Homebrew? The
documentation says:

   Update local package source clones to retrieve information about new
   packages that are available. Also fetches updated package
   information about any installed packages to determine if new
   versions are available.

It sounds like this means it's doing submodule update.

I’ll try to clarify it in the docs. It doesn’t do a recursive submodule update, it just updates the source repo itself (so submodule additions/removals are visible, but nothing further is actually downloaded).

- The documentation of the "list" command says:

   Filters available/installed packages by a chosen category and then
   outputs that filtered package list.

I don't understand what "available" means here. It could also mean
"packages that exist remotely but not installed locally" as opposed to
"available for use right now.”

It means the former — “list” operates on the combined set of installed and not-yet-installed packages.

Does wording it like “Filters known packages...” make it clearer for you?

- Regaring pkg.meta: this is more of a nit/style thing, but I like
minimalistic naming of configuration options, e.g.:

   [package]
   version = 1.0.0
   scripts = /path/to/scripts
   plugins = /path/to/plugins

Open to changing it, but seeing “scripts” as an option, without reading any further documentation, implies to me that you might be able to specify a list of paths/files there, which you can’t.

- Jon

Kind of related to this, I think we need to define some basic rules for package naming. This can help discoverability and also namespacing issues. Right now we have plugins named:

af_packet
elasticsearch
kafka
myricom
netmap
pf_ring
redis
tcprs

But I think they need to be renamed using prefixes like:

af_packet - pktsrc-af_packet
elasticsearch - log-writer-elasticsearch
kafka - log-writer-kafka
myricom - pktsrc-myricom
netmap - pktsrc-netmap
pf_ring - pktsrc-pf_ring
redis - log-writer-redis
tcprs - analyzer-tcprs

In one aspect the pktsrc- prefix acts like a tag, but can also help disambiguate plugins... i.e., a redis log writer plugin vs. a redis data store plugin vs. a redis protocol analyzer.

Right now, packages don’t get downloaded via the submodule, they are
cloned directly from the package’s full git URL (which git just
happens to encoded within the submodule).

So this means only packages a user is interested in end up getting
downloaded.

I'm not 100% following. Isn't every package recorded as submodule? Is
there any use case where you would do a submodule update? Or are the
packages just recorded there instead of recording them in a separate
file?

The package source just has to have some sort of database that links
nodes in a package hierarchy (e.g. alice/foo, bob/bar, eve/baz) to
their actual URLs. Git submodules just happens to perform this role.

(Yeah, reusing this makes sense)

> Filters available/installed packages by a chosen category and then
> outputs that filtered package list.
>
> I don't understand what "available" means here. It could also mean
> "packages that exist remotely but not installed locally" as opposed to
> "available for use right now.”

It means the former — “list” operates on the combined set of installed and not-yet-installed packages.

Does wording it like “Filters known packages...” make it clearer for you?

I think "known" is also ambiguous, because it doesn't clearly convey
the local aspect. How about just saying "filters installed packages"?

[..] but seeing “scripts” as an option, without reading any further
documentation, implies to me that you might be able to specify a list
of paths/files there, which you can’t.

Fair point. The reduction certainly omits some semantics. To simplify
reading the options, maybe add an underscore, e.g., script_path and
plugin_path?

    Matthias

Right now, packages don’t get downloaded via the submodule, they are
cloned directly from the package’s full git URL (which git just
happens to encoded within the submodule).

So this means only packages a user is interested in end up getting
downloaded.

I'm not 100% following. Isn't every package recorded as submodule?

Every package within a package source is recorded as a git submodule and that recording happens at the time the package author registers their package with a source. The bro-pkg client itself makes no changes to submodules.

Is there any use case where you would do a submodule update?

Depends on who “you” refers to:

- a regular bro-pkg user: no, they don’t need to be aware that submodules are used

- a package author: no, they only care that submodules are used when they do the one-time registration process to add their package to a source

- the bro-pkg developer/maintainer: not currently, but that’s maybe an implementation detail. I don’t currently ever update submodules and instead clone packages directly via their full git URL to a separate location because I think that’s the more robust implementation.

- some other entity that does periodic analysis on all packages (e.g. web frontend): I’d probably expect them to not be using bro-pkg at all, but they clone a package source and do recursive submodule updates on it as the easiest way of downloading the latest versions of everything.

Or are the
packages just recorded there instead of recording them in a separate file?

Right, using git submodules isn’t a requirement for the bro-pkg client to work, we could make up a different file/format for registering packages. But maybe submodules do provide some convenience for the last use case mentioned above.

I think "known" is also ambiguous, because it doesn't clearly convey
the local aspect. How about just saying "filters installed packages”?

Not all subcategories of “list” are working with just the locally “installed” packages. E.g. “list all” is looking at both installed packages (local git repos) and not-installed packages (remote git repos, but we know about them because they are registered with a source). How about this description:

“The ‘list’ command outputs a list of packages that match a given category”

maybe add an underscore, e.g., script_path and plugin_path?

Yeah, can do that. And maybe “dir” is more meaningful than “path” since the later may mean file or directory?

- Jon

> I'm not 100% following. Isn't every package recorded as submodule?

Every package within a package source is recorded as a git submodule
and that recording happens at the time the package author registers
their package with a source. The bro-pkg client itself makes no
changes to submodules.

Got it, thanks! Also, this page of the manual really helped me fill in
the missing pieces:

    https://bro.github.io/package-manager/source.html

[..] How about this description:

“The ‘list’ command outputs a list of packages that match a given
category”

Yep, my favorite so far!

> maybe add an underscore, e.g., script_path and plugin_path?

Yeah, can do that. And maybe “dir” is more meaningful than “path”
since the later may mean file or directory?

Also agreeing here.

    Matthias

I actually don't like this that much because some of these can cross boundaries and do all sorts of different things in a single plugin. It makes more sense to me to leave the naming open. If people want to name a plugin with a prefix, they're free to, but I wouldn't want to discourage people from maintaining individual plugins that provide a variety of features.

  .Seth

We really need to do this though, the end result otherwise will be chaos. Package names shouldn't have a generic name just because it was the first one in the repository.

Leaving it open will lead to:

The first person that writes a redis plugin for log writing calls it 'redis'.
Then a redis analyzer is called 'redis-analyzer'
Then someone writes a redis input source and that gets called 'input-source-redis'

Then a postgres analyzer is written and named 'postgresql'.
Then a postgres log writer plugin is named 'postgresql-log-writer'.
Then an input source is written named 'postgresql-input-source'.

So a year later we end up with packages named:

redis
redis-analyzer
input-source-redis
postgresql
postgresql-log-writer
postgresql-input-source

Where 'redis' is a log writer plugin and 'postgresql' is an analyzer.

Where the input source plugins are interchangeably named input-source-redis and postgresql-input-source.

If someone wanted to write a redis plugin that was both an input source, an analyzer, and a log writer, that could be called 'redis'... letting anything else be called 'redis' is confusing and misleading.

I actually don't like this that much because some of these can cross
boundaries and do all sorts of different things in a single plugin.
It makes more sense to me to leave the naming open.

I'm with Seth on this one. The reason why I think we should keep the
naming open is that it's the job of the meta data tags to take care of
the grouping. If someone writes a redis package, then they should apply
the redis package. Encoding this meta data into the package name is
quite limited, however.

    Matthias

And to add a me three to this - I am also with him on this one. On top of things - I might misremember this, but didn't we plan package names to include the github user name at one point of time? So a package name would be user/redis, for example, and there also could be user2/redis?

Johanna

Yes, package sources support hierarchical package names, but don’t require it. The hierarchy for the default package source is currently “github_user_name/package_name”. I’m the only one w/ a package at the moment, but you can see the structure here:

https://github.com/bro/packages

Right now, a user of bro-pkg can refer to my package as simply “bro-test-package”. If another user, say “bob”, creates “bob/bro-test-package”, then the client will no longer accept “bro-test-package” for commands where it is ambiguous and tell the user to clarify between either “bob/bro-test-package” or “jsiwek/bro-test-package”.

It’s also not allowed to have two packages with the same shortened name (e.g. “bro-test-package”) installed simultaneously. Interested to hear if people have use cases for that, but I expect the common case for same-name packages to be forks (either hard forks or just forking to contribute bugfixes) and allowing multiple packages of the same name to be installed may make that case more confusing/complicated for users/developers.

- Jon

Make it four. :slight_smile: I'm with Seth, too, better not to enforce any naming
scheme because the boundaries are unclear. Also, note that a single
binary Bro plugin can provide multiple quite different things (say, a
reader and an analyzer and a packet source all at the same time, if
one so desires :).

Also agree with Johanna: the username is part of the package name if I
follow correctly, so there's disambiguation there.

I have some more feeback on the package manager and Jon's questions
starting this thread, will send soon.

Robin

Ah, I see. Would it be better to generally use the full path as the
name, and not search for submatches, to make it consistent/unambiguous
what a name refers to?

Robin

At least from my usage it’s been convenient to be able to use a short name. It still always accepts full path names for packages even if they’re unambiguous when shortened, and the full path is what gets displayed in package listings so it’s never inconsistent in that regard. A user is free to always type full paths and for those that like to use short names an occasional “please clarify” may be more helpful than annoying: e.g. “oh I didn’t realize there were two, I should look into which one is more appropriate for me to use”.

- Jon

The package manager client is at a point now where I think it would be
usable.

Finally got a chance to play with it a bit. Excellent work, I really
like it!

Belows a list of just smaller things I noticed. The only larger
question I have is regading the use of submodules, also following up
on other parts of this thread. In principle, I actually quite like the
idea of using submodules; git already offers the mechanism, so why not
build on that. That said, seeing how the package manager ends up using
submodules, it's not quite the pure git model actually. If I
understood it right, it's using them really only to *find* the
external repos, but not to pinpoint a particular commit in there; the
package source never even updates the submodules. Given that approach,
I'm now wondering if a custom scheme wouldn't be the more intuitive
solution. My concern is that whoever looks at this submodule usage,
will take a while to understand what's actually happening. One could
argue that's only an implementation detail and shouldn't really matter
to anybody. On the other hand, if, for example, somebody ends up
browsing the package source repository on GitHub, I'm sure they'd be
confused by all the packages pointing to very old versions. So I'm
wondering if it would be worth switching to custom index instead of
submodules; seems that wouldn't be difficult either if indeed all we
need to do is track the external URLs somehow. Also, if you want to
track discoverability metadata there already as well, seems that the
URL could just become part of that, no?

Here's my list of other random things I noticed:

- Would suggest to rename “pkg.meta” to, say, “bro-pkg.meta”, just to
  make it more explicit that it's a Bro package. That's something one
  can also then search for on GitHub.

- Does "upgrade" show the packages affected and ask for confirmation?
  I would suggest either doing that or require an --all option for
  upgrading everything, as that's a potentially dangerous operation.

- I suppose upgrading does (better: will do) dependency checking
  again, including making sure the Bro version matches the one that
  update now requires?

- When installing the package manager as part of Bro, could we pull in
  the Python dependencies automatically, for example by installing
  them into the same prefix? Both GitHub and semantic_version are
  pretty non-standard. Using them is ok I think but it would be nice
  if "bro-pkg" wouldn't abort first thing because they aren't
  installed yet.

- How about adding a note to either packages.bro or the whole
  packages/ directory that's it's automatically maintained and not
  supposed to be manually messed with?

- In bro-pkg.conf, has "default" in "[sources]" a special meaning, or
  could it be any tag? Assuming the latter, I would just call it
  "bro": "bro/jsiwek/bro-test-package" is more intuitive than
  "default/jsiwek/bro-test-package".

- For our default package source, do we want to support non-GitHub
  repositories? If so, a naming scheme by GitHub user won't work.

- Suggest to rename "/opt/bro/var/lib/package-manager" to
  "../bro-package-manager" or "../bro-pkg".

- Once we support dependencies on Bro versions, would be nice if that
  worked also with the "x.y-z" scheme that git master uses (and maybe
  it just does anyways).

- I like the Python API!

- Documentation (nice!):

    - Python 3.x works, right? Then I'd list that explicitly.

    - A quick-start guide would be helpful that just mentions the most
      important steps, including basic installation along with Bro
      itself (once that's merged).

    - The "Installation" section becomes a bit confusing towards with
      the end with all those paths. Maybe split some parts out into an
      advanced section or so?

    - How-tos would be helpful that show by example how to create a
      (1) a pure script package, (2) and binary Bro plugin, and (3) a
      BroControl plugin.

Finally, my take on your questions:

My current idea is that instead of putting this type of data inside
the package’s metadata, the user puts it in the package source’s
metadata.

Yeah, I like that.

Automatic inter-package dependency analysis

Agree on lower priority.

* Is it acceptable to depend on GitPython and semantic_version python packages?

It is, it just would be nice if we could help people a bit getting
these installed, see above.

* Documentation is hosted on GitHub at the moment, move to bro.org?

Keeping these docs separate makes sense, although it would also be
nice to have the option to integrate into bro.org at least. For now,
I think it's fine to just do your own Sphinx and publish it wherever
(GitHub, RTM). We can later see what to do about bro.org. Generally,
our bro.org setup certainly needs work, it's become hard to maintain
and extend. We have been talking about some options here recently but
not settled on anything in particular yet.

* Thoughts on when to merge ‘package-manager’ branch in ‘bro’ ?

The main question is if we see this as a 2.5 feature? If so, we should
merge soon; if not, postpone until that's out the door. Given how far
you are already, I vote for making it part of 2.5. We could declare it
experimental still for 2.5 to get some more time to iron out the
workflow before we tell people it's ok to start relying on it.

Again, all very nice work!

Robin

I may have lost track of the design so I don't know where things stand now, but I think this would make sense too.

  .Seth

- Would suggest to rename “pkg.meta” to, say, “bro-pkg.meta”, just to
  make it more explicit that it's a Bro package. That's something one
  can also then search for on GitHub.

Just throwing in two more permutations: bro.meta or bro.pkg.

- For our default package source, do we want to support non-GitHub
  repositories? If so, a naming scheme by GitHub user won't work.

- Suggest to rename "/opt/bro/var/lib/package-manager" to
  "../bro-package-manager" or "../bro-pkg".

Yeah, especially if users don't install into the /opt/bro prefix, not
having "bro" as part of the filename might be confusing.

For now, I think it's fine to just do your own Sphinx and publish it
wherever (GitHub, RTM).

In case you're going with github, here's something non-intuitive that
took me a while to figure out: you need to put an (empty) file .nojekyll
in the document root, otherwise github interprets directories starting
with underscores, which sphinx uses (e.g., _static and _images).

Given how far you are already, I vote for making it part of 2.5.

+1

    Matthias