Package manager meta data

Playing with Bro packages and bundles, there's one thing where I ended
up a bit confused and that's the meta data. We have two places right
now that store meta data about a package: there's the central set
stored with the package source (bro-pkg.index), and then there's the
set coming with the package itself (bro-pkg.meta). However, I'm a bit
lost what goes where, and which information should remain accessible
once things are bundled up.

Seems bro-pkg.meta generally has "description" and "tags" at least.
However, that information seems lost once I bundle a package up and
don't have access to the index repository anymore. There doesn't seem
to be an expectation that bro-pkg.meta would have similar information,
so there's nothing there to fill the gap.

Then, bro-pkg.meta has a "version" field (the docs show that as the
one field to put in there). I believe that version isn't really used
anywhere other than showing it as part of the meta data output in
"info". In Python it also maps to PackageInfo.metadata_version. But
then there's also PackageStatus.current_version, which is from git
(although I'm not sure how I would actually set that; is it just any
tag?).

Once packages go through bundle/unbundle I cannot run "info" when I'm
offline, and hence I don't get some of the information anymore (same
from Python with PackageInfo I believe)

Ideally, what I'd like to have is a single interface (speaking in
Python API terms) for a package that gives me all the meta-data
consistently, pulling it out from where's it's stored and maintaining
it when bundling. For things like "description", "tags", it could pull
them out of bro-pkg.index by default, and maybe override them from
bro-pkg.meta if they are defined there (so that one can set them even
if there's no package source to begin with). And bundles would then
carry the information through, probably inside their manifest. For
version information, bro-pkg could start with the git branch/tag but
override it with values from bro-pkg.index and bro-pkg.meta if defined
there. That way people could choose how to do their versioning. The
API would just offer a single version() doing the right thing, and
"bro-pkg info" would show that.

Does this make sense? I'm not quite sure about the reasons for the
current structure, I'm mostly thinking from the perspective of
information about the package I'd like to have access to easily, and
where it's coming from.

Robin

Then, bro-pkg.meta has a "version" field (the docs show that as the
one field to put in there). I believe that version isn't really used
anywhere other than showing it as part of the meta data output in
"info”.

Right, the “version” field in bro-pkg.meta doesn’t serve an actual function at the moment. I was thinking about removing it (more on that in a response below).

then there's also PackageStatus.current_version, which is from git
(although I'm not sure how I would actually set that; is it just any
tag?).

If you mean what control does the author of a package have over how that field gets populated: they can use git tags or branches.

With respect to the way users follow package updates, I expected the two common cases to be either tracking a particular git branch or by only updating to release versions. A “release version” here is any git tag that looks like a SemVer (e.g. 1.0.0). So a user might do things like:

bro-pkg install —version 1.0.0 foo

or

bro-pkg install —version master foo

Ideally, what I'd like to have is a single interface (speaking in
Python API terms) for a package that gives me all the meta-data
consistently, pulling it out from where's it's stored and maintaining
it when bundling. For things like "description", "tags", it could pull
them out of bro-pkg.index by default, and maybe override them from
bro-pkg.meta if they are defined there (so that one can set them even
if there's no package source to begin with).

That could be misleading. The purpose of putting those tags in bro-pkg.index is because they are related to package “discoverability” and it’s easier to search that data if it’s located in a single repo, the package source, rather than having to download the entire universe of packages up front.

If bro-pkg.meta is allowed to “override” those fields just for the purpose of displaying metadata in the “info” command, it’s misleading because that is not the actual data that’s searchable via the “search” command.

If it’s important to always have some sort of “description” field to display from the “info” command, which would help situations where a package was never located via a package source (e.g. installed directly by git URL), then I’d suggest maybe either displaying contents of the README or just having a differently-named field. Since the user has already found the package, maybe it’s more useful for them to have a “usage” or “quickstart” field that gives them some high-level view of the API or options to tweak.

Technically, people can put any field they want in bro-pkg.meta and “info” will display it, so it’s just a matter of documenting suggested practices.

And bundles would then
carry the information through, probably inside their manifest.

Yeah, it would be fine to bind whatever metadata is available in the current bro-pkg.index to a package as it gets bundled or installed and to show that in the “info” command.

For
version information, bro-pkg could start with the git branch/tag but
override it with values from bro-pkg.index and bro-pkg.meta if defined
there. That way people could choose how to do their versioning. The
API would just offer a single version() doing the right thing, and
"bro-pkg info" would show that.

How would those other versioning schemes actually work, though?

A user needs to be able to install a particular package version, so bro-pkg needs to be able to associate a version with a git commit.

If versioning is done in bro-pkg.meta within a package, I’m not sure how to locate a particular version. Wouldn’t I have to brute force search through all commits? Even then, it’s possible the author screwed up and has multiple commits with the same version in bro-pkg.meta, which one does bro-pkg choose?

If versioning is done in bro-pkg.index, the author could create mappings of “version -> commit hash”. But if we’re talking about the official bro “packages” source, they have to submit a pull request each time they want to change that information and someone on the Bro team needs to approve it.

Sticking with standard git mechanisms seems like the best thing to do here. Tracking updates to branches “just works” and if people create tags that conform to semantic versioning then tracking updates to stable releases is also straight forward. This method also works for installing packages directly via explicit git URL (not via a package source). There’s also no way to accidentally create versioning ambiguities because git is the sole authority on versioning.

Does this make sense?

I'll need to see your responses to above points, but the point about “info” showing consistent metadata and working when offline makes sense. Set of potential changes:

1) Documentation updates: get rid of references to “version” in bro-pkg.meta, make it clear to use SemVer-formatted git tags for release versioning.

2) When a package is installed or bundled/unbundled, include a copy of its data from the currently-available bro-pkg.index. This data is bound to that particular installation of the package.

3) The “info” command should first check if a package is installed and display all metadata bound to that installation before trying to download direct from the origin repo or looking it up in package source. This makes “info” work offline when the package is installed.

Let me know how that looks.

- Jon

That could be misleading. The purpose of putting those tags in bro-pkg.index is because they are related to package “discoverability” and it’s easier to search that data if it’s located in a single repo, the package source, rather than having to download the entire universe of packages up front.

If bro-pkg.meta is allowed to “override” those fields just for the purpose of displaying metadata in the “info” command, it’s misleading because that is not the actual data that’s searchable via the “search” command.

I totally agree that overriding information can be misleading but it
might be worth compared to the other options.

If it’s important to always have some sort of “description” field to display from the “info” command, which would help situations where a package was never located via a package source (e.g. installed directly by git URL), then I’d suggest maybe either displaying contents of the README or just having a differently-named field. Since the user has already found the package, maybe it’s more useful for them to have a “usage” or “quickstart” field that gives them some high-level view of the API or options to tweak.

For me, introducing another field in bro-pkt.meta just increases
confusion. Displaying the README seems to be the easiest solution but I
think the behavior is still a bit confusing. Correct me if I am wrong
but bro-pkg.meta contains stuff like script_dir and dependencies (so
rather technically), whereas bro-pkg.index contains the descriptive
information like info text and tags (which is metadata, too, one could
even argue it's "more meta" than script_dir etc.).

From the user/developer perspective and excluding the technical

realization, I think the most desirable solution would be to have a
single file to put the meta data in, so that a package is completely
self-describing. This would also allow to provide different descriptions
for different versions.

Regarding the technical solution, I'll try to sum up: Using a
distributed structure implies that important information is distributed,
too. I think the first question is, where to aggregate the information?
One could either maintain a cache in every client or integrate it into
the list of packages aka the public repository (current implementation).
The second question would be, whether and how to synchronize the
information? If the info is part of the repository this can be either
done manually (more or less the overriding solution of the current
implementation, assuming that the developers keep meta data in sync) or
automatically (e.g., by a script that fetches meta data of packages once
a day). If the cache is part of the client, this could be done based on
an expiration threshold or intentionally by the user (similar to dnf).
Finally one could drop the requirement of synced package and repository
meta data, risking to confuse the users. In that case the information
contained in the package should be used whenever possible (e.g., the
info command for a not installed package could obtain the most recent
information from the package's git repo).

The current implementation avoids synchronization by splitting the
information at the cost of non-self-describing packages. In general I
would vote for completely self-describing packages (might also make
bundling easier). The different aggregation and synchronization options
have their own pros and cons.

Another question: Now that repositories only contain bro-pkg.index files
with links instead of submodules, how are deleted/unavailable packages
detected/removed?

Best regards,
Jan

Correct me if I am wrong
but bro-pkg.meta contains stuff like script_dir and dependencies (so
rather technically), whereas bro-pkg.index contains the descriptive
information like info text and tags (which is metadata, too, one could
even argue it's "more meta" than script_dir etc.).

That’s right. The way I was thinking about how it’s split up is: if the metadata is related to how users will search for and discover new packages, then put it bro-pkg.index. Else it’s likely related to how the package will interoperate with bro, bro-pkg, other packages, etc., and that goes in bro-pkg.meta.

I think the most desirable solution would be to have a
single file to put the meta data in, so that a package is completely
self-describing. This would also allow to provide different descriptions
for different versions.

Yes, I also think each package maintaining just it’s own, single metadata file is better. It also means that if the package author ever registered their package with multiple sources, they don’t have to maintain the same bro-pkg.index in multiple places.

I don’t remember if we just settled on the current implementation because it was quick/easy or there were objections to other more complicated technical solutions.

Regarding the technical solution, I'll try to sum up: Using a
distributed structure implies that important information is distributed,
too. I think the first question is, where to aggregate the information?
One could either maintain a cache in every client or integrate it into
the list of packages aka the public repository

Aggregating it into the package source is a better solution than having every client do it. The later isn’t going to scale well: the client will take longer and longer over time as more and more packages get registered to a source. Also takes longer as a function of total number of release versions a package has because we are collecting metadata for each version. Rather not ask users to just get used to developing more patience over time.

The second question would be, whether and how to synchronize the
information? If the info is part of the repository this can be either
done manually (more or less the overriding solution of the current
implementation, assuming that the developers keep meta data in sync) or
automatically (e.g., by a script that fetches meta data of packages once
a day).

I’d opt for a daily cron job to aggregate metadata into package sources.

If the cache is part of the client, this could be done based on
an expiration threshold or intentionally by the user (similar to dnf).
Finally one could drop the requirement of synced package and repository
meta data, risking to confuse the users. In that case the information
contained in the package should be used whenever possible (e.g., the
info command for a not installed package could obtain the most recent
information from the package's git repo).

It’s not a problem for the metadata to be out of sync for a day since only the “search” command is going to be using the aggregated data. Other commands would have direct access to accurate metadata since they’ve already cloned the package locally.

It would also be trivial to give users access to the aggregation tool if they have a problem with potentially using day-old metadata in their searches and are prepared to wait however long the aggregation process takes.

E.g. we add this command/flag: `bro-pkg refresh —aggregate-metadata`

Then the only difference between the daily aggregation process and a user is that the daily process does a `git commit && git push` in the locally cloned package source that bro-pkg is using internally.

Another question: Now that repositories only contain bro-pkg.index files
with links instead of submodules, how are deleted/unavailable packages
detected/removed?

At the moment, they’d have to be removed manually whenever someone notices or reports it.

If we switch to automated metadata aggregation, removal of nonexistent packages could naturally be a part of that.

- Jon

I'll just jump in at the end of the thread: Switching to fully
self-describing packages sounds great to me, that should solve all the
issues I noticed. I also don't quite recall the reasoning for arriving
at the current scheme, but it was probably a combination of iterating
over the design a few times, along with a desire to keep it simple.
But having something like a cronjob, or git hook, trigger a rebuild of
a central cache seems easy enough, and would be a major usability
improvement.

I believe the main thing to consider is making it really easy for
package sources (in particular external ones not maintained by the
project) to run the meta data aggregation. Maybe that additional "git
commit && push" could even be integrated into an additional
server-side bro-pkg command. One could then drive that from either
cron or git hooks (if a source operator can do hooks that will avoid
any delays at all).

Robin

Ah, that's not true of course, it only applies to initial setup of a
package.

Robin

Aggregating it into the package source is a better solution than having every client do it. The later isn’t going to scale well: the client will take longer and longer over time as more and more packages get registered to a source. Also takes longer as a function of total number of release versions a package has because we are collecting metadata for each version. Rather not ask users to just get used to developing more patience over time.

I am not sure how strong the effect might be but this is definitely a
good point!

I’d opt for a daily cron job to aggregate metadata into package sources.

For me this seems to be the easiest solution providing a maximum of
usability, too.

It’s not a problem for the metadata to be out of sync for a day since only the “search” command is going to be using the aggregated data. Other commands would have direct access to accurate metadata since they’ve already cloned the package locally.

What about the "info" command? If a package is not installed it would be
nice if the command would obtain the latest meta data from the package's
repository as well.

It would also be trivial to give users access to the aggregation tool if they have a problem with potentially using day-old metadata in their searches and are prepared to wait however long the aggregation process takes.

E.g. we add this command/flag: `bro-pkg refresh —aggregate-metadata`

Then the only difference between the daily aggregation process and a user is that the daily process does a `git commit && git push` in the locally cloned package source that bro-pkg is using internally.

That sounds great to me! This would be exactly the behavior I would
expect based on other package managers I have used so far.

Another question: Now that repositories only contain bro-pkg.index files
with links instead of submodules, how are deleted/unavailable packages
detected/removed?

At the moment, they’d have to be removed manually whenever someone notices or reports it.

If we switch to automated metadata aggregation, removal of nonexistent packages could naturally be a part of that.

Thanks for the clarification. Automatic removal might be a bit
aggressive but a warning would be very helpful I guess.

Best regards,
Jan

What about the "info" command? If a package is not installed it would be
nice if the command would obtain the latest meta data from the package's
repository as well.

Yep, I imagined “info” would still grab the latest metadata as well.

Thanks for the clarification. Automatic removal might be a bit
aggressive but a warning would be very helpful I guess.

From the user’s perspective, if they already installed a package that later becomes unavailable, they’d see a warning when they do “refresh”, but otherwise they could continue using the installed package.

But within the package source, what reason is there to keep old packages listed if their git URL no longer points to a valid package? What would a user do with that information? Theoretically, if the package was just temporarily unavailable, the next time the aggregation process runs, it would get listed again and users can seamlessly start receiving updates for it again.

- Jon

Thanks for the clarification. Automatic removal might be a bit
aggressive but a warning would be very helpful I guess.

From the user’s perspective, if they already installed a package that later becomes unavailable, they’d see a warning when they do “refresh”, but otherwise they could continue using the installed package.

But within the package source, what reason is there to keep old packages listed if their git URL no longer points to a valid package? What would a user do with that information? Theoretically, if the package was just temporarily unavailable, the next time the aggregation process runs, it would get listed again and users can seamlessly start receiving updates for it again.

I wasn't precise, sorry. I thought of an temporary unavailable package
repository and had in mind that it would be deleted from the upstream
package source repository by a cron job. I guess you are talking about a
local copy of that source repo. In that case deleting unavailable
packages wouldn't harm. I thought about supporting the operator of the
upstream source repository in cleaning the repo. However, this will
require manual interaction anyway and hopefully isn't a use case that
will be encountered soon :slight_smile:

Best regards,
Jan

I was thinking that even the upstream source repo could clean out invalid packages automatically since their metadata/listing is no longer useful to anyone. Did you have more thoughts/concerns about why that might require manual interaction?

- Jon

I was thinking that even the upstream source repo could clean out invalid packages automatically since their metadata/listing is no longer useful to anyone. Did you have more thoughts/concerns about why that might require manual interaction?

I just don't get this:

Theoretically, if the package was just temporarily unavailable, the next time the aggregation process runs, it would get listed again

How, if it is completely removed?

Jan

Oh, duh, I see what you mean. I guess the answer is related to something we haven’t yet spec’d out: how should the structure of a package source’s index files change to adapt to the new scheme of aggregating metadata?

A package source could look like:

https://github.com/bro/packages
  0xxon/
    packages.index
    bro-sumstats-counttable.meta
  sethhall/
    packages.index
    credit-card-exposure.meta
    ssn-exposure.meta
    domain-tld.meta

Contents of sethhall/packages.index:

  https://github.com/sethhall/credit-card-exposure
  https://github.com/sethhall/ssn-exposure
  https://github.com/sethhall/domain-tld

Contents of sethhall/ssn-exposure.meta:

  # Automatically generated, do not edit.
  [master]
  url = https://github.com/sethhall/ssn-exposure
  tags = file analysis, social security number, ssn, dlp, data loss
  description = Detect and log US Social Security numbers.
  script_dir = scripts

  [1.0.0]
  …

  [2.0.0]
  …

The packages.index files are manually modified by users during the act of package registration. The *.meta files are automatically created by the metadata aggregation process as it crawls the URLs listed in packages.index.

If a package is in packages.index, we say that its state is “registered”. Then, once it has a *.meta file, we say that its state is “listed”. If a package is “listed”, then bro-pkg users can see it show up from “search” and “list” commands. If the metadata aggregation process finds an invalid/unreachable package, it removes it’s *.meta file, but keeps it “registered" in packages.index, so the next crawl will still attempt to list the package in case it was just temporarily unavailable.

Thoughts? Is it useful to collect metadata for each version or just the latest? “Latest" here would mean the latest release version tag or, if none exist, the latest master branch commit.

If per-version metadata collection isn’t needed, the structure outlined above still works, but the existing structure would alsol: just stick latest metadata directly into bro-pkg.index (mixing autogenerated data w/ user-entered data).

- Jon

Here’s a summary of the set of changes I plan to make related to package metadata, let me know if there’s objections or alternate ideas:

1) remove mentions of ‘version' field from bro-pkg.meta as versioning is done entirely by git tags

2) packages should now put ‘description’ and ‘tags' fields into bro-pkg.meta so they are fully self-describing

3) change "info" command to first check if package is already installed before trying to download it. This allows metadata to be displayed when working offline.

4) new command/flag `bro-pkg refresh --aggregate-metadata <pkg source>`: crawls a package source’s index files and for each package take the master branch’s bro-pkg.meta and aggregate it into a single file, aggregate.meta, at the top-level of the package source

5) new command/flag `bro-pkg refresh --aggregate-metadata <pkg source> —push`: same as (4), but also commit/push changes to remote repo

6) change structure of package source index files:

bro-pkg.index will now contain just a simple list of package URLs:

    https://github.com/sethhall/credit-card-exposure
    https://github.com/sethhall/ssn-exposure
    https://github.com/sethhall/domain-tld

Aggregated metadata from packages in the source all lives within a single file at the top-level named “aggregate.meta”. Only metadata from current master branches of packages is included — I don’t think it’s that useful to collect metadata for each release version and doing that even makes it harder for package authors to change the way users discover their package.

- Jon

Here’s a summary of the set of changes I plan to make related to
package metadata, let me know if there’s objections or alternate
ideas:

Sounds all good to me. That should make the Python API easier to use
for accessing meta information, too.

Some questions for my understanding:

remove mentions of ‘version' field from bro-pkg.meta as versioning is
done entirely by git tags

What will be the relationship between version tags and the master
branch? Is master just another version I can select to install?

4) new command/flag `bro-pkg refresh --aggregate-metadata <pkg
>`:

A shortcut "bro-pkg refresh" might be useful here that simply does
that for all configured sources.

Also, what if I update the meta information locally, but then later
the remote repository updates its version as well, so that mine gets
stale. Would the newer remote version replace the local version?

single file at the top-level named “aggregate.meta”. Only metadata
from current master branches of packages is included

Depending on the answer to the version question above, an alternative
could be using the most recent version (i.e., the commit that that
version tag refers to). That way people could control a bit better
when meta data gets updated (it's when they set a new version). It
would still fallback to master if no version tag is set.

Robin

What will be the relationship between version tags and the master
branch? Is master just another version I can select to install?

Yes, any tag or branch can be selected with `bro-pkg install —version <version or branch>`.

When selecting a version, `bro-pkg update` only considers newer x.y.z release tags for that package. When selecting a branch, it tracks updates along that branch.

4) new command/flag `bro-pkg refresh --aggregate-metadata <pkg
>`:

A shortcut "bro-pkg refresh" might be useful here that simply does
that for all configured sources.

I’ll make —aggregate flag operate on all sources then add an optional "—source <source>” flag for those that just want to operate on a single one.

`bro-pkg refresh` without any flags is currently like a `git pull` from the remote source and what most users will be doing. Likely only the operators of sources will want to —aggregate (if the package ecosystem becomes large, this operation could take a while to complete).

Also, what if I update the meta information locally, but then later
the remote repository updates its version as well, so that mine gets
stale. Would the newer remote version replace the local version?

Yes, newer data from the remote source overwrites local changes.

single file at the top-level named “aggregate.meta”. Only metadata
from current master branches of packages is included

Depending on the answer to the version question above, an alternative
could be using the most recent version (i.e., the commit that that
version tag refers to). That way people could control a bit better
when meta data gets updated (it's when they set a new version). It
would still fallback to master if no version tag is set.

Sounds good, I’ll have it do that.

- Jon

When selecting a version, `bro-pkg update` only considers newer x.y.z
release tags for that package. When selecting a branch, it tracks
updates along that branch.

Ok, sounds like I should think about it not as "master" then, but as
"HEAD on the master branch". So let's say the first time I install a
package, there're tags v1.0 and v1.1 on master, but master's HEAD is
already a bit ahead, without a new version set yet. What gets
installed when I don't specify a version? 1.1? And what does update do
when (1) later a new version got set, and (2) master then moves beyond
that version?

It would be good to spell this all out clearly in the documentation,
including in the quick start / examples, so that people understand how
to use the versioning effectively.

I’ll make —aggregate flag operate on all sources then add an optional
"—source <source>” flag for those that just want to operate on a
single one.

Makes sense.

Robin

When selecting a version, `bro-pkg update` only considers newer x.y.z
release tags for that package. When selecting a branch, it tracks
updates along that branch.

Ok, sounds like I should think about it not as "master" then, but as
"HEAD on the master branch”.

For packages tracking a git branch, bro-pkg will update to the latest commit at the end of the branch.

Maybe I’m getting git terminology wrong, but I thought ‘master’ is always a reference to the latest commit at the end of the branch and HEAD usually means whatever commit is locally checked out (maybe not latest commit). So thinking of it as ‘master’ sounds right to me, but whatever you think of as the “end of the branch” is what bro-pkg is tracking.

So let's say the first time I install a
package, there're tags v1.0 and v1.1 on master, but master's HEAD is
already a bit ahead, without a new version set yet. What gets
installed when I don't specify a version? 1.1?

1.1 is installed by default. (rationale is to give preference to latest stable release over dev versions)

And what does update do
when (1) later a new version got set, and (2) master then moves beyond
that version?

(1) it installs the newer version tag (“newer” meaning the version number is greater than before)
(2) it doesn’t care what gets added on master, the local install isn’t updated

In other words, if you install a specific release tag of a package (by default or by choice), you only receive stable release updates.

If you install a package’s branch (by default or by choice), updates will be tracking the latest commit at the end of that branch.

It would be good to spell this all out clearly in the documentation,
including in the quick start / examples, so that people understand how
to use the versioning effectively.

Just added another section also linked to it in quickstart:

  http://bro-package-manager.readthedocs.io/en/latest/package.html#package-versioning

The upgrade command docs also have a brief description of how it works.

- Jon

Thanks, that and the new text help, I think I got it now. That all
makes sense. Maybe add to the text that tags are not pushed by default
by git, so one shouldn't forget "git push --tags".

Robin

The metadata changes discussed in this thread are now in bro-pkg 0.8 (and the structure of the bro/packages source on GitHub is adapted for it).

- Jon

I've tried this out now, it's all working great, thanks! I just
noticed one tiny little thing, for which I filed a ticket:
https://bro-tracker.atlassian.net/browse/BIT-1766

Robin