Moving policy scripts into packages

Looking for some thoughts here. One of the items on the roadmap for
4.0 is moving scripts that currently live in policy/ over into Zeek
packages. The goals here are to (1) facilitate maintaining & testing
them independently of Zeek releases; and (2) come to a more flexible
notion of "default scripts" that can incorporate community-maintained
packages as well. This is tracked by issue
https://github.com/zeek/zeek/issues/414, including a 1st pass over the
existing policy scripts to understand what should/can be moved.
(Thanks, Vlad!)

Before we can begin working on this, we need to figure out how to
organize this new world. One particular question is where the moved
packages will live. I see the following options so far:

    1. Move each into a a separate repository on the zeek/ GitHub
       account.

    2. Similar, but to avoid cluttering zeek/, create a new GitHub
       organization "zeek-packages".

    3. Put them all into a single mono-repository (e.g.,
       zeek/standard-packages), i.e., treat them a one package.

    4. Do (1) or (2), and additionally create "zeek-standard-packages"
       that's full of submodules pointing to them (and also to
       community packages).

    5. Do (1) or (2), and teach zkg to understand "collections" of
       packages that can be installed/managed as a group, defined
       through some meta data somewhere.

Along with all of this comes a question of how to make it easy for
people to install a set of default packages now that these won't come
with Zeek itself anymore. Some of the schemes above make that easier
than others.

Thoughts/opinions/more ideas?

Robin

I like (2) for cleanliness.

I think some people would be fine with one large package, but in other cases, you may want the ability to easily enable/disable various standard scripts. Probably not wanting to maintain the same script in multiple places, I think that eliminates (3). Towards, (4) & (5) and iven the number of standard scripts, there should be an easy way to distinguish them from other packages when doing a ‘zkg list’. Maybe that’s just done via a tag or in the naming scheme.

-Dop

I like (2) for cleanliness.

Vote counted!

there should be an easy way to distinguish them from other packages
when doing a 'zkg list'.

Good point.

Also, one additional thought: Jon reminded me that zkg can manage
dependencies already. So the "collection" I mentioned could be a
meta-package that depends on all the ones we want. We might need to
make that a bit more explicit for this use case (like you say for
example in the output of "list"), but the basic functionality is
there.

Robin

Hi,

just a few thoughts about this. Generally - I like the idea of breaking this up.

I would like to list a few thoughts about additional technical points that
we should perhaps think about and that play into this decision.

* Testing:

  Currently, some of the policy scripts have tests that use Zeek
  functionality in rather unique ways / or are the only tests for some
  Zeek functionality. The SSL validation scripts are one example.

  This, from my point of view, it would be neat to have a way to still
  easily install a rather large set of packages (potentially nearly
  everything that is in policy at the moment) and run test on them.

  This also comes with a fun problem. Sometimes we perform changes to Zeek
  that change a lot of the test baselines - especially when we touch
  something that affects connection-ID hashing, or the order of elements
  in hashmaps. These cases might now require an update to the test-cases
  in a large number of packages. It would be neat to have an easy way to
  perform this.

* Versioning:

  I think if we want to do this, we need to have a better story for
  versioning than we, at the moment, have with zkg. To expand on this - at
  the moment the policy scripts just work with the version of Zeek that we
  distribute them with.

  It would be nice if, afterwards, it would still be possible to install a
  working set of a script for the running version of Zeek. Meaning that -
  if someone happens to run a version of Zeek that is 12 months out of
  date - they should probably get the version of the policy script that is
  known to work with this version of Zeek and where the tests pass with
  this version of Zeek.

  It would be super nice if this worked rather fine-granular - so even for
  development versions of Zeek.

* Documentation:

  At the moment the documentation of the policy scripts just lives
  together with the Zeek documentation. This has a few advantages - it
  e.g. shows redefs that are performed in policy scripts. It would be neat
  to have a place that contains the combined documentation of these
  scripts.

Currently, especially given the "update-tons-of-test-baselines-simultaneously"
problem, I am kind of tempted by 3. 3 would also enable relatively easy
versioning and mapping to Zeek versions that the packages are known to
work with. This should also allow to keep the current testing
infrastructure more or less working as it is. It does, however, not give
really fine-grained access to individual packages.

Johanna

Yeah, agreed -- I prefer #2 for the same reason.

Best,
Christian

* Testing:

   Currently, some of the policy scripts have tests that use Zeek
   functionality in rather unique ways / or are the only tests for some
   Zeek functionality. The SSL validation scripts are one example.

   This, from my point of view, it would be neat to have a way to still
   easily install a rather large set of packages (potentially nearly
   everything that is in policy at the moment) and run test on them.

On a related note, I think it would be quite helpful if the Zeek install tree would include some of the standard btest helpers, such as random.seed and the various canonification scripts.

   This also comes with a fun problem. Sometimes we perform changes to Zeek
   that change a lot of the test baselines - especially when we touch
   something that affects connection-ID hashing, or the order of elements
   in hashmaps. These cases might now require an update to the test-cases
   in a large number of packages. It would be neat to have an easy way to
   perform this.

I agree that this would be really handy. We could build some tooling that would still let you maintain the packages as individual repos but that enables such use use cases. It'd be pretty handy to have a zkg command that clones a given package plus all of its dependencies for development work, for example. Since zkg already speaks git rather well, it could even synthesize a submodule structure.

Thanks,
Christian

    1. Move each into a a separate repository on the zeek/ GitHub
       account.

    2. Similar, but to avoid cluttering zeek/, create a new GitHub
       organization "zeek-packages".

I'm thinking (2). Technically, either one can likely be made to work,
but (1) has a slight downside in terms of hurting a primitive
browsability use-case: I imagine people want to simply scroll through
a list of packages on GH, and the existing non-packages in zeek/ would
distract from that goal.

    3. Put them all into a single mono-repository (e.g.,
       zeek/standard-packages), i.e., treat them a one package.

The shortcoming of that structure is the lack of customizability. If
a user only wants a specific set of functionality, or wants to avoid a
set of scripts due to intrinsic overhead/conflict, it's better if
they're able to start from blank slate and add only the pieces they
want.

    4. Do (1) or (2), and additionally create "zeek-standard-packages"
       that's full of submodules pointing to them (and also to
       community packages).

    5. Do (1) or (2), and teach zkg to understand "collections" of
       packages that can be installed/managed as a group, defined
       through some meta data somewhere.

Between those, (5) may fit more naturally -- zkg may already be fit to
handle "meta packages" via simple package metadata dependency
configuration. Plus, that's a similar pattern to other package
management environments AFAIK.

  This, from my point of view, it would be neat to have a way to still
  easily install a rather large set of packages (potentially nearly
  everything that is in policy at the moment) and run test on them.

Agree that type of integration test is helpful -- may find conflicts /
bad-interactions / versioning-issues that way.

  Sometimes we perform changes to Zeek
  that change a lot of the test baselines - especially when we touch
  something that affects connection-ID hashing, or the order of elements
  in hashmaps. These cases might now require an update to the test-cases
  in a large number of packages. It would be neat to have an easy way to
  perform this.

Might be a chance to go another direction and revamp the test-writing
guidelines/patterns (or provide internal config options that help
produce "canonical" outputs) such that less test-cases are fragile to
these kinds of widespread, low-level changes in the first place. E.g.
if the order of elements in hashmaps is not defined, it's the fault of
any given test-case that chose to rely on a specific order being
produced. Or if an irrelevant change in UID breaks a test, the test
either needs to canonify or exclude UID from its pass/fail criteria.

  I think if we want to do this, we need to have a better story for
  versioning than we, at the moment, have with zkg. To expand on this - at
  the moment the policy scripts just work with the version of Zeek that we
  distribute them with.

I see the current story for versioning as this: package metadata
advertises Zeek version compatibility, zkg knows how to enforce that
dependency, and package authors are left in control of deciding their
compatibility requirements and implementing them (likely via @if or
#if directives).

Once scripts that used to get distributed along with Zeek become
independent packages, I agree they should start abiding by that story,
but I'm not sure if there were any further ideas to help minimize the
overall maintenance burden/friction associated with this new
requirement.

  It would be nice if, afterwards, it would still be possible to install a
  working set of a script for the running version of Zeek. Meaning that -
  if someone happens to run a version of Zeek that is 12 months out of
  date - they should probably get the version of the policy script that is
  known to work with this version of Zeek and where the tests pass with
  this version of Zeek.

Seems like a couple distinct questions:

* What's the LTS policy for packages? It can be made different/longer
than LTS policy for Zeek-proper if that's desirable. But if a
12-month LTS cycle is decided for packages, too, any extra effort
spect to support older Zeeks sends mixed messages.

* How to obtain an aggregate set of packages that were validated with
Zeek X.Y.Z: seems like a job for the (meta)package dependency metadata
to point directly to specific package versions (in the realm of git
tags/branches/commits, etc.).

- Jon

  This, from my point of view, it would be neat to have a way to still
  easily install a rather large set of packages (potentially nearly
  everything that is in policy at the moment) and run test on them.

While I agree that integration testing is useful, too, ideally the new
packages would primarily rely on tests that are standalone. Do you see
a problem with that for, e.g., the SSL functionality?

  that change a lot of the test baselines - especially when we touch
  something that affects connection-ID hashing, or the order of elements
  in hashmaps.

Agree with Jon here: This might be an opportunity to make the tests
less fragile, more like what we'd recommend for external packages
anyways.

  It would be nice if, afterwards, it would still be possible to install a
  working set of a script for the running version of Zeek.

Yeah, if we worked with a meta-package, we could "bless" a specific
version of that for a given Zeek release. People could update further,
but with less of a guarantee, though we'd try hard to ensure they work
with different versions, can even CI them against a bunch of recent
releases.

Overall, our current policy/ scripts haven't required version-specific
changes very often, so I'm not too worried here. The most common use
case it probably some script starting to use a newly introduced
feature, and that's pretty easy to catch / guard against.

It would be neat to have a place that contains the combined
documentation of these scripts.

Agree, and I'd extend that to packages in general, you be a job for an
extended packages.zeek.org to provide autogen'ed documentation.

Robin

Good question. I think would tie it to the Zeek LTS policy, with a
"blessed" version of the meta-package that we recommend (and maintain)
for each currently maintained Zeek version.

Robin

Agree, and I'd extend that to packages in general, you be a job for an
extended packages.zeek.org to provide autogen'ed documentation.

I like that idea! It would be great to have a standard process for generating docs for each package.

I’ve been working[1] on a template for Zeek packages that consist only of scripts. It’s based heavily on the plugin-support[2] in zeek-aux, but it goes a few steps further by running the tests via GitHub Actions with the current {zeek, zeek-lts, zeek-nightly} packages. It also integrates with GitHub Pages to auto-generate the Sphinx documentation. You can see an example here: https://grigorescu.github.io/external_dns/. (Note: This is my “working” package, and deviates from the upstream cookiecutter as I play around with new things).

It tries to combine the user-provided README, with autogenerated documentation, and some useful things that I bundled into the docs (for example, explaining how plugins get loaded). One neat feature is using intersphinx to link to the official Zeek docs, so that all the references to types, enums, base scripts will still be clickable links.

There are a couple of open questions with this. For instance, Zeekygen does not generate documentation for redefs within the script that’s doing the redefs. So this[3] will tell you that your script is adding some new notice types, but the documentation is missing. Also, I don’t have a good way to generate documentation for all your packages into a single page, which would be nice. I also made some changes to the Sphinx scripts in zeek-docs that I should probably figure out how to upstream. We might want to just publish the Sphinx module separately, so that people can just use it.

–Vlad

[1] - <https://github.com/esnet/cookiecutter-zeekpackage>
[2] - <https://github.com/zeek/zeek-aux/tree/master/plugin-support>
[3] - <https://grigorescu.github.io/external_dns/scripts/NCSA/external_dns/main.zeek.html#redefinitions>

Good point, yeah. I took a look at this. The btests in the Zeek distribution use 352 different pcaps. The tests for the policy folder use 30; 18 of them are also used by other tests. That set is small, less than 1MB. Use of the same pcap across the policy tests is pretty limited and would likely occur only within one new package. So it's not too bad -- if those tests migrate into new Zeek packages, we would duplicate those 18 pcaps, but mostly just once.

Other Zeek packages seem pretty unlikely to need a pcap from the Zeek distribution, but I'd argue they probably benefit from a standard random.seed and some of the canonicalizers.

An alternative would be to install the full set of pcaps (that's 23MB), probably optionally, and make the availability of the pcap a test requirement.

What do you think? I personally could live with duplicating those pcaps.

Best,
Christian

To summarize this a bit, below is what I think what I heard so far.
Feel free to respond further, I'll move this over into the ticket
later once we have consensus.

Robin

- General preference to keep packages in individual repositories
  hosted inside a new GitHub organization "zeek-packages".

- Management through a meta-package that lists lists all desired
  packages as dependencies. The meta-package can version content by
  pinning packages to blessed versions.

- Tie these meta-packages to the current Zeek releases. To take this a
  bit further:

    - I imagine this means that we'll have three meta-packages at any
      point of time: "zeek-packages-current", "zeek-packages-lts", and
      "zeek-package-devel". When a new release comes out, these rotate
      through.

    - People can install packages for older (now unsupported) Zeek
      versions by picking a older version of the corresponding
      meta-package.

    - The Zeek distribution can either download the current version of
      the meta package on install; or even just include the full
      content somehow (also see "zkg" below).

- Testing
    - Need to make tests standalone and less dependent on Zeek versions.

    - Should make standard btest infrastructure available to tests
      (e.g., Zeek's btest helpers, pcaps).

    - Provide integration tests that execute across the full set of
      "zeek-packages".

- Development
    - Make it it easy to work multiple packages at once (e.g., to
      update baseline; get all dependencies in place)

- Documentation
    - Use Zeekygen to document the full content of a meta package at
      once; can host either on docs.zeek.org or packages.zeek.org.

    - Make it easy to autogen docs for individual packages (ideas:
      GitHub Pages through Vlad's cookie-cutter; autogen on
      packages.zeek.org)

- zkg:
    - Distinguish standard/recommended packages from others.
    
    - Could we add a way to "prime" zkg's package cache so that a Zeek
      distribution could distribute a snapshot of "zeek-packages" for
      direct use; but zkg would still pull in updates if online access
      is available?

Great summary, thanks Robin.

- zkg:
          - Could we add a way to "prime" zkg's package cache so that a Zeek
       distribution could distribute a snapshot of "zeek-packages" for
       direct use; but zkg would still pull in updates if online access
       is available?

Related to this, I was wondering about zkg's status as an affiliated project ... if we strengthen the notion of packages from the core distribution, we may want to ensure zkg can be available from the outset (as a core component)?

I tried to look at some equivalents in other environments ... for example, it looks like when you install Python/Ruby from the official tarballs you get pip/gem out of the box.

Best,
Christian

- zkg:
    - Distinguish standard/recommended packages from others.

Not sure what's meant there, but names/listings would all use
`zeek/zeek-packages/` as a prefix and be a way of distinguishing.

If it's more about organizing meta-packages to have a secondary-level
of optional/recommended content, there's maybe already a `suggests`
metadata field that helps:

https://docs.zeek.org/projects/package-manager/en/stable/package.html#suggests-field

    - Could we add a way to "prime" zkg's package cache so that a Zeek
      distribution could distribute a snapshot of "zeek-packages" for
      direct use; but zkg would still pull in updates if online access
      is available?

It's possible. Needs more planning in terms of what's
desired/required for the distribution and integration with CMake
build/install logic.

* Will `zkg` now be a required dependency for installing `zeek` ?

* In any case, might assume CMake install logic can at least use `zkg`
if available. Then, it may be convenient if the package distribution
format is already something `zkg` knows well. Say, a "bundle".

* After a `zkg unbundle`, packages should be set up to track the
real/online git repo URLs such that `zkg refresh && zkg upgrade` could
be used to receive updates.

* In the case `zkg` isn't required/available when installing `zeek`,
I'm sure there's some duplication/re-implementation of install logic
we could add in the `zeek` CMake logic to install packages into usable
locations, but it may generally be tricky/fragile to allow `zkg` to
take over such an installation "after the fact". If there were such a
change of mind in the person doing an install, it might be easiest to
"do the `zeek` install process again, this time with `zkg` available".

- Jon

Yeah, I'd be in favor of tying zkg more closely to Zeek itself so that
it's always available as a required dependency. I think that also
makes sense more generally as it has become a key part of our
ecosystem. We can add it to auxil/ as another submodule. The Zeek-side
can then use the code to get get packages in place directly, either
through the command-line client or through the zkg Python API.

Robin