script loading implementation

With the way the new policy scripts use (abuse?) loading scripts from subdirectories of BROPATH, it seems easier to confuse the mechanism by which the scanner knows whether a given script has already been loaded. Consider a policy tree like:

.
├── A
│ └── B
│ └── script.bro
├── other scripts/directories (e.g. loader.bro)

If BROPATH contains ".:A" (the policy root plus A/), then "script.bro" could potentially be loaded twice (thus causing errors) when both the "A/B/script" and "B/script" input forms are @load'd either from other scripts or from user input.

The common pitfall would probably be when "loader.bro" is some script that always gets loaded by default and itself does a "@load B/script". A user attempting to run bro will see errors when they try `bro A/B/script` (or even `bro B/script.bro`), but `bro B/script` will work.

I want to change it so that:

1) all paths branching off BROPATH that lead to the script-being-loaded should be considered as loaded

2) inputs to @loads (and user arguments) without a .bro extension should also consider the .bro input form as loaded (currently it only does the reverse)

Does this sound good, or were there any other ideas to simplify the script loading process (the current scan.l already seems to be somewhat complicated in the way this happens) ?

- Jon

1) all paths branching off BROPATH that lead to the script-being-loaded should be considered as loaded

I'm not sure I understand what you're saying here.

2) inputs to @loads (and user arguments) without a .bro extension should also consider the .bro input form as loaded (currently it only does the reverse)

Does that gain much benefit? Wouldn't the file with the same name as a directory already be inaccessible due to path ordering?

Does this sound good, or were there any other ideas to simplify the script loading process (the current scan.l already seems to be somewhat complicated in the way this happens) ?

I've been thinking about this a bit some too and I've was considering removing the subpaths from the default BROPATH. It would make it so that if you wanted to load the core http analysis, you'd have to load protocols/http. I know that doesn't really address what you're talking about here, but it should help avoid any potential path problems in the default install.

  .Seth

> 1) all paths branching off BROPATH that lead to the
> script-being-loaded should be considered as loaded

I'm not sure I understand what you're saying here.

If I have a script: policy/A/B/script.bro

and a BROPATH containing: policy:policy/A

then when a script is loaded via "@load B/script", the scanner needs to consider both "B/script" and "A/B/script" as already loaded (they're the same file, but we need to know all possible subpaths of paths in BROPATH that can lead to the file).

Another way to do it could be to canonicalize inputs to the shortest subpath of a path in BROPATH that can load the script.

If you want a real example to play around with, try figuring out the right way to load "policy/protocols/conn/base.bro" as a command line argument to bro in the policy-scripts-new branch. (pretend you don't know that the script actually gets loaded by default because of some crazy packet-filter -> notice -> conn dependencies)

Here's the order of attempts I made

fail: `./src/bro protocols/conn/base.bro`
fail: `./src/bro protocols/conn/base`
success: `./src/bro conn/base`

> 2) inputs to @loads (and user arguments) without a .bro extension
> should also consider the .bro input form as loaded (currently it
> only does the reverse)

Does that gain much benefit? Wouldn't the file with the same name as
a directory already be inaccessible due to path ordering?

This doesn't have to do with scripts and directories having the same name.

This has to do with why this succeeds:

    `./src/bro conn/base`

while this fails:

    `./src/bro conn/base.bro`

I've been thinking about this a bit some too and I've was considering
removing the subpaths from the default BROPATH. It would make it so
that if you wanted to load the core http analysis, you'd have to
load protocols/http. I know that doesn't really address what you're
talking about here, but it should help avoid any potential path
problems in the default install.

Actually, I think that would address the first problem I'm talking about (not the second regarding .bro extension canonicalization). But seems like something someone might try to do again the future only to run into the same problems I did.

- Jon

Yeah, I think your examples show that there's a problem.

I'm not sure I fully remember how the @load mechanism currently tracks
what's loaded, but could it just generally canonicalize all scripts
to their full *absolute* paths of the actual files being loaded (i.e.,
always including the .bro extensions) and then remember those?

We do the same before loading a script and if the absolute path is
already known, we don't load it again.

Robin

I'm not sure I fully remember how the @load mechanism currently
tracks what's loaded,

It looks like it tracks whatever string is given as an argument to @load (or as an arg to bro), and if it ends in .bro, it also tracks the version with that stripped.

but could it just generally canonicalize all scripts
to their full *absolute* paths of the actual files being loaded
(i.e., always including the .bro extensions) and then remember those?

Yes, I think that works best/easiest. I can work on this.

- Jon

> but could it just generally canonicalize all scripts
> to their full *absolute* paths of the actual files being loaded
> (i.e., always including the .bro extensions) and then remember
> those?

Err, I don't think it's trivial in C/C++ without extra libraries (e.g. Boost) to get absolute paths from relative ones (we can't rely just on BROPATH since that may also have relative paths in it) let alone compare that two lexical file representations actual resolve to the same file (duplicate '/', embedded './'s, etc.).

I think it works to just track/remember inode numbers, though.

- Jon

Yeah, nice thought, that should work as well (and even catches hard
links :).

For getting an absolute path, there's realpath(). However, the glibc
man page says:

    "Avoid using this function. It is broken by design ..."

Perhaps there are other C versions of that function out there (with a
suitable licence) that are safer to use and we could borrrow? (In
fact, the glibc version has a non-standard feature that avoids the
problem it seems.)

But either way seems fine.

Robin

Instead of going for the path you can go for the inode of the file. See
man fstat(2). That will even take care of links.

cu
Gregor

Instead of going for the path you can go for the inode of the file.
See man fstat(2). That will even take care of links.

This is what I ended up doing on Friday, just didn't commit yet because I was trying to figure out why one of the unit tests failed w/ that change. Turns out the baseline was just wrong because the way the test file was named "conn-id.bro", it was loaded twice since the old path comparison method failed to detect a duplicate.

- Jon