open a pipe?

How hard would it be to modify files so that we could use them like a pipe to another program? I'll give an example…

global myfile = open("| shasum > output");
print myfile, "hello";
close(myfile);

Then the output file would have the sha1 value for "hello". I don't think it would be a major change to files to support this, but I've been wrong before. :slight_smile:

  .Seth

I'm very reluctant to allow that without having script-level async
I/O. What if the pipe blocks?

Robin

Explosions and fire as you'd expect. :slight_smile:

Don't we already theoretically have this problem with the print statement?

  .Seth

Yes, but it's harder to make that block. :slight_smile:

One we have the pipe, the next thing you'll be doing is printing to
netcat. :slight_smile:

Robin

Dammit, that *is* one of my use cases. (not the only one though)

You know me too well!

  .Seth

Robin's comments aside, also note that you can implement a single pipe with popen() et al., but in order to add further output redirection or multiple pipes (e.g. your "> output") you need to run a shell (or a lot of work to do it without a shell)

cu
Gregor

> global myfile = open("| shasum> output");

I'm coming around on this if we move the execution of the pipe into a
thread. With the new thread infstrastructure, that should be mostly
straight-forward and get us around all the blocking problems.
Actually, I'm wondering if non-packet IOSources should generally move
into threads (e.g., the DNS Mgr).

Robin

We just have to be a little bit careful about that, because it mixes threading and forks. Seth and me are already testing that in the input framework for a bit - and it seems to work.

But it is not that easy to find straight answers if it is a good idea to call popen (which essentially does a fork and passes the command to "sh -c") in a thread - opinions on the tread-safety of popen seem to be divided.

Johanna

global myfile = open("| shasum> output");

I'm coming around on this if we move the execution of the pipe into a
thread. With the new thread infstrastructure, that should be mostly
straight-forward and get us around all the blocking problems.

Woo. We definitely need to think about this a bit more though. I don't really like the perl-ism inherent in writing commands with pipes in the "file name". I've been looking at node.js's Stream API[1] lately. We may be able to borrow some ideas from that, but I already see some things about their API that don't like.

Actually, I'm wondering if non-packet IOSources should generally move
into threads (e.g., the DNS Mgr).


Do you have any intuitions yet on where we might run into issues with too many threads? When I was playing around a few nights ago I actually created about a hundred threads. :slight_smile:

1. Stream | Node.js v21.2.0 Documentation

.Seth

Yeah, we'd need to understand that. My intuition believes it should be
doable, but I might be missing something here. Do you know a web page
or something that explains where the trouble is?

Robin

Woo. We definitely need to think about this a bit more though. I
don't really like the perl-ism inherent in writing commands with pipes
in the "file name".

I'm open to ideas. :slight_smile: The "classic" non-Perl way would obvioysly be
just providing a popen() that returns an instance of type file.

Do you have any intuitions yet on where we might run into issues with
too many threads?

No. My hope is that current OSs can deal well with *many* threads, in
particular if they are all of low load. What we'll eventually need
though is a way to cleanup threads no longer used. That's not done
currently.

That said, adding one thread per IOSource doesn't make much of a
difference, it's the logging that creates potentially many threads.

If things really don't work out, we'll need to swtich to
one-thread-per-stream mode, but that gets more tricky on the backend
side. I would prefer to avoid that.

  When I was playing around a few nights ago I actually created about
  a hundred threads. :slight_smile:

I'm not surprised. :slight_smile:

Robin

Woo. We definitely need to think about this a bit more though. I don't
really like the perl-ism inherent in writing commands with pipes in the
"file name".

I agree. One particular concern I have is that it makes it easier to screw
up and not properly escape/sanitize untrusted input that goes into the
"filename", which in this case instead allows shell command injection :-(.

Also, Robin, from what you sketch I'm not understanding how threading is
going to help. Are you moving away from the model that script execution
is atomic (other than "when" statements) and serialized? Wouldn't using
"when" statements of some form better fit here?

    Vern

I agree. One particular concern I have is that it makes it easier to screw
up and not properly escape/sanitize untrusted input that goes into the
"filename", which in this case instead allows shell command injection :-(.

yeah, I'm sure there are nicer interfaces, though I'm not sure we can
really avoid the injection problem; in the end, we give the user the
power to run a shell one way or the other (but we already do that with
system()).

Also, Robin, from what you sketch I'm not understanding how threading is
going to help. Are you moving away from the model that script execution
is atomic (other than "when" statements) and serialized? Wouldn't using
"when" statements of some form better fit here?

There are two different questions here: what the script-level
interface looks like, and how the implementation achieves that. I was
primarily talking about the latter: rather than manually interleaving
reading the pipe's output with the packet processing (which gets
cumbersome in particular if we need to support a potentially large
number of open pipes), we can have a thread execute the command and
take care of I/O. We already have the infrastructure to send results
back asynchronously into the main thread, where it can turn into
whatever we need. (Assuming any potential pthread/fork problems can be
solved.)

Regarding what the interface looks like, there are a number of
options. Using "when" is one, we could indeed feed in there. But I'm
not sure it's right model here: when would work best for simple
one-request-one-reply style I/O but with pipes we may want more: keep
writing into it, and keep reading out. That would work better with a
file-like object ones prints to, and any output turning into events.
But there may be still better models that that.

Robin

yeah, I'm sure there are nicer interfaces, though I'm not sure we can
really avoid the injection problem

Right. My point is how *easy* it is. The issue with building piping
into open() is the script writer might not even remember that the feature
is there. Thus, if they construct a filename from untrusted input,
it could wind up starting with '|', which was never anticipated. At least
with something like popen() it's clear up-front "whoa this is running
a command".

but with pipes we may want more: keep
writing into it, and keep reading out. That would work better with a
file-like object ones prints to, and any output turning into events.

I see. Yeah, for that, what you sketch makes more sense.

    Vern

I keep thinking that we just need to provide the connection between the input framework and sub processes. From the script-land perspective, something like this maybe?

# The sub process is defined
SubProcess::new("sha_hash", [$cmd="shasum"]);

# STDIN is connected to a Bro file.
local sha_command = SubProcess::get_stdin("sha_hash");

# STDERR and STDOUT are connected to inputs
Input::add_event([$name="sha_hash",
                  $source=SubProcess::get_stdout("sha_hash"),
                  $fields=ShaVal, $ev=sha_line,
                  $mode=Input::STREAM, $reader=Input::READER_RAW]);
Input::add_event([$name="sha_hash",
                  $source=SubProcess::get_stderr("sha_hash"),
                  $fields=ShaVal, $ev=sha_line,
                  $mode=Input::STREAM, $reader=Input::READER_RAW]);

# The subprocess is actually executed.
SubProcess::run("sha_hash");

# Send data to STDIN of the command.
print sha_command, "some data";

# The command dies and the input streams are destroyed.
close(sha_command);

I think that SubProcess::get_stdout and SubProcess::get_stderr would return strings (file names) and SubProcess::get_stdin would return a file typed variable. It's all pretty verbose, but I don't think we have many options to keep things async and functional with the event loop. Thoughts?

  .Seth

Yeah, that's my main concern. It's pretty complex for "just" opening a
pipe. Though perhaps we could hide some of the boilerplate in a
function that takes care of the most common case like reading a list
of lines as strings.

Robin