metrics framework

(This is mostly for Robin and Jon, but I'm CC-ing bro-dev as well to solicit others thoughts about the topic)

I'd appreciate if you guys took a look at the metrics framework and let me know what you think about it. There's a lot missing still, but it's a decent start. Running it on a tracefile with those metrics/*-example.bro scripts and taking a look at the output is probably one of the best ways to learn what it's doing. I'll document some features I'm still planning on adding to it here too.

Currently, in the metrics framework a metric is just a key or keys that is connected to a number which is collected over some interval before being written to disk and reset. One metric that you could collect is the number of established TCP connections. Going further, you can imagine only wanting to collect the metric for local addresses and collecting the metric separately for every /24 or every known internally allocated subnet. Changing the break time for the metric could be useful too because you may only care about some value per hour while others you want to collect the metric every 15 seconds. A subnet is one aggregation technique and results in one of the possible indexes. A user could configure that they want all metrics aggregated into /24 netmask length instead of calculating the metric per individual IP address. The other index is a freeform string field which can represent anything. It's possible to use either, both, or no indexes to aggregate the data.

Here are some thoughts about missing features:

- Missing support for cluster deployment. The way this will work (I think) is that the manager will handle the break interval (the time between metrics collection/logging and reset) and it will send a metrics collection event to the workers which will send their metrics data to the manager where the data from each worker will be added together and logged. This would essentially act as a lazy synchronization technique.

- Missing statistical support. I want to be able to define when notices should happen based on rate of change of a metric (per the break interval) increasing much faster than you think it should (SSH failed logins). There's probably a lot of other stuff in this area I haven't thought of.

- Need a way to refine metrics to add things like netmask aggregation or aggregation table without defining it in the Metrics::create function so that we can ship a metrics collection script that the user can subsequently reconfigure. For example, maybe they want to track successful outbound TCP connections by department. They could supply a table like:
  const subnets_to_depts = table( [1.2.3.0/24] = "Accounting", [4.3.0.0/16] = "Sales", [4.3.2.0/24] = "Accounting" );
and using that as the address aggregation table would aggregate the metrics per department instead of just per /24 or some other arbitrary netmask length.

- I need to write a command line tool to convert the log into something that Graphviz can understand because I'd like to be able to generate time-series graphs from these metrics really easily.

Here's the list of metrics I'm working towards currently...
1. Youtube video views.
2. HTTPS requests per host header (using a new SSL analyzer that provides the information from the SSL establishment), this is an example of a non-IP address based metric too.
3. TCP connections originated and responded to.
4. HTTP SQL injection requests (raise a notice when too many attacks)
5. HTTP requests subindexed by returned status code from server. How many 404,200,500 status codes seen per client IP address?
6. SSH failed connections (too many looks like scanning obviously).
7. DNS requests (watch for spiking could possibly find DNS tunnels or DNS amplification attacks)

Sorry for the rambling email, but my thoughts on this script are still mixing around a bit. :slight_smile:

Thanks,
  .Seth

Maybe this is a long-term / out-of-scope feature, but what about adding an ability to bind OIDs to certain metrics and make them available via SNMP / NETCONF / etc.? This could integrate nicely with existing network monitoring / management infrastructure wherever bro might be deployed, and could also let existing tools handle the collection / graphing / visualization of metrics.

--Gilbert Clark

- I need to write a command line tool to convert the log into
something that Graphviz can understand because I'd like to be able to
generate time-series graphs from these metrics really easily.

Any particular reason for Graphviz?

If not, I've frequently used matplotlib, a Python-based graphing/plotting library, to whip up stuff like this.

Though, maybe we could get away with an even lighter weight library than either to do this.

- Jon

I'd appreciate if you guys took a look at the metrics framework and
let me know what you think about it.

I'd love to do so, yet my cycles only allow for brief inline feedback.

Currently, in the metrics framework a metric is just a key or keys
that is connected to a number which is collected over some interval
before being written to disk and reset.

If it is really just a sequence of number, why not calling it time
series? The word metric (in networking) implies some sort of property of
a path, and more generally, some sort of performance measure. This would
also make more sense in the statistical context, where a time series
analysis is well-defined field of its own. I prefer this term not only
because I have taken a statistics course, but mainly because it is more
neutral, maybe even more general, since it only describes the format of
the data.

The way this will work (I think) is that the manager will handle the
break interval (the time between metrics collection/logging and reset)
and it will send a metrics collection event to the workers which will
send their metrics data to the manager where the data from each worker
will be added together and logged. This would essentially act as a
lazy synchronization technique.

To preserve the temporal ordering, timestamps need to be part of the
synchronization game. It looks like a mergeable table indexed by
timestamp will do the trick.

- Missing statistical support. I want to be able to define when
notices should happen based on rate of change of a metric (per the
break interval) increasing much faster than you think it should (SSH
failed logins). There's probably a lot of other stuff in this area I
haven't thought of.

There's a whole subfield of statistics waiting for you. The natural
question is much of this should be in Bro versus offline log munging.
Clearly you're talking Bro. It seems would you would like to have is a
variance analysis on a detrended series (i.e., on the first-order
differences between two data points). Other analyses would be to check
for seasonal components.

- I need to write a command line tool to convert the log into
something that Graphviz can understand because I'd like to be able to
generate time-series graphs from these metrics really easily.

Why not use R? It is brilliant time series support! (And there exist
also scripting language bindings if you really want a separate tool. I
tested the Ruby bindings once and they work well.)

2. HTTPS requests per host header (using a new SSL analyzer that
provides the information from the SSL establishment), this is an
example of a non-IP address based metric too.

Along those lines, one could (mis)use this new framework to count the
number of unique certificates per host as crude way to identify TLS
MITM attacks.

    Matthias

I'd love to do so, yet my cycles only allow for brief inline feedback.

All thoughts are welcomed. :slight_smile:

Currently, in the metrics framework a metric is just a key or keys
that is connected to a number which is collected over some interval
before being written to disk and reset.

If it is really just a sequence of number, why not calling it time
series? The word metric (in networking) implies some sort of property of
a path, and more generally, some sort of performance measure. This would
also make more sense in the statistical context, where a time series
analysis is well-defined field of its own. I prefer this term not only
because I have taken a statistics course, but mainly because it is more
neutral, maybe even more general, since it only describes the format of
the data.

I think I agree with this. I'll probably change the name at some point.

To preserve the temporal ordering, timestamps need to be part of the
synchronization game. It looks like a mergeable table indexed by
timestamp will do the trick.

Hm... there isn't a whole lot of attention given to temporal ordering. The manager would just be asking for a particular measurement (conns originated for example) along with the index or indexes (possibly each local /24 where a connection was originated?) and their counts. Once the workers sent their values off to the manager, they would reset back to zero and start counting up until the manager asks for the numbers again.

There is no actual attribute based synchronization (&synchronize) going on because workers don't care about the value on other workers and doing full synchronization would cause too much over head with variable synchronization.

There's a whole subfield of statistics waiting for you. The natural
question is much of this should be in Bro versus offline log munging.
Clearly you're talking Bro. It seems would you would like to have is a
variance analysis on a detrended series (i.e., on the first-order
differences between two data points). Other analyses would be to check
for seasonal components.

You lost me at "variance analyzer on a detrended series". :slight_smile:

Anyway, what I'm searching for is just enough statistics (even if it's fake-ish pseudo-statistics) to be able to raise notices when statistically significant changes happen in the time series data. I just don't even know where to start with it.

Why not use R? It is brilliant time series support! (And there exist
also scripting language bindings if you really want a separate tool. I
tested the Ruby bindings once and they work well.)

Someone mentioned to me recently that graphviz will even do ascii art based graphs in your terminal and it's already installed everywhere. This would just be a companion script so people would be free to write whatever works best for themselves.

2. HTTPS requests per host header (using a new SSL analyzer that
provides the information from the SSL establishment), this is an
example of a non-IP address based metric too.

Along those lines, one could (mis)use this new framework to count the
number of unique certificates per host as crude way to identify TLS
MITM attacks.

Oof. Yes. :slight_smile: I'm trying to implement this in a way that people will be able to do things like that though, so I suppose it's all good.

  .Seth

Someone mentioned to me recently that graphviz will even do ascii art
based graphs in your terminal and it's already installed everywhere.

That would be cool if it did, but I'm not so sure..

From what I remember graphviz's strength was more for (un)directed graphs.

This would just be a companion script so people would be free to write whatever works best for themselves.

Ok, I think I understand better -- you just need some way to massage the logs into a form that's easily sucked up by some plotting software (user's choice).

FWIW, gnuplot can just read a file of one "x y" datapoint per line. And I know it does do ASCII plots just as easily as X11 ones. R should also easily read files formatted like that.

- Jon

I'd appreciate if you guys took a look at the metrics framework and
let me know what you think about it.

Pretty neat.

Thoughts:

    - I'd split configuration of the metrics framework from adding
      data. Currently the data producer also configures things via
      create(), but it seems that's something better left to the user
      of the metrics framework. Doing so would also answer your point
      on setting up aggregation without using the create() function.

      Can you just skip the create() function altogether? From the
      producer's perspecive, that function isn't really doing
      anything, right?

      You would then instead provide a configure() function that a
      user of the metrics framework calls to define
      aggregration/break_interval/etc, either globally or optionally
      on a per ID basis.

      In the absense of any call to configure(), just pick some
      default, like aggregation per /24 and 10s intervals, or
      whatever.

    - I'd move the $increment field out of DataPlug and make it a
      separate argument to add_data(). It has different semantics than
      the other fields, and you could then rename DataPlug to just
      Index.

    - When no subnet aggregation is set but $host is passed in, I
      think it won't work correctly. Your example for
      HTTP_REQUESTS_BY_HOST uses $index for per-host aggregration but
      looks like cheating. :slight_smile:

    - I'm wondering whether executing log_it() get expensive when it
      needs to iterate through too many entries. An alternative would
      be to schedule a number of more fine-granular timers (one per
      ID, or even one per aggregation unit); but then the log
      intervals would become desynchronized, which may not be
      desirable.

- Missing support for cluster deployment.

Yeah, that's a tough one. Full &synchronize would be overkill, but
sending the data via events, like you suggest, also sounds quite
expensive if there are lots of entities for which something's counted.

Here's an alternative idea: don't do any communication at all, and
just let the workers log their metrics data separately (into the same
log file but including a node id column). Then provide a script that
postprocesses metrics.log by adding up all the worker's counts for the
same unit/time interval. This might cause slight time
desynchronizations, but not sure how much impact that would have if we
set sufficiently large break intervals.

Perhaps the manager could trigger logging by sending the log_it()
events, and only then would all the worker go ahead and do their
output. If the log_it() event comes with a unique interval ID, the
worker can write that out as well and then offline aggregation will be
really easy later (and if they in addition also log their local
timestamps, one can see how well the timing matches).

- Missing statistical support.

I'd leave that out for the first version. Or just do very a simple
piece: static thresholds relative to the break intervals (i.e.,
provide a function add_threshold(id, value) that alarms if a counter
for ID id exceeds value.

- I need to write a command line tool to convert the log into
something that Graphviz can understand because I'd like to be able to
enerate time-series graphs from these metrics really easily.

As everybody is mentioning his favorite tools, let me throw in mine. :slight_smile:
I also like matplotlib and R, in that order. But anything is fine with
me.

Robin

Nice idea, I could see that as an external tool that takes the
metrics.log as its input (or, at some point in the future, receives a
real-time feed of the data going into metrics.log).

Robin

One more regarding the TODO: "create a metrics config logging stream":

Indeed, having a record of how things were configured will be really
helpful for building long-term archives.

Robin

I like where this is going in general.

> - Missing support for cluster deployment.

Yeah, that's a tough one. Full &synchronize would be overkill, but
sending the data via events, like you suggest, also sounds quite
expensive if there are lots of entities for which something's counted.

What about a notion of "reduce", similar to the reduce operation in
map-reduce? It seems for a lot of metrics/statistics/time-series there
will be a natural way of combining parallelized computation of a given
sort over the sequence of values.

    Vern

I'm not completely sure that would apply because the only reduce operation that I'm currently envisioning is straight addition. It's basically taking the following example structure from all of the workers (with different counts on each worker obviously) and adding the values together on some break interval.

{
  [1.2.3.0/24] = { ["GET"] = 20, ["POST"] = 1 },
  [4.3.2.0/24] = { ["GET"] = 5304, ["POST"] = 45 },
  .... and on, and on, and on....
}

That would be an example of HTTP verbs used per /24 in requests. Each worker would have it's own table and on the break interval for the metric it would add together all of the values on the manager.

It's certainly possible that I'm just plain missing your point too. :slight_smile:

  .Seth

I'm not completely sure that would apply because the only reduce operation
that I'm currently envisioning is straight addition.

Right, straight addition is very likely a predominant case. However,
I was considering that with the ability to apply a reduce operation, we
can support other computations, too. A simple example would be computing
maxima/minima.

    Vern

Ahhh, good point. I'll have to think about this a bit more.

Related to this, do you have any thoughts on syntax for being able to dynamically call functions at runtime? If we had this capability a number of things would be easier. In this case for instance, we could store the reduce operations as anonymous functions call them dynamically when they're needed.

.Seth