(This is mostly for Robin and Jon, but I'm CC-ing bro-dev as well to solicit others thoughts about the topic)
I'd appreciate if you guys took a look at the metrics framework and let me know what you think about it. There's a lot missing still, but it's a decent start. Running it on a tracefile with those metrics/*-example.bro scripts and taking a look at the output is probably one of the best ways to learn what it's doing. I'll document some features I'm still planning on adding to it here too.
Currently, in the metrics framework a metric is just a key or keys that is connected to a number which is collected over some interval before being written to disk and reset. One metric that you could collect is the number of established TCP connections. Going further, you can imagine only wanting to collect the metric for local addresses and collecting the metric separately for every /24 or every known internally allocated subnet. Changing the break time for the metric could be useful too because you may only care about some value per hour while others you want to collect the metric every 15 seconds. A subnet is one aggregation technique and results in one of the possible indexes. A user could configure that they want all metrics aggregated into /24 netmask length instead of calculating the metric per individual IP address. The other index is a freeform string field which can represent anything. It's possible to use either, both, or no indexes to aggregate the data.
Here are some thoughts about missing features:
- Missing support for cluster deployment. The way this will work (I think) is that the manager will handle the break interval (the time between metrics collection/logging and reset) and it will send a metrics collection event to the workers which will send their metrics data to the manager where the data from each worker will be added together and logged. This would essentially act as a lazy synchronization technique.
- Missing statistical support. I want to be able to define when notices should happen based on rate of change of a metric (per the break interval) increasing much faster than you think it should (SSH failed logins). There's probably a lot of other stuff in this area I haven't thought of.
- Need a way to refine metrics to add things like netmask aggregation or aggregation table without defining it in the Metrics::create function so that we can ship a metrics collection script that the user can subsequently reconfigure. For example, maybe they want to track successful outbound TCP connections by department. They could supply a table like:
const subnets_to_depts = table( [22.214.171.124/24] = "Accounting", [126.96.36.199/16] = "Sales", [188.8.131.52/24] = "Accounting" );
and using that as the address aggregation table would aggregate the metrics per department instead of just per /24 or some other arbitrary netmask length.
- I need to write a command line tool to convert the log into something that Graphviz can understand because I'd like to be able to generate time-series graphs from these metrics really easily.
Here's the list of metrics I'm working towards currently...
1. Youtube video views.
2. HTTPS requests per host header (using a new SSL analyzer that provides the information from the SSL establishment), this is an example of a non-IP address based metric too.
3. TCP connections originated and responded to.
4. HTTP SQL injection requests (raise a notice when too many attacks)
5. HTTP requests subindexed by returned status code from server. How many 404,200,500 status codes seen per client IP address?
6. SSH failed connections (too many looks like scanning obviously).
7. DNS requests (watch for spiking could possibly find DNS tunnels or DNS amplification attacks)
Sorry for the rambling email, but my thoughts on this script are still mixing around a bit.