Questions about Bro Capabilities

I am looking at extending Bro to help with traffic isolation. What I need to be able to do is differentiate between traffic that matches a given set of criteria and that which does not. In general, I know this can be done through the policies, and I believe I can do most of what I want within a policy. There are a few things that from reading the documentation and some initial policy testing that I am not certain about.

1) Is it possible to denote particular packets in a capture? I know most of the analysis is done on a flow/connection basis, but I was wondering if any information regarding the pcap was kept in the streams/records that are passed?

2) Is it possible to get the content from http sessions? I want to be able to validate that the content is that which I know to be on a given site. I know there is a content_length and data_length values in the http_message record type, but I do not see much relating to the actual content.

Thanks for any help,
-Reed

1) Is it possible to denote particular packets in a capture? I know

No, not really. The main problem here is that the link between most
event handlers and the actual packets is pretty weak. In general,
Bro does not give guarantees about when a particular event is raised
and also doesn't keep track which packet triggered it. There's a
function called get_current_packet() which returns the packet Bro
currently munching on but when script code is running it's hard to
predict which packet that actually is.

The only event which directly refers to packets is new_packet() but
using that is expensive because it is raised for *all* packets.

That said, perhaps we might be able to come up with some idea if you
sketch in a bit more detail what you're trying to achieve.

2) Is it possible to get the content from http sessions?

Yes, that's possible. The event for this is http_entity_data(); see
http-body.bro for an example that logs HTTP content into http.log.

Robin

In trying to get the contents of http sessions, I have run http-body.bro against a pcap, and there is not output to http.log. This is the same with most http-<x> scripts, except http-reply. Looking at the traces, I am seeing only 'connection_state_remove' from the main HTTP module for http-body, http, etc. With http-reply, I see the events in http-reply, and the 'connection_state_remove' as expected.

I am running 1.3.2 on Ubuntu.

I have looked at the pcap and run it against tcptrace, and all seems good.

Any thoughts?

-Reed

The HTTP scripts are a bit different from other analyzers in the
sense that they are "incremental", i.e., you typically need to load
more than one depending on which parts of the HTTP sessions you want
to analyze.

The minimum is http-request.bro which analyzes client-side requests.
You can add http-reply.bro to also see the response code of the
servers. Then you can further add, e.g., http-body.bro, to get the
session payload and/or http-header.bro to see all HTTP headers.

So, in your case, this should do the trick:

    bro -r trace http-request http-reply http-body

Robin

That worked. Thanks.

-Reed

I am working on a Traffic Generator (TG) project. Our TG has static content for webpages and fileshares. In addition, we know when our TG hosts attempt to access that data. Given those to things, I want to be able to take a network capture, run it through a system and separate out traffic that we know our TG generated, by correlating intent and traffic content, and other traffic on the network. The end goal being smaller and more relevant network captures for an analyst. In order to do this I want to try and leverage others protocol analyzers and parsers. Bro seems to be a good choice as I believe through a policy and some pregenerated variables (based on the content and host intent) I can validate given traffic to be from our TG system, and leave the rest for others to analyze. I believe that in order to do this I need to get out of Bro the relevant packets, either packet number or timestamp. Given that information, I would be able to run it through a script that would split the pcap based on the output. The added benefit of Bro is that it does some additional analysis that could be useful for capture analysis.

Is that a better sketch? Any thoughts?

Thanks in advance,
-Reed

On Wed, Oct 03, 2007 at 10:04:33AM -0400, Reed Porada composed:

> No, not really. The main problem here is that the link between most
> event handlers and the actual packets is pretty weak. In general,
> Bro does not give guarantees about when a particular event is raised
> and also doesn't keep track which packet triggered it. There's a
> function called get_current_packet() which returns the packet Bro
> currently munching on but when script code is running it's hard to
> predict which packet that actually is.

Although for events which are effectively instentanious (eg, the ones
created directly and indirectly by the protocol parsing),
get_current_packet() will be PART of the current stream, but due to
reordering issues, may not be the last packet in the current stream in
TCP sequence order.

I am working on a Traffic Generator (TG) project. Our TG has static
content for webpages and fileshares. In addition, we know when our
TG hosts attempt to access that data. Given those to things, I want
to be able to take a network capture, run it through a system and
separate out traffic that we know our TG generated, by correlating
intent and traffic content, and other traffic on the network. The
end goal being smaller and more relevant network captures for an
analyst. In order to do this I want to try and leverage others
protocol analyzers and parsers. Bro seems to be a good choice as I
believe through a policy and some pregenerated variables (based on
the content and host intent) I can validate given traffic to be from
our TG system, and leave the rest for others to analyze. I believe
that in order to do this I need to get out of Bro the relevant
packets, either packet number or timestamp. Given that information,
I would be able to run it through a script that would split the pcap
based on the output. The added benefit of Bro is that it does some
additional analysis that could be useful for capture analysis.

What exactly are the defining characteristics of your synthetic traffic?

Our synthetic traffic is not any different than if a normal user was on a machine generating the traffic. Meaning that we use IE to navigate to a page, and we use Windows File Browsing to look at network file shares. Our TG is designed to be run on an isolated network, ala DETER, thus we setup a simulated internet, and other simulated networks. Since we are creating these networks, we control server content, IP addresses, and host-names. The belief that we have is that since we know what our content is (i.e. what is at a given website, or on a given file share) and we know when we tried to access the given data (we have our host agents log intent), that we can separate out our TG traffic. In theory there is no defining characteristic of our synthetic traffic in the packet captures that we could make Bro or really any other packet analyzer look for, basically we do not set the evil bit. However, with the additional knowledge of what the content is, and what a synthetic user was doing, we believe we can find our traffic. After looking at the variables and other things that Bro policy language has, I believe I can construct the lookup tables for host_agent_events and web_content. Therefore, I believe that I can create a policy script to "find" our traffic. What I am not sure is that from the policy I can provide the information necessary to get our traffic out of the capture, i.e make a smaller capture with just the non-TG traffic.

Not sure if that answered you question Nicholas, but hopefully it clears some things up.

-Reed

On Wed, Oct 03, 2007 at 11:14:45AM -0400, Reed Porada composed:

>
>On Wed, Oct 03, 2007 at 10:04:33AM -0400, Reed Porada composed:
>
>>I am working on a Traffic Generator (TG) project. Our TG has static
>>content for webpages and fileshares. In addition, we know when our
>>TG hosts attempt to access that data. Given those to things, I want
>>to be able to take a network capture, run it through a system and
>>separate out traffic that we know our TG generated, by correlating
>>intent and traffic content, and other traffic on the network. The
>>end goal being smaller and more relevant network captures for an
>>analyst. In order to do this I want to try and leverage others
>>protocol analyzers and parsers. Bro seems to be a good choice as I
>>believe through a policy and some pregenerated variables (based on
>>the content and host intent) I can validate given traffic to be from
>>our TG system, and leave the rest for others to analyze. I believe
>>that in order to do this I need to get out of Bro the relevant
>>packets, either packet number or timestamp. Given that information,
>>I would be able to run it through a script that would split the pcap
>>based on the output. The added benefit of Bro is that it does some
>>additional analysis that could be useful for capture analysis.
>
>What exactly are the defining characteristics of your synthetic
>traffic?

Our synthetic traffic is not any different than if a normal user was
on a machine generating the traffic. Meaning that we use IE to
navigate to a page, and we use Windows File Browsing to look at
network file shares. Our TG is designed to be run on an isolated
network, ala DETER, thus we setup a simulated internet, and other
simulated networks. Since we are creating these networks, we control
server content, IP addresses, and host-names. The belief that we
have is that since we know what our content is (i.e. what is at a
given website, or on a given file share) and we know when we tried to
access the given data (we have our host agents log intent), that we
can separate out our TG traffic. In theory there is no defining
characteristic of our synthetic traffic in the packet captures that
we could make Bro or really any other packet analyzer look for,
basically we do not set the evil bit. However, with the additional
knowledge of what the content is, and what a synthetic user was
doing, we believe we can find our traffic. After looking at the
variables and other things that Bro policy language has, I believe I
can construct the lookup tables for host_agent_events and
web_content. Therefore, I believe that I can create a policy script
to "find" our traffic. What I am not sure is that from the policy I
can provide the information necessary to get our traffic out of the
capture, i.e make a smaller capture with just the non-TG traffic.

One thought:

For offline processing, do a two-pass approach. In the first pass,
you use Bro to find the TG flows based on the higher-level attributes,
and write out the flow IDs. For the second pass, only capture the
flows which don't correspond.

Yeah, that was my thought too. (This is an offline scheme, isn't it?)

If I understood your approach correctly, you depend on
application-layer analysis to find "your" traffic. In that case,
doing it in a single pass would likely miss packets because you
might only be able to take the decision some way into the stream.

At the same time it also sounds like you're always cutting out
complete flows rather than just individual packets. So, a two-pass,
flow-based approach sounds indeed reasonable.

Does this make any sense?

Robin

For offline processing, do a two-pass approach. In the first pass,
you use Bro to find the TG flows based on the higher-level attributes,
and write out the flow IDs. For the second pass, only capture the
flows which don't correspond.

Yeah, that was my thought too. (This is an offline scheme, isn't it?)

Yes this is an offline scheme at this point.

If I understood your approach correctly, you depend on
application-layer analysis to find "your" traffic. In that case,
doing it in a single pass would likely miss packets because you
might only be able to take the decision some way into the stream.

For http, yes we depend on the application layer to validate the session, as we otherwise have no good way to validate individual packets.

At the same time it also sounds like you're always cutting out
complete flows rather than just individual packets. So, a two-pass,
flow-based approach sounds indeed reasonable.

I believe in general that entire flows will be cut out, given that if a single packet is off it is hard to validate the rest as being ours. However, we still would like to possibly make an educated guess as to the culprit packet if possible.

Does this make any sense?

In general I understand what you and Nick have proposed. I do not know how to get the flow-ids out. Are the http_request_stream$id's unique? One thing that was suggested by a co-worker after looking at the output, is that we have a timestamp, src ip/port, dst ip/port. In general within a pcap that is sufficient for identifying a packet, my guess as to why you have not suggested this option is that the network_time() that is being used in output does not relate to the stream. Is there anyway to get that to have a closer correlation to the stream? I am also curious as to how to interpret the output from http-body. What does each printout from http_entity_data events represent? Is it a new packet, or an update to the stream that could be the sum of an arbitrary number of packets?

Thanks again for your time and help,

-Reed

On Thu, Oct 04, 2007 at 11:03:07AM -0400, Reed Porada composed:

>Does this make any sense?

In general I understand what you and Nick have proposed. I do not
know how to get the flow-ids out. Are the http_request_stream$id's
unique? One thing that was suggested by a co-worker after looking at
the output, is that we have a timestamp, src ip/port, dst ip/port.
In general within a pcap that is sufficient for identifying a packet,
my guess as to why you have not suggested this option is that the
network_time() that is being used in output does not relate to the
stream. Is there anyway to get that to have a closer correlation to
the stream? I am also curious as to how to interpret the output from
http-body. What does each printout from http_entity_data events
represent? Is it a new packet, or an update to the stream that could
be the sum of an arbitrary number of packets?

with most hosts, the 5-tuple should be unique (SRC ip/port,DST
ip/port,proto). So just record the 5-tuple of anything to exclude in
a file, and then use that file in the second pass to filter out those
connections.