Email Link Extraction

Hi list,

Is there an easy way to extract links from emails in a method similar to smtp_entities processing of attachments?

Thanks in advance!

Jason

Yea I'll second that...email packet captures make finding links a challenge as quoted emails split the links..this would really help to correlate a user click to actual email in a fraction of the time. Thank you.

James

+1 on that.

At first glance this seems like all it needs is an appropriate regex. But then consider: any string containing both "." and "/" might be a candidate. (Actually, just a string containing "." with no space around it.)

So, this might range from the full regex to detect '<a href=".+">.+</a>' to just '\s.+\..+\s' (Perl regex used).

I'd welcome attempts to work on this. And, even if the result does not catch everything, if it gets anything at all it'd be better than what we have now.

Here's a special just from this morning (xx's added):

Hello,

Please view the document i uploaded for you using Google docs.
*VIEW <hxxp://mensmentis.hu/godocs/index.htm> HERE *just sign in with your
email to view the document its very important

Regards

And the quoted-printable content (it's a hoot):

<a rel=3D"nofollow" href=3D"hxxp://mensmenti=
s.hu/godocs/index.htm" target=3D"_blank" style=3D"color:rgb(40,98,197);outl=
ine-width:0px">VIEW=A0</a>

guessing that some normalization will be needed to nuke the 3D's and possible ='s within links, or just match on "http://" and call it good. Hope the above shows up right.

James

This is far from perfect due to the reason you pointed out, but it's a start and this code snippet is from the next release of Bro (you just call find_all_urls_without_scheme with the string that you want to extract urls from):

const url_regex = /^([a-zA-Z\-]{3,5})(:\/\/[^\/?#"'\r\n><]*)([^?#"'\r\n><]*)([^[:blank:]\r\n"'><]*|\??[^"'\r\n><]*)/ &redef;

## Extracts URLs discovered in arbitrary text.
function find_all_urls(s: string): string_set
  {
  return find_all(s, url_regex);
  }

## Extracts URLs discovered in arbitrary text without
## the URL scheme included.
function find_all_urls_without_scheme(s: string): string_set
  {
  local urls = find_all_urls(s);
  local return_urls: set[string] = set();
  for ( url in urls )
    {
    local no_scheme = sub(url, /^([a-zA-Z\-]{3,5})(:\/\/)/, "");
    add return_urls[no_scheme];
    }

  return return_urls;
  }

  .Seth

Thanks Seth...as I'm still horrifically newb with Bro, I'm guessing the above can go in local.bro? Thank you.

James