Some issues with find_all_urls() function

Hi Everyone,

I’m using the “find_all_urls()” function from urls.zeek to extract all URLs from HTTP bodies. I occasionally errors such as this these:

1485557634.826679 error in /usr/local/zeek/share/zeek/base/utils/urls.zeek, line 122: bad conversion to count (to_count(parts[1]) and answers:PersonalBing:EZBubbleClose) no-repeat center;width:11px;height:11px;background-position-y:-10px}#hp_bottomCell #ezp_notification #ezp_bubble .ezp_bubble_close:hover{background-position-y:0}.ezp_location{font:14px)

1485557634.826679 error in /usr/local/zeek/share/zeek/base/utils/urls.zeek, line 122: bad conversion to count (to_count(parts[1]) and answers:PersonalBing:EZPanelClose) no-repeat center;width:11px;height:11px}.ezp_module{float:left;height:269px;width:255px;margin:25px 0;padding:0 42px}.ezp_module.ezp_module_narrow{width:122px}.ezp_module_leftseparator{border-left:1px solid #222}.ezp_module_title{font-size:20px;line-height:24px;margin-bottom:11px}.ezp_module_desc{font-size:16px;line-height:20px;margin-bottom:20px}.ezp_interests_icon{vertical-align:middle}.ezp_option_control{background:url(rms:)

1485557634.826679 error in /usr/local/zeek/share/zeek/base/utils/urls.zeek, line 122: bad conversion to count (to_count(parts[1]) and answers:PersonalBing:EZPanelClose) no-repeat center;width:11px;height:11px;position:relative;top:-22px;left:-10px}#hp_tbar.ezp_signin_message{background-image:-webkit-gradient(linear,left top,left bottom,from(rgba(0,0,0,.55)),to(rgba(0,0,0,.85)));background-image:-moz-linear-gradient(rgba(0,0,0,.55) 0,rgba(0,0,0,.85) 80%);background-image:-ms-linear-gradient(rgba(0,0,0,.55) 0,rgba(0,0,0,.85) 80%);background-image:-o-linear-gradient(rgba(0,0,0,.55) 0,rgba(0,0,0,.85) 80%);background-image:linear-gradient(rgba(0,0,0,.55) 0,rgba(0,0,0,.85) 80%)}.ezp_opened .ezp_barrier{display:block;background-color:#000;height:111px;margin:0 40px;position:relative;top:-185px;opacity:0}#sc_mdc.loading+.ezp_panelopened{margin-top:-46px}.ezp_icon{position:relative;top:-5px;left:0;cursor:pointer;background-color:rgba(34,34,34,.75);margin-right:1px;margin-bottom:-7px;-webkit-margin-after:-5px}#ezp_bubble_message{position:absolute;left:30px;background-color:rgba(0,0,0,.8);color:#fff;border:1px solid #333;padding:0 12px;font-size:13px;line-height:40px;height:40px;opacity:0}#ezp_bubble_message .ezp_info{vertical-align:middle;margin-right:12px}#ezp_bubble_message .ezp_bubble_down{background:url(rms:)

1378597102.912603 error in /usr/local/zeek/share/zeek/base/utils/urls.zeek, line 122: bad conversion to count (to_count(parts[1]) and )

www.iec.ch\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16IEC http

I have a couple of questions regarding this:

  1. When trying to resolve some of these issues, should I directly modify urls.zeek or will this have unintended consequences regarding other scripts/functionality in Zeek? The reason I ask this is when printing URLs extracted with the find_all_urls() function I get some results which are clearly not valid URLs e.g. “http://www.yootheme.com/license) */” - this should have cut off before the “)” which I believe are bug with urls.zeek rather than simply being intended functionality that I’d like to change.

  2. Assuming I don’t manage to fix all of these errors and choose to accept some, how can I stop them from printing to console each time I process a PCAP?

  3. While trying to fix some of these errors with regex, I ran into the example “www.iec.ch\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16IEC http”. I’ve tried to strip everthing after the first “” but this doesn’t work due to it being Hex (I guess) rather than an actual “”, any ideas for this specific case?

  4. Finally, a regex related question I’ve been meaning to ask for a while. Because I’m trying to extract URLs from HTML/JS, I need to deal with cases whitespace and multiple types of quote character may be used. When I’ve written projects in Python, I would create a variable with all of the possible characters in it and then I would use this variable in the regex e.g.

q = r"[\‘\’’"\s](?:&quot|’)"

pattern = q+r"userTokens"+q+r"(?::|=)"+q+r"(\w+)"+q

if re.search(pattern, data):

do something…

I can’t workout how to do this with regex in Bro/Zeek scripts so I’m having to create incredibly long patterns to ensure all possible cases are met, if anybody can recommend a better way (like how I did it in Python), that would be awesome!

Thanks in Advance,

Jonah (CryptoCat)

1) When trying to resolve some of these issues, should I directly modify urls.zeek or will this have unintended consequences regarding other scripts/functionality in Zeek? The reason I ask this is when printing URLs extracted with the find_all_urls() function I get some results which are clearly not valid URLs e.g. "Terms of Service - YOOtheme) */" - this should have cut off before the ")" which I believe are bug with urls.zeek rather than simply being intended functionality that I'd like to change.

Since the types of changes you'd be doing are purely bug fixes, it
would be great it you fork Zeek on GitHub, modify urls.zeek directly
with your fixes, and then submit a Pull Request so everyone can
benefit from the improvements.

2) Assuming I don't manage to fix all of these errors and choose to accept some, how can I stop them from printing to console each time I process a PCAP?

You could do:

    redef Reporter::errors_to_stderr=F;

(But these errors are symptoms of real bugs that we'd want to
eventually fix in urls.zeek)

3) While trying to fix some of these errors with regex, I ran into the example "www.iec.ch\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16IEC http". I've tried to strip everthing after the first "\" but this doesn't work due to it being Hex (I guess) rather than an actual "\", any ideas for this specific case?

"Strip everything after first non-printable character" could look like:

    print gsub(input_string, /[^[:print:]].*$/, "");

4) Finally, a regex related question I've been meaning to ask for a while. Because I'm trying to extract URLs from HTML/JS, I need to deal with cases whitespace and multiple types of quote character may be used. When I've written projects in Python, I would create a variable with all of the possible characters in it and then I would use this variable in the regex e.g.

The pattern conjunction/concatenation operator, &, may do what you
want, docs at:

    Types — Book of Zeek (git/master)

The example shown there for combining patterns:

    /foo/ & /bar/ in "foobar"

- Jon

Thanks, I think that’s just what I was looking for with the regex variables. Does that mean I need to add ‘i’ after each of the concatenated patterns for it to be case insensitive?

e.g.

q = /[\‘\’’"\s](?:&quot|’)/i

q* & /test/i & q & /test2/i & q & /test3/i

The string_to_pattern function will be very handy too :blush:

Regarding my last message, I realised I can also use find_all instead of match_pattern to find all occurances so that’s awesome.

Thanks,

Jonah