case-insensitive patterns

Vern · June 29, 2018, 7:00pm

Once I wound up monkeying around with the internals of the pattern-matching
code (to fix leaks, because Johanna [correctly] pushed back on adding the
&/| operators for general use if they leaked, which an old ticket indicated
they would) ... I thought what-the-heck, it's time for supporting
case-insensitive patterns.

This turned out to be tricky to implement, as I gleaned from talking with
Seth about an approach he had tried a while back but abandoned. But I now
have it working. Here's the blurb from the NEWS entry in the
topics/vern/case-insensitive-patterns branch:

- You can now specify that a pattern matches in a case-insensitive
  fashion by adding 'i' to the end of its specification. So for example
  /fOO/i == "Foo" yields T, as does /fOO/i in "xFoObar". Characters
  enclosed in quotes however keep their casing, so /"fOO"/i in "xFoObar"
  yields F, though it yields T for "xfOObar".

  You can achieve the same functionality for a subpattern enclosed in
  parentheses by adding "+i" to the open parenthesis, optionally followed
  by whitespace. So for example "/foo|(+i bar)/" will match "BaR", but
  not "FoO".

  For both ways of specifying case-insensitivity, characters enclosed in
  double quotes maintain their case-sensitivity. So for example /"foo"/i
  will not match "Foo", but it will match "foo".

The funky (+i ...) syntax isn't meant for general user consumption (though
it's okay if a user wants to use it directly), but rather is how I implemented
/pattern/i functionality. Basically, /pattern/i turns into /(+i pattern)/.
That switch is necessary because the robust way to implement case-insensitive
patterns, such that they can be composed with the & and | operators and
behave as expected, is to modify the parsing of REs to turn any instance
of a letter into a character class (so that /foo/ becomes /[Ff][Oo]Oo]/,
just like people have been doing by hand for years), and also to modify
the parsing of character classes. That requires alerting the RE scanner
that it's doing a case-insensitive (sub)pattern, which in turn requires
a prefix operator that specifies case-insensitivity.

Let me know if you have any concerns. Otherwise, I'll tee this up
for merging early next week.

Vern

johanna · June 29, 2018, 7:14pm

Once I wound up monkeying around with the internals of the pattern-matching
code (to fix leaks, because Johanna [correctly] pushed back on adding the
&/| operators for general use if they leaked, which an old ticket indicated
they would) ... I thought what-the-heck, it's time for supporting
case-insensitive patterns.

Thanks a lot for searching the memory leaks - I know that has been a pain.

This turned out to be tricky to implement, as I gleaned from talking with
Seth about an approach he had tried a while back but abandoned. But I now
have it working.

This is great - case-insensitive pattern have been something that I wanted
to have for a long time.

  You can achieve the same functionality for a subpattern enclosed in
  parentheses by adding "+i" to the open parenthesis, optionally followed
  by whitespace. So for example "/foo|(+i bar)/" will match "BaR", but
  not "FoO".

Hum. Is there a reason why we come up with our own syntax for this? Other
implementations already have this using a just slightly different syntax.

To do the same in perl, you would use "/foo|(?i:bar)/". It also supports
turning off case insensitivity for part of a pattern by doing
"/foo|(?-i:bar)/". Furthermore you can also switch it on for the rest of
the pattern by doing (?i) - after that everything is insensitive.
perlre - Perl regular expressions - Perldoc Browser has more details

Python supports the exact same syntax. And - to make things easier for
users I think it would be way nicer if we just also would do this.

The funky (+i ...) syntax isn't meant for general user consumption (though
it's okay if a user wants to use it directly), but rather is how I implemented
/pattern/i functionality.

And this is fine - but if we support it I would actually prefer just
making it explicit and doing it like everyone else

Johanna

Vern · June 29, 2018, 7:23pm

Hum. Is there a reason why we come up with our own syntax for this?

No, just that I didn't have the other syntax on my radar. I was looking
at Snort & Suricata and didn't find this; I'm not a PCRE user myself.
It's simple to change, will do so now (though I think mine is slightly
more cool ;-).

Python supports the exact same syntax. And - to make things easier for
users I think it would be way nicer if we just also would do this.

Sure.

Just so I have this right: it looks like the preferred would not be
/(?i foo)/ but rather /(?i)foo/, yes?

Vern

Topic		Replies	Views
case-insensitive patterns Development development	3	119	May 6, 2022
case-insensitive patterns Development development	1	124	May 6, 2022
Case insensitive pattern matching Zeek	3	78	May 6, 2022
pattern values and "\|\|"/"&&" operators Development development	3	109	May 6, 2022
Pattern matching for the Bro language Development development	5	150	May 6, 2022

case-insensitive patterns

Related topics