Search this site


Metadata

Articles

Projects

Presentations

Boost xpressive library supports user-defined assertions

See this doc

Basically this regex library (Boost.Xpressive) supports what I like about perl's regex engine: The (??{ code }) feature (except with different syntax). This means what I had to hack around in grok-perl I can easily express in C++ code. Awesome.

The docs only show examples of using static regexes with this great feature. I'm going to try using it with dynamic regexes. If it works, I'll be converting grok to C++.

Oniguruma - named capture example

For whatever reason, I decided to play with oniguruma tonight (a newish regular expression library). I'm considering an effort to port some of grok's functionality to C or C++ for speed reasons. Doing it in C++ would require me to re-learn C++.

The docs are pretty complete, but not very helpful with respect to examples. I wasn't able to find very many useful examples on google, but the API docs are quite good. What wasn't answered by the docs was answered by reading header files. Excellent.

The result of this adventure is this:

# regex: ^(?<test>.*?)( (?<word2>.*))?$
# input: "hello there"

% gcc -I/usr/local/include -L/usr/local/lib -lonig oniguruma_named_captures.c
% ./a.out "hello there"
word2 = there
test = hello
% ./a.out "foobarbazfizz"
word2 = 
test = foobarbazfizz

Download the code

Grok, smarter predicates, and outrageous perl regex fu.

The past two days have been a frustrating exercise in working around a perl bug coupled with putting strong pressure on the limits of perl's regex system. The battle was won, but perl left some scarring..

In the last batch of updates in grok, you are now able to specify additional match predicates to the patterns. For example, if you have: "%IP~/^192/%" it will match an IP, but only IPs starting with 192.

This works great if it's the only IP on the line, but what if you want to grep for an IP starting with 192 on a line with multiple IP addresses? The current implementation works something like this:

  1. Perform the match for %IP%
  2. Assert that the matched IP also matches the predicate /^192/
  3. If the previous assertion succeeded, continue as normal (react)
  4. If the assertion failed, drop this line and keep going
Let's consider a simple example:

I want to use %WORD% to match a word. However, I only want a word that has 'foo' in it. Under the current implementation, I might consider using "%WORD~/foo/%" but this would not work, becuase it would match the first word only, which may not match 'foo' as well, and fail. Regex predicates should ideally be involved during the match process, not after. Perl has some crazy code eval features that let you do this. The following code should work:

echo "foo foobar foobaz" \
| perl -nle 'm/(\b\w+\b)(??{ ($^N =~ [email protected]@ ? "" : "(?=.\\A)") })/'
Basically, the (??{ }) part checks if the captured group also matches 'bar'. If it doesn't match 'bar' it will inject a regular expression that cannot ever match (guaranteed negative match). The negative match is accomplished with a forward-lookahead looking for any character followed by the beginning of the string (\A), which clearly isn't possible (Neat hack!). The end result is that $1 should be 'foobar'.

However, if you try to run this code:

$ echo "foo foobar foobaz" \
  | perl -nle 'm/(\b\w+\b)(??{ ($^N =~ [email protected]@ ? "" : "(?=.\\A)") })/'
> Segmentation fault (core dumped)
Perl crashes. It crashes on linux, freebsd, and solaris in various versions and various permutations of code. It must have to do something with doing other regular expressions inside code evals running from another regex. It seems like it smashes the stack in some unpleasant way.

Frustrating! I spent a few hours trying to fix this without solving the problem. No luck. Shortly after, an outrageous idea hit me - if it is unsafe to do regex within regex, why not fork and do the inner regex in another process?

Sounds crazy, and stupid, and silly? That's because it is. What bug work-arounds aren't? :)

That's exactly what I did. I wrote a short prototype script to test this new subprocess theory. It seems to work! Despite it's outrageous complexity, it gets the job done and skirts around the perl bug.

The script matches an IP, and ensure that that IP contains a '5'. Here's an example run:

scorn(~) % perl regexp-fu.pl "1.2.3.4 8.1.1.4 1.2.3.5"
1.2.3.4 / (?-xism:5)
8.1.1.4 / (?-xism:5)
1.2.3.5 / (?-xism:5)
Match: 1
Group: 1.2.3.5
You can see that it matched 3 IPs, but only the 3rd one had a '5' in it. The result of the 1 regular expression is that it matched the one I wanted. Finally!

I used perlbug(1) in an attempt to report this bug, but I have no idea if it actually sent any email.

This solution is totally suboptimal, but it works. Given the alternative, perl crashing, I'll take this solution.