Search this site


Metadata

Articles

Projects

Presentations

Flashback 2004: Too much about perl and regexp

Continuing the flashback series. Here's 2004.

Still in school. Hacking on countless projects in perl.

While writing a command-line interpreter, I wanted you to be able to type 'st' or 'sta' instead of 'status' and have the code find a best match for your command. This is often called 'ambiguous commands' or somesuch. I came up with a clever way to do this using regular expressions (May 2004).

I also posted about using regular expression to match quoted and nonquoted strings (June 2004).

Finally, I experiment with using regular expressions to find word boundaries near the cursor for my command line interpreter (Oct 2004).

Are you drowning in perl and regular expressions, yet? ;)

Grok (pcre grok) nested predicates

I've spent the past few days refactoring and redesigning some of grok (the C version). Some of the methodology was using lazy test-driven design (writing tests in parallel, rather than before), which seemed to help me get the code working quicker.

We can now nest predicates, so you could ask to match an ip or host which has a word in it that matches 'google'. This example is a little silly, but it does show nested expressions.

% echo "www.cnn.com something google.com" \
  | ./main '%{IPORHOST=~/%{WORD=~,google,}/}' IPORHOST
google.com
I switched away from using tsearch(3) and over to using in-memory bdb; I've been happy ever since. Predicates can now live in an external library in preparation for allowing you to write predicates in a scripting language like Python or Ruby.

I'm using CUnit plus a few script hacks to do the testing. It's working pretty well. I have a few hacks (check svn for these), but the results look like this:

% make test
  Test: grok_capture.test.c:test_grok_capture_encode_and_decode ... passed
  Test: grok_capture.test.c:test_grok_capture_encode_and_decode_large ... passed
  Test: grok_capture.test.c:test_grok_capture_get_db ... passed
  Test: grok_capture.test.c:test_grok_capture_get_by_id ... passed
  Test: grok_capture.test.c:test_grok_capture_get_by_name ... passed
  Test: grok_pattern.test.c:test_grok_pattern_add_and_find_work ... passed
...

Boost xpressive dynamic regexp with custom assertions

As it turns out, xpressive is (so far) exactly what I'm looking for.

'Dynamic regular expression' in Xpressive's docs are means that the regex object comes from compiling a regex string, not from using the static regular expression (aka coded in C++) that is the alternative. Very fortunately, you can mix the uses of dynamic and static expressions, since both end up turning into the same objects!

What I wanted was dynamic regexps with custom assertions, and here's how you do it:

struct is_private {
  bool operator()(ssub_match const &sub) const {
    /* Some test on 'sub' */
  }
};

/* somewhere in your code ... */
sregex ip_re = sregex::compile("(?:[0-9]+\\.){3}(?:[0-9]+)");
sregex priv_ip_re = ip_re[ check(is_private()) ];
This is excellent because this was one of the features of perl that kept me from making grok available in any other language.

I have a working demo you can download. I've tested on Linux and FreeBSD with success. It requires boost 1.34.1 and the xpressive 2.0.1. The version of xpressive that comes with boost 1.34.1 is insufficient, you must separately download the latest version of xpressive. I installed it by unzipping it and copying boost/xpressive/* to /usr/local/include/boost/xpressive/ - this overwrote the old copy of xpressive I had installed.

Compile with (on freebsd, the -I and :

g++ -I/usr/local/include -c -o boost_xpressive_test.o boost_xpressive_test.cpp
g++  boost_xpressive_test.o -o xpressivetest
Running it:
% ./xpressivetest 
RFC1918 test on '1.2.3.4': fail
RFC1918 test on '4.5.6.7': fail
RFC1918 test on '192.168.0.5': pass
Match on test1: 192.168.0.5
RFC1918 test on '129.21.60.0': fail
RFC1918 test on '29.21.60.0': fail
RFC1918 test on '9.21.60.0': fail
RFC1918 test on '172.17.44.25': pass
Match on test2: 172.17.44.25
This is exactly the behavior I expected.

Oniguruma - named capture example

For whatever reason, I decided to play with oniguruma tonight (a newish regular expression library). I'm considering an effort to port some of grok's functionality to C or C++ for speed reasons. Doing it in C++ would require me to re-learn C++.

The docs are pretty complete, but not very helpful with respect to examples. I wasn't able to find very many useful examples on google, but the API docs are quite good. What wasn't answered by the docs was answered by reading header files. Excellent.

The result of this adventure is this:

# regex: ^(?<test>.*?)( (?<word2>.*))?$
# input: "hello there"

% gcc -I/usr/local/include -L/usr/local/lib -lonig oniguruma_named_captures.c
% ./a.out "hello there"
word2 = there
test = hello
% ./a.out "foobarbazfizz"
word2 = 
test = foobarbazfizz

Download the code

Ruby/Oniguruma code block patches

I love perl's (?{ code }) feature. I want it in other languages.

I spent some time on hacking this into ruby a few weeks ago. I finally got around to making patches.

In FreeBSD ports, I select to build Ruby 1.8.6 with oniguruma for the regex engine. After doing 'make configure' you can apply these patches:

I haven't tested this on other platforms, and it's not feature complete, but it's close.

Query parsing in JavaScript

For pimp, I want to be able to search a specific column, say, artist, without needing multiple fields for searching. The ability to specify more advanced searches than simple keywords is quite useful. How do we leverage this on the client and turn a search query into a set of key-value pairs?

I must confess I was hesitant to put this kind of logic into Javascript instead of Python. Furthermore, it makes me feel a little uneasy using /foo/ in anything other than Perl. Nonetheless, doing this in Javascript was simple and it's still fast (as it should be).

The particular type of query I want to parse come in the following (hopefully intuitive) formats:

  • foo
  • artist:Eminem
  • album:"Across a Wire"
  • artist:"Counting Crows" album:august
The following code does this for me. The parse_query function will return a dictionary of query terms. Values are lists.

Here's an example:

Query
rain baltimore artist:"Counting Crows" album:august
Results of parse_query
    { "artist": ["Counting Crows"],
      "album": ["august"],
      "any": ["rain", "baltimore"],
    }
  
I take the dictionary returned and pass it to jQuery's $.post function to execute an AJAX (I hate that term, it's such a misnomer these days) request. Here's the code:
query_re = /(?:"([^"]+)")|(?:\b((?:[a-z:A-Z_-]|(?:"([^"]+)"))+))/gi,

function parse_query(string) {
  dict = {}
  while (m = query_re.exec(string)) {
    val = (m[1] || m[2]).split(":",2)
    if (val[1]) { key = val[0]; val = val[1]; }
    else { key = "any"; val = val[0]; }

    val = val.replace(/"/g,"");

    dict[key] = dict[key] || [];
    // the following should be .append(val) but 
    // I don't think javascript lists have them
    dict[key][dict[key].length] = val;
  }

  return dict;
}