Search this site


Metadata

Articles

Projects

Presentations

new grok version available (1.20101030)

Another grok release is available. Major changes include:
  • Pattern discovery as described here.
  • Doxygen (C) and RDoc (Ruby) docs now available.
  • Much improved ruby support (gem install jls-grok).

Hop on over to the grok project page and download the new version.

Changes since last announced release:

1.20101030.*
  - Add 'make package-debian' to produce a .deb build of grok.

1.20101018.*
  - API docs via doxygen
  - rdoc for the Ruby module
  - Add Grok::Pile to Ruby

1.20100419.*
  - Fix tests
  - Add a ruby example for pattern discovery
  - Add grok-web example (runs grok in ruby via sinatra to show pattern discovery)
  - Add more time formats (US, EU, ISO8601)
  - Fix bug that prevented multiple patterns with the same complexity from being
    used in discovery.

1.20100416.*
  - Add pattern discovery through grok_discover (C) and Grok#discover (Ruby)
    Idea for this feature documented here:
    http://www.semicomplete.com/blog/geekery/grok-pattern-autodiscovery.html
  - The ruby gem is now called 'jls-grok' since someone already had the 'grok'
    gem name on gemcutter.
  - Fix some pattern errors found in the test suite.
  - New version numbering to match my other tools.

new grok version available (20091227.01)

The latest release is another important step in grok's life. Most major changes were outside of the code:
  • FreeBSD users can install grok via ports: sysutils/grok. Thanks to sahil and wxs for making this happen.
  • The project has online documentation and also ships with a manpage.

Hop on over to the grok project page and download the new version.

Changes since last announced release:

20091227.01
 - Add function to get the list of loaded patterns.
 - Ruby: new method Grok#patterns returns a Hash of known patterns.
 - Added flags to grok: -d and --daemon to daemonize on startup (after config
   parsing). Also added '-f configfile' for specifying the config file.
 - Added manpage (grok.1, generated from grok.pod)

20091110
 - match {} blocks can now have multiple 'pattern:' instances
 - Include samples/ directory of grok configs in release package.

Grok (pcre grok) nested predicates

I've spent the past few days refactoring and redesigning some of grok (the C version). Some of the methodology was using lazy test-driven design (writing tests in parallel, rather than before), which seemed to help me get the code working quicker.

We can now nest predicates, so you could ask to match an ip or host which has a word in it that matches 'google'. This example is a little silly, but it does show nested expressions.

% echo "www.cnn.com something google.com" \
  | ./main '%{IPORHOST=~/%{WORD=~,google,}/}' IPORHOST
google.com
I switched away from using tsearch(3) and over to using in-memory bdb; I've been happy ever since. Predicates can now live in an external library in preparation for allowing you to write predicates in a scripting language like Python or Ruby.

I'm using CUnit plus a few script hacks to do the testing. It's working pretty well. I have a few hacks (check svn for these), but the results look like this:

% make test
  Test: grok_capture.test.c:test_grok_capture_encode_and_decode ... passed
  Test: grok_capture.test.c:test_grok_capture_encode_and_decode_large ... passed
  Test: grok_capture.test.c:test_grok_capture_get_db ... passed
  Test: grok_capture.test.c:test_grok_capture_get_by_id ... passed
  Test: grok_capture.test.c:test_grok_capture_get_by_name ... passed
  Test: grok_pattern.test.c:test_grok_pattern_add_and_find_work ... passed
...

Revision 2000

I spent some time putting love into cgrok (uses libpcre) tonight.
  • Logging facility to help in debugging. Lets you choose what features you want logging (instead of lame warn/info/number log levels)
  • Added string and number comparison predicates
  • Wrote a few more tests which uncovered some bugs
I also broke 2000 revisions in subversion. Yay.
Sending        test/Makefile
Transmitting file data .
Committed revision 2001.

PCRE, Grok, and match predicates

I finished the first function in pcre-grok's predicate feature.
% ifconfig | ./grokre '%{IP}' IP
Entry: IP => 192.168.0.5
Entry: IP => 127.0.0.1

# Now, with a predicate:
% ifconfig | ./grokre '%{IP =~ /^192/}' IP
Entry: IP => 192.168.0.5
The eventual plan is to allow users to register their own predicates. The first target of this will be a python module wrapping grok allowing you to use grok and additionally write predicate functions in python, executed inside the regular expression.

So far, PCRE has not let me down.

Grok + PCRE

Perl grok was great. I learned a lot about how far beyond normal I could take regular expressions. I ported much of perl grok to C++ using Boost Xpressive, but Boost has a lot of baggage with it. I didn't like the feel of Xpressive, Boost is huge, compiling takes forever and a day (thanks C++ Templates!), and binaries are almost guaranteed to be 1meg or more.

That said, I think I might be reinventing the wheel again by trying to see what grok in C with libpcre feels like. Sample code line:

  re = pcre_compile("([0-9]+)(?C1)", 0, &errptr, &erroffset, NULL);
(?C1) is PCRE-syntax for "call callback #1" - the callback I wrote converts the last capture into a number and only succeeds if the value is greater than 5. It'll succeed once that precondition passes:
% ./a.out "foo 2 4 6 8"     
Trying: 2
Trying: 4
Trying: 6
Found: 6
All with a single regular expression + callouts. This feature (called callouts by PCRE) is what allows me (and you) to use predicates in grok. PCRE passes the first test.

A few hours later, I had pattern injection working (Turning %FOO% into it's regular expression) and could parse logs with ease.

I couldn't help pitting the boost and pcre versions against eachother, even though the feature set isn't the same, yet. pcregrok processed 37000lines/sec of apachelog (the most complex regexp I have), versus 6200/sec from c++/boost grok.

PCRE, and how to not write an API.

From the pcreapi(3) manpage:
The first two-thirds of the vector is used  to  pass  back  captured  sub-
strings,  each  substring using a pair of integers. The remaining third of
the vector is used as workspace by pcre_exec()  while  matching  capturing
subpatterns, and is not available for passing back information. The length
passed in ovecsize should always be a multiple of three. If it is not,  it
is rounded down.
The 'vector' in question is used by pcre to store offset information for captured groups. It's a good and simple way to figure out where each capture starts and ends.

What doesn't make sense is the portion I put in bold. Why wouldn't pcre_exec simply allocate that scratch space itself? This does not make sense to me. In the mean time, I'm left wondering why I am allocating parts of an array I am told are unusable. I hope there's a good reason. Perhaps some unknown efficiency is gained from doing it this way.