Search this site


Page 1 of 4  [next]

Metadata

Articles

Projects

Presentations

new grok version available (1.20101030)

Another grok release is available. Major changes include:
  • Pattern discovery as described here.
  • Doxygen (C) and RDoc (Ruby) docs now available.
  • Much improved ruby support (gem install jls-grok).

Hop on over to the grok project page and download the new version.

Changes since last announced release:

1.20101030.*
  - Add 'make package-debian' to produce a .deb build of grok.

1.20101018.*
  - API docs via doxygen
  - rdoc for the Ruby module
  - Add Grok::Pile to Ruby

1.20100419.*
  - Fix tests
  - Add a ruby example for pattern discovery
  - Add grok-web example (runs grok in ruby via sinatra to show pattern discovery)
  - Add more time formats (US, EU, ISO8601)
  - Fix bug that prevented multiple patterns with the same complexity from being
    used in discovery.

1.20100416.*
  - Add pattern discovery through grok_discover (C) and Grok#discover (Ruby)
    Idea for this feature documented here:
    http://www.semicomplete.com/blog/geekery/grok-pattern-autodiscovery.html
  - The ruby gem is now called 'jls-grok' since someone already had the 'grok'
    gem name on gemcutter.
  - Fix some pattern errors found in the test suite.
  - New version numbering to match my other tools.

new grok version available (20091227.01)

The latest release is another important step in grok's life. Most major changes were outside of the code:
  • FreeBSD users can install grok via ports: sysutils/grok. Thanks to sahil and wxs for making this happen.
  • The project has online documentation and also ships with a manpage.

Hop on over to the grok project page and download the new version.

Changes since last announced release:

20091227.01
 - Add function to get the list of loaded patterns.
 - Ruby: new method Grok#patterns returns a Hash of known patterns.
 - Added flags to grok: -d and --daemon to daemonize on startup (after config
   parsing). Also added '-f configfile' for specifying the config file.
 - Added manpage (grok.1, generated from grok.pod)

20091110
 - match {} blocks can now have multiple 'pattern:' instances
 - Include samples/ directory of grok configs in release package.

grok 20091103 release

Lots of changes since the last announced release. Grok should get some more activity now that I'm actually using it in a few places. If you find bugs or have feature requests, please file them on googlecode issue tracker (see below)

The largest changes are:

  • we ship with Ruby and C API.
  • lots of new testing code.
  • we now use tokyocabinet internally instead of bdb.
Grok documentation:
http://code.google.com/p/semicomplete/wiki/Grok
Download:
http://semicomplete.googlecode.com/files/grok-20091103.tar.gz
File bugs/features:
http://code.google.com/p/semicomplete/issues/list

This release has all tests passing in these configurations:

  • FreeBSD 7.1. tokyocabinet 1.4.30, pcre 8.00, libevent 1.4.12
  • Ubuntu 9.04. tokyocabinet 1.4.35, pcre 7.8-2, libevent 1.3e-3
  • CentOS 5.3. tokyocabinet 1.4.9-1, pcre 7.8-2, libevent 1.1a-3.2.1
Thanks to Pete Fritchman, grok also ships with an RPM spec so you can 'rpmbuild -tb grok-20091103.tar.gz' for simple build and deployment. The spec builds grok, grok-devel, and grok-ruby.

I'm using this version of grok myself with good success. It's also being used in the new logstash (log indexing tool) project for doing log parsing.

Full changelist since the last announced release:

20091103
 - New: ruby bindings are now really supported.
 - Change 'WORD' pattern to be word bounded (\b)
 - Move grok-patterns to patterns/base
 - update rpm spec to install patterns/base in /usr/share/grok

20091102
 - Add a bunch of tests, mostly in ruby, to exercise grok. This uncovered a
   few bugs which are fixed.
   All tests currently pass (both CUnit and Ruby Test::Unit) on:
   * FreeBSD 7.1. tokyocabinet 1.4.30, pcre 8.00, libevent 1.4.12
   * Ubuntu 9.04. tokyocabinet 1.4.35, pcre 7.8-2, libevent 1.3e-3
   * CentOS 5.3. tokyocabinet 1.4.9-1, pcre 7.8-2, libevent 1.1a-3.2.1
 - When making strings in ruby, we now make them tainted per ruby C docs.
 - "Too many replacements" error will now occur if you have cyclic patterns,
   such as defining 'FOO' to be '%{FOO}'. Max replacements is 500.

20091030
 - Make 'grok' main take a config for an argument.
 - Add grok rpm spec.
 - Updated Makefile to work on Linux and FreeBSD without modification.
 - Fixed bug introduced in 20091022 where capture_by_(name,subname) didn't
   work properly.
 - Add default values for match {} grok.conf blocks:
   shell: stdout
   reaction: "[email protected]}"
 - Have grok exit nonzero if there were no reactions executed, akin to grep(1)
   not matching anything. 'reactions' are important here; matches with no
   reaction will not count as a reaction.

20091023
 - Fix libgrok accidentally sharing it's parser/lexer functions. Turns out,
   libgrok doesn't actually need to parse the grok.conf, so we don't build
   against it anymore for the library.

20091022:
 - Convert to using tokyocabinet instead of berkeley db.
   * Berkeley DB isn't easy to target across platforms (4.x versions vary
     wildly in bugs)
   * tokyo cabinet should be faster
   * tokyo cabinet is less code to write, and slightly more readable in the
     author's opinion.
   * we don't have to serialize with xdr anymore

20091019:
 - include pregenerated bison/flex output since gnu flex varies much from
 non-gnu flex, and many important platforms don't have gnu flex available
 easily from packages (freebsd, centos, etc) but come with the other flex.

 No functional changes.

Grok + Lucene

I mentioned last night some ideas about an open source data analytics tool. I spent a few minutes today cleaning up the code I used to test grok and lucene.

I used the latest HEAD version of grok to turn Apache logs into JSON and wrote a Java program to read the JSON output into Lucene. The last step was to write a simple search tool to query the data in Lucene.

For a test case, I used a 10000-line apache access log. To populate, I just ran this:
% ./grok | java GrokJSONImport
Grok (per the config above) will output json objects for each match and GrokJSONImport will read each line and parse it as json, telling Lucene that each new log entry is a new document with fields matched by grok.

Let's search for all successful HTTP POSTs (well, the first 100 hits, since LogSearch.java only asks for 100 hits):

% java LogSearch '+response:200 +verb:post' timestamp verb request response
Found 5794 hits.
timestamp: 18/Jan/2009:04:01:00 -0500
verb: POST
request: /hackday08/randomtags.py
response: 200

timestamp: 18/Jan/2009:04:01:05 -0500
verb: POST
request: /hackday08/randomtags.py
response: 200

< remainder of output cut >
Most of the hits are related to 'randomtags.py' which is a CGI script used by my yahoo pipes hack, SnackUpon. Let's filter out all of those requests:
% java LogSearch '+response:200 +verb:post NOT request:/hackday08/randomtags.py' timestamp verb request response
Found 91 hits.
timestamp: 18/Jan/2009:09:12:04 -0500
verb: POST
request: /blog/geekery/217
response: 200

timestamp: 18/Jan/2009:09:16:02 -0500
verb: POST
request: /blog/static/about#comment_anchor
response: 200

< remainder of output cut >
What if I want to see some non-200 response code GETs? Turn the query into 'verb:get NOT response:200' and you're done.

Pretty cool, eh? :)

Grok beta 20081228 available

The new C version of grok is ready for beta testing.

The requirements are listed in the INSTALL file. There are piles of differences between the new C version and the old perl version, including a different config file syntax to let you more easily batch common input sets through the same set of matches. I'll publish a complete feature list when I get around to it, which isn't right now.

The tarball comes with a sample grok.conf that shows you a a few different things you can do with the new version.

To run it, once you've built it, you must have a 'grok.conf' in the same directory from which you are running the 'grok' binary.

Please send any questions you have to [email protected]

Download: grok-beta-20081228.tar.gz

Grok (pcre grok) nested predicates

I've spent the past few days refactoring and redesigning some of grok (the C version). Some of the methodology was using lazy test-driven design (writing tests in parallel, rather than before), which seemed to help me get the code working quicker.

We can now nest predicates, so you could ask to match an ip or host which has a word in it that matches 'google'. This example is a little silly, but it does show nested expressions.

% echo "www.cnn.com something google.com" \
  | ./main '%{IPORHOST=~/%{WORD=~,google,}/}' IPORHOST
google.com
I switched away from using tsearch(3) and over to using in-memory bdb; I've been happy ever since. Predicates can now live in an external library in preparation for allowing you to write predicates in a scripting language like Python or Ruby.

I'm using CUnit plus a few script hacks to do the testing. It's working pretty well. I have a few hacks (check svn for these), but the results look like this:

% make test
  Test: grok_capture.test.c:test_grok_capture_encode_and_decode ... passed
  Test: grok_capture.test.c:test_grok_capture_encode_and_decode_large ... passed
  Test: grok_capture.test.c:test_grok_capture_get_db ... passed
  Test: grok_capture.test.c:test_grok_capture_get_by_id ... passed
  Test: grok_capture.test.c:test_grok_capture_get_by_name ... passed
  Test: grok_pattern.test.c:test_grok_pattern_add_and_find_work ... passed
...

Revision 2000

I spent some time putting love into cgrok (uses libpcre) tonight.
  • Logging facility to help in debugging. Lets you choose what features you want logging (instead of lame warn/info/number log levels)
  • Added string and number comparison predicates
  • Wrote a few more tests which uncovered some bugs
I also broke 2000 revisions in subversion. Yay.
Sending        test/Makefile
Transmitting file data .
Committed revision 2001.

PCRE, Grok, and match predicates

I finished the first function in pcre-grok's predicate feature.
% ifconfig | ./grokre '%{IP}' IP
Entry: IP => 192.168.0.5
Entry: IP => 127.0.0.1

# Now, with a predicate:
% ifconfig | ./grokre '%{IP =~ /^192/}' IP
Entry: IP => 192.168.0.5
The eventual plan is to allow users to register their own predicates. The first target of this will be a python module wrapping grok allowing you to use grok and additionally write predicate functions in python, executed inside the regular expression.

So far, PCRE has not let me down.

Grok + PCRE

Perl grok was great. I learned a lot about how far beyond normal I could take regular expressions. I ported much of perl grok to C++ using Boost Xpressive, but Boost has a lot of baggage with it. I didn't like the feel of Xpressive, Boost is huge, compiling takes forever and a day (thanks C++ Templates!), and binaries are almost guaranteed to be 1meg or more.

That said, I think I might be reinventing the wheel again by trying to see what grok in C with libpcre feels like. Sample code line:

  re = pcre_compile("([0-9]+)(?C1)", 0, &errptr, &erroffset, NULL);
(?C1) is PCRE-syntax for "call callback #1" - the callback I wrote converts the last capture into a number and only succeeds if the value is greater than 5. It'll succeed once that precondition passes:
% ./a.out "foo 2 4 6 8"     
Trying: 2
Trying: 4
Trying: 6
Found: 6
All with a single regular expression + callouts. This feature (called callouts by PCRE) is what allows me (and you) to use predicates in grok. PCRE passes the first test.

A few hours later, I had pattern injection working (Turning %FOO% into it's regular expression) and could parse logs with ease.

I couldn't help pitting the boost and pcre versions against eachother, even though the feature set isn't the same, yet. pcregrok processed 37000lines/sec of apachelog (the most complex regexp I have), versus 6200/sec from c++/boost grok.

C++Grok bindings working in Python

% python example.py "%SYSLOGDATE%" < /var/log/messages | head -1
{'MONTH': 'Mar', '=LINE': 'Mar 23 06:47:03 snack syslogd 1.4.1#21ubuntu3: restart.', '=MATCH': 'Mar 23 06:47:03', 'TIME': '06:47:03', 'SYSLOGDATE': 'Mar 23 06:47:03', 'MONTHDAY': '23'}
That's right. I can now use C++Grok from python.

After I saw it work, I immediately ran a time check against the perl version:

% seq 20000 > /tmp/x
% time python example.py "%NUMBER>5000%" < /tmp/x > /tmp/x.python
0.59s user 0.00s system 99% cpu 0.595 total
% time perl grok -m "%NUMBER>5000%" -r "%NUMBER%" < /tmp/x  > /tmp/x.perl
4.86s user 0.94s system 18% cpu 31.647 total
The same basic operation is 50x faster in python with c++grok bindings than the pure perl version. Excellent. Sample python code:
g = pygrok.GrokRegex()
g.add_patterns( <dictionary of patterns> )
g.set_regex("%NUMBER>5000%")
match = g.search("hello there 123 456 7890 pants")
if match:
  print match["NUMBER"]
# prints '7890'
I knew I wasn't doing reference counting properly, so to test that I ran the python code against an input set of 1000000 lines and watched the memory usage, which clearly showed leaking. I quickly read up on ref counting in Python and what functions return new or borrowed references. A few keystrokes later my memory leaks were gone. After that I put python in the test suite and am read to push a new version of c++grok.

Download: cgrok-20080327.tar.gz

Python Build instructions:

% cd pygrok
% python setup.py install

# make sure it's working properly
% python -c 'import pygrok'
There is an example and some docs in the pygrok directory.

Let me know what you think :)