Search this site


Page 1 of 4  [next]

Metadata

Articles

Projects

Presentations

Grok + Lucene

I mentioned last night some ideas about an open source data analytics tool. I spent a few minutes today cleaning up the code I used to test grok and lucene.

I used the latest HEAD version of grok to turn Apache logs into JSON and wrote a Java program to read the JSON output into Lucene. The last step was to write a simple search tool to query the data in Lucene.

For a test case, I used a 10000-line apache access log. To populate, I just ran this:
% ./grok | java GrokJSONImport
Grok (per the config above) will output json objects for each match and GrokJSONImport will read each line and parse it as json, telling Lucene that each new log entry is a new document with fields matched by grok.

Let's search for all successful HTTP POSTs (well, the first 100 hits, since LogSearch.java only asks for 100 hits):

% java LogSearch '+response:200 +verb:post' timestamp verb request response
Found 5794 hits.
timestamp: 18/Jan/2009:04:01:00 -0500
verb: POST
request: /hackday08/randomtags.py
response: 200

timestamp: 18/Jan/2009:04:01:05 -0500
verb: POST
request: /hackday08/randomtags.py
response: 200

< remainder of output cut >
Most of the hits are related to 'randomtags.py' which is a CGI script used by my yahoo pipes hack, SnackUpon. Let's filter out all of those requests:
% java LogSearch '+response:200 +verb:post NOT request:/hackday08/randomtags.py' timestamp verb request response
Found 91 hits.
timestamp: 18/Jan/2009:09:12:04 -0500
verb: POST
request: /blog/geekery/217
response: 200

timestamp: 18/Jan/2009:09:16:02 -0500
verb: POST
request: /blog/static/about#comment_anchor
response: 200

< remainder of output cut >
What if I want to see some non-200 response code GETs? Turn the query into 'verb:get NOT response:200' and you're done.

Pretty cool, eh? :)

Grok beta 20081228 available

The new C version of grok is ready for beta testing.

The requirements are listed in the INSTALL file. There are piles of differences between the new C version and the old perl version, including a different config file syntax to let you more easily batch common input sets through the same set of matches. I'll publish a complete feature list when I get around to it, which isn't right now.

The tarball comes with a sample grok.conf that shows you a a few different things you can do with the new version.

To run it, once you've built it, you must have a 'grok.conf' in the same directory from which you are running the 'grok' binary.

Please send any questions you have to grok-users@googlegroups.com.

Download: grok-beta-20081228.tar.gz

Grok (pcre grok) nested predicates

I've spent the past few days refactoring and redesigning some of grok (the C version). Some of the methodology was using lazy test-driven design (writing tests in parallel, rather than before), which seemed to help me get the code working quicker.

We can now nest predicates, so you could ask to match an ip or host which has a word in it that matches 'google'. This example is a little silly, but it does show nested expressions.

% echo "www.cnn.com something google.com" \
  | ./main '%{IPORHOST=~/%{WORD=~,google,}/}' IPORHOST
google.com
I switched away from using tsearch(3) and over to using in-memory bdb; I've been happy ever since. Predicates can now live in an external library in preparation for allowing you to write predicates in a scripting language like Python or Ruby.

I'm using CUnit plus a few script hacks to do the testing. It's working pretty well. I have a few hacks (check svn for these), but the results look like this:

% make test
  Test: grok_capture.test.c:test_grok_capture_encode_and_decode ... passed
  Test: grok_capture.test.c:test_grok_capture_encode_and_decode_large ... passed
  Test: grok_capture.test.c:test_grok_capture_get_db ... passed
  Test: grok_capture.test.c:test_grok_capture_get_by_id ... passed
  Test: grok_capture.test.c:test_grok_capture_get_by_name ... passed
  Test: grok_pattern.test.c:test_grok_pattern_add_and_find_work ... passed
...

Revision 2000

I spent some time putting love into cgrok (uses libpcre) tonight.
  • Logging facility to help in debugging. Lets you choose what features you want logging (instead of lame warn/info/number log levels)
  • Added string and number comparison predicates
  • Wrote a few more tests which uncovered some bugs
I also broke 2000 revisions in subversion. Yay.
Sending        test/Makefile
Transmitting file data .
Committed revision 2001.

PCRE, Grok, and match predicates

I finished the first function in pcre-grok's predicate feature.
% ifconfig | ./grokre '%{IP}' IP
Entry: IP => 192.168.0.5
Entry: IP => 127.0.0.1

# Now, with a predicate:
% ifconfig | ./grokre '%{IP =~ /^192/}' IP
Entry: IP => 192.168.0.5
The eventual plan is to allow users to register their own predicates. The first target of this will be a python module wrapping grok allowing you to use grok and additionally write predicate functions in python, executed inside the regular expression.

So far, PCRE has not let me down.

Grok + PCRE

Perl grok was great. I learned a lot about how far beyond normal I could take regular expressions. I ported much of perl grok to C++ using Boost Xpressive, but Boost has a lot of baggage with it. I didn't like the feel of Xpressive, Boost is huge, compiling takes forever and a day (thanks C++ Templates!), and binaries are almost guaranteed to be 1meg or more.

That said, I think I might be reinventing the wheel again by trying to see what grok in C with libpcre feels like. Sample code line:

  re = pcre_compile("([0-9]+)(?C1)", 0, &errptr, &erroffset, NULL);
(?C1) is PCRE-syntax for "call callback #1" - the callback I wrote converts the last capture into a number and only succeeds if the value is greater than 5. It'll succeed once that precondition passes:
% ./a.out "foo 2 4 6 8"     
Trying: 2
Trying: 4
Trying: 6
Found: 6
All with a single regular expression + callouts. This feature (called callouts by PCRE) is what allows me (and you) to use predicates in grok. PCRE passes the first test.

A few hours later, I had pattern injection working (Turning %FOO% into it's regular expression) and could parse logs with ease.

I couldn't help pitting the boost and pcre versions against eachother, even though the feature set isn't the same, yet. pcregrok processed 37000lines/sec of apachelog (the most complex regexp I have), versus 6200/sec from c++/boost grok.

C++Grok bindings working in Python

% python example.py "%SYSLOGDATE%" < /var/log/messages | head -1
{'MONTH': 'Mar', '=LINE': 'Mar 23 06:47:03 snack syslogd 1.4.1#21ubuntu3: restart.', '=MATCH': 'Mar 23 06:47:03', 'TIME': '06:47:03', 'SYSLOGDATE': 'Mar 23 06:47:03', 'MONTHDAY': '23'}
That's right. I can now use C++Grok from python.

After I saw it work, I immediately ran a time check against the perl version:

% seq 20000 > /tmp/x
% time python example.py "%NUMBER>5000%" < /tmp/x > /tmp/x.python
0.59s user 0.00s system 99% cpu 0.595 total
% time perl grok -m "%NUMBER>5000%" -r "%NUMBER%" < /tmp/x  > /tmp/x.perl
4.86s user 0.94s system 18% cpu 31.647 total
The same basic operation is 50x faster in python with c++grok bindings than the pure perl version. Excellent. Sample python code:
g = pygrok.GrokRegex()
g.add_patterns( <dictionary of patterns> )
g.set_regex("%NUMBER>5000%")
match = g.search("hello there 123 456 7890 pants")
if match:
  print match["NUMBER"]
# prints '7890'
I knew I wasn't doing reference counting properly, so to test that I ran the python code against an input set of 1000000 lines and watched the memory usage, which clearly showed leaking. I quickly read up on ref counting in Python and what functions return new or borrowed references. A few keystrokes later my memory leaks were gone. After that I put python in the test suite and am read to push a new version of c++grok.

Download: cgrok-20080327.tar.gz

Python Build instructions:

% cd pygrok
% python setup.py install

# make sure it's working properly
% python -c 'import pygrok'
There is an example and some docs in the pygrok directory.

Let me know what you think :)

Python C++ Grok bindings

I've gotten quite a bit further tonight on making c++grok's functionality available in python.

Mostly tonight's efforts have been spent learning the python C api and learning how to add new objects and methods. I'm planning to have this ready for BarCampRochester3 in two weeks.

So far I can make new GrokRegex objects and call set_regex() and search() on them. Next time I'll be implementing GrokMatch objects (like in the C++ version) and a few other small things. Fun fun :)

Looks scary, actually simple: grammar parsing.

I hit a mental roadblock a few days ago; I was afraid to write a grammar parser for c++grok that supported the same basic format as the perl grok config format.

Perl grok's config grammar was super easy to write thanks to Parse::RecDescent. In the C++ version, I wanted similar ease. However, the tools I had available didn't appear to be expressive enough to support what I wanted. The config object in C++ was going to be a class, so you were free to have multiple config objects, and which meant I couldn't have any global variables. Both Boost Xpressive and Boost Spirit support grammar parsing almost trivially, but they require awkward wrapping and basically make it very hard to use when you want to update values in a class instance instead of some global variables.

Eventually, I gave up and wrote my own recursive descent bits using Xpressive to do the pattern matching and some trivial in-object state management to keep track of what was going on. It was really simple, despite my fears.

I'm not really sure what made me afraid of doing it, but the fear was totally unfounded.

C++ Grok has working filters and exec sections now.

I finished implementing exec and filters:
exec "tail -1 /var/log/auth.log" {
  type "syslog" {
    match = ".*";
    reaction = "echo %=MATCH|shellescape%";
  };
};
I've made a point of having perl-grok's config format work, because I think it was a reasonable format (you're free to disagree!). At any rate, filters are now working, and the result of the above code is:
Reaction: echo Feb  8 23:25:01 snack CRON\[21596\]: pam_unix\(cron:session\): session closed for user root
Checking for input: tail -1 /var/log/auth.log(0x74b100)
Reading from: tail -1 /var/log/auth.log

Feb 8 23:25:01 snack CRON[21596]: pam_unix(cron:session): session closed for user root