Search this site


Metadata

Articles

Projects

Presentations

Grok + Lucene

I mentioned last night some ideas about an open source data analytics tool. I spent a few minutes today cleaning up the code I used to test grok and lucene.

I used the latest HEAD version of grok to turn Apache logs into JSON and wrote a Java program to read the JSON output into Lucene. The last step was to write a simple search tool to query the data in Lucene.

For a test case, I used a 10000-line apache access log. To populate, I just ran this:
% ./grok | java GrokJSONImport
Grok (per the config above) will output json objects for each match and GrokJSONImport will read each line and parse it as json, telling Lucene that each new log entry is a new document with fields matched by grok.

Let's search for all successful HTTP POSTs (well, the first 100 hits, since LogSearch.java only asks for 100 hits):

% java LogSearch '+response:200 +verb:post' timestamp verb request response
Found 5794 hits.
timestamp: 18/Jan/2009:04:01:00 -0500
verb: POST
request: /hackday08/randomtags.py
response: 200

timestamp: 18/Jan/2009:04:01:05 -0500
verb: POST
request: /hackday08/randomtags.py
response: 200

< remainder of output cut >
Most of the hits are related to 'randomtags.py' which is a CGI script used by my yahoo pipes hack, SnackUpon. Let's filter out all of those requests:
% java LogSearch '+response:200 +verb:post NOT request:/hackday08/randomtags.py' timestamp verb request response
Found 91 hits.
timestamp: 18/Jan/2009:09:12:04 -0500
verb: POST
request: /blog/geekery/217
response: 200

timestamp: 18/Jan/2009:09:16:02 -0500
verb: POST
request: /blog/static/about#comment_anchor
response: 200

< remainder of output cut >
What if I want to see some non-200 response code GETs? Turn the query into 'verb:get NOT response:200' and you're done.

Pretty cool, eh? :)