Grok + Lucene
Posted Sun, 18 Jan 2009
I mentioned last
night some ideas about an open source data analytics tool. I spent a few minutes today cleaning up the code I used to test grok and lucene.
I used the latest HEAD version of grok to turn Apache logs into JSON and wrote a Java program to read the JSON output into Lucene. The last step was to write a simple search tool to query the data in Lucene.
For a test case, I used a 10000-line apache access log. To populate, I just ran this:
% ./grok | java GrokJSONImportGrok (per the config above) will output json objects for each match and GrokJSONImport will read each line and parse it as json, telling Lucene that each new log entry is a new document with fields matched by grok.
Let's search for all successful HTTP POSTs (well, the first 100 hits, since LogSearch.java only asks for 100 hits):
% java LogSearch '+response:200 +verb:post' timestamp verb request response Found 5794 hits. timestamp: 18/Jan/2009:04:01:00 -0500 verb: POST request: /hackday08/randomtags.py response: 200 timestamp: 18/Jan/2009:04:01:05 -0500 verb: POST request: /hackday08/randomtags.py response: 200 < remainder of output cut >Most of the hits are related to 'randomtags.py' which is a CGI script used by my yahoo pipes hack, SnackUpon. Let's filter out all of those requests:
% java LogSearch '+response:200 +verb:post NOT request:/hackday08/randomtags.py' timestamp verb request response Found 91 hits. timestamp: 18/Jan/2009:09:12:04 -0500 verb: POST request: /blog/geekery/217 response: 200 timestamp: 18/Jan/2009:09:16:02 -0500 verb: POST request: /blog/static/about#comment_anchor response: 200 < remainder of output cut >What if I want to see some non-200 response code GETs? Turn the query into 'verb:get NOT response:200' and you're done.
Pretty cool, eh? :)