Random thoughts: Log analytics with open source
Posted Sun, 18 Jan 2009
With all the awesome open source projects available to date that focus on tight features and perform well, how much work would it be to tie them together and produce a tool that's able to compete with Splunk?
I hooked grok and Lucene together last night to parse and index logs, and the results were pretty slick. I could query for any keyword I wanted, etc. If I wanted logs involving specific fields like IP address, apache response code, etc, I could do it. Grok does the hard part of eating a log line and outputting key:value pairs while Lucene does the hard part of indexing field values and such.
Indexing logs in Lucene required using it in a somewhat strange way: We treat every log entry as a unique document. This way, each log line can have several key:value pairs (fields) associated with it, and searching becomes easy.
- Log parsing: grok and other tools have this done.
- Log indexing: lucene
- On-demand graph tools: python matlotlib, javascript flot, etc
- Alerting: nagios
- Fancy web interface: Ruby on Rails, or whatever
The hard part, from an implementation perspective, is only as hard as taking output (logs, data, whatever) and feeding your indexer with the fields you want to store.
Parsing all kinds of log formats isn't a trivial task, since different log formats will require new pattern matching. However, grok's automatic pattern discovery could be used to help fill in gaps where you may not yet have defined patterns.
Pending time and energy, I might have time to pursue this project.