Search this site





Random thoughts: Log analytics with open source

Over the past few years, I've tinkered on and off with various projects to help me do log analysis, data aggregation, graphing, etc. Recently, I had a discussion with a coworker about alternatives to Splunk (specifically, free ones). Turns out there aren't any projects, as far as I can tell, that provide most of what Splunk does.

With all the awesome open source projects available to date that focus on tight features and perform well, how much work would it be to tie them together and produce a tool that's able to compete with Splunk?

I hooked grok and Lucene together last night to parse and index logs, and the results were pretty slick. I could query for any keyword I wanted, etc. If I wanted logs involving specific fields like IP address, apache response code, etc, I could do it. Grok does the hard part of eating a log line and outputting key:value pairs while Lucene does the hard part of indexing field values and such.

Indexing logs in Lucene required using it in a somewhat strange way: We treat every log entry as a unique document. This way, each log line can have several key:value pairs (fields) associated with it, and searching becomes easy.

  • Log parsing: grok and other tools have this done.
  • Log indexing: lucene
  • On-demand graph tools: python matlotlib, javascript flot, etc
  • Alerting: nagios
  • Fancy web interface: Ruby on Rails, or whatever
Indexing non-log data, such as SNMP queries, only requires you feed Lucene with the right data.

The hard part, from an implementation perspective, is only as hard as taking output (logs, data, whatever) and feeding your indexer with the fields you want to store.

Parsing all kinds of log formats isn't a trivial task, since different log formats will require new pattern matching. However, grok's automatic pattern discovery could be used to help fill in gaps where you may not yet have defined patterns.

Pending time and energy, I might have time to pursue this project.

fancydb performance

Various trials with basically the same input set: 2.5 million row entries, maximum 1 entry per second. The insertion rate drops by 60% if you add rule evaluations, which is an unfortunate performance loss. I'll work on making rules less invasive. Unfortunately, python threading will never run on two processors at once I can't gain significant performance from sharding rule processing to separate threads; most unfortunate. Maybe fork+ipc is necesary here, but I am somewhat loathe to doing that.

The slowdown when rules are present are to the record keeping that is done to notify that a rule should be evaluated again (rule evaluations are queued). Basicaly the loop 'is this row being watched by a rule' is the slowdown. I'll try attacking this first.

With 2 rules (unoptimized rules):
    hits.minute => hits.mean.1hour @ 60*60
    hits.minute => hits.mean.1day @ 60*60*24
  insertion rate = 7600/sec

With 2 rules (optimized chaining)
    hits.minute => hits.mean.1hour @ 60*60
    hits.mean.1hour => hits.mean.1day @ 60*60*24
  insertion rate = 12280/sec

With 9 rules (optimized chaining):
  insertion rate: 10000/sec

With 0 rules:
  trial 1: 40000/sec
  trial 2: 26700/sec

Week of unix tools; day 4: data source tools

Day 4 is finally ready for consumption, a bit late ;)

This article touches: cat, nc, ssh, openssl, GET, wget, w3m, and others. It's designed to show you a pile of tools you can use to pull data from various places.

day 4; data sources