Search this site


Metadata

Articles

Projects

Presentations

Random thoughts: Log analytics with open source

Over the past few years, I've tinkered on and off with various projects to help me do log analysis, data aggregation, graphing, etc. Recently, I had a discussion with a coworker about alternatives to Splunk (specifically, free ones). Turns out there aren't any projects, as far as I can tell, that provide most of what Splunk does.

With all the awesome open source projects available to date that focus on tight features and perform well, how much work would it be to tie them together and produce a tool that's able to compete with Splunk?

I hooked grok and Lucene together last night to parse and index logs, and the results were pretty slick. I could query for any keyword I wanted, etc. If I wanted logs involving specific fields like IP address, apache response code, etc, I could do it. Grok does the hard part of eating a log line and outputting key:value pairs while Lucene does the hard part of indexing field values and such.

Indexing logs in Lucene required using it in a somewhat strange way: We treat every log entry as a unique document. This way, each log line can have several key:value pairs (fields) associated with it, and searching becomes easy.

  • Log parsing: grok and other tools have this done.
  • Log indexing: lucene
  • On-demand graph tools: python matlotlib, javascript flot, etc
  • Alerting: nagios
  • Fancy web interface: Ruby on Rails, or whatever
Indexing non-log data, such as SNMP queries, only requires you feed Lucene with the right data.

The hard part, from an implementation perspective, is only as hard as taking output (logs, data, whatever) and feeding your indexer with the fields you want to store.

Parsing all kinds of log formats isn't a trivial task, since different log formats will require new pattern matching. However, grok's automatic pattern discovery could be used to help fill in gaps where you may not yet have defined patterns.

Pending time and energy, I might have time to pursue this project.

Impulse-driven computing

Muscle memory is great. Are there flexible, programmable tools which let you turn a set of potentially-complex actions into something muscle-memory trainable?

I suspect making a generic tool to do this would be difficult. keynav and xdotool aim to solve some of these problems, but what about some of the more complex ones? Is it worth trying to solve these edge cases with automation? Specifically, I mean solutions where programatically you'd be talking to two or more separate systems (or APIs).

One specific set of problems is because X11's default clipboard buffer is not the same thing as GTK's clipboard buffer. So, in firefox, using 'middle click' to paste gives me X11's clipboard while CTRL+V gives me GTK (firefox)'s clipboard contents. It's likely I'm calling this thing "X11's clipboard" when it's really the "X11 Selection". It seems simple to write a tool that would copy X11's current selection to GTK's clipboard.

You could have code that looked like this, but it wouldn't be efficient:

while true:
  if gtk_clipboard_changed:
    set_x11_clipboard(value)
  else if x11_clipboard_changed:
    set_gtk_clipboard(value)
You could make that not chew up cpu by adding a small sleep at the end of each iteration, but that still sucks. From what I can tell, GTK has a way to block for clipboard changes, but X11 may not.

If the X11 application uses cut buffers, then the root X window gets notifications about cut buffer changes. However, copying stuff in firefox doesn't show any cut buffers being used.

Alternately, we could hack our own "ctrl+v" functionaly by grabbing that keystroke, or by grabbing a different, unrelated keystroke, which would do:

  1. copy primary selection to clipboard
  2. send literal "ctrl+v"
  3. restore clipboard
Update: An existing tool will keep your selection and clipboard buffers in sync: autocutsel. Looks like it uses the sleepy-loop approach I mentioned, but it does work. Awesome!

What next?

I've still been swamped with work, life, more work and more life. As as result, I've posted much less frequently lately than I have in the past.

This is mostly due to lacking time to spend on research. I've learned many new things in recent weeks that I'm eager to post about, but I haven't had time to sit down and actually write.

So. This site has very few regular visitors, but to those of you who do read: What should I post about next? Ideas would be much appreciated. :)

I'm also working on a new site layout and planning on putting more time and effort into research and posts here. We'll see how far that actually goes, though. Site layout should be done soon. My awesome (?) GIMP-fu created a new logo. Thoughts and suggestions welcome.

Project Ideas for the near future

  • Write/find a non-crappy, very simple comment plugin for pyblosxom.
  • Write/find a better tab-searchy extension than Reveal
  • Rework grok's config syntax to do more with less typing
  • Sleep
If you have any suggestions for the 'find'-type items, let me know :)

Grok plans and ideas

I'm almost done and graduated from RIT, which is why I haven't posted about anything recently. Finally done with this school. Wee!

Some noise woke me up from my pleasant slumber, and I can't seem to get back to sleep, so I might aswell do some thinking.

I think grok's config syntax is far too cluttered. I would much rather make it more simplified, somehow. Something like:

file /var/log/auth.log: syslog
{blockuser, t=3, i=60} Invalid user %USER% from %IP%

file /var/log/messages: syslog
{tracksudo, prog=su} BAD SU %USER:F% to %USER:T% on %TTY%

reaction blockuser "pfctl -t naughty -T add %IP%"
reaction tracksudo "echo '%USER:F% failed su. (to %USER:T%)' >> /var/log/su.log"
Seems like it's less writing than the existing version. Less writing is good. A more relaxed syntax would be nicer aswell -such as not requiring quotes around filenames.

This new syntax also splits reaction definitions from pattern matches, allowing you to call reactions from different files and other matches. I'll be keepin the perlblock and other features that reactions have, becuase they are quite useful.

I've also come up with a simple solution to reactions that execute shell commands. The current version of grok will run system("your command") every time a shell reaction is run. This is tragicaly suboptimal due to startup overhead. The simple solution is to run "/bin/sh -" so that there is already a shell accepting standard input and waiting for my commands. Simply write the shell command to that program.

I wrote a short program to test the speed of using system() vs printing to the input of a shell. You can view the testsh.pl script and the profiled output.

An abbreviated form of the profiled output follows:

%Time    Sec.     #calls   sec/call  F  name
92.24    1.5127     1000   0.001513     main::sys
 2.04    0.0335     1000   0.000033     main::sh
sys() calls system(), where sh() simply sends data to an existing shell process. The results speak for themselves, even though this example is only running "echo" - the startup time of the shell is obviously a huge factor in runtime. The difference is incredible. I am definitely implementing this in the next version of grok. I've already run into many cases where I am processing extremely large logs post-hoc and I end up using perl block reactions to speed up execution. This shell execution speed-up will help make grok even faster, and it can always use more speed.

Still no time to work on projects right now, perhaps soon! I'm moving to California in less than a week, so this will have to wait until after the move.

Event DB - temporal event storage

Travelling always gives me lots of time to think about new ideas. Today's 12 hours of flight gave me some some to spend brainstorming some ideas for my "sysadmin time machine" project.

I've come up with the following so far:

  • The concept of an event is something which has "when, where, and what"-ness. Other properties of events such as significance and who-reported-it are trivial. The key bits are when the event occurred, where it occurred, and what the event was.
  • Software logs happen to have these three key properties. Put simply: store it in a database that lets you search over a range of times and you have yourself a time machine.
  • Couple this with visualizations and statistical analysis. Trends are important. Automatic novelty detection is important.
  • Trends can be seen by viewing data over time - whether visual or formulaic (though the former is easier for Joe Average to see). An example trend would be showing a gradual increase in disk usage over a period of time.
  • Novelty detection can occur a number of ways. Something as simple as a homoskedasticity test could show if data were "normal" - though homoskedasticity only works well for linear models, iirc. I am not a statistician.
  • Trend calculation can provide useful models predicting resource exhaustion, MTBNF, and other important data.
  • Novelty detection aids in fire fighting and post-hoc "Oops it's broken" forensics.
I'm hoping to find time to get an event storage prototype working soon. The following step would be to leverage RRDtool as a numeric storage medium and perform novelty/trend detection/analysis from its data.

The overall goal of this is to somewhat automate problem detection and significantly aid in problem cause/effect searching.

The eventdb system will likely support many interfaces:

  • syslog - hosts can log directly to eventdb
  • command line - scriptably/manually push data to the eventdb
  • generic numeric data input - a lame frontend to rrdtool, perhaps
Thus, all data would be pushed through eventdb which would figure out which on-disk data medium to store it in. Queries could be done asking eventdb about things such as "show me yesterday's mysql activity around 3am" or "compare average syscall usage across this week and last week"

This sort of trend and novelty mapping would be extremely useful in a production software environment to compare configuration or software changes. That is, last month's syscall averages might be much lower than this months - and perhaps the only change was a configuration file change or new software being pushed to production. You would be able to track the problem back to when it first showed up - hopefully correllating to some change that was known about. After all, the first step in solving a problem is knowing of its existence.

My experience with data analysis techniques is not extensive. So I wouldn't expect the data analysis tools in the prototype to sport anything fancy.

I need more hours in a day! Oh well, back to doing homework. Hopefully I'll have some time to prototype something soon.

Note to self, use autoindent instead of cindent.

I use cindent in vim. cindent is very useful. However, it breaks a lot of the time, especially when not in C-style languages (perl, etc). It indents "how I want" 90% of the time, the other 10% it does it incorrectly or when I don't want it to so.

If I can find time tomorrow, I'd like to write a few magic-indentation keybindings so I can easily do certain kinds of indentation the way I want, and only when I want. Most of the time, autoindent is all I need anyway.

I'll post any scripts I come up with here when they're done.

Xvfb + Firefox

Resig has a bunch of unit tests he does to make sure jQuery works properly on whatever browser. Manually running and checking unit test results is annoying and time consuming. Let's automate this.

Update (May 2010): See this post for more details on automating xserver startup without having to worry about display numbers (:1, :2, etc).

Combine something simple like Firefox and Xvfb (X Virtual Frame Buffer), and you've got a simple way to run Firefox without a visible display.

Let's start Xvfb:

startx -- `which Xvfb` :1 -screen 0 1024x768x24

# Or with Xvnc (also headless)
startx -- `which Xvnc` :1 -geometry 1024x768x24

# Or with Xephyr (nested X server, requires X)
startx -- `which Xephyr` :1 -screen 1024x768x24

This starts Xvfb running on :1 with a screen size of 1024x768 and 24bits/pixel color depth. Now, let's run firefox:
DISPLAY=:1 firefox

# Or, if you run csh or tcsh
env DISPLAY=:1 firefox
Seems simple enough. What now? We want to tell firefox to go to google.com, perhaps.
DISPLAY=:1 firefox-remote http://www.google.com/
Now, let's take a screenshot (requires ImageMagick's import command):
DISPLAY=:1 import -window root googledotcom.png
Lets see what that looks like: googledotcom.png

While this isn't complicated, we could very easily automate lots of magic using something like the Selenium extension, all without requiring the use of a visual display (Monitor). Hopefully I'll find time to work on something cool using this soon.

Problems with screen scraping and other website interaction automation is that it almost always needs to be done without a browser. For instance, all of my screen scraping adventures have been using Perl. Browsers already know how to speak to the web, so why reinvent the wheel?

Firefox has lots of javascript-magic extensions such as greasemonkey and Selenium to let you execute browser-side javascript and activity automatically. Combine these together with Xvfb, and you can automate lots of things behind the scenes.

Tie this back to unit tests. Instead of simply displaying results of unit tests, have the page also report the results to a cgi script on the webserver. This will let you automatically test websites using a web browser and have it automatically report the results back to a server.

Projects for Spring Break

The following projects stuff I'd like to put time into during spring break. These are in no particular order of preference.
  • templating system for the stats aggregator with logwatcher
  • jquery sort() plugin
  • get newpsm/newmoused into FreeBSD -CURRENT
  • pam_captcha code audit
Whether or not I actually get around to doing any of this is still in question.

New ideas, et al.

I think I'm going to add a few new sections to the website, namely a tips and tricks section for whenever I discover something neat and such.
I need to seriously sit down and learn python and possibly ruby. I've played with pygame a little bit and it's way cool. I haven't been able to do much (I want to do some sort of information-kiosk interface for another project) with it due to my lack of python knowledge. Perhaps I'll venture investing in a good book, I'm not sure yet.
I'm going to start work on an XML schema for my posts, so instead of doing silly things like File: directives, I can simply do in-line <file> tags. Flexibility++
This will also make doing post -e much simpler, in that I won't need any specialness to unparse the changed directives.