Search this site


Page 1 of 4  [next]

Metadata

Articles

Projects

Presentations

c++ grok vs perl grok on pattern discovery

I finished up work on the pattern discovery feature for the C++ port of grok. As soon as it was finished, I wanted to see the dpeed differences between the perl and C++ versions.

  • Perl grok: 6 lines analyzed per second
  • C++ grok: 130 lines analyzed per second
The feature tested here was the one detailed in this post.

130 lines per second isn't fantastic, but it's 21.66 times faster than the perl version, and that's huge.

I still have to implement a few other features to make the C++ version equivalent to the perl version:

  • config file (same format, ideally, as the perl version)
  • filters, like %SYSLOGDATE|parsedate%

Grok predicates - Perl vs C++

I just finished implementing predicates in c++grok (tentative name) and wanted to compare the performance against perl grok.

An input of 50000 lines of apache logfile amounting to 9.7megs of data.

I initially attempted this using the regex predicate %IP~/^129% but I realized that perl grok compiles the predicate regex every time it is executed, and wasn't a fair test. So I switched to %IP>=129% instead, which converts the match to an integer first (so 129.21.60.9 turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.

# C++ Grok
% /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null
2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+408minor)pagefaults 0swaps

# Perl Grok
% /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null
8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17721minor)pagefaults 0swaps
What still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
  • Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
  • Using predicates shows the C++ version running 10 times faster.
I still need to write test cases for the C++ version in addition to porting the pattern discovery portion from perl.

Exciting :)

Grok speed improvements

The benchmark for testing speed improvements was to process 3000 lines for '%SYSLOGBASE .*? %IPORHOST%'. Profiler was Devel::Profile.

Before:

time running program:  11.9738  (85.07%)
number of calls:       409362

%Time    Sec.     #calls   sec/call  F  name
64.52    7.6336     2870   0.002660     main::meta2string
10.00    1.1838     5740   0.000206     main::filter
 8.88    1.0507     3000   0.000350     main::handle
 3.14    0.3710    54828   0.000007     main::handle_capture
 0.77    0.0908     2870   0.000032     main::react
After
time running program:  2.5216  (82.73%)
number of calls:       105152

%Time    Sec.     #calls   sec/call  F  name
40.56    1.0228     3000   0.000341     main::handle
15.22    0.3838    54828   0.000007     main::handle_capture
 4.47    0.1128     2870   0.000039     main::meta2string
 3.31    0.0834     2870   0.000029     main::react
 2.61    0.0658    14747   0.000004     main::debug
 1.81    0.0456      237   0.000192     main::readlog
Primary changes were mostly to pregenerate a few regular expressions. Previously, I was generating the same regex every time filter() or meta2string() was being called. These small changes gave grok a serious boost in speed: what was taking 12 seconds now takes 2.5 seconds.

One example of a simple optimization is this:

before: my $re = join("|", map { qr/(?:$_)/ } keys(%$filters));
after:  our $re ||= join("|", map { qr/(?:$_)/ } keys(%$filters));
This may decrease readability, but it only sets $re once no matter how many times that line of code is executed. This line of code change was in the filter() function. Just doing this one simple change reduced the runtime of each filter() call by 97%; which reduces its runtime to something trivially small by comparison.

At this point, this is where I was very happy to have written tests for grok. To verify that grok still behaved properly after making these speed improvements, I just ran the tests suite. It passed, giving me some confidence in the stability of the changes.

new grok version available (20070915)

Hop on over to the grok project page and download the new version.

The changelist from the previous announced release is as follows:

Changes for 20070915:

  * Added 'grok_patfind.pl' which adds 'grok -F' functionality. Read about it
    here:
    http://www.semicomplete.com/blog/geekery/grok-pattern-autodiscovery.html

  * Proper shutdown is called to kill the "hack" subprocess
  * Add 'shell' option to 'type' sections; not currently used.
  * Warn if we're trying to override an existing filter.
  * Added more perlthoughts to t/theory/

  * Numerous pattern regex changes:
    - NUMBER no longer uses Regexp::Common's real number regex since that one
      matches '.' and my thoughts are that '.' is not a number
    - Added POSITIVENUM pattern
    - Fix HOSTNAME to match hostnames starting with a number (Again,
      Regexp::Common fails me)
    - Add path patterns: UNIXPATH, WINPATH, URI
    - MONTH no longer matches 0 through 12
    - DAY no longer matches 0 through 6
    - SYSLOGPROG is more specific now, since valid prog names can have dashes.

Grok and automatic log pattern discovery

My todo list has had "grok - magical 'pattern finder' thing" since May of this year. I added it after I wrote up some thoughts on pattern matching, string sequencing, and automation.

I spent many hours on that problem tonight.

Initially, I wrote a python script which would compare each line in a file against every other line in the file. Using difflib and string split, I could figure out what words were changed between two lines. If the lines were similar, I could compute the difference and replace the differences with a token placeholder, such as "WORD".

Download this python script

Here's a sample output of the tool where it compares one line against another, computes the difference and makes a new string that will match both lines:

0.91: error: PAM: authentication error for illegal user karinebeers
    : error: PAM: authentication error for illegal user leshana
  ==> error: PAM: authentication error for illegal user %WORD%
This script is fairly primitive in execution. It only compares whole tokens, which are delimited by space. This was good, but not enough. It doesn't know about compound patterns such as quoted strings, or complex patterns such as those matching an IP address or a file path.

How do we consider complex patterns like hostnames and quoted strings? Turns out, most of the pattern matching wizardry is in grok. Grok knows about named patterns, and we can abuse this to use grok in a different way - instead of parsing text, we're going to use it to turn input text into a grok pattern.

Example:

% echo '/foo/bar/baz "foo bar baz" hello ther 1234 www.google.com' \
| perl grok -F 
%UNIXPATH% %QS% hello ther 1234 %IPORHOST%
What did it do?
input:  /foo/bar/baz "foo bar baz" hello ther 1234 www.google.com
output: %UNIXPATH%   %QS%          hello ther 1234 %IPORHOST%
Using a new hack on top of grok, we can now turn an unknown plaintext input into a pattern that is reusable and human-meaningful. This is totally awesome.

This hack only considers complex tokens for simplicity's sake; that is, tokens only containing letters and numbers are ignored. Why? Is 'foo1234' a word or a hostname? Is 1997 a number or a year? Grok allows you to make these distinctions, but I skip simple tokens so I don't have to programatically weight random patterns. Note the above example, where '1234' was not replaced with '%NUMBER%' or something similar.

So, I run this new grok hack on my /var/log/messages file. Here's a sampling of output from grok:

# grok -F < /var/log/messages
%SYSLOGBASE% logfile turned over due to size>100K
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% error: PAM: authentication error for testuser from %IPORHOST%
%SYSLOGBASE% Limiting closed port RST response from 380 to 200 packets/sec
%SYSLOGBASE% error: PAM: authentication error for admin from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for pants from %IPORHOST%
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% kernel time sync enabled 2001
This is pretty close to a reusable pattern that captures all the changing data. You'll notice there are commonalities among certain lines such as the 'authentication error' lines where only the username is different. Let's run the 'authentication error' lines only through the python script and see what happens:
# messages.patterns contains the output of 'grok -F'
% grep 'authentication error' messages.patterns > testinput
% python distance_sort.py  testinput | sort | uniq
%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for illegal user %WORD% from %IPORHOST%
Wow, that's exactly what I wanted out of this process. Now we have some useful patterns, generated with almost zero effort. Let's apply this to our log file and see what we can get. How about counting, by user, failed login attempts?
pattern="%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%"
./grok -m "$pattern" -r "%WORD%" < messages \
| sort | uniq -c | sort -n | tail -5
   5 backup
   9 angel
  10 pgsql
  12 mail
  17 ident

So what have we really accomplished?

You can take any log file and run it through this process. The output of this process is a set of patterns you can use to query data in that log file (and others like it). The real power here is not that patterns are generated, it's that named, queryable patterns are generated. That's awesome.

A letter to Ruby and Python.

Dear Ruby and Python,

Please implement Perl's (?{}) and (??{}) in your regular expression engines so I can do outrageous state machine and pattern matching in your languages. Thank you.

Love,
Jordan

Grok gets more ridiculous

In my last post, I discussed a way to perform additional assertions on a given captured group while still inside of the regex state machine.

I spent some time today and implemented it in grok and it works like a charm! This kind of functionality gives you extreme power in the kind of matches you can specify.

Here's an example: How can we find the first number on a line that's greater than 20?

% jot 50 | xargs -n15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
46 47 48 49 50
Using the above as our sample input:
% jot 50 | xargs -n15 | perl grok -m "%NUMBER>20%" -r "%NUMBER%"
21
31
46
The assertion '>20' is applied during the regular expression match. This is sweet!

Another example would be to 'grep' the output of 'ls -l' for any lines containing a number greater than 5000:

% ls -l | perl grok -m "%INT>5000%"
-rw-r--r--  1 jls  jls   21590 Apr  2 20:17 foo
-rw-r--r--  1 jls  jls   22451 Apr  2 23:52 grok
-rw-r--r--  1 jls  jls  129536 Apr  2 23:47 grok-20070402.tar.gz
-rw-r--r--  1 jls  jls   13539 Mar 22 14:22 grok.1

The equivalent awk statement would look like this:

% ls -l | awk '$5 > 5000'
But if you look at the awk statement, you have to count columns. With grok's solution, all you need to know is you want any line with an integer greater than 5000. You are able to specify the specifics of your match without having to know the precise layout of the data you are matching.

What do you get if you can chain predicates? I haven't added that functionality yet, but it would be trivial to add, so expect it soon.

If you're interested, please try the latest version.

The CSH Bawls Programming Competition

Yesterday, I participated in a 12-hour coding-binge competition. It started at 7pm Friday night and ran until 7am Saturday morning. It was fueled by Computer Science House and Bawls, both sponsors of the event. Needless to say, I haven't gotten much sleep today.

The competition website is here. Go there if you want to view this year's objectives.

The Dream Team consisted of John Resig, Darrin Mann, Matt Bruce, and myself. Darrin, Resig, and I are all quite proficient at web development, so we decided this year we would represent ourselves as "Team JavaScript" - and do everything possible in javascript. Bruce is not a programmer, but I enlisted his graphical art skills because I figured with our team doing some web-based project, we definitely needed an artist.

After reviewing all the objectives, we came up with a significant modification upon the Sudoku objective. The sudoku objective was a problem that lacked much room for innovation, so we went further and instead of solving Sudoku, wrote a web-based version of an extremely popular game in Second Life. The contest organizer approved of our new objective, so we did just that.

Resig worked on game logic, I worked on chat features, Darrin worked on scoring and game generation, and Bruce worked on the interface graphics. Becuase our tasks were all mostly unrelated, we could develop them independently. Most of the game was completed in about 6 hours, and the remainder of the time was spent fixing bugs, refactoring, and some minor redesign.

The backends were minimal. The chat backend was only 70 lines of perl, and the score backend was 9 lines of /bin/sh. Everything else was handled in the browser. We leveraged Resig's jQuery to make development faster. Development went extremely smooth, a testament to the "Dream Team"-nature of our team, perhaps? ;)

The game worked by presenting everyone with the same game - so you can compete for the highest score. You could also chat during and between games, if you wanted to.

A screenshot can be found here. At the end of the competition, we only had one known bug left. That bug didn't affect gameplay, and we were all tired, so it didn't get fixed. There were a few other issues that remained unresolved that may or may not be related to our code. Firefox was having issues with various things we were doing, and we couldn't tell if it was our fault or not.

Despite the fact that I probably shouldn't have attended the competition due to scholastic time constraints, I was glad I went. We had a blast writing the game.

We may get some time in the near future to improve the codebase and put it up online so anyone can play. There are quite a few important features that need to be added before it'll be useful as a public game.

RRDTool to graph log-originating data.

I need to relearn rrdtool, again, for this sysadmin time machine project. Today's efforts were spent testing for features I hoped were in RRDTool. So far, my feature needs are met :)

Take something simple, like webserver logs. Let's graph the hits.

Create the RRD:

rrdtool create webhits.rrd --start 1128626000 -s 60 \
   DS:hits:GAUGE:120:0:U RRA:AVERAGE:.5:5:600000 \
   RRA:AVERAGE:.5:30:602938 RRA:AVERAGE:.5:60:301469 \
   RRA:AVERAGE:.5:240:75367 RRA:AVERAGE:.5:1440:12561
My logs start *way* back in November of last year, so I create the rrd with a start date of sometime in Novemeber. The step is 60, so it expects data every minute. I then specify one data type, hits, which is a gaugue (rate), and ranges from 0 to infinity (U). The rest of the command is RRA's defining how data is stored. The first one says take 5 samples and average them, and store 600,000 of these samples, at a maximum.

Now that we have the database, we need a "hits-per-minute" data set. I wrote a short perl script, parsehttp that will read from standard input and calculate hits-per-minute and output rrdtool update statements. Capture this output and run it through sh:

./parsehttp < access.log | sh -x
Simple enough. This will calculate hits-per-minute for all times in the logs and store it in our RRD.

Now that we have the data, we can graph it. However, since I want to view trends and compare time periods, I'll need to do something fancier than simple graphs.

RRDTool lets you graph multiple data sets on the same graph. So, I want to graph this week's hits and last week's hits. However, since the data sets are on different time intervals, I need to shift last week's set forward by one week. Here's the rrdtool command that graphs it for us, with last week's and this week's data on the same graph, displayed at the same time period:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   LINE1:lastweek#00FF00:"last week" LINE1:hits#FF0000:"this week"
That'll look like line noise if you've never used RRDTool before. I define two data sets with DEF: hits and lastweek. They both read from the 'hits' data set in webhits.rrd. One starts at "-1 week" (one week ago, duh) and the other starts 2 weeks ago and ends last week. I then shift last week's data forward by 7 days (604800 seconds). Lastly, I draw two lines, one for last weeks (green), the other for this weeks (red).

That graph looks like this:

That's not really useful, becuase there's so many data points the graph almost becomes meaningless. This is due to my poor creation of RRAs. We can fix that by redoing the database, or using the TREND feature. Change our graph statement to be:

rrdtool graph webhits.png -s "-1 week" \
   DEF:hits=webhits.rrd:hits:AVERAGE  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-2 weeks":end="start + 1 week" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:lastweek#CCFFCC:"last week" LINE1:hits#FFCCCC:"this week" \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week"
I added only two CDEF statements. They take a data set and "trend" it by one day (86400 seconds). This creates a sliding average across time. I store these in new data sets called t_hits and t_lastweek and graph those aswell.

The new graph looks like this:

You'll notice the slide values are chopped off on the left, that's becuase it doesn't have enough data points at those time periods to make an average. However, including the raw data makes the graph scale as it did before, making viewing the trend difference awkward. So, let's fix it by not graphing the raw data. Just cut out the LINE1:lastweek and LINE1:hits options.

Fixing the sliding average cutoff, add a title, and a vertical label:

rrdtool graph webhits.png -s "-1 week" \
   -t "Web Server Hits - This week vs Last week" \
   -v "hits/minute" \
   DEF:hits=webhits.rrd:hits:AVERAGE:start="-8 days":end="start + 8 days"  \
   DEF:lastweek=webhits.rrd:hits:AVERAGE:start="-15 days":end="start + 8 days" \
   SHIFT:lastweek:604800 \
   CDEF:t_hits=hits,86400,TREND CDEF:t_lastweek=lastweek,86400,TREND \
   LINE1:t_lastweek#00FF00:"last week" LINE1:t_hits#FF0000:"this week" \
The graph is still from one week ago until now, but our data sets used extend beyond those boundaries, so that sliding averages can be calculated throughout. The new, final graph, looks like this:

Now I can compare this week's hits against last weeks, quickly with a nice visual. This is what I'm looking for.

This would become truely useful if we had lots of time periods (days, weeks, whatever) to look at. Then we could calculate standard deviation, etc. A high outlier could be marked automatically with a label, giving an instant visual cue that something is potentially novel. It might be simple to create a sort-of sliding "standard deviation" curve. I haven't tried that yet.

New event recording database prototype

I finally managed to find time today to work on my events database project. In the processes of doing so, I found a few bugs in grok that needed to get fixed. Some of my regular expressions were being a bit greedy, so certain pattern expansion was breaking.

To summarize the event recording system, it is a webserver listening for event publish requests. It accepts the "when" "where" and "what" of an event, and stores it in a database.

To have my logs pushed to the database, I'll leverage the awesome power of Grok. Before I do that, I gathered all of the auth.log files and archives and compiled them into their respective files.

The grok.conf for this particular maneuver:

exec "cat ./logs/nightfall.auth.log ./logs/sparks.auth.log ./logs/whitefox.auth.log" {
   type "all syslog" {
      match_syslog = 1;
      reaction = 'fetch -qo - "http://localhost:8080/?when=%SYSLOGDATE|parsedate%&where=%HOST%/%PROG|urlescape|shdq%&what=%DATA:GLOB|urlescape|shdq%"';
   };
};
This is farily simple. I added a new standard filter, 'urlescape' to grok becuase I needed it. it will url escape a data piece. Hurray!

Run grok, and it sends event notifications to the webserver for every syslog-matching line. Using FreeBSD's command-line web client, fetch.

sqlite> select count(*) from events;
8085
Now, let's look for something meaningful. I want to know what happened on all sshd services between 1am and 3am this morning (Today, May 3rd):
nightfall(~/projects/eventdb) % date -j 05030100 +%s
1146632400
nightfall(~/projects/eventdb) % date -j 05030400 +%s
1146643200
Now I know the Unix epoch times for May 3rd at 1am and 4am.
sqlite> select count(*) from events where time >= 1146632400 
   ...> and time <= 1146643200 and location like "%/sshd" 
   ...> and data like "Invalid user%";
2465
This query is instant. Much faster than doing 'grep -c' on N log files across M machines. I don't care how good your grep-fu is, you aren't going to be faster.This speed feature is only the beginning. Think broader terms. Nearly instantly zoom to any point in time to view "events" on a system or set of systems. Filter out particular events by keyword or pattern. Look for the last time a service was restarted. I could go on, but you probably get the idea. It's grep, but faster, and with more features.

As far as the protocol and implementation goes, I'm not sure how well this web-based concept is going to prevail. At this point, I am not interested in protocol or database efficiency. The prototype implementation is good enough. From what I've read about Splunk in the past months in the form of advertisements and such, it seems I already have the main feature Splunk has: searching logs easily. Perhaps I should incorporate and sell my own, better-than-Splunk, product? ;)

Bear in mind that I have no idea what Splunk actually does beyond what I've gleaned from advertisements for the product. I'm sure it's atleast somewhat useful, or no one would invest.

Certainly, a pipelined HTTP client could perform this much faster than doing 10000 individual http requests. A step further would be having the web server accept any number of events per page request. The big test is going to see how well HTTP scales, but that can be played with later.

At this point, we have come fairly close to the general idea of this project: Allowing you to zoom to particular locations in time and view system events.

The server code for doing this was very easy. I chose Python and started playing with CherryPy (a webserver framework). I had a working event reciever server in about 30 minutes. 29 minutes of that time was spent writing a threadsafe database class to front for pysqlite. The CherryPy bits only amount to about 10 lines of code, out of 90ish.

The code to do the server can be found here: /scripts/cherrycollector.py