Search this site


[prev]  Page 2 of 4  [next]

Metadata

Articles

Projects

Presentations

Python C++ Grok bindings

I've gotten quite a bit further tonight on making c++grok's functionality available in python.

Mostly tonight's efforts have been spent learning the python C api and learning how to add new objects and methods. I'm planning to have this ready for BarCampRochester3 in two weeks.

So far I can make new GrokRegex objects and call set_regex() and search() on them. Next time I'll be implementing GrokMatch objects (like in the C++ version) and a few other small things. Fun fun :)

Looks scary, actually simple: grammar parsing.

I hit a mental roadblock a few days ago; I was afraid to write a grammar parser for c++grok that supported the same basic format as the perl grok config format.

Perl grok's config grammar was super easy to write thanks to Parse::RecDescent. In the C++ version, I wanted similar ease. However, the tools I had available didn't appear to be expressive enough to support what I wanted. The config object in C++ was going to be a class, so you were free to have multiple config objects, and which meant I couldn't have any global variables. Both Boost Xpressive and Boost Spirit support grammar parsing almost trivially, but they require awkward wrapping and basically make it very hard to use when you want to update values in a class instance instead of some global variables.

Eventually, I gave up and wrote my own recursive descent bits using Xpressive to do the pattern matching and some trivial in-object state management to keep track of what was going on. It was really simple, despite my fears.

I'm not really sure what made me afraid of doing it, but the fear was totally unfounded.

C++ Grok has working filters and exec sections now.

I finished implementing exec and filters:
exec "tail -1 /var/log/auth.log" {
  type "syslog" {
    match = ".*";
    reaction = "echo %=MATCH|shellescape%";
  };
};
I've made a point of having perl-grok's config format work, because I think it was a reasonable format (you're free to disagree!). At any rate, filters are now working, and the result of the above code is:
Reaction: echo Feb  8 23:25:01 snack CRON\[21596\]: pam_unix\(cron:session\): session closed for user root
Checking for input: tail -1 /var/log/auth.log(0x74b100)
Reading from: tail -1 /var/log/auth.log

Feb 8 23:25:01 snack CRON[21596]: pam_unix(cron:session): session closed for user root

c++ grok vs perl grok on pattern discovery

I finished up work on the pattern discovery feature for the C++ port of grok. As soon as it was finished, I wanted to see the dpeed differences between the perl and C++ versions.

  • Perl grok: 6 lines analyzed per second
  • C++ grok: 130 lines analyzed per second
The feature tested here was the one detailed in this post.

130 lines per second isn't fantastic, but it's 21.66 times faster than the perl version, and that's huge.

I still have to implement a few other features to make the C++ version equivalent to the perl version:

  • config file (same format, ideally, as the perl version)
  • filters, like %SYSLOGDATE|parsedate%

Grok predicates - Perl vs C++

I just finished implementing predicates in c++grok (tentative name) and wanted to compare the performance against perl grok.

An input of 50000 lines of apache logfile amounting to 9.7megs of data.

I initially attempted this using the regex predicate %IP~/^129% but I realized that perl grok compiles the predicate regex every time it is executed, and wasn't a fair test. So I switched to %IP>=129% instead, which converts the match to an integer first (so 129.21.60.9 turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.

# C++ Grok
% /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null
2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+408minor)pagefaults 0swaps

# Perl Grok
% /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null
8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17721minor)pagefaults 0swaps
What still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
  • Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
  • Using predicates shows the C++ version running 10 times faster.
I still need to write test cases for the C++ version in addition to porting the pattern discovery portion from perl.

Exciting :)

Grok porting to C++

I've got pattern generation working.
% ./test '%NUMBER%' "hello 45.04" "-1.34"
Testing: %NUMBER%
Appending pattern 'NUMBER'
Test str: '(?:[+-]?(?:(?:[0-9]+(?:\.[0-9]*)?)|(?:\.[0-9]+)))'
regexid: 0x692840
Match: 1 / '45.04'
Match: 1 / '-1.34'
I'm pretty sure this is the 4th time I've at least started implementing grok in any given language. The total so far has been: perl, python, ruby, C++. I stopped working on the one in ruby because ruby's regexp engine is lacking in some useful features (*). The python port of grok was written before I added advanced predicates, which is why the ruby port was halted quickly.

(*) I opened a ruby feature request explaining a few problems I'd found with ruby's regexp feature. I even offered to help fix some of them. Circular discussions happened and I basically gave up on the idea of moving to ruby after ruby's own creator expressed a defeatist attitude about adding such a feature. My patches are still available. I don't particularly care that my request hasn't gone anywhere, so don't ask me about it, as I've happily moved on :)

Assuming I do this right, this should give grok a serious boost in speed.

Grok speed improvements

The benchmark for testing speed improvements was to process 3000 lines for '%SYSLOGBASE .*? %IPORHOST%'. Profiler was Devel::Profile.

Before:

time running program:  11.9738  (85.07%)
number of calls:       409362

%Time    Sec.     #calls   sec/call  F  name
64.52    7.6336     2870   0.002660     main::meta2string
10.00    1.1838     5740   0.000206     main::filter
 8.88    1.0507     3000   0.000350     main::handle
 3.14    0.3710    54828   0.000007     main::handle_capture
 0.77    0.0908     2870   0.000032     main::react
After
time running program:  2.5216  (82.73%)
number of calls:       105152

%Time    Sec.     #calls   sec/call  F  name
40.56    1.0228     3000   0.000341     main::handle
15.22    0.3838    54828   0.000007     main::handle_capture
 4.47    0.1128     2870   0.000039     main::meta2string
 3.31    0.0834     2870   0.000029     main::react
 2.61    0.0658    14747   0.000004     main::debug
 1.81    0.0456      237   0.000192     main::readlog
Primary changes were mostly to pregenerate a few regular expressions. Previously, I was generating the same regex every time filter() or meta2string() was being called. These small changes gave grok a serious boost in speed: what was taking 12 seconds now takes 2.5 seconds.

One example of a simple optimization is this:

before: my $re = join("|", map { qr/(?:$_)/ } keys(%$filters));
after:  our $re ||= join("|", map { qr/(?:$_)/ } keys(%$filters));
This may decrease readability, but it only sets $re once no matter how many times that line of code is executed. This line of code change was in the filter() function. Just doing this one simple change reduced the runtime of each filter() call by 97%; which reduces its runtime to something trivially small by comparison.

At this point, this is where I was very happy to have written tests for grok. To verify that grok still behaved properly after making these speed improvements, I just ran the tests suite. It passed, giving me some confidence in the stability of the changes.

new grok version available (20070915)

Hop on over to the grok project page and download the new version.

The changelist from the previous announced release is as follows:

Changes for 20070915:

  * Added 'grok_patfind.pl' which adds 'grok -F' functionality. Read about it
    here:
    http://www.semicomplete.com/blog/geekery/grok-pattern-autodiscovery.html

  * Proper shutdown is called to kill the "hack" subprocess
  * Add 'shell' option to 'type' sections; not currently used.
  * Warn if we're trying to override an existing filter.
  * Added more perlthoughts to t/theory/

  * Numerous pattern regex changes:
    - NUMBER no longer uses Regexp::Common's real number regex since that one
      matches '.' and my thoughts are that '.' is not a number
    - Added POSITIVENUM pattern
    - Fix HOSTNAME to match hostnames starting with a number (Again,
      Regexp::Common fails me)
    - Add path patterns: UNIXPATH, WINPATH, URI
    - MONTH no longer matches 0 through 12
    - DAY no longer matches 0 through 6
    - SYSLOGPROG is more specific now, since valid prog names can have dashes.

Grok and automatic log pattern discovery

My todo list has had "grok - magical 'pattern finder' thing" since May of this year. I added it after I wrote up some thoughts on pattern matching, string sequencing, and automation.

I spent many hours on that problem tonight.

Initially, I wrote a python script which would compare each line in a file against every other line in the file. Using difflib and string split, I could figure out what words were changed between two lines. If the lines were similar, I could compute the difference and replace the differences with a token placeholder, such as "WORD".

Download this python script

Here's a sample output of the tool where it compares one line against another, computes the difference and makes a new string that will match both lines:

0.91: error: PAM: authentication error for illegal user karinebeers
    : error: PAM: authentication error for illegal user leshana
  ==> error: PAM: authentication error for illegal user %WORD%
This script is fairly primitive in execution. It only compares whole tokens, which are delimited by space. This was good, but not enough. It doesn't know about compound patterns such as quoted strings, or complex patterns such as those matching an IP address or a file path.

How do we consider complex patterns like hostnames and quoted strings? Turns out, most of the pattern matching wizardry is in grok. Grok knows about named patterns, and we can abuse this to use grok in a different way - instead of parsing text, we're going to use it to turn input text into a grok pattern.

Example:

% echo '/foo/bar/baz "foo bar baz" hello ther 1234 www.google.com' \
| perl grok -F 
%UNIXPATH% %QS% hello ther 1234 %IPORHOST%
What did it do?
input:  /foo/bar/baz "foo bar baz" hello ther 1234 www.google.com
output: %UNIXPATH%   %QS%          hello ther 1234 %IPORHOST%
Using a new hack on top of grok, we can now turn an unknown plaintext input into a pattern that is reusable and human-meaningful. This is totally awesome.

This hack only considers complex tokens for simplicity's sake; that is, tokens only containing letters and numbers are ignored. Why? Is 'foo1234' a word or a hostname? Is 1997 a number or a year? Grok allows you to make these distinctions, but I skip simple tokens so I don't have to programatically weight random patterns. Note the above example, where '1234' was not replaced with '%NUMBER%' or something similar.

So, I run this new grok hack on my /var/log/messages file. Here's a sampling of output from grok:

# grok -F < /var/log/messages
%SYSLOGBASE% logfile turned over due to size>100K
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% error: PAM: authentication error for testuser from %IPORHOST%
%SYSLOGBASE% Limiting closed port RST response from 380 to 200 packets/sec
%SYSLOGBASE% error: PAM: authentication error for admin from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for pants from %IPORHOST%
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% kernel time sync enabled 2001
This is pretty close to a reusable pattern that captures all the changing data. You'll notice there are commonalities among certain lines such as the 'authentication error' lines where only the username is different. Let's run the 'authentication error' lines only through the python script and see what happens:
# messages.patterns contains the output of 'grok -F'
% grep 'authentication error' messages.patterns > testinput
% python distance_sort.py  testinput | sort | uniq
%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for illegal user %WORD% from %IPORHOST%
Wow, that's exactly what I wanted out of this process. Now we have some useful patterns, generated with almost zero effort. Let's apply this to our log file and see what we can get. How about counting, by user, failed login attempts?
pattern="%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%"
./grok -m "$pattern" -r "%WORD%" < messages \
| sort | uniq -c | sort -n | tail -5
   5 backup
   9 angel
  10 pgsql
  12 mail
  17 ident

So what have we really accomplished?

You can take any log file and run it through this process. The output of this process is a set of patterns you can use to query data in that log file (and others like it). The real power here is not that patterns are generated, it's that named, queryable patterns are generated. That's awesome.

grok gets a mailing list

For a few days now I've felt that grok should have some kind of mailing list. Why? It's maturing and now is the time to start building a community.

What hosting provider should I use? Sourceforge, berlios, and other open source hosting sites may provide mailing lists, but they also provide things I don't need: CVS repository, web hosting, etc. All I needed was mailing lists.

I had forgotten about Google Groups! I created a new list for grok, [email protected]. There are lots of administrative options available to me as an admin, and the interface to those options doesn't suck.

Free and easy mailing list hosting? I'll take it.

Please join me the on the grok-users mailing list :)

Group URL: http://groups.google.com/group/grok-users/