photo
Jordan Sissel
geek. sysadmin. blogger.

Fri, 28 Mar 2008

C++Grok bindings working in Python

% python example.py "%SYSLOGDATE%" < /var/log/messages | head -1
{'MONTH': 'Mar', '=LINE': 'Mar 23 06:47:03 snack syslogd 1.4.1#21ubuntu3: restart.', '=MATCH': 'Mar 23 06:47:03', 'TIME': '06:47:03', 'SYSLOGDATE': 'Mar 23 06:47:03', 'MONTHDAY': '23'}
That's right. I can now use C++Grok from python.

After I saw it work, I immediately ran a time check against the perl version:

% seq 20000 > /tmp/x
% time python example.py "%NUMBER>5000%" < /tmp/x > /tmp/x.python
0.59s user 0.00s system 99% cpu 0.595 total
% time perl grok -m "%NUMBER>5000%" -r "%NUMBER%" < /tmp/x  > /tmp/x.perl
4.86s user 0.94s system 18% cpu 31.647 total
The same basic operation is 50x faster in python with c++grok bindings than the pure perl version. Excellent. Sample python code:
g = pygrok.GrokRegex()
g.add_patterns( <dictionary of patterns> )
g.set_regex("%NUMBER>5000%")
match = g.search("hello there 123 456 7890 pants")
if match:
  print match["NUMBER"]
# prints '7890'
I knew I wasn't doing reference counting properly, so to test that I ran the python code against an input set of 1000000 lines and watched the memory usage, which clearly showed leaking. I quickly read up on ref counting in Python and what functions return new or borrowed references. A few keystrokes later my memory leaks were gone. After that I put python in the test suite and am read to push a new version of c++grok.

Download: cgrok-20080327.tar.gz

Python Build instructions:

% cd pygrok
% python setup.py install

# make sure it's working properly
% python -c 'import pygrok'
There is an example and some docs in the pygrok directory.

Let me know what you think :)

Comments: 0 (view comments)
Tags: , , , , ,
Permalink: /geekery/python-cppgrok-bindings
posted at: 01:31

Mon, 24 Mar 2008

Python C++ Grok bindings

I've gotten quite a bit further tonight on making c++grok's functionality available in python.

Mostly tonight's efforts have been spent learning the python C api and learning how to add new objects and methods. I'm planning to have this ready for BarCampRochester3 in two weeks.

So far I can make new GrokRegex objects and call set_regex() and search() on them. Next time I'll be implementing GrokMatch objects (like in the C++ version) and a few other small things. Fun fun :)

Comments: 0 (view comments)
Tags: , , , ,
Permalink: /geekery/python-cgrok-bindings-2
posted at: 13:36

Sat, 09 Feb 2008

Looks scary, actually simple: grammar parsing.

I hit a mental roadblock a few days ago; I was afraid to write a grammar parser for c++grok that supported the same basic format as the perl grok config format.

Perl grok's config grammar was super easy to write thanks to Parse::RecDescent. In the C++ version, I wanted similar ease. However, the tools I had available didn't appear to be expressive enough to support what I wanted. The config object in C++ was going to be a class, so you were free to have multiple config objects, and which meant I couldn't have any global variables. Both Boost Xpressive and Boost Spirit support grammar parsing almost trivially, but they require awkward wrapping and basically make it very hard to use when you want to update values in a class instance instead of some global variables.

Eventually, I gave up and wrote my own recursive descent bits using Xpressive to do the pattern matching and some trivial in-object state management to keep track of what was going on. It was really simple, despite my fears.

I'm not really sure what made me afraid of doing it, but the fear was totally unfounded.

Comments: 0 (view comments)
Tags: , ,
Permalink: /geekery/parsing-config-files
posted at: 17:08

C++ Grok has working filters and exec sections now.

I finished implementing exec and filters:
exec "tail -1 /var/log/auth.log" {
  type "syslog" {
    match = ".*";
    reaction = "echo %=MATCH|shellescape%";
  };
};
I've made a point of having perl-grok's config format work, because I think it was a reasonable format (you're free to disagree!). At any rate, filters are now working, and the result of the above code is:
Reaction: echo Feb  8 23:25:01 snack CRON\[21596\]: pam_unix\(cron:session\): session closed for user root
Checking for input: tail -1 /var/log/auth.log(0x74b100)
Reading from: tail -1 /var/log/auth.log

Feb 8 23:25:01 snack CRON[21596]: pam_unix(cron:session): session closed for user root

Comments: 0 (view comments)
Tags: ,
Permalink: /geekery/c-grok-filters-working
posted at: 02:28

Mon, 21 Jan 2008

c++ grok vs perl grok on pattern discovery

I finished up work on the pattern discovery feature for the C++ port of grok. As soon as it was finished, I wanted to see the dpeed differences between the perl and C++ versions.

  • Perl grok: 6 lines analyzed per second
  • C++ grok: 130 lines analyzed per second
The feature tested here was the one detailed in this post.

130 lines per second isn't fantastic, but it's 21.66 times faster than the perl version, and that's huge.

I still have to implement a few other features to make the C++ version equivalent to the perl version:

  • config file (same format, ideally, as the perl version)
  • filters, like %SYSLOGDATE|parsedate%

Comments: 1 (view comments)
Tags: , , , ,
Permalink: /geekery/cgrok-vs-grok-pattern-discovery
posted at: 19:07

Tue, 15 Jan 2008

Grok predicates - Perl vs C++

I just finished implementing predicates in c++grok (tentative name) and wanted to compare the performance against perl grok.

An input of 50000 lines of apache logfile amounting to 9.7megs of data.

I initially attempted this using the regex predicate %IP~/^129% but I realized that perl grok compiles the predicate regex every time it is executed, and wasn't a fair test. So I switched to %IP>=129% instead, which converts the match to an integer first (so 129.21.60.9 turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.

# C++ Grok
% /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null
2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+408minor)pagefaults 0swaps

# Perl Grok
% /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null
8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17721minor)pagefaults 0swaps
What still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
  • Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
  • Using predicates shows the C++ version running 10 times faster.
I still need to write test cases for the C++ version in addition to porting the pattern discovery portion from perl.

Exciting :)

Comments: 0 (view comments)
Tags: , , , ,
Permalink: /geekery/grok-predicates-perl-vs-cplusplus
posted at: 02:31

Fri, 11 Jan 2008

Grok porting to C++

I've got pattern generation working.
% ./test '%NUMBER%' "hello 45.04" "-1.34"
Testing: %NUMBER%
Appending pattern 'NUMBER'
Test str: '(?:[+-]?(?:(?:[0-9]+(?:\.[0-9]*)?)|(?:\.[0-9]+)))'
regexid: 0x692840
Match: 1 / '45.04'
Match: 1 / '-1.34'
I'm pretty sure this is the 4th time I've at least started implementing grok in any given language. The total so far has been: perl, python, ruby, C++. I stopped working on the one in ruby because ruby's regexp engine is lacking in some useful features (*). The python port of grok was written before I added advanced predicates, which is why the ruby port was halted quickly.

(*) I opened a ruby feature request explaining a few problems I'd found with ruby's regexp feature. I even offered to help fix some of them. Circular discussions happened and I basically gave up on the idea of moving to ruby after ruby's own creator expressed a defeatist attitude about adding such a feature. My patches are still available. I don't particularly care that my request hasn't gone anywhere, so don't ask me about it, as I've happily moved on :)

Assuming I do this right, this should give grok a serious boost in speed.

Comments: 0 (view comments)
Tags: , , ,
Permalink: /geekery/grok-porting-to-c++
posted at: 06:38

Sat, 29 Sep 2007

Grok speed improvements

The benchmark for testing speed improvements was to process 3000 lines for '%SYSLOGBASE .*? %IPORHOST%'. Profiler was Devel::Profile.

Before:

time running program:  11.9738  (85.07%)
number of calls:       409362

%Time    Sec.     #calls   sec/call  F  name
64.52    7.6336     2870   0.002660     main::meta2string
10.00    1.1838     5740   0.000206     main::filter
 8.88    1.0507     3000   0.000350     main::handle
 3.14    0.3710    54828   0.000007     main::handle_capture
 0.77    0.0908     2870   0.000032     main::react
After
time running program:  2.5216  (82.73%)
number of calls:       105152

%Time    Sec.     #calls   sec/call  F  name
40.56    1.0228     3000   0.000341     main::handle
15.22    0.3838    54828   0.000007     main::handle_capture
 4.47    0.1128     2870   0.000039     main::meta2string
 3.31    0.0834     2870   0.000029     main::react
 2.61    0.0658    14747   0.000004     main::debug
 1.81    0.0456      237   0.000192     main::readlog
Primary changes were mostly to pregenerate a few regular expressions. Previously, I was generating the same regex every time filter() or meta2string() was being called. These small changes gave grok a serious boost in speed: what was taking 12 seconds now takes 2.5 seconds.

One example of a simple optimization is this:

before: my $re = join("|", map { qr/(?:$_)/ } keys(%$filters));
after:  our $re ||= join("|", map { qr/(?:$_)/ } keys(%$filters));
This may decrease readability, but it only sets $re once no matter how many times that line of code is executed. This line of code change was in the filter() function. Just doing this one simple change reduced the runtime of each filter() call by 97%; which reduces its runtime to something trivially small by comparison.

At this point, this is where I was very happy to have written tests for grok. To verify that grok still behaved properly after making these speed improvements, I just ran the tests suite. It passed, giving me some confidence in the stability of the changes.

Comments: 0 (view comments)
Tags: , ,
Permalink: /geekery/grok-speed-improvements
posted at: 03:27

Sat, 15 Sep 2007

new grok version available (20070915)

Hop on over to the grok project page and download the new version.

The changelist from the previous announced release is as follows:

Changes for 20070915:

  * Added 'grok_patfind.pl' which adds 'grok -F' functionality. Read about it
    here:
    http://www.semicomplete.com/blog/geekery/grok-pattern-autodiscovery.html

  * Proper shutdown is called to kill the "hack" subprocess
  * Add 'shell' option to 'type' sections; not currently used.
  * Warn if we're trying to override an existing filter.
  * Added more perlthoughts to t/theory/

  * Numerous pattern regex changes:
    - NUMBER no longer uses Regexp::Common's real number regex since that one
      matches '.' and my thoughts are that '.' is not a number
    - Added POSITIVENUM pattern
    - Fix HOSTNAME to match hostnames starting with a number (Again,
      Regexp::Common fails me)
    - Add path patterns: UNIXPATH, WINPATH, URI
    - MONTH no longer matches 0 through 12
    - DAY no longer matches 0 through 6
    - SYSLOGPROG is more specific now, since valid prog names can have dashes.

Comments: 0 (view comments)
Tags: , ,
Permalink: /geekery/grok-20070915
posted at: 04:13

Sun, 02 Sep 2007

Grok and automatic log pattern discovery

My todo list has had "grok - magical 'pattern finder' thing" since May of this year. I added it after I wrote up some thoughts on pattern matching, string sequencing, and automation.

I spent many hours on that problem tonight.

Initially, I wrote a python script which would compare each line in a file against every other line in the file. Using difflib and string split, I could figure out what words were changed between two lines. If the lines were similar, I could compute the difference and replace the differences with a token placeholder, such as "WORD".

Download this python script

Here's a sample output of the tool where it compares one line against another, computes the difference and makes a new string that will match both lines:

0.91: error: PAM: authentication error for illegal user karinebeers
    : error: PAM: authentication error for illegal user leshana
  ==> error: PAM: authentication error for illegal user %WORD%
This script is fairly primitive in execution. It only compares whole tokens, which are delimited by space. This was good, but not enough. It doesn't know about compound patterns such as quoted strings, or complex patterns such as those matching an IP address or a file path.

How do we consider complex patterns like hostnames and quoted strings? Turns out, most of the pattern matching wizardry is in grok. Grok knows about named patterns, and we can abuse this to use grok in a different way - instead of parsing text, we're going to use it to turn input text into a grok pattern.

Example:

% echo '/foo/bar/baz "foo bar baz" hello ther 1234 www.google.com' \
| perl grok -F 
%UNIXPATH% %QS% hello ther 1234 %IPORHOST%
What did it do?
input:  /foo/bar/baz "foo bar baz" hello ther 1234 www.google.com
output: %UNIXPATH%   %QS%          hello ther 1234 %IPORHOST%
Using a new hack on top of grok, we can now turn an unknown plaintext input into a pattern that is reusable and human-meaningful. This is totally awesome.

This hack only considers complex tokens for simplicity's sake; that is, tokens only containing letters and numbers are ignored. Why? Is 'foo1234' a word or a hostname? Is 1997 a number or a year? Grok allows you to make these distinctions, but I skip simple tokens so I don't have to programatically weight random patterns. Note the above example, where '1234' was not replaced with '%NUMBER%' or something similar.

So, I run this new grok hack on my /var/log/messages file. Here's a sampling of output from grok:

# grok -F < /var/log/messages
%SYSLOGBASE% logfile turned over due to size>100K
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% error: PAM: authentication error for testuser from %IPORHOST%
%SYSLOGBASE% Limiting closed port RST response from 380 to 200 packets/sec
%SYSLOGBASE% error: PAM: authentication error for admin from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for pants from %IPORHOST%
%SYSLOGBASE% kernel time sync enabled 6001
%SYSLOGBASE% kernel time sync enabled 2001
This is pretty close to a reusable pattern that captures all the changing data. You'll notice there are commonalities among certain lines such as the 'authentication error' lines where only the username is different. Let's run the 'authentication error' lines only through the python script and see what happens:
# messages.patterns contains the output of 'grok -F'
% grep 'authentication error' messages.patterns > testinput
% python distance_sort.py  testinput | sort | uniq
%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%
%SYSLOGBASE% error: PAM: authentication error for illegal user %WORD% from %IPORHOST%
Wow, that's exactly what I wanted out of this process. Now we have some useful patterns, generated with almost zero effort. Let's apply this to our log file and see what we can get. How about counting, by user, failed login attempts?
pattern="%SYSLOGBASE% error: PAM: authentication error for %WORD% from %IPORHOST%"
./grok -m "$pattern" -r "%WORD%" < messages \
| sort | uniq -c | sort -n | tail -5
   5 backup
   9 angel
  10 pgsql
  12 mail
  17 ident

So what have we really accomplished?

You can take any log file and run it through this process. The output of this process is a set of patterns you can use to query data in that log file (and others like it). The real power here is not that patterns are generated, it's that named, queryable patterns are generated. That's awesome.

Comments: 0 (view comments)
Tags: , ,
Permalink: /geekery/grok-pattern-autodiscovery
posted at: 02:23

Search this site

Navigation

Page 1 of 3  [next]

Metadata

Home About Resume My Code (SVN Web)

Articles

ARP Security Dynamic DNS with DHCP OpenLDAP+Kerberos+SASL PPP over SSH SSH Security: /bin/false Week of Unix Tools Work Efficiency

Projects

fex firefox tabsearch firefox urledit grok keynav liboverride newpsm (FreeBSD) nis2ldap pam_captcha poor man's backup Solaris audio utility xboxproxy xdotool xmlpresenter xpathtool misc scripts

Presentations

Yahoo! Hack Day '06 Unix Essentials Vi/Vim Essentials

Tag Cloud

Calendar

< March 2008 >
SuMoTuWeThFrSa
       1
2 3 4 5 6 7 8
9101112131415
16171819202122
23242526272829
3031     

Friends

BarCamp Kent Brewster Tantek Çelik John Resig Wesley Shields Tyler Shields

Technorati