Search this site


Page 1 of 2  [next]

Metadata

Articles

Projects

Presentations

C++Grok bindings working in Python

% python example.py "%SYSLOGDATE%" < /var/log/messages | head -1
{'MONTH': 'Mar', '=LINE': 'Mar 23 06:47:03 snack syslogd 1.4.1#21ubuntu3: restart.', '=MATCH': 'Mar 23 06:47:03', 'TIME': '06:47:03', 'SYSLOGDATE': 'Mar 23 06:47:03', 'MONTHDAY': '23'}
That's right. I can now use C++Grok from python.

After I saw it work, I immediately ran a time check against the perl version:

% seq 20000 > /tmp/x
% time python example.py "%NUMBER>5000%" < /tmp/x > /tmp/x.python
0.59s user 0.00s system 99% cpu 0.595 total
% time perl grok -m "%NUMBER>5000%" -r "%NUMBER%" < /tmp/x  > /tmp/x.perl
4.86s user 0.94s system 18% cpu 31.647 total
The same basic operation is 50x faster in python with c++grok bindings than the pure perl version. Excellent. Sample python code:
g = pygrok.GrokRegex()
g.add_patterns( <dictionary of patterns> )
g.set_regex("%NUMBER>5000%")
match = g.search("hello there 123 456 7890 pants")
if match:
  print match["NUMBER"]
# prints '7890'
I knew I wasn't doing reference counting properly, so to test that I ran the python code against an input set of 1000000 lines and watched the memory usage, which clearly showed leaking. I quickly read up on ref counting in Python and what functions return new or borrowed references. A few keystrokes later my memory leaks were gone. After that I put python in the test suite and am read to push a new version of c++grok.

Download: cgrok-20080327.tar.gz

Python Build instructions:

% cd pygrok
% python setup.py install

# make sure it's working properly
% python -c 'import pygrok'
There is an example and some docs in the pygrok directory.

Let me know what you think :)

Python C++ Grok bindings

I've gotten quite a bit further tonight on making c++grok's functionality available in python.

Mostly tonight's efforts have been spent learning the python C api and learning how to add new objects and methods. I'm planning to have this ready for BarCampRochester3 in two weeks.

So far I can make new GrokRegex objects and call set_regex() and search() on them. Next time I'll be implementing GrokMatch objects (like in the C++ version) and a few other small things. Fun fun :)

Looks scary, actually simple: grammar parsing.

I hit a mental roadblock a few days ago; I was afraid to write a grammar parser for c++grok that supported the same basic format as the perl grok config format.

Perl grok's config grammar was super easy to write thanks to Parse::RecDescent. In the C++ version, I wanted similar ease. However, the tools I had available didn't appear to be expressive enough to support what I wanted. The config object in C++ was going to be a class, so you were free to have multiple config objects, and which meant I couldn't have any global variables. Both Boost Xpressive and Boost Spirit support grammar parsing almost trivially, but they require awkward wrapping and basically make it very hard to use when you want to update values in a class instance instead of some global variables.

Eventually, I gave up and wrote my own recursive descent bits using Xpressive to do the pattern matching and some trivial in-object state management to keep track of what was going on. It was really simple, despite my fears.

I'm not really sure what made me afraid of doing it, but the fear was totally unfounded.

C++ Grok has working filters and exec sections now.

I finished implementing exec and filters:
exec "tail -1 /var/log/auth.log" {
  type "syslog" {
    match = ".*";
    reaction = "echo %=MATCH|shellescape%";
  };
};
I've made a point of having perl-grok's config format work, because I think it was a reasonable format (you're free to disagree!). At any rate, filters are now working, and the result of the above code is:
Reaction: echo Feb  8 23:25:01 snack CRON\[21596\]: pam_unix\(cron:session\): session closed for user root
Checking for input: tail -1 /var/log/auth.log(0x74b100)
Reading from: tail -1 /var/log/auth.log

Feb 8 23:25:01 snack CRON[21596]: pam_unix(cron:session): session closed for user root

c++ grok vs perl grok on pattern discovery

I finished up work on the pattern discovery feature for the C++ port of grok. As soon as it was finished, I wanted to see the dpeed differences between the perl and C++ versions.

  • Perl grok: 6 lines analyzed per second
  • C++ grok: 130 lines analyzed per second
The feature tested here was the one detailed in this post.

130 lines per second isn't fantastic, but it's 21.66 times faster than the perl version, and that's huge.

I still have to implement a few other features to make the C++ version equivalent to the perl version:

  • config file (same format, ideally, as the perl version)
  • filters, like %SYSLOGDATE|parsedate%

Grok predicates - Perl vs C++

I just finished implementing predicates in c++grok (tentative name) and wanted to compare the performance against perl grok.

An input of 50000 lines of apache logfile amounting to 9.7megs of data.

I initially attempted this using the regex predicate %IP~/^129% but I realized that perl grok compiles the predicate regex every time it is executed, and wasn't a fair test. So I switched to %IP>=129% instead, which converts the match to an integer first (so 129.21.60.9 turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.

# C++ Grok
% /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null
2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+408minor)pagefaults 0swaps

# Perl Grok
% /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null
8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17721minor)pagefaults 0swaps
What still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
  • Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
  • Using predicates shows the C++ version running 10 times faster.
I still need to write test cases for the C++ version in addition to porting the pattern discovery portion from perl.

Exciting :)

Vim function to make g++ errors readable.

If you've ever used templates in C++, you've probably gone blind trying to read the compiler errors.
grokmatch.hpp:7: error: 'typedef class std::map<std::basic_string<char,
std::char_traits<char>, std::allocator<char> >, std::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char>
> >, std::allocator<std::pair<const std::basic_string<char,
std::char_traits<char>, std::allocator<char> >, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > > >
GrokMatch<boost::xpressive::basic_regex<__gnu_cxx::__normal_iterator<const
char*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >
> >::match_map_type' is private
I'm supposed to read all that crap? Especially since 99% of the data isn't useful in most cases. The following vim script sanitizes this output:
function! GPPErrorFilter()
  silent! %s/->/ARROW/g
  while search("<", "wc")
    let l:line = getline(".")
    let l:col = col(".")
    let l:char = l:line[l:col - 1]
    if l:char == "<"
      normal d%
    else
      break
    endif
  endwhile
  silent! %s/ARROW/->/g
  silent %!awk '/: In/ { print "---------------"; print }; \!/: In/ {print }'
endfunction
If I dump the output of make to a file (including stderr), and run the function while in vim, using ':call GPPErrorFilter()', the output turns into this:
g++ -g -I/usr/local/include -c -o main.o main.cpp
---------------
grokmatch.hpp: In function 'int main(int, char**)':
grokmatch.hpp:7: error: 'typedef class std::map GrokMatch::match_map_type' is private
main.cpp:43: error: within this context
make: *** [main.o] Error 1
So much better... Now i know I'm clearly trying to access a private typedef. Sanity++

Grok porting to C++

I've got pattern generation working.
% ./test '%NUMBER%' "hello 45.04" "-1.34"
Testing: %NUMBER%
Appending pattern 'NUMBER'
Test str: '(?:[+-]?(?:(?:[0-9]+(?:\.[0-9]*)?)|(?:\.[0-9]+)))'
regexid: 0x692840
Match: 1 / '45.04'
Match: 1 / '-1.34'
I'm pretty sure this is the 4th time I've at least started implementing grok in any given language. The total so far has been: perl, python, ruby, C++. I stopped working on the one in ruby because ruby's regexp engine is lacking in some useful features (*). The python port of grok was written before I added advanced predicates, which is why the ruby port was halted quickly.

(*) I opened a ruby feature request explaining a few problems I'd found with ruby's regexp feature. I even offered to help fix some of them. Circular discussions happened and I basically gave up on the idea of moving to ruby after ruby's own creator expressed a defeatist attitude about adding such a feature. My patches are still available. I don't particularly care that my request hasn't gone anywhere, so don't ask me about it, as I've happily moved on :)

Assuming I do this right, this should give grok a serious boost in speed.

Boost xpressive dynamic regexp with custom assertions

As it turns out, xpressive is (so far) exactly what I'm looking for.

'Dynamic regular expression' in Xpressive's docs are means that the regex object comes from compiling a regex string, not from using the static regular expression (aka coded in C++) that is the alternative. Very fortunately, you can mix the uses of dynamic and static expressions, since both end up turning into the same objects!

What I wanted was dynamic regexps with custom assertions, and here's how you do it:

struct is_private {
  bool operator()(ssub_match const &sub) const {
    /* Some test on 'sub' */
  }
};

/* somewhere in your code ... */
sregex ip_re = sregex::compile("(?:[0-9]+\\.){3}(?:[0-9]+)");
sregex priv_ip_re = ip_re[ check(is_private()) ];
This is excellent because this was one of the features of perl that kept me from making grok available in any other language.

I have a working demo you can download. I've tested on Linux and FreeBSD with success. It requires boost 1.34.1 and the xpressive 2.0.1. The version of xpressive that comes with boost 1.34.1 is insufficient, you must separately download the latest version of xpressive. I installed it by unzipping it and copying boost/xpressive/* to /usr/local/include/boost/xpressive/ - this overwrote the old copy of xpressive I had installed.

Compile with (on freebsd, the -I and :

g++ -I/usr/local/include -c -o boost_xpressive_test.o boost_xpressive_test.cpp
g++  boost_xpressive_test.o -o xpressivetest
Running it:
% ./xpressivetest 
RFC1918 test on '1.2.3.4': fail
RFC1918 test on '4.5.6.7': fail
RFC1918 test on '192.168.0.5': pass
Match on test1: 192.168.0.5
RFC1918 test on '129.21.60.0': fail
RFC1918 test on '29.21.60.0': fail
RFC1918 test on '9.21.60.0': fail
RFC1918 test on '172.17.44.25': pass
Match on test2: 172.17.44.25
This is exactly the behavior I expected.

Boost xpressive library supports user-defined assertions

See this doc

Basically this regex library (Boost.Xpressive) supports what I like about perl's regex engine: The (??{ code }) feature (except with different syntax). This means what I had to hack around in grok-perl I can easily express in C++ code. Awesome.

The docs only show examples of using static regexes with this great feature. I'm going to try using it with dynamic regexes. If it works, I'll be converting grok to C++.