An input of 50000 lines of apache logfile amounting to 9.7megs of data.
I initially attempted this using the regex predicate %IP~/^129% but I realized that perl grok compiles the predicate regex every time it is executed, and wasn't a fair test. So I switched to %IP>=129% instead, which converts the match to an integer first (so 129.21.60.9 turns into 129, for example), which seems like more equal ground based on the implementations in both perl and C++.
# C++ Grok % /usr/bin/time ./test_patterns "%IP>=129%" < /tmp/access.50klines > /dev/null 2.56user 0.14system 0:02.92elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+408minor)pagefaults 0swaps # Perl Grok % /usr/bin/time perl grok -m "%IP>=129/%" -r "%IP%" < /tmp/access.50klines > /dev/null 8.87user 1.24system 0:25.94elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+17721minor)pagefaults 0swapsWhat still remains consistent is the trend that the more complexity I add in C++ equates to a greater margin of speed from the perl version.
- Using strict %FOO% patterns with no predicates, the C++ version is 6 to 7 times faster than the perl equivalent in grok.
- Using predicates shows the C++ version running 10 times faster.
Exciting :)