Search this site


[prev]  Page 3 of 4  [next]

Metadata

Articles

Projects

Presentations

Grok gets more ridiculous

In my last post, I discussed a way to perform additional assertions on a given captured group while still inside of the regex state machine.

I spent some time today and implemented it in grok and it works like a charm! This kind of functionality gives you extreme power in the kind of matches you can specify.

Here's an example: How can we find the first number on a line that's greater than 20?

% jot 50 | xargs -n15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
46 47 48 49 50
Using the above as our sample input:
% jot 50 | xargs -n15 | perl grok -m "%NUMBER>20%" -r "%NUMBER%"
21
31
46
The assertion '>20' is applied during the regular expression match. This is sweet!

Another example would be to 'grep' the output of 'ls -l' for any lines containing a number greater than 5000:

% ls -l | perl grok -m "%INT>5000%"
-rw-r--r--  1 jls  jls   21590 Apr  2 20:17 foo
-rw-r--r--  1 jls  jls   22451 Apr  2 23:52 grok
-rw-r--r--  1 jls  jls  129536 Apr  2 23:47 grok-20070402.tar.gz
-rw-r--r--  1 jls  jls   13539 Mar 22 14:22 grok.1

The equivalent awk statement would look like this:

% ls -l | awk '$5 > 5000'
But if you look at the awk statement, you have to count columns. With grok's solution, all you need to know is you want any line with an integer greater than 5000. You are able to specify the specifics of your match without having to know the precise layout of the data you are matching.

What do you get if you can chain predicates? I haven't added that functionality yet, but it would be trivial to add, so expect it soon.

If you're interested, please try the latest version.

Grok, smarter predicates, and outrageous perl regex fu.

The past two days have been a frustrating exercise in working around a perl bug coupled with putting strong pressure on the limits of perl's regex system. The battle was won, but perl left some scarring..

In the last batch of updates in grok, you are now able to specify additional match predicates to the patterns. For example, if you have: "%IP~/^192/%" it will match an IP, but only IPs starting with 192.

This works great if it's the only IP on the line, but what if you want to grep for an IP starting with 192 on a line with multiple IP addresses? The current implementation works something like this:

  1. Perform the match for %IP%
  2. Assert that the matched IP also matches the predicate /^192/
  3. If the previous assertion succeeded, continue as normal (react)
  4. If the assertion failed, drop this line and keep going
Let's consider a simple example:

I want to use %WORD% to match a word. However, I only want a word that has 'foo' in it. Under the current implementation, I might consider using "%WORD~/foo/%" but this would not work, becuase it would match the first word only, which may not match 'foo' as well, and fail. Regex predicates should ideally be involved during the match process, not after. Perl has some crazy code eval features that let you do this. The following code should work:

echo "foo foobar foobaz" \
| perl -nle 'm/(\b\w+\b)(??{ ($^N =~ [email protected]@ ? "" : "(?=.\\A)") })/'
Basically, the (??{ }) part checks if the captured group also matches 'bar'. If it doesn't match 'bar' it will inject a regular expression that cannot ever match (guaranteed negative match). The negative match is accomplished with a forward-lookahead looking for any character followed by the beginning of the string (\A), which clearly isn't possible (Neat hack!). The end result is that $1 should be 'foobar'.

However, if you try to run this code:

$ echo "foo foobar foobaz" \
  | perl -nle 'm/(\b\w+\b)(??{ ($^N =~ [email protected]@ ? "" : "(?=.\\A)") })/'
> Segmentation fault (core dumped)
Perl crashes. It crashes on linux, freebsd, and solaris in various versions and various permutations of code. It must have to do something with doing other regular expressions inside code evals running from another regex. It seems like it smashes the stack in some unpleasant way.

Frustrating! I spent a few hours trying to fix this without solving the problem. No luck. Shortly after, an outrageous idea hit me - if it is unsafe to do regex within regex, why not fork and do the inner regex in another process?

Sounds crazy, and stupid, and silly? That's because it is. What bug work-arounds aren't? :)

That's exactly what I did. I wrote a short prototype script to test this new subprocess theory. It seems to work! Despite it's outrageous complexity, it gets the job done and skirts around the perl bug.

The script matches an IP, and ensure that that IP contains a '5'. Here's an example run:

scorn(~) % perl regexp-fu.pl "1.2.3.4 8.1.1.4 1.2.3.5"
1.2.3.4 / (?-xism:5)
8.1.1.4 / (?-xism:5)
1.2.3.5 / (?-xism:5)
Match: 1
Group: 1.2.3.5
You can see that it matched 3 IPs, but only the 3rd one had a '5' in it. The result of the 1 regular expression is that it matched the one I wanted. Finally!

I used perlbug(1) in an attempt to report this bug, but I have no idea if it actually sent any email.

This solution is totally suboptimal, but it works. Given the alternative, perl crashing, I'll take this solution.

grok pattern match predicates

I've added predicate tests to grok's pattern match system. These predicates allow you to specify an additional requirement on any matched patterns. Here's the grammar:
  '%' pattern_name [ ':' subname ] [ operator value ] '%'
The difference is that now you can put operator and values on the end of the pattern. The following are valid operators: < > <= >= == ~

== < > <= >=
Match equals, less than, etc. Should be obvious. One special note is that if both the match and predicate values are numbers, then the comparison is done using perl's numerical compare operators. Otherwise, string comparators are used (eq, lt, gt, etc).
~
Regular expression match.

Still confused? Let's run through some examples.

  1. Let's find out what's going on in our auth.log on any day from 20:00 to 20:09:
    % sudo cat /var/log/auth.log | ./grok -m '%TIME~/^20:0[0-9]/%'
    Sep 15 20:05:24 nightfall sshd[503]: Server listening on :: port 22.
    Sep 15 20:05:24 nightfall sshd[503]: Server listening on 0.0.0.0 port 22.
    Sep 15 20:07:31 nightfall login: login on ttyv0 as jls
    Nov 12 20:09:42 nightfall xscreensaver[647]: FAILED LOGIN 1 ON DISPLAY ":0.0", FOR "jls"
    Nov 26 20:07:18 nightfall sshd[494]: Server listening on :: port 22.
    Nov 26 20:07:18 nightfall sshd[494]: Server listening on 0.0.0.0 port 22.
      
  2. How about looking through 'netstat -s' output for big numbers? Yes, you can use awk for this particular example.
    % netstat -s | ./grok -m "%NUMBER>100000%"
            130632 total packets received
            130465 packets for this host
            114759 packets sent from this host
      
  3. Let's look in "all.log" (all syslog stuff goes here) for sshd lines with an IP starting with '83.'
    % ./grok -m "%SYSLOGBASE~/sshd/% .* %IP~/^83\./%" -r "%SYSLOGDATE% %IP%" < all.log
    Oct 17 09:54:37 83.170.72.199
    Oct 17 09:54:53 83.170.72.199
    Oct 17 09:56:02 83.170.72.199
    <snip some output >
    Apr 16 06:54:52 83.14.104.202
    Apr 16 06:54:53 83.14.104.202
    Apr 16 06:54:54 83.14.104.202
If you're interested in playing with this new feature, download grok-20070226.

This seems pretty powerful. Next feature I need to add is the ability to add predicates to patterns after they've been specified. Something like this would be sweet:

% ./grok -m "%APACHELOG%" -p "%NUMBER:RESPONSE==404%"
< some output showing you all apache log entries with response code 404 >
Something like that, which would let you modify the %NUMBER:RESPONSE% pattern to add a predicate requiring that it be 404.

grok 20070224 released.

It's been almost a year since the first release of grok. I've finally found some energy to put into the project and it's time for another release.

Download: grok-20070224.tar.gz

A quick summary of the changelist (which comes with the tarball):

  • Lots of doc updates. More examples in the manpage.
  • Lots of new builtin patterns
  • More new filters like strftime, ip2host, and uid2user.
  • Fancier syslog matching options
  • New flags -m and -r. See this post about this change
  • filelist, catlist, and filecmd thanks mostly to Canaan Silberberg.
  • More tests to make sure that it works. Find these in the 't' directory in the grok tarball.
Email me if the tests provided don't work.

grok + netstat

What hosts is this machine connected to:
% netstat -anfinet | perl grok -m "%IP:S%.*?%IP:D%" -r "%IP:D|ip2host%" | sort | uniq
fury.csh.rit.edu
mc-in-f104.google.com
mc-in-f147.google.com
scorn.csh.rit.edu
I have no idea the mc-in-f104 stuff is, but firefox is open to 'www.google.com' right now. Let's find out what 'www.google.com' points at:
% host www.google.com | perl grok -m "%IP%" -r "%IP|ip2host%"
mc-in-f147.google.com
mc-in-f99.google.com
mc-in-f104.google.com
I keep finding more uses for grok now that you can use it on the commandline easily.

grok for apache log analysis

I recently made a small change to my rss and atom feeds. I add a tracker image in the content. It looks like this:
<img src="/images/spacer.gif?path_of_a_blog_entry">
Any time someone views a post in an RSS feed, that image is loaded and the client browser happily reports the referrer url and I get to track you! Wee.

This is in an effort to find out how many people actually read my blog. Now that I can track viewship of the rss/atom feeds, how do I go about analyzing it? grok to the rescue:

% grep 'spacer.gif\?' accesslog \
   | perl grok -m '%APACHELOG%' -r '%IP% %QUOTEDSTRING:REFERRER%' \
   | sort | uniq -c | sort -n
<IP addresses blotted out, only a few entries shown>
  1 XX.XX.XX.XX "http://www.google.com/reader/view/"
  9 XX.XX.XX.XX "http://whack.livejournal.com/friends"
  10 XX.XX.XXX.XXX "http://www.bloglines.com/myblogs_display?sub=44737984&site=6302113"
  10 XX.XXX.XXX.XX "http://www.semicomplete.com/?flav=rss20"
  27 XX.XXX.XXX.XX "http://whack.livejournal.com/friends"
Each line represents a unique viewer, and tells me what the reader was using to view the feed.

Yay grok.

grok - now with more steroids

I added a new filter to grok: strftime. Same format strings as strftime(3) provides but you need to use & instead of %. Ie; strftime("&D"). This is useful when combined with parsedate. I also added a new default pattern, APACHELOG, which will match a standard apache log entry.

Along with that little addition comes another way cooler addition which lets you use grok entirely from the command line in a way resembling grep on crack. For this we needed 2 new command line flags: new flags -m and -r.

  • -m : specify a match string
  • -r : specify a reaction string. Defaults to "%=LINE%" if omitted.

The reaction string specifies what is printed on a match. There is no support (yet?) for specifying reactions other than printing out data. If you want a command to be executed, you could use a clever combination of the shdq filter and have grok output shell commands. More on that later on.

The implementation of this is somewhat klunky, but it works. Under the hood, here's what happens:

% grok -m FOO -r BAR
grok takes this and generates the following config in memory:
  exec "cat" {
    type "all" {
      match = "FOO";
      reaction = { print meta2string("BAR", $v); };
    };
  };
The data source being read from is output from cat, which is just a lame hack to trick grok into reading a file from stdin. This is really useful. Let's try a few examples:

Grep a file for anything looking like an IP:

We have a log file with IPs in it. Writing a regex to grab any line with an IP on it is annoying. Let's use grok:
% perl grok -m "%IP%" < /var/log/messages | head -5
Feb  7 19:50:52 kenya dhcpd: Forward map from D962WZ71.home to 192.168.10.189 FAILED: Has an A record but no DHCID, not mine.
Feb  7 19:50:52 kenya dhcpd: Forward map from D962WZ71.home to 192.168.10.189 FAILED: Has an A record but no DHCID, not mine.
Feb  7 22:17:16 kenya named[17044]: stopping command channel on 127.0.0.1#953
Feb  7 22:17:18 kenya named[16239]: command channel listening on 127.0.0.1#953
Feb  9 05:11:17 kenya sshd[22002]: error: PAM: authentication error for root from 211.147.17.110
At this point, grok is behaving much like grep, but you get all of the easy matching power of grok.

Syslog messages with IPs + extra text processing

How about if we want any syslog message with an IP in it, and we want to know what date and program logged it?
% perl grok -m "%SYSLOGBASE% .* %IP%" -r "%SYSLOGDATE|parsedate|strftime('&D')% %PROG% %IP%\n" < /var/log/messages | head -5
02/07/07 dhcpd 192.168.10.189
02/07/07 dhcpd 192.168.10.189
02/07/07 named 127.0.0.1
02/07/07 named 127.0.0.1
02/09/07 sshd 211.147.17.110

Process apache logs

What about the new APACHELOG pattern? Here's a sample usage of it:
% tail -5 access | perl grok -m "%APACHELOG%" -r "%HTTPDATE|parsedate% %QUOTEDSTRING:URL|httpfilter%\n"
1171799519 /blog/geekery/grok-like-grep.html?source=rss20
1171799581 /projects/solaudio
1171799624 /projects/solaudio/
1171799651 /~psionic/seminars/vi/viseminar.html
1171799652 /seminars/vi/viseminar.html

Break a file into parts grouped by IP

What if you want to find out what IP causes the most log chatter? Use the shdq filter and have grok output shell commands which you then pipe to /bin/sh.
% cat /var/log/messages | perl grok -m '%IP%' -r 'echo "%=LINE|shdq%" >> /tmp/log.%IP%'  | sh
% ls /tmp/log.*
/tmp/log.127.0.0.1              /tmp/log.211.147.17.110
/tmp/log.192.168.0.254          /tmp/log.71.70.243.218
% wc -l /tmp/log.*
       4 /tmp/log.127.0.0.1
       1 /tmp/log.192.168.0.254
      70 /tmp/log.211.147.17.110
       1 /tmp/log.71.70.243.218

Latest version (potentially unstable, but the above examples work):

Download and enjoy: grok-20070218.tar.gz

Grok to be rewritten in python

I spent a few hours tonight working on new features for grok and kept running into problems keeping track of data structures in my head. Grok currently makes heavy use of hash-of-hash-of-hash-of-ha...-type datastructures in perl. Remembering context is annoying and slows development.

I decided that grok could use some serious refactoring. So much refactoring, that I could probably get away with rewriting it faster than redesigning portions of it. Since I need to know Python better, and I am more familiar with OO in Python than I am in Perl, I figure I should rewrite grok in Python. Python already has one critical feature that I need in grok, named captures. The hack for this in perl is unwieldy and potentially unmaintainable if future perl versions change the format or deprecate it. It is listed in 'perldoc perlre' as experimental.

At any rate, I made a little prototype that tries to be very OO. My experience with Good(tm) object oriented programming is still limited. The CS curriculum at RIT sucked for teaching proper OO, too many professors taught wildly different styles or were unclear about what Good(tm) OOP should look like.

Therefore, rewriting grok is a good opportunity to explore test-driven development and maintainable object-orientation. Oh, and synergism too. *shifty eyes*

I've got a bit of code up and running already, and writing "for tests" seems to be a very cool way to think about programming. If I force myself to write easily-testable code, then writing tests is easy. Furthermore, initial experience seems to show that adding new features is much easier when all of the code is compartmentalized.

If nothing else, I wrote a somewhat cool debug method that accesses the call stack for function, class, module, etc. Check out the 'debuglib.py' file. The output looks something like this:

grok/groklib.py:52] RegexGenerator.saveRegex: happy message here
The file, line number, class and function name are all discovered magically in debug(). I like this.

If you get bored, you can look at the original stuff here: scripts/grok-py-test

Project Ideas for the near future

  • Write/find a non-crappy, very simple comment plugin for pyblosxom.
  • Write/find a better tab-searchy extension than Reveal
  • Rework grok's config syntax to do more with less typing
  • Sleep
If you have any suggestions for the 'find'-type items, let me know :)

Grok plans and ideas

I'm almost done and graduated from RIT, which is why I haven't posted about anything recently. Finally done with this school. Wee!

Some noise woke me up from my pleasant slumber, and I can't seem to get back to sleep, so I might aswell do some thinking.

I think grok's config syntax is far too cluttered. I would much rather make it more simplified, somehow. Something like:

file /var/log/auth.log: syslog
{blockuser, t=3, i=60} Invalid user %USER% from %IP%

file /var/log/messages: syslog
{tracksudo, prog=su} BAD SU %USER:F% to %USER:T% on %TTY%

reaction blockuser "pfctl -t naughty -T add %IP%"
reaction tracksudo "echo '%USER:F% failed su. (to %USER:T%)' >> /var/log/su.log"
Seems like it's less writing than the existing version. Less writing is good. A more relaxed syntax would be nicer aswell -such as not requiring quotes around filenames.

This new syntax also splits reaction definitions from pattern matches, allowing you to call reactions from different files and other matches. I'll be keepin the perlblock and other features that reactions have, becuase they are quite useful.

I've also come up with a simple solution to reactions that execute shell commands. The current version of grok will run system("your command") every time a shell reaction is run. This is tragicaly suboptimal due to startup overhead. The simple solution is to run "/bin/sh -" so that there is already a shell accepting standard input and waiting for my commands. Simply write the shell command to that program.

I wrote a short program to test the speed of using system() vs printing to the input of a shell. You can view the testsh.pl script and the profiled output.

An abbreviated form of the profiled output follows:

%Time    Sec.     #calls   sec/call  F  name
92.24    1.5127     1000   0.001513     main::sys
 2.04    0.0335     1000   0.000033     main::sh
sys() calls system(), where sh() simply sends data to an existing shell process. The results speak for themselves, even though this example is only running "echo" - the startup time of the shell is obviously a huge factor in runtime. The difference is incredible. I am definitely implementing this in the next version of grok. I've already run into many cases where I am processing extremely large logs post-hoc and I end up using perl block reactions to speed up execution. This shell execution speed-up will help make grok even faster, and it can always use more speed.

Still no time to work on projects right now, perhaps soon! I'm moving to California in less than a week, so this will have to wait until after the move.